Call Spots

The Call Reference Spots section of the pipeline is a method of gene calling which runs quickly on a small set of spots ($\approx$ 50, 000 per tile) of the anchor image. Initially, this was our final mode of gene calling, but has since been superseded by OMP, which differs from Call Spots in that it runs on several more pixels, regardless of whether they have been detected as a spot.

Despite this, the call spots section is still a crucial part of the pipeline as it estimates several important parameters used in the OMP section.

Some of the most important exported parameters of this section are:

Colour Normalisation Factor $\mathbf{A}$: $(n_{\text{t}} \times n_{\text{r}} \times n_{\text{c}})$ array which multiplies the colours to minimise any systematic brightness variability between different tiles, rounds and channels and maximise spectral separation of dyes,
Bleed Matrix $\mathbf{B}$: $(n_{\text{d}} \times n_{\text{c}})$ array of the typical channel spectrum of each dye.
Bled Codes $\mathbf{K}$: $(n_{\text{g}} \times n_{\text{r}} \times n_{\text{c}})$ array of the expected colour spectrum for each gene.

Algorithm Breakdown

The inputs to the algorithm are:

Raw spot colours $F_{src}$ for all spots $s$ (defined as local maxima of round $r_{\text{anchor}}$, channel $c_{\text{anchor}}$) and the tile $t(s)$ they belong to.
A list of genes $g$ and their associated dye codes $d(g, r)$ for each round $r$. These codes were generated by the Reed-Solomon Algorithm which should minimise the number of overlap between codes.
A raw bleed matrix $\mathbf{B_{\textrm{raw}}}$ of shape $(n_{\text{dyes}} \times n_{\text{c}})$ obtained from images of free-floating drops of each dye.

0: Preprocessing

The purpose of this step is to approximately equalise the brightness of different tiles, rounds and channels and to remove any background which is constant across rounds from each spot.

We transform the raw spot colours $F_{src}$ as follows:

\[F_{src} \mapsto \tilde{A}_{t(s)rc}F_{src}\]

In the formula above:

The initial normalisation factor $\tilde{A}_{trc}$ is defined as

\[ \tilde{A}_{trc} = \dfrac{1}{\text{Percentile}_s(F_{src}, 95)} \]

for all spots $s$ in tile $t$. This is a good estimate of the scaling factor needed to make the brightest spots in each tile, round and channel have the same intensity.

If background_subtract in the config is set to true (typically false)

\[ F_{src} \mapsto F_{src} - \text{Percentile}_r(F_{src}, 25) \]

For 7 rounds, this is the brightness of the second dimmest round of the scaled spot colours in channel c. This is a good estimate of the constant signal in channel c across all rounds, which we want to remove.

We define the intensity of spot $s$ as

\[ I(s) = \min_r\Big(\max_c(|F_{src}|)\Big) \]

which will be useful later for refined spot selection by ensuring there is brightness in every sequencing round.

1: Initial Gene Assignment

The purpose of this step is to provide some preliminary gene assignments that will allow us to estimate the bleed matrix and the bled codes. We will work extensively with the bleed matrix in these calculations, but bear in mind that this is the raw bleed matrix $\mathbf{B_{\textrm{raw}}}$.

We'd like to define a probability that spot $s$ (fluorescence $F_{src}$) comes from gene $g$, and we'd like this probability to have the following properties:

The probability of round $r$ being assigned to dye $d$ should be invariant to changes in the overall brightness of $\mathbf{F_{sr}}$,
the probabilities of each round $r$ should be independent.

Why these properties?

Property 1 is desirable because of the way the bridge probes work in the experiment, as shown below.

The 3 probe system.

Upon every mRNA transcript of interest, several padlock-probes are attached. These stay fixed in place throughout the experiment. To illuminate each gene with the expected dye in each round, bridge probes (which are gene-specific on one arm and dye-specific on the other) are transported to all the padlock probes associated with this spot and ligated.

The brightness of each gene in each round is proportional to the amount of bridge probes that have ligated. Unfortunately, the number of bridge probes varies wildly between genes and rounds, giving rise to systematic differences in brightness between genes and rounds.

Normalising the brightnesses of each spot in each round is a way to get around this problem.

Spots assigned to the gene Chrm 1 had many more bridge probes in round 4 than round 3, leading to systematic brightness differences.

Property 2 is only approximately true, but independence between rounds makes the formula for the probability of a spot being assigned to a gene much simpler.

Let:

$\mathbf{f} = (\mathbf{f_1}, \ldots, \mathbf{f_{n_r}}) ^ T$ be the $(n_r \times n_c)$ round-normalised fluorescence matrix of the spot,
$\mathbf{b}_g = (\mathbf{B_{d(g, 1)}}, \ldots, \mathbf{B_{d(g, n_r)}}) ^ T$ be the $(n_r \times n_c)$ round-normalised bled code for gene $g$.

We define the probability of spot $s$ being assigned to gene $g$ as

\[ \mathbb{P}[G = g \mid \mathbf{F} = \mathbf{f}] = \frac{\exp(\kappa \mathbf{b}_g \cdot \mathbf{f})}{\sum_{g'} \exp(\kappa\mathbf{b}_{g'} \cdot \mathbf{f})},\]

where $\kappa$ is a concentration parameter which controls how much the probabilities are spread out among the genes.

How to choose $\kappa$?

The parameter $\kappa$ is set by adjusting the config parameter kappa and has default value 2. The value of $\kappa$ controls how much the probabilities are spread out among the genes, but does not influence the gene ordering.

$\kappa = 0$ yields a uniform distribution of probabilities between all genes,
$\kappa \rightarrow \infty$ yields a distribution that tends to 1 for the gene with the maximum dot product and 0 for all others.

Effects of varying $\kappa$ on the probabilities of a single spot.

When working with larger gene panels, all probabilities are spread out more naturally, so it helps to increase $\kappa$ so that probabilities have a consistent interpretation. Out current implementation sets $\kappa = 2$ if $n_g < 200$ and 3 otherwise.

Gene Probability Derivation

Dye Probabilities

We'll model the normalised round fluorescence vectors $\mathbf{F_r}$ arising from dye $d$ as being random and distributed according to a von Mises-Fisher distribution with mean $\mathbf{B_d}$ and concentration parameter $\kappa$.

This model has probability density function

\[\mathbb{P}[\mathbf{F_{r}} = \mathbf{f_r} \mid D = d] = M_{\kappa} \exp(\kappa\mathbf{f_r} \cdot \mathbf{B_d}),\]

where $\mathbf{f_r}$ is a unit vector and $M_{\kappa}$ is a normalization constant we don’t need to worry about.

Gene Probabilities

Now let $\mathbf{F} = (\mathbf{F_1}, \ldots, \mathbf{F_{n_r}}) ^ T$ be the $(n_r \times n_c)$ matrix of normalised fluorescence vectors of each round $r$ of a spot $s$. By independence between rounds, the probability of observing the fluorescence $\mathbf{f}$ from a spot of gene $g$ is just the product of the probabilities that each round $r$ is assigned to dye $d(g, r)$. In equations, this simplifies nicely to:

\[ \begin{aligned} \mathbb{P}[\mathbf{F} = \mathbf{f} \mid G = g] &= \prod_r \mathbb{P}[\mathbf{F_{r}} = \mathbf{f_r} \mid D = d(g, r)] \\ &= \prod_r M_{\kappa} \exp \left( \kappa\mathbf{f_r} \cdot \mathbf{B_{d(g, r)}} \right) \\ &= M_{\kappa}^{n_r} \exp \left( \kappa \sum_r \mathbf{f_r} \cdot \mathbf{B_{d(g, r)}} \right) \\ &= M_{\kappa}^{n_r} \exp(\kappa \mathbf{f \cdot b_g}), \end{aligned} \]

where

$\mathbf{f} = (\mathbf{f_1}, \ldots, \mathbf{f_{n_r}}) ^ T$ is the observed round-normalised $(n_r \times n_c)$ fluorescence matrix of the spot,
$\mathbf{b_g} = (\mathbf{B_{d(g, 1)}}, \ldots, \mathbf{B_{d(g, n_r)}}) ^ T$ is the $(n_r \times n_c)$ matrix of the bled code for gene $g$,
the dot product $\mathbf{f \cdot b_g}$ is the Frobenius Inner Product for Matrices, ie: the sum of the elementwise product of the two matrices.

We have so far only defined the probability of $\mathbf{F} = \mathbf{f}$ given $G = g$. We can find the probability of $G = g$ given $\mathbf{F} = \mathbf{f}$ using Bayes' Rule:

\[ \mathbb{P}[G = g \mid \mathbf{F} = \mathbf{f}] = \dfrac{\mathbb{P}[\mathbf{F} = \mathbf{f} \mid G = g] \mathbb{P}[G = g]}{ \mathbb{P}[\mathbf{F} = \mathbf{f}]}. \]

For the priors, we will assume that:

$\mathbb{P}[G = g] = \frac{1}{n_g}$ (ie: all genes are equally likely),
$\mathbb{P}[\mathbf{F} = \mathbf{f}] = \sum_g \mathbb{P}[\mathbf{F} = \mathbf{f} \mid G = g] \mathbb{P}[G = g] = \frac{1}{n_g} \sum_g M_{\kappa}^{n_r} \exp(\kappa \mathbf{b}_g \cdot \mathbf{f})$, (ie: $\mathbf{f}$ comes from one of the genes)

This gives us the final probability:

\[ \mathbb{P}[G = g \mid \mathbf{F} = \mathbf{f}] = \frac{\exp(\kappa \mathbf{b}_g \cdot \mathbf{f})}{\sum_g \exp(\kappa\mathbf{b}_g \cdot \mathbf{f} )}\]

2: Bleed Matrix Calculation

The purpose of this step is to compute an updated estimate of the bleed matrix.

Set some probability threshold $\gamma$ (in the config file $\gamma$ is called gene_prob_thresold and has default value 0.9). Set an intensity threshold $\delta$ (in the config file $\delta$ is called gene_intensity_threshold and has default value 0.2). We define the following sets:

\[ \mathcal{S} = \{ s : p(s) > \gamma \space \cap \space I(s) > \delta \}, \]

\[ G_{rd} = \{ g : d(g,r) = d \}, \]

\[ J_{rd} = \{ \mathbf{F_{sr}} \in \mathbb{R}^{n_c}: s \in \mathcal{S}, \ g_s \in G_{rd} \} \]

In words, these can be described as follows:

$\mathcal{S}$ is the set of spots with $s$ with high probability/intense spots,
$G_{rd}$ is the set of genes with dye $d$ in round $r$,
$J_{rd}$ is the set of colours of high probability/intense spots assigned to genes with dye $d$ in round $r$.

By taking the union of $J_{rd}$ across rounds, we end up with a set of reliable colour vector estimates for dye $d$:

\[ \mathcal{J}_d = \bigcup_r J_{rd} \]

Why do we find spots like this?

The simpler way to find represantitive colours for each dye would be to look at the colours for all spots $s$ where $\mathbf{F_{sr}} \cdot \mathbf{B_{raw, d}}$ is above some threshold. This would give us a set of colours which are likely to be from dye $d$.

Our method is better for two reasons:

The raw bleed matrix $\mathbf{B_{raw}}$ is not always good estimate of the bleed matrix.
The central dyes have very similar colour spectra, so it is difficult to classify which dye the vector comes from by looking at round $r$ alone. By using information from adjacent rounds, we can more confidently ensure that the colours we are looking at are from dye $d$.

Let $\mathbf{J}$ be the $(n_{\text{good spots}} \times n_c)$ matrix form of the set $\mathcal{J}_d$. This just means each row of $\mathbf{J}$ corresponds to a good spot and each column corresponds to a channel. Compute the first singular vectors of $\mathbf{J}$, ie: the optimal unit vectors $\boldsymbol{\omega} \in \mathbb{R}^{n_c}$ and $\boldsymbol{\eta} \in \mathbb{R}^{n_{\text{good spots}}}$ such that

\[ J_{s, c} \approx \lambda \eta_s \omega_c, \]

for some scalar $\lambda$. We then set $\mathbf{B_d} = \boldsymbol{\omega}$, which is a normalised fluorescence vector for dye $d.$

3: Free Bled Code Estimation

The purpose of this step is to estimate a representative colour, which we call a free bled code $E_{grc}$ for each gene $g$.

What makes these codes free?

$E_{grc}$ is free in the sense that for each gene $g$, $E_{grc}$ is only determined by spots assigned to gene $g$ and not by spots assigned to other genes.

Our method of estimating the tile-independent free bled codes $\mathbf{E_{g}}$ (and similarly the tile-dependent free bled codes $\mathbf{D_{g,t}}$) should satisfy the following properties:

If we have no spots, we should use a prior vector $\mathbf{E_g} = (\mathbf{B_{d(g, 1)}}, \ldots, \mathbf{B_{d(g, n_r)}}) ^ T$,
We should allow each round $\mathbf{E_{gr}}$ to scale $\mathbf{B_{d(g, r)}}$ easily,
We should allow each round $\mathbf{E_{gr}}$ to change the direction of $\mathbf{B_{d(g, r)}}$ less easily, but still allow it to change.

Why these properties?

Property 1 is necessary because for large gene panels we often have very few reads of each gene, meaning that we have very few samples to compute $\mathbf{E_g}$ and even fewer to compute $\mathbf{D_{g, t}}$.
Property 2 is necessary because, as mentioned previously, different concentrations of bridge probes lead to systematic differences in brightness between genes in different rounds. We want to allow the brightness of each gene in each round to be scaled up or down without needing very many samples to do so.
Property 3 is necessary because sometimes the way that a particular dye is expressed varies from gene to gene. An example of this is when the dyes are not completely washed out between rounds, leading to a small amount of bleedthrough from the previous round.

In the example below, both CPLX2 and FOS have dye 2 in their codes (R5 and R4 respectively), but due to incomplete washout of dye 0 in R3 of CPLX2 these genes have very different codes for dye 2.

CPLX2FOS

Bleedthrough into R5, Dye 2.

No bleedthrough into R6, D2.

The following mean satisfies all the properties mentioned above. Given $n$ round fluorescence vectors $\mathbf{f_1}, \ldots, \mathbf{f_n}$ and a prior unit vector $\mathbf{b}$, all in $\mathbb{R}^{n_c}$, we define the parallel bayes mean of these as

\[ \mathbf{\bar{F}}_{\alpha\beta}(\mathbf{b}) = \dfrac{\alpha^2}{n + \alpha^2} \mathbf{b} + \dfrac{1}{n + \alpha^2} \bigg( \sum_i \mathbf{f_i \cdot b} \bigg) \mathbf{b} + \dfrac{1}{n+\beta^2} \sum_i \bigg( \mathbf{f_i} - (\mathbf{f_i} \cdot \mathbf{b})\mathbf{b} \bigg). \]

The values $\alpha^2$ and $\beta^2$ are in the config file as concentration_parameter_parallel (default value 10) and concentration_parameter_perpendicular (default value 50) respectively.

How to choose and interpret $\alpha$ and $\beta$?

The formula for $\mathbf{\bar{F}}_{\alpha \beta}(\mathbf{b})$ is quite complcated, but it is actually quite easy to interpret once we understand what it is doing for different values of $\alpha$ and $\beta$.

If $\alpha = \beta = 0$, this is just the average. ie: $\mathbf{\bar{F}}_{0,0}(\mathbf{b}) = \frac{1}{n} \sum_i \mathbf{f_i}$,
if $\alpha = \beta = m$, for some positive integer $m$, this is the average of $\mathbf{f_1}, \ldots, \mathbf{f_n}$ and $m$ copies of $\mathbf{b}$, ie: $$\mathbf{\bar{F}}_{m,m}(\mathbf{b}) = \frac{1}{n + m} \sum_i \mathbf{f_i} + \frac{m}{n + m} \mathbf{b}$.
From the previous point, we see that if $\alpha = \beta = \infty$, this is just the prior vector $\mathbf{b}$, ie: $\mathbf{\bar{F}}_{\infty, \infty}(\mathbf{b}) = \mathbf{b}$.
The component of $\mathbf{\bar{F}}_{\alpha \beta}(\mathbf{b})$ parallel to $\mathbf{b}$ has magnitude $\frac{\alpha^2 + \sum_i \mathbf{f_i} \cdot \mathbf{b}}{n + \alpha^2}$, which means that:
- If $n << \alpha^2$ this magnitude is approximately 1, so this component is approximately $\mathbf{b}$,
- If $n >> \alpha^2$ this magnitude is approximately $\frac{1}{n} \sum_i (\mathbf{f_i} \cdot \mathbf{b} ) \mathbf{b}$, which is the magnitude of the sample mean $\mathbf{\bar{f}}$ in the direction $\mathbf{b}$.
The component of $\mathbf{\bar{F}}_{\alpha \beta}(\mathbf{b})$ perpendicular to $\mathbf{b}$ has magnitude $\frac{1}{n + \beta^2} \sum_i ( \mathbf{f_i} - (\mathbf{f_i} \cdot \mathbf{b})\mathbf{b} )$ which means that:
- If $n << \beta^2$ this magnitude is approximately 0, so this component is approximately $\mathbf{0}$,
- If $n >> \beta^2$ this magnitude is approximately $\frac{1}{n} \sum_i \mathbf{f_i} - (\mathbf{f_i} \cdot \mathbf{b})\mathbf{b}$, which is the sample mean $\mathbf{\bar{f}}$ perpendicular to $\mathbf{b}$.

From this analysis, we see that $\alpha^2$ is roughly the number of spots needed to scale $\mathbf{\bar{F}}_{\alpha \beta}(\mathbf{b})$ in the direction of $\mathbf{b}$, and $\beta^2$ is roughly the number of spots needed to scale $\mathbf{\bar{F}}_{\alpha \beta}(\mathbf{b})$ perpendicular to $\mathbf{b}$.

This is why we to set $\alpha^2 << \beta^2$. We want to easily scale the average in the direction of the prior vector, but not easily change its direction.

We use these to estimate the free bled codes $\mathbf{E_{gr}}$ for each gene $g$ and round $r$ as follows:

Let $\mathbf{f_1}, \ldots, \mathbf{f_n} \in \mathbb{R}^{n_c}$ be the round $r$ fluorescence vectors of spots assigned to gene $g$ with probability greater than $\gamma$ and let $\mathbf{B_{d(g, r)}}$ be the prior unit vector. We then set each round $r$ to have free bled codes $\mathbf{E_{gr}}$ given by

\[ \mathbf{E_{gr}} = \mathbf{\bar{F}}_{\alpha \beta}(\mathbf{B_{d(g, r)}}). \]

The case for $\mathbf{D_{g, t}}$ is exactly analogous, except we use the fluorescence vectors of spots assigned to gene $g$ in tile $t$ with probability greater than $\gamma$.

Derivation of the Parallel Bayes Mean

The formula for the parallel bayes mean is a maximum a posteriori estimate. This means that we view the data as coming from a particular distribution with some unkown mean $\boldsymbol{\mu}$, which we want to estimate. We have some prior beliefs about what $\boldsymbol{\mu}$ should be and how this should vary, which we encode in a prior distribution of potential values for $\boldsymbol{\mu}$. The observed data has a certain probability given $\boldsymbol{\mu}$, and by Bayes rule each $\boldsymbol{\mu}$ has a probability given the data. The maximum a posteriori estimate $\boldsymbol{\hat{\mu}}$ is the value of $\boldsymbol{\mu}$ which maximises this conditional probability distribution.

Let $\mathbf{F_1}, \ldots, \mathbf{F_n}$ be the round $r$ fluorescence vectors of spots assigned to gene $g$ with high probability and let $\mathbf{B}_{d(g,r)}$ be the prior unit vector.

To begin, assume the vectors $\mathbf{F_1}, \ldots, \mathbf{F_n}$ are i.i.d normal random variables with mean $\boldsymbol{\mu}$ and covariance $I_{n_c}$, wihch means the sample mean is also normal with mean $\boldsymbol{\mu}$ and covariance $\frac{I_{n_c}}{n}$. Impose a normal prior on the space of possible means:

\[ \overline{\mathbf{F}} \sim \mathcal{N} \bigg( \boldsymbol{\mu}, \frac{\boldsymbol{I_{n_c}}}{n} \bigg) \]

\[ \boldsymbol{\mu} \sim \mathcal{N}(\mathbf{B}_{d(g,r)}, \Sigma) \]

where

\[ \Sigma = \text{Diag}\left(\frac{1}{\alpha^2}, \frac{1}{\beta^2}, \ldots, \frac{1}{\beta^2}\right), \]

in the orthonormal basis $\mathbf{v}_1 = \mathbf{B}_{d(g,r)}$, and everything else orthogonal to this.

Set $\boldsymbol{\Lambda} =\boldsymbol{\Sigma}^{-1}$, $\mathbf{b} = \mathbf{B_{d(g,r)}}$, and recall that the normal is a conjugate prior, meaning the posterior $\boldsymbol{\mu} \mid \mathbf{\overline{F}}$ is also normal.

To find its mode we'll solve for the zeros of the derivative of its log-density. The log-density of $\boldsymbol{\mu} \mid \mathbf{\overline{F}}$ is given by

\[ \begin{aligned} l(\boldsymbol{\mu}) &= \log P(\boldsymbol{\mu}| \overline{\mathbf{F}} = \mathbf{f}) \\ \\ &= \log P(\boldsymbol{\mu}) + \log P(\overline{\mathbf{F}} = \mathbf{f} | \boldsymbol{\mu}) + C \\ \\ &= -\frac{1}{2} (\boldsymbol{\mu} - \mathbf{b})^T \boldsymbol{\Lambda} (\boldsymbol{\mu} - \mathbf{b}) - \frac{n}{2} (\boldsymbol{\mu} - \mathbf{f})^T (\boldsymbol{\mu} - \mathbf{f}) + C \end{aligned} \]

This has derivative

\[ \frac{\partial l}{\partial \boldsymbol{\mu}} = - \boldsymbol{\Lambda} (\boldsymbol{\mu} - \boldsymbol{b}) - n(\boldsymbol{\mu} - \mathbf{f}) \]

Setting this to $\mathbf{0}$, rearranging for $\boldsymbol{\mu}$ and using the fact that

\[ \boldsymbol{\Lambda} \mathbf{v} = \begin{cases} \alpha^2 \mathbf{v} & \text{if } \mathbf{v} = \lambda\mathbf{b} \\ \\ \beta^2 \mathbf{v} & \text{otherwise} \end{cases} \]

we get

\[ \begin{aligned} \boldsymbol{\hat{\mu}} &= (\Lambda + nI)^{-1}(\Lambda \mathbf{b} + n\mathbf{f}) \\ \\ &= (\Lambda + nI)^{-1}(\alpha^2 \mathbf{b} + n\mathbf{f}) \\ \\ &= (\Lambda + nI)^{-1}(\alpha^2 \mathbf{b} + n(\mathbf{f} \cdot \mathbf{b})\mathbf{b} + n(\mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b})) \\ \\ &= (\Lambda + nI)^{-1}((\alpha^2 + n \mathbf{f} \cdot \mathbf{b})\mathbf{b} + n(\mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b}))\\ \\ &= \dfrac{(\alpha^2 + n \mathbf{f} \cdot \mathbf{b})}{n + \alpha^2} \mathbf{b} + \dfrac{n}{n+\beta^2} \bigg( \mathbf{f} - (\mathbf{f} \cdot \mathbf{b})\mathbf{b} \bigg) \end{aligned} \]

Plugging in the observed sample mean $\mathbf{f} = \frac{1}{n}\sum_i \mathbf{f_{i, r}}$ yields our estimate $\boldsymbol{\hat{\mu}}$.

4: Round and Channel Normalisation

The purpose of this step is to find a scale factor $V_{rc}$ for each round $r$ and channel $c$ which gets as many of our spots as close as possible to their target values. We will then multiply $V_{rc}$ by the free bled codes $E_{grc}$ to get the constrained bled codes $K_{grc}$.

What makes these codes constrained?

The codes $K_{grc}$ are constrained in the sense that the value of $K_{grc}$ is determined by several genes other than $g$.

These codes have nice global properties, like as many genes as possible being as close as possible to their target values, but will not be representative of the spots assigned to gene $g$. This is addressed in section 5, where we will be to find a scale to get the spots as close as possible to these new constrained bled codes.

The target values work as follows:

$T$ is defined as target_values in the config file as a list of length $n_c$.
$T_c$ is the target value for channel $c$ in its representative dye $d_{\textrm{max}}(c)$,
$d_{\textrm{max}}$ is defined as d_max in the config file as a list of length $n_c$.
$d_{\textrm{max}}(c)$ is the dye we use to represent channel $c$, and we want to get its brightness in channel $c$, $B_{d_{\textrm{max}}(c), c}$ as close as possible to $T_c$.

Any gene that has dye $d_{\textrm{max}}(c)$ in round $r$ will have its free bled code $E_{grc}$ scaled by $V_{rc}$ to get as close as possible to $T_c$. Since $E_{grc}$ is a representative colour for all spots assigned to gene $g$, this will also get the spots as close as possible to their target values.

As in section 2 above, let $G_{rd}$ be the set of genes with dye $d$ in round $r$ and define the loss function

\[ L(V_{rc}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g}} \ \bigg( V_{rc} \ E_{grc} - T_{c} \bigg)^2, \]

where $N_g$ is the number of high probability spots assigned to gene $g$. There is no reason this has to be a square root, though if it is not, too much influence is given to the most frequent genes. We minimise this loss to obtain the optimal value

\[ V_{rc} = \dfrac{ \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_g} E_{grc} T_{c} } { \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_g} E_{grc}^2 }, \]

Now define the constrained bled codes, which we will just call bled codes

\[ K_{grc} = E_{grc}V_{rc}. \]

5: Tile Normalisation

The purpose of this step is to remove brightness differences between images from different tiles in the same round and channel. We do this by finding a scale factor $Q_{trc}$ for each tile $t$, round $r$ and channel $c$ which gets as many of our spots on tile $t$ as close as possible to $K_{grc}$.

Our method works almost identically to step 4. Let $G_{rd}$ be the genes with dye $d$ in round $r$. Define the loss

\[ L(Q_{trc}) = \sum_{g \in G_{r, \ d_{max}(c)}} \sqrt{N_{g,t}} \ \bigg( Q_{trc} \ D_{gtrc} - K_{grc} \bigg)^2, \]

where $N_{g, t}$ is the number of high probability spots of gene $g$ in tile $t$.

If Q is correcting for tile differences, why does it have indices for $r$ and $c$?

The scale factor $Q_{trc}$ is defined to correct for differences in brightness between tiles, but the way that the brightness varies between tiles is completely independent for different round-channel pairs.

This is because the cause of brightness differences between tiles is largely random from image to image, as can be observed by looking at spots in the overlapping regions of adjacent tiles in the same round and channel.

We minimise this loss to obtain the optimal value

\[ Q_{trc} = \dfrac{ \sum_{g \in G_{r, \ d_{max}(c) }} \sqrt{N_{gt}} \ K_{grc} D_{gtrc}} { \sum_{g \in G_{r,\ d_{max}(c) }} \sqrt{N_{gt}} D_{gtrc}^2 }. \]

6 and 7: Application of Scales, Computation of Final Scores and Bleed Matrix

All that is left to do is multiply the spot colours $F_{src}$ by the updated normalisation factor $Q_{trc}$ to get the final spot colours: $F_{src} \mapsto Q_{trc} F_{src}$.

We then compute a score between each spot colour $\mathbf{F_s}$ and each gene bled code $K_{grc}$:

\[ \text{scores}(g) = \frac{1}{N_r}\Bigg|\sum_{rc}(\hat{F}_{src}\hat{K}_{grc})\Bigg| \]

where

\[ \hat{F}_{src} = \frac{F_{src}}{\sqrt{\sum_c|F_{src}|^2}}\text{,}\space\space\space \hat{K}_{grc} = \frac{K_{grc}}{\sqrt{\sum_c|K_{grc}|^2}}\text{,}\space\space\space N_r=\sum_r1 \]

The score rewards spots matching to the bled code in multiple rounds.

An intensity for each spot is saved to the notebook and used in the Viewer. It is computed from the final, scaled colours.

\[ \text{intensity}_s = \min_r(\max_c(|\mathbf{F}_{src}|)) \]

This intensity should have a threshold when looking at gene results as it removes poor gene reads caused by colour that is bright in only some of the rounds. From data, a value of 0.15 is reasonable. This is the default threshold for the Viewer.

What could cause brightness in some rounds but not others?

There are many possible explanations: 1) A registration mistake has caused a misalignment in some pixels. 2) An experiment error has failed to light up a gene in a specific round. 3) An experiment error has caused bright artifacts to appear in specific rounds and not others. So we must be robust against missing round brightness. This is especially true for OMP as this runs on every image pixel, which will include background noise.

Then use the best gene score for each spot's assigned gene:

dot_product_gene_no[s] = $\textrm{argmax}_g (\textrm{scores}(g))$
dot_product_gene_score[s] = $\textrm{max}_g (\textrm{scores}(g))$

We also compute probabilities for each spot $s$ being assigned to gene $g$ as

gene_prob[s, g] = $\dfrac{\exp(\kappa \mathbf{K_{g} \cdot F_s})}{\sum_{g'} \exp(\kappa \mathbf{K_{g'} \cdot F_s})}$,

where $\mathbf{F_s}$ and $\mathbf{K_{g}}$ have both been round-normalised. Finally, with these updated gene assignments, we can compute the final bleed matrix $\mathbf{B}$ in the same way as in step 2.

Diagnostics

Diagnosing the quality of the gene assignments is a crucial part of the pipeline. We provide several diagnostics to help with this:

View Scaling And BG Removal

from coppafisher.plot.call_spots import ViewScalingAndBGRemoval

ViewScalingAndBGRemoval(nb)

(or simply press 'N' in the main results' viewer)

Viewing the background removal and scaling of a subset of isolated spots.

This shows a subset of 10,000 isolated spots in descending order of amount of background. The images on the top row are spot colours, each flattened into a single row and demarcated into channels by the red vertical lines. The plots on the bottom row show the intensity of a bright spot in each round channel.

This plot shows us a few things:

Certain channels have much higher background than others. The final column is a good check that the background has been removed.
Different channels have different baseline brightnesses. Check that the brightnesses in the middle and final column are to your liking and similar to the target values.
The final brightnesses are not all the same: this is because we imposed channel-specific target values in step 4. This is a good check that the target scaling is working as expected.

View Bleed Matrix

import matplotlib.pyplot as plt
from coppafisher.plot.call_spots import ViewBleedMatrix

ViewBleedMatrix(nb.basic_info, nb.call_spots)
plt.show()

(or simply press 'B' in the main results' viewer)

Viewing the bleed matrix.

This viewer shows 3 bleed matrices, each with columns (dyes) normalised.

The first is the raw bleed matrix $\mathbf{B_{raw}}$ which is the initial estimate of the bleed matrix, used for the very first gene assignments in step 1.
The second is the initial bleed matrix $\tilde{\mathbf{B}}$ made from an SVD of high probability spots. This is scaled according to the initial scale factor $\tilde{A}_{trc}$ introduced in step 0. This is why channel 10 is so much brighter than its target value.
The final bleed matrix $\mathbf{B}$ is the bleed matrix estimated from the final gene assignments, on high probability spots. This is scaled according to the final scale factor $A_{trc} = Q_{trc}\tilde{A}_{trc}$. You should be able to see the values are roughly in the same ratios as the target values.

View Free And Constrained Bled Codes

from coppafisher.plot.call_spots import ViewFreeAndConstrainedBledCodes

ViewFreeAndConstrainedBledCodes(nb)

This will pull up a viewer that shows you the free bled codes $E_{grc}$ and the constrained bled codes $K_{grc}$ from the most influential genes for a given round and channel. This is a good way to check if the target scale $V_{rc}$ is working as expected.

To view different rounds and channels, simply scroll.

R0C5R2C15

If this works as expected, the constrained bled codes should have values close to their target values and the constrained bled codes should be more homogeneous than the free bled codes. This can be seen in the first image, where R0C5 is initially very bright, but after scaling is much closer to the brightnesses of the other rounds and channels.

View Target Regression

from coppafisher.plot.call_spots import ViewTargetRegression

ViewTargetRegression(nb)

This viewer is similar to the previous one in that it is showing how well the target scaling is working. It does this in a bit more detail, but is a little confusing!

To view different rounds and channels, simply scroll.

R0C27R4C5

In the plots above:

Each dot is a gene, which has dyed $d_{\textrm{max}}(c)$ in round $r$.
The size of the dot is proportional to the number of spots assigned to that gene.
The x-axis values are completely random.
In the leftmost column, the y-axis values are the brightnesses $E_{grc}$ of the genes after the initial scaling $\tilde{A}_{trc}$ but before the target scaling $V_{trc}$.
In the middle column, the y-axis values are the brightnesses $K_{grc}$ of the genes after the target scaling $V_{trc}$.

It is important to check how well each round and channel is being scaled to its target value. R0C27 is pretty good, with most of the genes being pretty concentrated at the target value. R4C5 is much noisier, with many genes consistently too bright or too dim.

View Tile Scale Regression

from coppafisher.plot.call_spots import ViewTileScaleRegression

ViewTileScaleRegression(nb, t)

This function looks at a fixed tile and then shows the regression for the tile scale factor $Q_{trc}$ for each round and channel. Recall that this is the scale factor that multiplies the tile-dependent free bled codes $D_{gtrc}$ to get the constrained bled codes $K_{grc}$.

As in the previous diagnostic, each spot is a gene with dye $d_{\textrm{max}}(c)$ in round $r$ and the size of the dot is proportional to the number of spots assigned to that gene. Unlike the previous diagnostic, in this plot the x-values are not random, but are the brightnesses $D_{gtrc}$ of the genes (averaged from spots which have been multiplied by initial scale $\tilde{A}_{trc}$). The y-values are the brightnesses $K_{grc}$.

A couple of things to note:

Different slopes within the same column (channel) indicate that this is picking up on differences in round brightnesses for this channel on this tile.
If these regressions have a low $R^2$ value, the tile scaling is not working well. This may be a sign of a blank tile or poor registration.

Viewing the background removal and scaling of a subset of isolated spots.

### View Scale Factors

from coppafisher.plot.call_spots import ViewScaleFactors

ViewScaleFactors(nb)

This simple viewer shows the target scale $V_{rc}$, the tile scale $Q_{trc}$ and the relative scale $Q_{trc}/V_{rc}$ for each round and channel. What to expect: - The tile scale $Q_{trc}$ should be close to $V_{rc}$ for each tile $t$. This is because $D_{gtrc}Q_{trc} \approx K_{grc = E_{grc}V_{rc}}$. So if $E_{grc} \approx D_{gtrc}$, then $Q_{trc} \approx V_{rc}$. - The relative scale measures how much we deviate from the vase where $E_{grc} = D_{gtrc}$, and should not have a huge amount of variance. In the plot above the highest value is 0.5 and the lowest is 0.35, which is a good range.

### Gene Efficiency Viewer

from coppafisher.plot.call_spots import ViewGeneEfficiencies

ViewGeneEfficiencies(nb, score_threshold=gamma, mode=gene_assignment_mode)

or press `e` in the Viewer.

Each row represents a gene and each column a round. The colour of each cell is the amount of weight that gene $g$ has in round $r$. The ideal case would be homogeneous colours across the rows, indicating that each gene is equally bright in each round, but this is never the case. Look out for: - Genes with an abnormal amount of spots assigned to them. This is often the case for poor quality genes which look a lot like background. If this is the case, the gene probability threshod `gene_prob_thresh` in the config file should be increased. - Genes with very high or low gene efficiencies. This should not happen if `concentration_parameter_parallel` is sufficiently high, as it typically should need at least 10 spots to scale each dye. If the gene efficiencies are incorrect, OMP will struggle to find the correct gene assignments. ### Gene Spots Viewer

from coppafisher.plot.call_spots import GeneSpotsViewer

GeneSpotsViewer(nb, score_threshold=gamma, gene_index=g, mode=gene_assignment_mode)

(or simply click one of the genes in the gene efficiency viewer)

This viewer shows the spots assigned to a particular gene above a certain threshold, under a certain gene assignment mode (either 'anchor', 'prob' or 'omp'). This is the viewer I use the most. It helps me find abnormalities that would be hard to spot otherwise, like the persistent unexpected channel 27 signal in round 0 in the images above. If a particular gene is under or over expressed, this viewer will typically tell us why. It also gives us a very nice representation of which dyes, rounds and channels are clean and which are noisy.