Introduction

In studying the brain we are often confronted with phenomena that involve specific subsets of neurons distributed across many brain regions. Computations, for example, are performed by neuronal networks connecting cells in different parts of the brain and, in development, neurons in different anatomical regions of the brain share the same lineage. An example of each of these types will be considered here, but the challenge is very general: how to measure and analyze multi-region neuronal data with cellular resolution.

In a typical cell-count experiment, gene expression is used to tag the specific cells of interest with a targeted indicator (1). The brain is sliced, an entire stack of brain sections from a single animal is imaged, aligning and registering the images to a standardised brain atlas such as the Allen mouse atlas (25), segmenting the images into anatomical regions and counting the labeled cells in each region. The resulting dataset consists of labeled cell counts across each of ∼10–100 brain regions. This technology is being deployed to address questions in a broad range of neuroscience subfields, for example: memory (68), neurodegenerative disorders (9), social behavior (10), and stress (11).

Cell-counts are often compared across groups of animals which differ by an experimental condition such as drug treatment, genotype or behavioural manipulation. However, the expense and difficulty of the experiment means that the number of animals in each group is often small, ten is a typical number but fewer than ten is not uncommon. This means that these data are undersampled: the dimensionality of the data, which corresponds to the number of brain regions, is much larger than the number of samples, which usually corresponds to the number of animals (Figure 1A). Current statistical methods are not well-suited to these nested wide-but-shallow datasets. Furthermore, due to the complicated preparation and imaging procedure, there is often missing data and there is variability derived from experimental artifacts.

Introduction.

A: Each of N animals produce a cell-count from a total of R brain regions of interest. Cell-count data is typically undersampled with NR. Scientists analyse the brain sections from experiment for positive signals. Here an example section is shown where teal points mark cells expressing the immediate early gene c-Fos. The final cell-count is equal to the sum of these individual items (sagittal brain map taken from the Allen mouse brain atlas: https://mouse.brain-map.org. B: Partial pooling is a hierarchical structure that jointly models observations from some shared population distribution. It is a spectrum that depends on the value of the population variance τ. When τ = 0 there is no variation in the population and each individual observation is modeled as a conditionally independent estimate of some fixed population mean θ (complete pooling). As τ tends to infinity observations do not combine inferential strength but inform an independent estimate γi (no pooling). In between the two extremes combine. Each observation can contribute to the population estimate while simultaneously supporting a local one to effectively model the variance in the data. Observed quantities have been highlighted with a thicker stroke in the graphical model. C: An example of partial pooling on simulated count data. As the population standard deviation increases on the x-axis the individual estimates exp(γi) trace a path from a completely pooled estimate to an unpooled estimate. Circular points give the raw data values. Parameters are exponentiated because the outcomes are Poisson and so parameters are fit on the log scale.

In cell count data there are two obvious sources of noise. The first of these is easy to describe, if a region has an rate that determines how likely a cell is to be marked for counting, then the actual number of marked cells is sampled from a Poisson distribution. The second source of noise is the animal to animal variability of the rate itself and this depends on diverse features of the individual animal and the experiment that are often unrelated to the phenomenon of interest. The challenge is to control for outliers and ‘poor’ data points whose rate is noisy, while extracting as much information as possible about the underlying process. Dealing with outliers is often an opaque and ad hoc procedure, it is also a binary decision, a point is either excluded so it does not contribute, or included, noise and all. This is where partial pooling helps. Partial pooling allows for the simultaneous estimation of individual parameters with the population distribution that describes them. This helps the data to self-regularise and elegantly balances the contribution of informative and weak observations to parameter values.

In recent years Bayesian approaches to data analysis have become powerful alternatives to classical frequentist approaches (1214), including for some types of neuroscience data: such as neurolinguistics (15), neural coding (16, 17), synaptic parameters (1821) and neuronal-circuit connectivity (22, 23). A Bayesian approach is particularly well suited to cell-count data but has not previously been applied to this problem.

A Bayesian approach formalizes the process of scientific inference, it distinguishes the data and a probabilistic mathematical model of the data. This model has a likelihood which gives the probability of the observed data for a given set of model parameters. The model often has a hierarchical structure which we compose to reflect the structure of the experiment and the investigators hypothesis of how the data depends on experimental condition. This hierarchy determines a set of a priori probabilities for the parameter values. The result of Bayesian inference is a probability distribution for these model parameters given the data, termed the posterior.

There are three advantages of a Bayesian approach that we want to emphasis, (1) while traditional multi-level models also allow a hierarchy (24), Bayesian models are more flexible and the role of the model is clearer, (2) since the result of Bayesian inference is a probability distribution over model parameters, it indicates not just the fitted value of a parameter but the uncertainty of the parameter value. Finally (3) Bayesian models tend to make more efficient use of data and therefore improving statistical power.

Here, our aim is to introduce a ‘standard’ Bayesian model for cell count data. We illustrate the application of this model to two datasets, one related to neural activation and the other to developmental lineage. In both cases the Bayesian model produces clearer results than the classical frequentist approach.

Materials and Methods

Data

To illustrate our approach we consider two example applications, one which counts cells active in regions of the recognition memory circuit of rat during a familiarity discrimination task, and the other which examines the distribution of a specific interneuron type in the mouse thalamus.

Case study one – transient neural activity in the recognition memory circuit

The recognition memory network, Figure 3, is a distributed network which has been well studied using a variety of behavioural tasks. It includes the hippocampus (HPC) and perirhinal cortex (PRH), shown to deal with object spatial recognition and familiarity discrimination respectively (2527); medial prefrontal cortex (MPFC), concerned with executive functions such as decision making but also with working memory, and the temporal association cortex (TE2) used for acquisition and retrieval of long-term object-recognition memories (28). The nucleus reuniens (NRe) is also believed to be an important component of the circuit (25, 29), owing to its reciprocal connectivity with both MPFC and HPC (30). In previous studies, lesions of the NRe have been shown to significantly impair long-term but not short-term object-in-place recognition memory (29). The data analysed in this case study were collected to investigate the role of the NRe in the recognition memory circuit through contrasting the neural activation for animals with a lesion in the NRe with neural activation for animals with a sham surgery. The immediate early gene c-Fos is rapidly expressed following strong neural activation and is useful as a marker of transient neural activity. Animals in the experiment performed a familiarity discrimination task (single-item recognition memory), discriminating novel or familiar objects with or without an NRe lesion and the number of cells that expressed c-Fos was counted in regions across the recognition memory circuit. The two-by-two experimental design allocated ten animals to each of the four experiment groups {sham, lesion} × {novel, familiar} and cell counts were recorded from a total of 23 brain regions. The visual cortex V2C and the motor cortex M2C were taken as control regions as they were not expected to show differential c-Fos expression in response to novel or familiar objects (31).

Case study two – Ontogeny of inhibitory neurons in mouse thalamus and hypothalamus

The second dataset comes from a study, (32), that, in part, counted the number of inhibitory interneurons in the thalamocortical regions of mouse. Sox14 is a gene associated with inhibitory neurons in subcortical areas. It is required for the development and migration of local inhibitory interneurons in the dorsal lateral geniculate nucleus (LGd) of the thalamus (32, 33). Consequently, Sox14 is useful for identifying discrete neuronal populations in the thalamus and hypothalamus (or in the diencephalon) (3234). A heterozygous (HET) knock-in mouse line Sox14GFP/+ to mark Sox14 expressing neurons with green fluorescent protein (GFP), was compared to a homozygous knock-out (KO) mouse line Sox14GFP/GFP engineered to block expression of the endogenous Sox14 coding sequence (35). Each animal produced two samples, one for each hemisphere giving a total of ten data points, six belonging to HET (three animals), and four to KO (two animals). Each observation is 50 dimensional, corresponding to fifty individual brain regions in each hemisphere.

Hierarchical modeling

Our goal in both cases is to quantify group differences in the data. We start with a ‘standard’ hierarchical model; this model can be further refined or changed to improve the analysis and to more clearly match the structure of a particular experiment, we will give an example of this for our second dataset. At the bottom of the model are the data themselves, the cell counts yi. The index i runs over the full set of samples, in this case 23 × 10 × 4 datapoints in the first study, and 50 × 6 + 50 × 4 in the second. The basic assumption the model makes is that this count is derived from an underlying propensity, λi > 0, which depends on brain region and, potentially, group:

Hence, in this case, the propensity λi is the mean of the Poisson distribution, and the statistical model describes the dependence of this parameter on brain region and animal. Since λi is strictly positive a log-link function is used:

where

and we have used ‘array notation’ (36), mapping the sample index i to properties of the sample, so r[i] returns the region index of observation i, and similar for g[i] but for groups and animals.

We use partial pooling. Thus, ignoring Ei for now, the rate has been split between two terms, θr[i],g[i] is the fixed effect which is constant across animals and γr[i],a[i] is the random effect which captures the animal to animal variability. While θrg models the mean for log-cell-count for each region, given the condition; γi models variation around this mean. For this reason, γi is assumed to follow a normal distribution with zero mean. The regression term may appear over-parameterized, without θrg the γi could ‘do the work’ of matching the data. However, the model is regularized by a prior; observations with a weak likelihood will have their random effect γi shrunk towards the population location. The amount of regularization depends on the variation in the population, a quantity that is estimated from each likelihood. This is how partial pooling works as an adaptive prior for ‘similar’ parameters (Figure 1B). The data ‘pools’ some evidence while still allowing for individual differences in sample.

The final term is the exposure Ei. Cell counts may be recorded from sections with different areas. The exposure term scales the parameters in the linear model as the recording area increases (13). In our model, the exposure is equal to the logarithm of the recording area; this value is available as part of the experimental data.

The set of parameters τr,g models the population standard deviations of the noise for each region r and animal group g.

Parameter table for the hierarchical model.

When working on the log scale, priors for these parameters are typically derived in terms of multiplicative increases. Since the parameters are positive they are assigned a half-normal distribution

with an appropriately chosen scale s > 1. For our analyses we used s = 1.05 because this gives a HalfNormal distribution with 95% of its density in the interval [0, log(1.1)]. This translates into an approximate 10% variation around exp(θ) at the upper end, which is a moderately informative prior, reflecting our belief that within-group animal variability is small relative to between-group variability. This regularization also helps model inference when the datasets are undersampled.

Horseshoe prior

Cell count data often has outliers, for example due to experimental artifacts. Since by default the likelihood does not account for these outliers, they may cause substantial changes in fitted parameter values. This is demonstrated in Figure 2, where a careless application of the Poisson distribution on data with several zero counts has a large influence on the posterior distribution. There are two general options for dealing with outliers: either modeling them in the likelihood or in the prior. Although the likelihood option is preferred as it is more direct, see our zero-inflation model below, it can be hard to design because it requires knowledge of the outlier generation process. The alternative is via a flexible prior such as the horseshoe (37, 38). This more generic option may be suitable as a default approach if the outliers are poorly understood.

Methods.

A table of partial pooling behaviour for different likelihood and prior combinations. Rows are the two prior choices for the population distribution, and columns the two distributions for the data. Within each cell the expectation of the marginal posterior p(exp(γi)|θ, τ, y) is plotted as a function of τ. The solid black line is the expectation of the marginal posterior p(θ|τ, y) with one standard deviation highlighted in grey. Top left: Combining a normal prior for the population with a Poisson likelihood is unsatisfactory in the presence of a zero observation. The zero observations influence the population mean in an extreme way owing to their high importance under the Poisson likelihood. Bottom left: By changing to a horseshoe prior the problematic zero observations can escape the regularisation machinery. However, regularisation of the estimates with positive observations is much less impactful. Top right: A zero-inflated Poisson likelihood accounts for the zero observations with an added process, reducing the burden on the population estimate to compromise between extreme values. Bottom right: No model.

Recognition memory circuit.

Schematic of the recognition memory network adapted from (31). Bold arrows show the assumed two-way connection between the medial prefrontal cortex and the hippocampus facilitated by the NRe. Colours highlight the HPC (red), MPC (blue) and specific areas of the rhinal cortex (yellow). The NRe was lesioned in the experiment.

The horseshoe prior is a hierarchical prior for sparsity. It introduces an auxiliary parameter κi that multiplies the population scale τ. This construction allows surprising observations far from the bulk of the population density to escape regularisation.

An example of this is given in Figure 2 as the bottom left cell of the 2 × 2 table of models. The horseshoe prior often uses a Cauchy distribution but in our case the heavy tail causes problems for the sampling algorithm, see SI section 8.

Zero inflation

A particular trait of the second dataset is that there are a large number of zero data points (∼6%). Although a zero observation is always possible for a Poisson distribution, for plausible values of the propensity, zeros should be rare. It is likely that for some regions the experiment has not worked as expected and the zeros show that something has ‘gone wrong’ and that the readings are not well described by a Poisson distribution. Here we extend the model to include this possibility. This is a useful elaboration of the ‘standard model’, in the standard model the horseshoe prior ensures that these anomalous readings only have a small effect on the result but it is more informative to extend the model to include them. While this particular extension is specific to these data, it also serves as an example of how a standard Bayesian model can serve as a starting for point for an investigation of the data.

The zero-inflated Poisson model is intended to model a situation were there are zeros unrelated to the Poisson distribution, in this case this might, for example, be the result of an error in the automated registration process that identifies regions and counts their cells. It is a mixture model, if

there is a probability π that yi = 0 and a 1−π probability that yi follows a Poisson distribution with rate λi. Importantly this means there are two ways in which yi can be zero, through the Bernoulli process parameterized by π, or through the Poisson distribution. This has the effect of ‘inflating’ the probability mass at zero with the additional parameter π giving the proportion of extra zeros in the data that could not be explained by the standard Poisson distribution. This distribution can be visualized in Figure 2 and further mathematical details are described in SI 2C.

Model inference

Posterior inference was performed with the probabilistic programming language Stan (39), using its custom implementation of the No-U-Turn (NUTS) sampler (40, 41). For each model, the posterior was sampled using four chains for 8000 iterations, with half of these being attributed to the warm-up phase; giving a total of 16000 samples from the posterior distribution.

Results

We describe differences in estimated counts between groups in terms of log2-fold changes. Fold changes are useful because they prevent differences that are small in absolute magnitude from being masked by regions with high overall expression. Our results compare Bayesian highest density intervals (HDI) with the confidence interval (CI) from an uncorrected Welch’s t-test. The Bayesian HDI is calculated from the posterior distribution and is the smallest width interval that includes a chosen probability, here 0.95 (to correspond to α = 0.05), and summarizes the meaningful uncertainty over a parameter of interest.

Case study one – transient neural activity in the recognition memory circuit

Results for the first dataset are presented in Figure 4 with Figure 4A plotting cell-count differences between the novel and familiar conditions without lesion and Figure 4B with lesion. These data were collected to investigate the role of different hippocampal and adjacent cortical regions in memory. However, some regions of interest, such as the intermediate dendate gyrus (IDG), and the dorsal subiculum (DSUB) look under-powered: in the sham animals, for both regions there is a markedly non-zero difference in expression between the novel and familiar conditions but a wide confidence interval overlapping zero makes the evidence unreliable (orange bars Figure 4A).

Results—Case study 1.

log2-fold differences for each surgery type. The 95% Bayesian HDI is given in green, and the 95% confidence interval calculated from a Welch’s t-test in orange. Horizontal lines within the intervals mark the posterior mean of the Bayesian results, and the raw data means in the t-test case. The x-axis is ordered in terms of decreasing p-value from the significance test and ticks have been color-paired with the nodes in the recognition memory circuit diagram Figure 3. Black ticks are not present in the circuit because they are the control regions in the experiment. A: log2-fold differences between sham-familiar (SF) and sham-novel (SN) groups. B: log2-fold differences between lesion-familiar (LF) and lesion-novel (LN) groups.

In contrast, the Bayesian estimates (green bars Figure 4) produce a clear result. For a number of brain regions in Figure 4A, sham-novel animals have higher expression than sham-familiar ones. These differences disappear in Figure 4B with lesion-novel and lesion-familiar animals showing roughly equal cell counts. This indicates that the difference is only present when the NRe is intact.

Case study two – Ontogeny of inhibitory interneurons of the mouse thalamus

The estimated log2-fold difference in GFP expressing cells between the two genotypes, for each of the fifty brain regions, are plotted in Figure 5. This include the purple and pink 95% HDI from the horseshoe and zero-inflated Poisson models along with the 95% confidence interval arising from a t-test in orange.

Results—Case study 2.

log2 fold differences in GFP positive cells between mouse genotypes, heterozygous (HET) and knockout (KO), for each of the fifty recorded brain regions spread across two rows. The 95% Bayesian HDI is given in purple and pink for the Bayesian horseshoe and zero-inflated model. The 95% confidence interval calculated from a Welch’s t-test in orange. Horizontal lines within the intervals mark the posterior mean of the Bayesian results and the data estimate for the t-test. The x-axis is ordered in terms of decreasing p-value from the significance test.

Zero count observations.

On the left under data: boxplots with medians and interquartile ranges for the raw data for two example brain regions. The shape of each point pairs left and right hemisphere readings in each of the five animals. For both example regions there is a heterozygous animal with zero readings in both hemispheres. On the right under inference: HDIs and confidence intervals are plotted. The Bayesian estimates are not strongly influenced by the zero-valued observations and have means close to the data median. This explains the advantage of the Bayesian results over the confidence interval.

Diagnostics – Poisson.

Standard Poisson model, case study – 1.

Diagnostics – Horseshoe.

Horseshoe model, case study – 2.

Diagnostics – ZIPoisson.

Zero-inflated Poisson, case study – 2.

PPC – Poisson.

Posterior predictive check for the standard Poisson model in case study – 1. A: The proportion of zeroes in the data matches the proportion of zeroes in posterior predictive samples. This proportion is zero. B: The distribution of standard deviations computed over a number of posterior predictive datasets (histogram), aligns with the standard deviation of the data.

PPC – Horseshoe.

Horseshoe model, case study – 2. Posterior predictive check for the standard horseshoe model in case study – 2. A: The proportion of zeroes in the data is larger than those found in posterior predictive datasets. This makes sense, because the likelihood is still a Poisson distribution. B: The distribution of standard deviations computed over a number of posterior predictive datasets (histogram), aligns with the standard deviation of the data.

PPC – ZIPoisson.

Zero-inflated Poisson, case study – 2. A: The proportion of zeroes in the data matches the proportion of zeroes in posterior predictive samples. B: The distribution of standard deviations computed over a number of posterior predictive datasets (histogram), aligns with the standard deviation of the data.

The use of the zero-inflated Poisson distribution uncovers some interesting differences when compared to the t-test confidence intervals. One example of this is the result for tuberomammillary nucleus, dorsal part (TMd). For this region the HET animals have high GFP expression across both hemispheres, yet animal three has a reading of zero for both hemispheres. This injects variability into the standard deviation of the HET group. Consequently, the pooled standard deviation used in the t-test is large and almost certainly guarantees a non-significant result. Furthermore, the sample mean of this region looks nothing like zero, but also nothing like the other two animals with positive counts. In addition to the wide interval, the sign of the difference does not agree with the data. Similarly, the medial preoptic nucleus (MPN) also suffers from poor estimation. Once again, this region contains a single HET animal for which the reading from both hemispheres is zero. The zero-inflated Poisson produces a posterior distribution of the appropriate sign with small uncertainty.

Despite the large difference in interval estimation between the HDI and CI for many brain regions, as the data becomes stronger from the perspective of the frequentist p-value towards the right-hand side of the second row in Figure 5, they become much more compatible. The variation within groups is very small for these regions; further regularization is not necessary and so the impact of partial pooling has been reduced. The sample estimate of the t-test has ‘caught up’ to the regularized estimate because the signal is strong.

Discussion

We have presented a standard workflow for Bayesian analysis of multi-region cell-count data. We propose a likelihood and appropriate priors with a nested hierarchical structure reflecting the structure of the experiment. We applied this to two distinct example datasets and demonstrate that they capture more fruitfully the characteristics of the data when compared to field-standard frequentist analyses.

For both case studies, the Bayesian uncertainty intervals are more precise than the confidence intervals. These confidence intervals tend to be quite wide on these data because of the small sample size and because of violations to their parametric model assumptions. Our workflow uses a horseshoe prior, along with the partial pooling, this allows our model to deal effectively with outliers. Furthermore, for the data sizes presented here, a full Bayesian inference using Stan does not require long computation time, or even particularly high performance hardware. Modern multi-core laptop processors are quite sufficient for this task. Fitting a model typically takes less than an hour.

The workflow we have exhibited here is intended as a standard approach. We believe that, without extension, it will provide a robust model for cell-count data. However, we also suggest that the standard workflow can be a useful first step for a more comprehensive, extended, model when one is required. We have given an example of this for the second dataset where the anomalous zeros prompted us to change the likelihood to a zero-inflated Poisson. There are other possibilities, for example, zero inflation is not the only way to handle an anomaly in the number of zeros: the hurdle model is an alternative, (42). This is not a mixture model, instead it restricts the probability of zeros to some value π with the probabilities for the positive counts coming from a truncated Poisson distribution. The hurdle mode can deflate as well as inflate the probability mass at zero. This did not match the situation in the data we considered but might for other datasets. Another extension might involve tighter priors based on previous experiments. This is likely to be very relevant for cell-count data since these experiments are rarely performed in isolation and so prior information can be leveraged from a history of empirical results.

One obvious elaboration of our model would replace normal distributions with multivariate normal distributions. This would have two advantages. Firstly, correlations are difficult to estimate for under-sampled data. Including correlation matrix priors provides extra information, for example, based on anatomical connectivity, that can aid the statistical estimation of other parameters. Secondly, it would more closely match our understanding of the experiment: we know that activity is likely to be correlated across regions and so it is apposite to include that directly in the model. Unfortunately the problem of finding a suitable prior for the correlation proved insurmountable: the standard Lewandowski-Kurowicka-Joe distribution (43) which has been useful in lower-dimensional situations is too regularising here. This is an area where further work needs to be done.

It is important to highlight that a mixed effects model is not a uniquely Bayesian construction. Indeed, any model that tries to include more sophistication through hierarchical structures, Bayesian or otherwise is useful. However, non-Bayesian models can be complicated and opaque, they are also often more restrictive, for example they often assume normal distributions, and circumventing these restrictions can make the models even less transparent. A Bayesian approach is, at first, unfamiliar; this can make it seem more obscure than better established methods, but, in the long run, Bayesian models are typically clearer and do not involve so many different assumptions and so many fine adjustments.

Code and data availability

The code necessary to run the models presented in this manuscript can be found at our Github respository: https://github.com/SydneyJake/Hierarchical Bayesian Cell Counts. The data for case study one on nucleus reunion lesion are available from https://doi.org/10.5281/zenodo.12787211 (44). The data from case study two on Sox14 expressing neurons are available from https://doi.org/10.5281/zenodo.12787287 (45).

Acknowledgements

We are grateful to Andrew Dowsey and Matthew Nolan for useful discussion and helpful suggestions. We would like to acknowledge funding from the EPSRC (EP/R513179/1 Doctoral Training Partnership PhD studentship to SD and EP/W024020/1 to SS), BBSRC (BB/L02134X/1 to ECW and BB/R007020/1 to AD), Wellcome Trust (206401/Z/17/Z to ECW), Leverhulme Trust (RF-2021-533 to CJH), and MRC (MR/S026630/1 to COD)

Appendix

1. Full model

Here we present the complete mathematical model for each of of the three models applied in the main text. In all cases, the exposure term is only necessary for the first case study. Area was not available for the second so the exposure term was left out.

A. Poisson model

B. Horseshoe model

C. Zero-inflated model

2. Distributions

A. Zero inflated Poisson distribution

The zero inflated Poisson distribution is a mixture distribution with mixing parameter π. The distribution is formally defined below. If,

then

where

Equivalently, the mixture can be specified with an indicator function.

B. Half Normal distribution

The HalfNormal distribution coincides with a zero-mean normal distribution truncated at zero. It has a single scale parameter σ. If,

then

is the probability density function.

3. Additional methods

A. Fold differences

In our results we present differences between experimental groups in terms of log2-fold differences. We calculate this as follows. The parameter of interest θr,g=i is modeled on the natural log scale owing to the log-link function necessary for the Poisson regression. At the average animal the difference,

is the natural log of the ratio of the expected counts. To obtain log2-fold differences we simply change the base by multiplying log Δij by log2(e).

B. Data transformations

To facilitate a comparison with the Bayesian intervals in terms of log2-fold differences, it was necessary to add one to any zero counts before applying the t-test.

C. Non-centered parameterization

Hierarchical models can produce geometry that is difficult for the sampler to explore. Fortunately, there exists a simple reparameterisation known as non-centering that can remedy this problem. In our model, instead of sampling γi directly, we sample the parameter instead, and use it to reconstruct γi. That is, sample from a standard normal distribution,

and reconstruct γi as a deterministic function of sampled values of and τr[i],g[i].

This removes the frustrating joint behaviour between of (γi, τr[i],g[i]), and promotes efficient sampling.

D. Preprocessing – case study 1

In these data some animals produced more than on reading per brain region. Before fitting our model these were summed together to produce a single count; the exposure term was also properly adjusted to correctly reflect the area of the recording site.

4. Software, packages and libraries

  • R version 4.2.1–“Funny-looking-kid”.

  • Computation was performed locally on a Dell XPS 13 7390 laptop (Intel i7-10510U @ 1.80 GHz, 16 GB of RAM; Ubuntu 20.04.4 LTS)

  • Panels composed using inkscape version 1.2.2.

5. Glossary

Here we include a glossary of terms for the brain regions in each case study.

6. Sampler diagnostics

The basic Poisson model (Equation 1) was sampled excellently. Measures of sampling performance such as (46, 47), and effective sample size, were all satisfactory (SI figure 1). Similarly, for the zero-inflated model (Equation 7), no problems were observed for any of the diagnostics (SI figure 2). Contrasting with this, the horseshoe model (Equation 6), exhibited some signs of fitting problems (SI figure 3). Divergences were not observed, and given the longer chain length this is reassuring evidence against biased computation. However for many parameters the effective sample size is much lower than we would like to see. This is reflected in the trace plots: the sampler is not making large jumps across the parameter space implying high auto-correlation and low effective sample size. Unfortunately, the horseshoe is notoriously hard to fit and we resort to brute-force methods such as increasing the number of iteration and reducing the step size of the sampler to improve the inference. In the following three plots diagnostics have been summarised with the following three items:

  • A: The performance of the sampler is illustrated by plotting (R-hat, ideal) against the ratio of the effective number of samples (larger is better) for each parameter in the model. Points represent individual parameters in the model, and have further been colour coded by their type. For example, all θr,g are coloured in green. Points have also been scaled based on how numerous the parameters are so the more numerous parameters have smaller dots, the less numerous, larger.

  • B: A histogram comparing the marginal energy distribution πE, and the transitional energy distribution πΔE of the Hamiltonian. Ideally, these distributions should match each other closely if the posterior distribution has been properly explored by the sampler.

  • C: For each parameter type the parameter with the ‘poorest’ mixing (largest ) are presented with a post-warmup trace plot that overlays the ordered sequence of samples from each of the four chains. Corresponding points in A are marked with a black border and zero transparency.

7. Posterior predictive checking

Posterior predictive checks use the posterior predictive distribution,

where p(θ|y) is the posterior distribution, and p(yrep |θ) is the data distribution for yrep that follows the same form as the likelihood for y. If we can verify that the posterior predictive distribution can generate replicate datasets with similar statistics to the observed data, then we might conclude that our model is consistent with the observed data and useful for answering questions about it. In practice a Monte-carlo approach is used to approximate statistics of the posterior predictive distribution. For example, if T is the test statistic of interest such as the sample mean, then

  1. For 1, …, S.

    1. sample θsp(θ|y)

    2. sample

    3. calculate , where T is the statistic of interest

  2. return

The posterior predictive checks that follow examine two important statistics of count data. 1) The standard deviation of the data (dispersion), as figure a. 2) The proportion of zeroes in the data (zero-inflation), as figure B. The value of the test statistic applied to the observed data T (y) is plotted as a solid purple line, and the distribution of test statistics as a purple histogram.

8. Horseshoe densities

In our model a horseshoe prior was used to allow some γi, typically those informed by yi = 0, to escape regularisation by partial pooling. However, we encountered many problems with the default parameterisation that assigns a HalfCauchy density to the individual inflation parameters κi. In Figure 13A, the proportional conditional posterior density

where θ and τ have been fixed, is plotted when yi lies close to the population mean (left), or when it equals zero (right). Note that the x-axis is and not γi because the model samples a non-centered parameterisation. That is,

captures the deviations of γi around the population mean θ.

Horseshoe densities.

A: Conditional posterior. B: MCMC pair plots. Divergent samples are coloured in pink, non-divergent in blue.

Modified horseshoe densities.

A: The conditional posterior when y = 0 (left) and y ≠ 0 (right). B: MCMC pair plots of samples from the marginal posterior density .

Figure 13B plots samples from the marginal posterior , when fitting to the data y = {770, 820, 713, 541, 0, 0}. Each data point corresponds to one of the six plots in Figure 13B. This small example dataset is in fact the cell-counts for region TMd recorded from the heterozygote group in case study 2. The similarities in geometry can be readily seen with Figure 13A. A large number of divergences were produced by the sampler as demonstrated by the number of pink points compared to the non-divergent transitions in blue. For fixed τ the values can take increase or decrease with smaller or larger κ respectively. The HalfCauchy places an extremely long right tail over κ that frustrates this relationship resulting in a posterior density that is difficult to sample from.

Modified horseshoe

For a Poisson model with parameters modeled on the log scale we consider the Cauchy parameterisation to be too extreme. In light of this we opted for a pragmatic approach, a modification to the original horseshoe by replacing the HalfCauchy distribution with a HalfNormal distribution. The modified horseshoe cuts off the top of the funnels by restricting κ to produce pleasant posterior geometry Figure 13A, B. The modified horseshoe is much easier to sample from, but with the cost of a much more constraining prior over γi.