Bayesian machine learning analysis of single-molecule fluorescence colocalization images

  1. Yerdos A Ordabayev
  2. Larry J Friedman
  3. Jeff Gelles  Is a corresponding author
  4. Douglas L Theobald  Is a corresponding author
  1. Department of Biochemistry, Brandeis University, United States
9 figures, 5 tables and 7 additional files

Figures

Example CoSMoS experiment.

(A) Experiment schematic. DNA target molecules labeled with a blue-excited fluorescent dye (blue star) are tethered to the microscope slide surface. RNA polymerase II (Pol II) binder molecules labeled with a green-excited dye (green star) are present in solution. (B) Data collection and preprocessing. After collecting a single image with blue excitation to identify the locations of the DNA molecules, a time sequence of Pol II images was collected with green excitation. Preprocessing of the images includes mapping of the corresponding points in target and binder channels, drift correction, and identification of two sets of areas of interest (AOIs). One set corresponds to locations of target molecules (e.g., purple square); the other corresponds to locations where no target is present (e.g., yellow square). (C) On-target data. Data are time sequences of 14 × 14 pixel AOI images centered at each target molecule. Frames show presence of on-target (e.g., frame 630) and off-target (e.g., frame 645) Pol II molecules. (D) Off-target control data. Control data consists of images collected from randomly selected sites at which no target molecule is present. Such sites can be AOIs in which no fluorescent target molecule is visible (e.g., the yellow square in the DNA channel shown in B). Alternatively, control data can be taken from a recording of a separate control sample to which no target molecules were added. Image data in B, C, and D is from Data set A in Table 1.

Figure 2 with 2 supplements
Depiction of the cosmos probabilistic image model and model parameters.

(A) Example AOI image (from Data set A in Table 1). The AOI image is a matrix of 14 × 14 pixel intensities which is shown here as both a 2-D grayscale image and as a 3-D intensity plot. The image contains two spots; one is centered at target location (image center) and the other is located off-target. (B) Examples of four idealized noise-free image representations (μI). Image representations consist of zero, one, or two idealized spots (μS) superimposed on a constant background (b). Each fluorescent spot is represented as a 2-D Gaussian parameterized by integrated intensity (h), width (w), and position (x, y). The presence of spots is encoded in the binary spot existence indicator m. (C) Simulated idealized images illustrating different values of the target-specific spot state parameter z and index parameter θ. θ = 0 corresponds to a case when no specifically bound molecule is present (z = 0); θ = 1 or 2 corresponds to the cases in which specifically bound molecule is present (z = 1) and corresponds to spot 1 or 2, respectively. (D) Condensed graphical representation of the cosmos probabilistic model. Model parameters are depicted as circles and deterministic functions as diamonds. Observed image (D) is represented by a shaded circle. Related nodes are connected by edges, with an arrow pointing towards the dependent node (e.g., the shape of each 2-D Gaussian spot μS depends on spot parameters m, h, w, x, and y). Plates (rounded rectangles) contain entities that are repeated for the number of instances displayed at the bottom-right corner: number of total AOIs (N+Nc), frame count (F), and maximum number of spots in a single image (K = 2). Parameters outside of the plates are global quantities that apply to all frames of all AOIs. A more complete version of the graphical model specifying the relevant probability distributions is given in Figure 2—figure supplement 1.

Figure 2—figure supplement 1
Extended graphical representation of the generative probabilistic model.

Directed factor graph representation (Bishop, 2006) of model parameters and parameter distributions. This diagram is a more complete version of the graphical model shown in Figure 2D; it includes additional parameters (μb, σb,δ) and explicitly specifies the relevant probability distributions. Model parameters are depicted as circles, parameter distributions as small filled squares, and deterministic functions as diamonds. Names of the probability distributions are written next to the squares. Input parameters and output parameters are connected by lines, with an arrow pointing towards the dependent parameter. Observed AOI image (D) is the sum of the noisy photon-dependent image (I) and the photon-independent camera offset (δ). Plates (rounded rectangles) contain nodes that are repeated for the number of instances displayed at the bottom-right corner: number of AOIs (N+Nc), frame count (F), maximum number of spots in a single image (K), and number of image pixels (P×P). The prior for x and y is Uniform for target-nonspecific spots (θk) and AffineBeta for target-specific spots (θ=k) (see Figure 2—figure supplement 2).

Figure 2—figure supplement 2
Prior distributions for the x and y spot position parameters.

Prior distributions of x and y for specific and non-specific binding. Probability densities for x and y are defined in the range [-(P+1)/2,(P+1)/2] relative to the target molecule and are conditional on the identity of the spot (specific or non-specific). The width of the peak in the specific distribution is given by σxy, the value of which is learned from the data. Probability densities for x and y are identical.

Figure 3 with 4 supplements
Tapqir analysis and inferred model parameters.

(A,B) Tapqir was applied to simulated data (lamda0.5 parameter set in Supplementary file 1) (A) and to experimental data (Data set A in Table 1) (B). (A) and (B) each show a short extract from a single target location in the data set. The first row shows AOI images for the subset of frames indicated by gray shaded stripes in the plots; image contrast and offset settings are consistent within each panel. The second row shows the locations of spots determined by Tapqir. Spot numbers 1 (blue) and 2 (orange) are assigned arbitrarily and may change from fame to frame. For clarity, only data for spots with a spot probability p(m=1) > 0.5 are shown. Spots predicted to be target-specific (p(θ=k) > 0.5 for spot k) are shown as filled circles. The topmost graphs (green) show the calculated probability that a target-specific spot is present (p(specific)) in each frame. Below are the calculated spot intensities (h), spot widths (w), and locations (x, y) for spot 1 (blue) and spot 2 (orange), and the AOI background intensities (b). Again, for clarity data are only shown for likely spots (p(m=1) > 0.5). Error bars: 95% CI (credible interval) estimated from a sample size of 500. Some error bars are smaller than the points and thus not visible.

Figure 3—figure supplement 1
Calculated spot probabilities.

The data sets used for panels A and B are identical to those in Figure 3A and B; the first two rows and the p(specific) (green) graph are reproduced from that figure. Blue graphs show the probability of being present (p(m=1)) and of being target-specific (p(θ=1)) for the arbitrarily designated spot 1 in each frame. Orange graphs show the analogous quantities p(m=1) and p(θ=2) for spot 2. For a given image, the probability p(specific)p(z=1) that any target-specific spot is present is equal to p(θ=1)+p(θ=2).

Figure 3—figure supplement 2
Reproduction of experimental data by posterior predictive sampling.

Example frames are shown from Data set A (A: SNR = 1.61), Data set B (B: SNR = 3.77), Data set C (C: SNR = 4.23), and Data set D (D: SNR = 3.06) in Table 1. In each panel the top row shows AOI images selected from the experimental data and middle row shows corresponding images obtained by sampling from the posterior distributions. Image contrast and offset are consistent within each panel. The bottom row shows pixel intensity distributions from the experimental and posterior prediction images.

Figure 3—figure supplement 3
Tapqir analysis of image data simulated using a broad range of global parameters.

Simulations (see Materials and methods) consist of 16 data sets where values of global parameters (π,λ, σxy, and g) were randomly generated for each data set (Supplementary file 2). Simulated data were fit with Tapqir, and parameter values from the fit (with 95% credible interval estimated from a sample size of 10,000) are plotted against the true parameter values. To guide the eye, dashed lines indicate identical true and fit values. (A) Gain of the camera g. (B) Average target-specific binding probability π. (C) Target non-specific binding density λ. (D) Proximity parameter σxy.

Figure 3—figure supplement 4
Effect of AOI size on analysis of experimental data.

(A) and (B) each show a short extract from a single target location (AOI 163 in (A) and AOI 0 in (B)) from Data set A (Table 1; SNR = 1.61). Tapqir was applied to the data set using AOI image sizes P of 14 × 14 (first row), 10 × 10 (second row), and 6 × 6 (third row) pixels. Corresponding output p(specific) probabilities are plotted in the graph. Image contrasts in (A) and (B) are different. Unattended calculation time on an AMD Ryzen Threadripper 2990 WX with an Nvidia GeForce RTX 2080Ti GPU using CUDA version 11.5 for the different AOI sizes were: 7 h 40 min (P = 14), 3 h 5 min (P = 10), and 2 h 40 min (P = 6).

Figure 4 with 1 supplement
Tapqir performance on simulated data with different SNRs or different non-specific binding densities.

(A–D) Analysis of simulated data over a range of SNR. SNR was varied in the simulations by changing spot intensity h while keeping other parameters constant (Supplementary file 3). (A) Example images showing the appearance of the same target-specific spot simulated with increasing SNR. (B) Mean of Tapqir-calculated target-specific spot probability p(specific) (with 95% CI; see Materials and methods) for the subset of images where target-specific spots are known to be present. (C) Histograms of p(specific) for selected simulations with SNR indicated. Data are shown as stacked bars for images known to have (green, 15%) or not have (gray, 85%) target-specific spots. Count is zero for bins where bars are not shown. (D) Accuracy of Tapqir image classification with respect to presence/absence of a target-specific spot. Accuracy was assessed by MCC, recall, and precision (see Results and Materials and methods sections). (E–G) Same as in (B–D) but for the data simulated over a range of non-specific binding densities λ at fixed SNR = 3.76 (Supplementary file 1). (H) Spot recognition in AOI images containing closely spaced target-specific and non-specific spots. Images were selected from the λ = 1 data set in (E–G). AOI images and spot detection are plotted as in Figure 3, with spot numbers 1 (blue) and 2 (orange) assigned arbitrarily and spots predicted to be target-specific shown as filled circles. (I) Same as in (C) but for the data simulated over a range of non-specific binding densities λ with no target-specific binding (π = 0) (Supplementary file 4).

Figure 4—figure supplement 1
False negative spot misidentifications by Tapqir and spot-picker method.

The same λ = 1 simulated data set used in Figure 4E–H (lamda1 in Supplementary file 1) was analyzed by Tapqir and spot-picker. The data set contained 418 AOI images containing target-specific spots, of which the 37 shown here were falsely predicted to contain no target-specific spot (3 by Tapqir and 34 by spot-picker). Correct (+) and incorrect (−) predictions by each program are indicated. In all AOI images except AOI 3 frame 109, there is a nearby target non-specific spot in addition to the target-specific one. False negative classifications by spot-picker method are presumably due to the presence of a closely located target non-specific spot that distorts the shape of a target-specific spot. Tapqir, on the other hand, is able to correctly infer the presence of two closely located spots even when they are not completely resolved (Figure 4H). The rare (3 out of 418) false negative classifications by Tapqir likely arise from target-specific spots with centers that deviate from the target location by much more (∼ 0.7 pixels) than the inferred proximity parameter (σxy = 0.2 pixels).

Tapqir analysis of association/dissociation kinetics and thermodynamics.

(A) Chemical scheme for a one-step association/dissociation reaction at equilibrium with pseudo-first-order binding and dissociation rate constants kon and koff, respectively. (B) A simulation of the reaction in (A) and scheme for kinetic analysis of the simulated data with Tapqir. The simulation used SNR = 3.76, kon = 0.02 s−1, koff = 0.2 s−1, and a high target-nonspecific binding frequency λ = 1 (Supplementary file 5, data set kon0.02lamda1). Full dataset consists of 100 AOI locations and 1,000 frames each for on-target data and off-target control data. Shown is a short extract of on-target data from a single AOI location in the simulation. Plots show simulated presence/absence of the target-specific spot (blue) and Tapqir-calculated estimate of corresponding target-specific spot probability p(specific) (green). Two thousand binary traces (e.g., black records) were sampled from the p(specific) posterior distribution and used to infer kon and koff using a two-state hidden Markov model (HMM) (see Materials and methods). Each sample trace contains well-defined time intervals corresponding to target-specific spot presence and absence (e.g., Δton and Δtoff). (C,D,E) Kinetic and equilibrium constants from simulations (Supplementary file 5) using a range of kon values and target-nonspecific spot frequencies λ, with constant koff = 0.2 s−1. (C) Values of kon used in simulations (blue) and mean values (and 95% CIs, black) inferred by HMM analysis from the 2000 posterior samples. Some error bars are smaller than the points and thus not visible. (D) Same as (C) but for koff. (E) Binding equilibrium constants Keq=kon/koff used in simulation (blue) and inferred from Tapqir-calculated π as Keq=π/(1π) (black).

Figure 6 with 3 supplements
Extraction of target-binder association kinetics from example experimental data.

Data are from Data set B (SNR = 3.77, λ = 0.1575; see Table 1). (A) Probabilistic rastergram representation of Tapqir-calculated target-specific spot probabilities p(specific) (color scale). AOIs were ordered by decreasing times-to-first-binding. For clarity, only every thirteenth frame is plotted. (B) Time-to-first-binding distribution using Tapqir. Plot shows the cumulative fraction of AOIs that exhibited one or more target-specific binding events by the indicated frame number (green) and fit curve (black). Shading indicates uncertainty. (C) Time-to-first-binding distribution using an empirical spot-picker method Friedman et al., 2013. The spot-picker method jointly fits first spots observed in off-target control AOIs (yellow) and in on-target AOIs (purple) yielding fit curves (black). (D) Values of kinetic parameters ka, kns, and Af (see text) derived from fits in (B) and (C). Uncertainties reported in (B, C, D) represent 95% credible intervals for Tapqir and 95% confidence intervals for spot-picker (see Materials and methods).

Figure 6—figure supplement 1
Additional example showing extraction of target-binder association kinetics from experimental data.

Data are from Data set A (SNR = 1.61, λ = 0.2943; see Table 1). Results are plotted as in Figure 6, except that for clarity only every second frame and every third AOI is shown in (A).

Figure 6—figure supplement 2
Additional example showing extraction of target-binder association kinetics from experimental data.

Data are from Data set C (SNR = 4.23, λ = 0.0876; see Table 1). Results are plotted as in Figure 6, except that for clarity only every tenth frame is shown in (A).

Figure 6—figure supplement 3
Additional example showing extraction of target-binder association kinetics from experimental data.

Data are from Data set D (SNR = 3.06, λ = 0.0437; see Table 1). Results are plotted as in Figure 6, except that for clarity only every thirteenth frame and every second AOI is shown in (A).

Extraction of AOI images from raw images.
Pseudocode representation of cosmos model.
Pseudocode representation of cosmos guide.

Tables

Table 1
Experimental data sets.
Data set sizeaSNRπ [95% CI]λ [95% CI]g [95% CI]σxy [95% CI]Compute time
Data set A: Binder, SNAPf-tagged S. cerevisiae RNA polymerase II labeled with DY549; Target, transcription template DNA containing 5× Gal4 upstream activating sequences and CYC1 core promoter; Conditions, yeast nuclear extract supplemented with Gal4-VP16 activator and NTPs. From Rosen et al., 2020.
N= 331, Nc = 526, F = 7901.610.0951 [0.0936, 0.0966]0.2943 [0.2924, 0.2963]6.645 [6.643, 6.647]0.577 [0.573, 0.580]7 h 40 mb
3 h 50 mc
Data set B: Binder, 0.1 nM E. coli σ54 RNA polymerase labeled with Cy3; Target, 852 bp DNA containing the glnALG promoter; Conditions, physiological buffer, no NTPs. From (Fig. 1E) of Friedman et al., 2013.
N= 102, Nc = 127, F = 44073.770.0846 [0.0835, 0.0857]0.1575 [0.1569, 0.1583]11.861 [11.856, 11.865]0.476 [0.474, 0.479]7 h 40 mb
Data set C: Binder, 0.4 nM E. coli σ54 RNA polymerase labeled with Cy3; Target, 3,591 bp DNA containing the glnALG promoter; Conditions, physiological buffer, no NTPs. From (Fig. 3D) of Friedman et al., 2013.
N= 122, Nc = 157, F = 38554.230.0267 [0.0262, 0.0273]0.0876 [0.0869, 0.0883]16.777 [16.773, 16.782]0.404 [0.399, 0.408]9 h 15 mb
Data set D: Binder, 0.15 nM E. coli Cy3-GreB; Target, reconstituted backtracked EC-6 E. coli transcription elongation complex; Conditions, physiological buffer, no NTPs. Randomly selected subset of data set from Tetone et al., 2017.
N= 200, Nc = 200, F = 56223.060.0038 [0.0036, 0.0039]0.0437 [0.0434, 0.0440]18.727 [18.724, 18.731]0.451 [0.438, 0.463]11 hb
  1. *N - number of on-target AOIs, Nc - number of control off-target AOIs, F - number of frames.

  2. bUnattended calculation time on an AMD Ryzen Threadripper 2990WX with an Nvidia GeForce RTX 2080Ti GPU using CUDA version 11.5.

  3. cUnattended calculation time on an Intel Xeon CPU with an Nvidia Tesla V100-SXM2-16GB GPU using CUDA version 11.2 in a Google Colab Pro account.

Table 2
The effect of AOI size on classification accuracy*.
AOI dimension, P (pixels)MCCCompute time
140.9512 h 10 m
100.9481 h 25 m
60.9391 h 20 m
  1. *

    Tapqir was applied to the same simulated data set (height1000 parameter set in Supplementary file 3; SNR = 1.25) using different AOI sizes.

  2. The width (w) of the simulated spots (one standard deviation of the 2-D Gaussian) is equal to 1.4 pixels.

  3. Unattended calculation time on an AMD Ryzen Threadripper 2990WX with an Nvidia GeForce RTX 2080Ti GPU using CUDA version 11.5.

Table 3
Variables used in the Tapqir model.
SymbolMeaningDomain
KMaximum number of spots per image
NNumber of on-target AOIs
NcNumber of off-target control AOIs
FNumber of frames
PSize of the AOI image in pixels
gCamera gainR>0
σxyProximity(0,(P+1)/12)
πAverage target-specific binding probability[0,1]
λTarget-nonspecific binding densityR>0
μbMean background intensity across AOIR>0AOI[N]
σbStandard deviation of background intensity across AOIR>0AOI[N]
bBackground intensityR>0AOI[N]×frame[F]
zTarget-specific spot presence{0,1}AOI[N]×frame[F]
θTarget-specific spot index{0,1,,K}AOI[N]×frame[F]
mSpot presence indicator{0,1}spot[K]×AOI[N]×frame[F]
hIntegrated spot intensityR>0spot[K]×AOI[N]×frame[F]
wSpot width[0.75,2.25]spot[K]×AOI[N]×frame[F]
xCenter of the spot on the x-axisRspot[K]×AOI[N]×frame[F]
yCenter of the spot on the y-axisRspot[K]×AOI[N]×frame[F]
μS2-D Gaussian spotR>0spot[K]×AOI[N]×frame[F]×pixelX[P]×pixelY[P]
μIIdeal image w/o offsetR>0AOI[N]×frame[F]×pixelX[P]×pixelY[P]
δOffset signalR>0AOI[N]×frame[F]×pixelX[P]×pixelY[P]
IObserved image w/o offset signalR>0AOI[N]×frame[F]×pixelX[P]×pixelY[P]
DObserved image (I+δ)R>0AOI[N]×frame[F]×pixelX[P]×pixelY[P]
xtargetTarget molecule position on the x-axis[P/21,P/2]AOI[N]×frame[F]
ytargetTarget molecule position on the y-axis[P/21,P/2]AOI[N]×frame[F]
iPixel index on the x-axis{0,,(P1)}pixelX[P]
jPixel index on the y-axis{0,,(P1)}pixelX[P]
WWidth of the raw microscope images in pixels
HHeight of the raw microscope image in pixels
DrawRaw microscope imagesR>0frame[F]×pixelX[H]×pixelY[W]
xtarget,rawTarget molecule position in raw images on the x-axis[0.5,H0.5]AOI[N]×frame[F]
ytarget,rawTarget molecule position in raw images on the y-axis[0.5,W0.5]AOI[N]×frame[F]
Table 4
Probability distributions used in the model.
DistributionPDF
xAffineBeta(μ,ν,a,b)yα1(1y)β1B(α,β)whereα=ν(μa)ba,β=ν(bμ)ba,andy=xaba
xBernoulli(π)πx(1-π)1-x
xBeta(α,β)xα-1(1-x)β-1B(α,β)
xCategorical(p)i=1kpi[x=i]
xEmpirical(z,p)i=1kpi[x=zi]
xExponential(λ)λe-λx
xGamma(μ,σ)βαΓ(α)xα1eβxwhereα=μ2σ2andβ=μσ2
xHalfNormal(σ)2σπexp(x22σ2)forx>0
kTruncPoisson(λ,K){1eλi=0K1λii!ifk=Kλkeλk!otherwise
xUniform(a,b)1baforx[a,b]
Table 5
The effect of mapping precision on classification accuracy*.
σxy(true)σxy(fit) [95% CI]MCCσxy Prior
0.20.21 [0.20, 0.22]0.989Exponential(1)
10.96 [0.90, 1.02]0.939Exponential(1)
1.51.49 [1.40, 1.59]0.890Exponential(1)
21.96 [1.84, 2.09]0.834Exponential(1)
21.97 [1.84, 2.09]0.834Uniform(0,(P+1)/12)
  1. *

    Data were simulated over a range of proximity parameter σxy values at fixed π=0.15 and λ=0.15 (Supplementary file 6).

Additional files

Supplementary file 1

Varying non-specific binding rate simulation parameters and corresponding fit values.

https://cdn.elifesciences.org/articles/73860/elife-73860-supp1-v2.xlsx
Supplementary file 2

Randomized simulation parameters and corresponding fit values.

https://cdn.elifesciences.org/articles/73860/elife-73860-supp2-v2.xlsx
Supplementary file 3

Varying intensity (SNR) simulation parameters and corresponding fit values.

https://cdn.elifesciences.org/articles/73860/elife-73860-supp3-v2.xlsx
Supplementary file 4

No target-specific binding and varying non-specific binding rate simulation parameters and corresponding fit values.

https://cdn.elifesciences.org/articles/73860/elife-73860-supp4-v2.xlsx
Supplementary file 5

Kinetic simulation parameters and corresponding fit values.

https://cdn.elifesciences.org/articles/73860/elife-73860-supp5-v2.xlsx
Supplementary file 6

Varying proximity simulation parameters and corresponding fit values.

https://cdn.elifesciences.org/articles/73860/elife-73860-supp6-v2.xlsx
Transparent reporting form
https://cdn.elifesciences.org/articles/73860/elife-73860-transrepform1-v2.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Yerdos A Ordabayev
  2. Larry J Friedman
  3. Jeff Gelles
  4. Douglas L Theobald
(2022)
Bayesian machine learning analysis of single-molecule fluorescence colocalization images
eLife 11:e73860.
https://doi.org/10.7554/eLife.73860