Fast and flexible estimation of effective migration surfaces
Abstract
Spatial population genetic data often exhibits ‘isolationbydistance,’ where genetic similarity tends to decrease as individuals become more geographically distant. The rate at which genetic similarity decays with distance is often spatially heterogeneous due to variable population processes like genetic drift, gene flow, and natural selection. Petkova et al., 2016 developed a statistical method called Estimating Effective Migration Surfaces (EEMS) for visualizing spatially heterogeneous isolationbydistance on a geographic map. While EEMS is a powerful tool for depicting spatial population structure, it can suffer from slow runtimes. Here, we develop a related method called Fast Estimation of Effective Migration Surfaces (FEEMS). FEEMS uses a Gaussian Markov Random Field model in a penalized likelihood framework that allows for efficient optimization and output of effective migration surfaces. Further, the efficient optimization facilitates the inference of migration parameters per edge in the graph, rather than per node (as in EEMS). With simulations, we show conditions under which FEEMS can accurately recover effective migration surfaces with complex geneflow histories, including those with anisotropy. We apply FEEMS to population genetic data from North American gray wolves and show it performs favorably in comparison to EEMS, with solutions obtained orders of magnitude faster. Overall, FEEMS expands the ability of users to quickly visualize and interpret spatial structure in their data.
Introduction
The relationship between geography and genetics has had enduring importance in evolutionary biology (see Felsenstein, 1982). One fundamental consideration is that individuals who live near one another tend to be more genetically similar than those who live far apart (Wright, 1943; Wright, 1946; Malécot, 1948; Kimura, 1953; Kimura and Weiss, 1964). This phenomenon is often referred to as ‘isolationbydistance’ (IBD) and has been shown to be a pervasive feature in spatial population genetic data across many species (Slatkin, 1985; Dobzhansky and Wright, 1943; Meirmans, 2012). Statistical methods that use both measures of genetic variation and geographic coordinates to understand patterns of IBD have been widely applied (Bradburd and Ralph, 2019; Battey et al., 2020). One major challenge in these approaches is that the relationship between geography and genetics can be complex. Particularly, geographic features can influence migration in localized regions leading to spatially heterogeneous patterns of IBD (Bradburd and Ralph, 2019).
Multiple approaches have been introduced to model spatially nonhomogeneous IBD in population genetic data (McRae, 2006; Duforet‐Frebourg and Blum, 2014; Hanks and Hooten, 2013; Petkova et al., 2016; Bradburd et al., 2018; AlAsadi et al., 2019; Safner et al., 2011; Ringbauer et al., 2018). Particularly relevant to our proposed approach is the work of Petkova et al., 2016 and Hanks and Hooten, 2013. Both approaches model genetic distance using the ‘resistance distance’ on a weighted graph. This distance metric is inspired by concepts of effective resistance in circuit theory models, or alternatively understood as the commute time of a random walk on a weighted graph or as a Gaussian graphical model (specifically a conditional autoregressive process) (Chandra et al., 1996; Hanks and Hooten, 2013; Rue and Held, 2005). Additionally, the resistance distance approach is a computationally convenient and accurate approximation to spatial coalescent models (McRae, 2006), although it has limitations in asymmetric migration settings (Lundgren and Ralph, 2019).
Hanks and Hooten, 2013 introduced a Bayesian model that uses measured ecological covariates, such as elevation, to predict genetic distances across subpopulations. Specifically, they use a graphbased model for genotypes observed at different spatial locations. Expected genetic distances across subpopulations in their model are given by resistance distances computed from the edge weights. They parameterize the edge weights of the graph to be a function of known biogeographic covariates, linking local geographic features to genetic variation across the landscape.
Concurrently, the Estimating Effective Migration Surfaces (EEMS) method was developed to help interpret and visualize nonhomogeneous geneflow on a geographic map (Petkova, 2013; Petkova et al., 2016). EEMS uses resistance distances to approximate the betweensubpopulation component of pairwise coalescent times in a ‘steppingstone’ model of migration and genetic drift (Kimura, 1953; Kimura and Weiss, 1964). EEMS models the withinsubpopulation component of pairwise coalescent times, with a nodespecific parameter. Instead of using known biogeographic covariates to connect geographic features to genetic variation as in Hanks and Hooten, 2013, EEMS infers a set of edge weights (and diversity parameters) that explain the genetic distance data. The inference is based on a hierarchical Bayesian model and a Voronoitessellationbased prior to encourage piecewise constant spatial smoothness in the fitted edge weights.
EEMS uses Markov Chain Monte Carlo (MCMC) and outputs a visualization of the posterior mean for effective migration and a measure of genetic diversity for every spatial position of the focal habitat. Regions with relatively low effective migration can be interpreted to have reduced geneflow over time, whereas regions with relatively high migration can be interpreted as having elevated geneflow. EEMS has been applied to multiple systems to describe spatial genetic structure, but despite EEMS’s advances in formulating a tractable solution to investigate spatial heterogeneity in IBD, the MCMC algorithm it uses can be slow to converge, in some cases leading to days of computation time for large datasets (Peter et al., 2020).
The inference problems faced by EEMS and Hanks and Hooten are related to a growing area referred to as ‘graph learning’ (Dong et al., 2019; Mateos et al., 2019). In graph learning, a noisy signal is measured as a scalar value at a set of nodes from the graph, and the aim is then to infer nonnegative edge weights that reflect how spatially ‘smooth’ the signal is with respect to the graph topology (Kalofolias, 2016). In population genetic settings, this scalar could be an allele frequency measured at locations in a discrete spatial habitat with effective migration rates between subpopulations. Like the approach taken by Hanks and Hooten, 2013, one widely used representation of smooth graph signals is to associate the smoothness property with a Gaussian graphical model where the precision matrix has the form of a graph Laplacian (Dong et al., 2016; Egilmez et al., 2016). The probabilistic model defined on the graph signal then naturally gives rise to a likelihood for the observed samples, and thus much of the literature in this area focuses on developing specialized algorithms to efficiently solve optimization problems that allow reconstruction of the underlying latent graph. For more information about graph learning and signal processing in general see the survey papers of Dong et al., 2019 and Mateos et al., 2019.
To position the present work in comparison to the ‘graph learning’ literature, our contributions are twofold. First, in population genetics, it is impossible to collect individual genotypes across all the geographic locations and, as a result, we often work with many, often the majority, of nodes having missing data. As far as we are aware, none of the work in graph signal processing considers this scenario and thus their algorithms are not directly applicable to our setting. In addition, if the number of the observed nodes is much smaller than the number of nodes of a graph, one can project the large matrices associated with the graph to the space of observed nodes, therefore allowing for fast and efficient computation. Second, highly missing nodes in the observed signals can result in significant degradation of the quality of the reconstructed graph unless it is regularized properly. Motivated by the Voronoitessellationbased prior adopted in EEMS (Petkova et al., 2016), we propose regularization that encourages spatial smoothness in the edge weights.
Building on advances in graph learning, we introduce a method, Fast Estimation of Effective Migration Surfaces (FEEMS), that uses optimization to obtain penalizedlikelihoodbased estimates of effective migration parameters. In contrast to EEMS which uses a nodespecific parameterization of effective migration, we optimize over edgespecific parameters allowing for more flexible migration processes to be fit, such as spatial anisotropy, in which the migration process is not invariant to rotation of the coordinate system (e.g. migration is more extensive along a particular axis). Although we developed this model as a Gaussian Markov Random Field, the resulting likelihood has key similarities to the EEMS model, in that it is a Wishartdistribution that is a function of a genetic distance matrix. Expected genetic distances in both models can be interpreted as ‘resistance distances’ (McRae, 2006).
To fit the model, rather than using MCMC, we develop a fast quasiNewton optimization algorithm (Nocedal and Wright, 2006) and a crossvalidation approach for choosing the penalty parameter used in the penalized likelihood. We demonstrate the method using coalescent simulations and an application to a dataset of gray wolves from North America. The output is comparable to the results of EEMS but is provided in orders of magnitude less time. With this improvement in speed, FEEMS opens up the ability to perform fast exploratory data analysis of spatial population structure.
Results
Overview of FEEMS
Figure 1 shows a visual schematic of the FEEMS method. The input data are genotypes and spatial locations (e.g. latitudes and longitudes) for a set of individuals sampled across a geographic region. We construct a dense spatial grid embedded in geographic space where nodes represent subpopulations, and we assign individuals to nodes based on spatial proximity (see Appendix 1—figure 1 for a visualization of the grid construction and node assignment procedure). The density of the grid is user defined and must be explored to appropriately balance model misspecification and computational burden. As the density of the lattice increases, the model is similar to discrete approximations used for continuous spatial processes, but the increased density comes at the cost of computational complexity.
Details on the FEEMS model are described in the Materials and methods section, however at a high level, we assume exchangeability of individuals within each subpopulation and estimate allele frequencies, ${\widehat{f}}_{j}\left(k\right)$, for each subpopulation, indexed by $k$, and single nucleotide polymorphism (SNP), indexed by $j$, under a simple Binomial sampling model. We also use the recorded sample sizes at each node to model the precision of the estimated allele frequency. With the estimated allele frequencies in hand, we model the data at each SNP using an approximate Gaussian model whose covariance is, up to constant factors, shared across all SNPs—in other words, after rescaling by SNPspecific variation factors, we assume that the set of observed frequencies at each SNP is an independent realization of the same spatial process. The latent frequency variables, ${f}_{j}\left(k\right)$, are modeled as a Gaussian Markov Random Field (GMRF) with a sparse precision matrix determined by the graph Laplacian and a set of residual variances that vary across SNPs. The pseudoinverse of the graph Laplacian in a GMRF is inherently connected to the notion of resistance distance in an electrical circuit (Hanks and Hooten, 2013) that is often used in population genetics to model the genetic differentiation between subpopulations (McRae, 2006). The graph’s weighted edges, denoted by ${w}_{ij}$ between nodes $i$ and $j$, represent geneflow between the subpopulations (Friedman et al., 2008; Hanks and Hooten, 2013; Petkova et al., 2016). The Gaussian approximation has the advantage that we can analytically marginalize out the latent frequency variables. The resulting likelihood of the observed frequencies shares a number of similarities to that of EEMS (see Materials and methods).
To prevent overfitting we use penalized maximum likelihood to estimate the edge weights of the graph. Our overall goal is thus to solve the following optimization problem:
where $\mathit{\bm{w}}$ is a vector that stores all the unique elements of the weighted adjacency matrix, $\mathit{\bm{l}}$ and $\mathit{\bm{u}}$ are elementwise nonnegative lower and upper bounds for $\mathit{\bm{w}}$, $\mathrm{\ell}\left(\mathit{\bm{w}}\right)$ is the negative loglikelihood function that comes from the GMRF model described above, and ${\varphi}_{\lambda}\left(\mathit{\bm{w}}\right)$ is a penalty that controls how constant or smooth the output migration surface will be and is controlled by the hyperparameter $\lambda >0$. Writing $\mathcal{V}$ to denote the set of nodes in the graph and $\mathcal{E}\left(i\right)\subset \mathcal{V}$ to denote the subset of nodes that have edges connected to node $i$, our penalty is given by
This function serves to penalize large differences between the weights ${w}_{ik}$ and ${w}_{i\mathrm{\ell}}$ on edges that are adjacent, that is, penalizing differences for any pair of edges that share a common node. The tuning parameter λ controls the overall strength of the penalization placed on the output of the migration surface—if λ is large, the fitted surface will favor a homogeneous set of inferred migration weights on the graph, while if λ is low, more flexible graphs can be fitted to recover richer local structure, but this suffers from the potential for overfitting. The tuning parameter λ is selected by evaluating the model’s performance at predicting allele frequencies at held out locations using leaveoneout crossvalidation (see Materials and methods ‘Leaveoneout crossvalidation to select tuning parameters’).
The scale parameter ${\widehat{w}}_{0}$ is chosen first fitting a ‘constant $w$’ model, which is a spatially homogeneous isolationbydistance model constrained to have a single $w$ value for all edges. In the ${\varphi}_{\lambda}$ penalty, for adjacent edges $(i,k)$ and $(i,\mathrm{\ell})$, if ${w}_{ik}$ and ${w}_{i\mathrm{\ell}}$ are large (relative to ${\widehat{w}}_{0}$) then the corresponding term of the penalty is approximately proportional to ${\left({w}_{ik}{w}_{i\mathrm{\ell}}\right)}^{2}$, penalizing differences among neighboring edges on a linear scale; if instead ${w}_{ik}$ and ${w}_{i\mathrm{\ell}}$ are small relative to ${\widehat{w}}_{0}$, then the penalty is approximately proportional to ${\left(\mathrm{log}\left({w}_{ik}\right)\mathrm{log}\left({w}_{i\mathrm{\ell}}\right)\right)}^{2}$, penalizing differences on a logarithmic scale. In fact, it is also possible to consider treating this scale parameter as a second tuning parameter—we can define a penalty function ${\varphi}_{\lambda ,\alpha}\left(\mathit{\bm{w}}\right)=\frac{\lambda}{2}{\sum}_{i\in \mathcal{V}}{\sum}_{k,\mathrm{\ell}\in \mathcal{E}\left(i\right)}{\left(\mathrm{log}\left({e}^{\alpha {w}_{ik}}1\right)\mathrm{log}\left({e}^{\alpha {w}_{i\mathrm{\ell}}}1\right)\right)}^{2}$, and explore the solution across different values of both λ and α. However, we find that empirically choosing $\alpha =1/{\widehat{w}}_{0}$ offers good performance as well as an intuitive interpretation (i.e. scaling edge weights ${w}_{ik}$ with reference to the constant$w$ model), and allows us to avoid the computational burden of searching a twodimensional tuning parameter space.
We use sparse linear algebra routines to efficiently compute the objective function and gradient of our parameters, allowing for the use of widely applied quasiNewton optimization algorithms (Nocedal and Wright, 2006) implemented in standard numerical computing libraries like scipy (Virtanen et al., 2020) (RRID:SCR_008058). See the Materials and methods section for a detailed description of the statistical models and algorithms used.
Evaluating FEEMS on ‘out of model’ coalescent simulations
While our statistical model is not directly based on a population genetic process, it is useful to see how it performs on simulated data under the coalescent stepping stone model (Figure 2, also see Appendix 1—figure 2 for additional scenarios). In these simulations we know, by construction, the model we fit (FEEMS) is different from the true model we simulate data under (the coalescent), allowing us to assess the robustness of the fit to a controlled form of model misspecification.
The first migration scenario (Figure 2A–C) is a spatially homogeneous model where all the migration rates are set to be a constant value on the graph, this is equivalent to simulating data under an homogeneous isolationbydistance model. In the second migration scenario (Figure 2D–E), we simulate a nonhomogeneous process by representing a geographic barrier to migration, lowering the migration rates by a factor of 10 in the center of the habitat relative to the left and right regions of the graph. Finally, in the third migration scenario (Figure 2G–I), we simulate a pattern which corresponds to anisotropic migration with edges that point east/west being assigned to a fivefold higher migration rate than edges pointing north/south. For each migration scenario, we simulate two sampling designs. In the first ‘densesampling’ design (Figure 2B,E,I) we sample individuals for every node of the graph. Next, in the ‘sparsesampling’ design (Figure 2C,F,J) we sample individuals for only a randomly selected 20% of the nodes.
For each coalescent simulation, we used leaveoneout crossvalidation (at the level of sampled nodes) to select the smoothness parameter λ. In the homogeneous migration simulations, the best value for the smoothness parameter, as determined by the grid value with the lowest leaveoneout crossvalidation error, is ${\lambda}_{\text{cv}}=100$ in both sampling scenarios with complete and missing data. In the heterogeneous migration simulations ${\lambda}_{\text{cv}}=0.298$ with no missing data and ${\lambda}_{\text{cv}}=37.927$ with missing data. Finally, in the anisotropic simulations with no missing data ${\lambda}_{\text{cv}}=0.298$ and with missing data ${\lambda}_{\text{cv}}=0.042$. We note the magnitude of the selected λ depends on the scale of the loss function so comparisons across different datasets are not generally interpretable.
With regard to the visualizations of effective migration, FEEMS performs best when all the nodes are sampled on the graph, that is, when there is no missing data (Figure 2B,E,H). Interestingly, in the simulated scenarios with many missing nodes, FEEMS can still partly recover the migration history, including the presence of anisotropic migration (Figure 2I). A sampling scheme with a central gap leads to a slightly narrower barrier in the heterogeneous migration scenario (Appendix 1—figure 2I) and for the anisotropic scenario, a degree of oversmoothness in the northern and southern regions of the center of the graph (Appendix 1—figure 2N). For the missing at random sampling design, FEEMS is able to recover the relative edge weights surprisingly well for all scenarios, with the inference being the most challenging when there is anisotropic migration. The potential for FEEMS to recover anisotropic migration is novel relative to EEMS, which was parameterized for fitting nonstationary isotropic migration histories and produces banding patterns perpendicular to the axis of migration when applied to data from anisotropic coalescent simulations (Petkova et al., 2016, Supplementary Figure 2; see also Appendix 1 ‘Edge versus node parameterization’ for a related discussion). Overall, even with sparsely sampled graphs, FEEMS is able to produce visualizations that qualitatively capture the migration history in coalescent simulations.
Application of FEEMS to genotype data from North American gray wolves
To assess the performance of FEEMS on real data, we used a previously published dataset of 111 gray wolves sampled across North America typed at 17,729 SNPs (Schweizer et al., 2016; Appendix 1—figure 5). This dataset has a number of advantageous features that make it a useful test case for evaluating FEEMS: (1) The broad sampling range across North America includes a number of relevant geographic features that, a priori, could conceivably lead to restricted geneflow averaged throughout the population history. These geographic features include mountain ranges, lakes, and islands. (2) The scale of the data is consistent with many studies for nonmodel systems whose spatial population structure is of interest. For instance, the relatively sparse sampling leads to a challenging statistical problem where there is the potential for many unobserved nodes (subpopulations), depending the density of the grid chosen.
Before applying FEEMS, we confirmed a signature of spatial structure in the data through regressing genetic distances on geographic distances and top genetic PCs against geographic coordinates (Appendix 1—figures 6, 7, 8, 9). We also ran multiple replicates of ADMIXTURE for $K=2$ to $K=8$, selecting for each $K$ the highest likelihood run among replicates to visualize (Appendix 1—figure 10). As expected in a spatial genetic dataset, nearby samples have similar admixture proportions and continuous gradients of changing ancestries are spread throughout the map (Bradburd et al., 2018). Whether such gradients in admixture coefficients are due to isolation by distance or specific geographic features that enhance or diminish the levels of genetic differentiation is an interpretive challenge. Explicitly modeling the spatial locations and genetic distance jointly using a method like EEMS or FEEMS is exactly designed to explore these types of questions in the data (Petkova, 2013; Petkova et al., 2016).
We first show FEEMS results for four different values of the smoothness parameter, λ from large $\lambda =100$ to small $\lambda =0.0008$ (Figure 3). One interpretation of our regularization penalty is that it encourages fitting models of homogeneous and isotropic migration. When λ is very large (Figure 3A), we see FEEMS fits a model where all of the edge weights on the graph nearly equal the mean value, hence all the edge weights are colored white in the relative logscale. In this case, FEEMS is fitting a relatively homogeneous migration model where all the estimated edge weights get assigned nearly the same value on the graph. As we sequentially lower the penalty parameter, (Figure 3B,C,D) the fitted graph begins to appear more complex and heterogeneous as expected (discussed further below). Figure 3E shows the crossvalidation error for a predefined grid of λ values (also see Appendix 1—figure 6 for visualizations of the fitted versus genetic distance on the full dataset).
The crossvalidation approach finds the optimal value of λ to be 2.06. This solution visually appears to have a moderate level of regularization and aligns with several known landscape features (Figure 4). Spatial features in the FEEMS visualization qualitatively matches the structure plot output from ADMIXTURE using $K=6$ (Appendix 1—figure 10). We add labels on the figure to highlight a number of pertinent features: (A) St. Lawrence Island, (B) the coastal islands and mountain ranges in British Columbia, (C) the boundary of Boreal Forest and Tundra ecoregions in the Shield Taiga, (D) Queen Elizabeth Islands, (E) Hudson Bay, and (F) Baffin Island. Many of these features were described in Schweizer et al., 2016 by interpretation of ADMIXTURE, PCA, and ${F}_{ST}$ statistics. FEEMS is able to succinctly provide an interpretable view of these data in a single visualization. Indeed many of these geographic features plausibly impact gray wolf dispersal and population history (Schweizer et al., 2016).
Comparison to EEMS
We also ran EEMS on the same gray wolf dataset. We used default parameters provided by EEMS but set the number of burnin iterations to $20\times {10}^{6}$, MCMC iterations to $50\times {10}^{6}$, and thinning intervals to 2000. We were unable to run EEMS in a reasonable run time ($\le 3$ days) for the dense spatial grid of 1207 nodes so we ran EEMS and FEEMS on a sparser graph with 307 nodes.
We find that FEEMS is multiple orders of magnitude faster than EEMS, even when running multiple runs of FEEMS for different regularization settings on both the sparse and dense graphs (Table 1). We note that constructing the graph and fitting the model with very low regularization parameters are the most computationally demanding steps in running FEEMS.
We find that many of the same geographic features that have reduced or enhanced geneflow are concordant between the two methods. The EEMS visualization, qualitatively, best matches solutions of FEEMS with lower λ values (Figure 4, Appendix 1—figure 11); however, based on the ADMIXTURE results, visual inspection in relation to known geographical features and inspection of the observed vs fitted dissimilarity values (Appendix 1—figures 14, 22), we find these solutions to be less satisfying compared to the FEEMS solution found with λ chosen by leaveoneout crossvalidation. We note that in many of the EEMS runs the MCMC appears to not have converged (based on visual inspection of trace plots) even after a large number of iterations.
Discussion
FEEMS is a fast approach that provides an interpretable view of spatial population structure in real datasets and simulations. We want to emphasize that beyond being a fast optimization approach for inferring population structure, our parameterization of the likelihood opens up a number of exciting new directions for improving spatial population genetic inference. Notably, one major difference between EEMS and FEEMS is that in FEEMS each edge weight is assigned its own parameter to be estimated, whereas in EEMS each node is assigned a parameter and each edge is constrained to be the average effective migration between the nodes it connects (see Materials and methods and Appendix 1 ‘Edge versus node parameterization’ for details). The nodebased parameterization in EEMS makes it difficult to incorporate anisotropy and asymmeteric migration (Lundgren and Ralph, 2019). As we have shown here, FEEMS’s simple and novel parameterization already has potential to fit anisotropic migration (as shown in coalescent simulations) and may be extendable to other more complex migration processes (such as longrange migration, see below).
One general challenge, which is not unique to this method, is selecting the tuning parameters controlling the strength of regularization (λ in our case). A natural approach is to use crossvalidation, which estimates the outofsample fit of FEEMS for a particular choice of λ. We used leaveoneout crossvalidation, leaving one sampled population out at a time, and find such an approach works well based on the coalescent simulations and application to the North American wolf data. That said, we sometimes found high variability in the selected solution when we used crossvalidation with fewer folds (e.g. fivefold versus leaveoneout, results not shown). We expect this happens when the number of sampled populations is small relative to the complexity of the gene flow landscape, and we recommend using leaveoneout crossvalidation in general. We also find it useful to fit FEEMS to a sequential grid of regularization parameters and to look at what features are consistent or vary across multiple fits. Informally, one can gain an indication of the strongest features in the data by looking at the order they appear in the regularization path that is, what features overcome the strong penalization of smoothness in the data and that are highly supported by the likelihood. For example, early in the regularization path, we see regions of reduced geneflow occurring in the west coast of Canada that presumably correspond to Coastal mountain ranges and islands in British Columbia (Figure 3B) and this reduced geneflow appears throughout more flexible fits with lower λ.
An important caveat is that the objective function we optimize is nonconvex so any visualization output by FEEMS should be considered a local optimum and, keeping in mind that with different initialization one could get different results. That said, for the datasets investigated, we found the output visualizations were not sensitive to initialization, and thus our default setting is constant initialization fitted under an homogeneous isolation by distance model (See Materials and methods).
When comparing to EEMS, we found FEEMS to be much faster (Table 1). While this is encouraging, care must be taken because the goals and outputs of FEEMS and EEMS have a number of differences. FEEMS fits a sequential grid of solutions for different regularization parameters, whereas EEMS infers a posterior distribution and outputs the posterior mean as a point estimate. FEEMS is not a Bayesian method and unlike EEMS, which explores the entire landscape of the posterior distribution, FEEMS returns a particular point estimate: a local minimum point of the optimization landscape. Setting the prior hyperparameters in EEMS act somewhat like a choice of the tuning parameter λ, except that EEMS uses hierarchical priors that in principle allow for exploration of multiple scales of spatial structure in a single run, but requires potentially long computation times for adequate MCMC convergence.
Like EEMS, FEEMS is based on an assumed underlying spatial graph of populations exchanging gene flow with neighboring populations. While the inferred migration rates explain the data under an assumed model, it is important for users and readers of FEEMS results to keep in mind the range and density of the chosen grid when interpreting results. We note that using a denser grid has the two potential advantages of providing improved approximation for continuously distributed species, as well as a more flexible model space to fit the data.
Depending on the scale of the analysis and the life history of the species, the process of assuming and assigning a single geographic location for each individual is a potential limitation of the modeling framework used here. For instance, the North American wolves studied here are understood to be generally territorial with individual ranges that are on the scale of 10^{3} km^{2} (Burch et al., 2005), which is small relative to the greater than 10^{6} km^{2} scale of our analysis. Thus, modeling individual wolves with single locations may not generally be problematic. However, at the boundary of the Boreal forest and Tundra, there are wolves with larger annual ranges and seasonal migrations that track caribou herds roughly northsouth over distances of 1000 km (Musiani et al., 2007), and the wolves in the study were sampled in the winter (Musiani et al., 2007; Schweizer et al., 2016). If the samples were instead obtained in the summer, the position of the inferred low migration feature near the boundary of the Boreal Forest (marked 'C' in Figure 4) would presumably shift northward. The general cautionary lesson is that one must be careful when interpreting these maps to consider the life history of dispersal for the organism under study during the interpretation of results. Extending the methodology to incorporate knowledge of uncertainty in position or known dispersal may be an interesting direction for future work.
One natural extension to FEEMS, pertinent to a number of biological systems, is incorporating longrange migration (Pickrell and Pritchard, 2012; Bradburd et al., 2016). In this work, we have used a triangular lattice embedded in geographic space and enforced smoothness in nearby edge weights through penalizing their squared differences (see Materials and methods). We could imagine changing the structure of the graph by adding edges to allow for longrange connections; however, our current regularization scheme would not be appropriate for this setting. Instead, we could imagine adding an additional penalty to the objective, which would only allow a few long range connections to be tolerated. This could be considered to be a combination of two existing approaches for graphbased inference, graphical lasso (GLASSO) and graph Laplacian smoothing, combining the smoothness assumption for nearby connections and the sparsity assumption for longrange connections (Friedman et al., 2008; Wang et al., 2016). Another potential methodological avenue to incorporate longrange migration is to use a ‘greedy’ approach. We could imagine adding longrange edges one a time, guided by refitting the spatial model and taking a datadriven approach to select particular longrange edges to include. The proposed greedy approach could be considered to be a spatial graph analog of TreeMix (Pickrell and Pritchard, 2012).
Another interesting extension would be to incorporate asymmetric migration into the framework of resistance distance and Gaussian Markov Random Field based models. FEEMS, like EEMS, used a likelihood that is based on resistance distances, which are limited in their ability to model asymmetric migration (Lundgren and Ralph, 2019). Recently, Hanks, 2015 developed a promising new framework for deriving the stationary distribution of a continuous time stochastic process with asymmetric migration on a spatial graph. Interestingly, the expected distance of this process has similarities to the resistance distancebased models, in that it depends on the pseudoinverse of a function of the graph Laplacian. Hanks, 2015 used MCMC to estimate the effect of known covariates on the edge weights of the spatial graph. Future work could adapt this framework into the penalized optimization approach we have considered here, where adjacent edge weights are encouraged to be smooth.
Finally, when interpreted as mechanistic rather than statistical models, both EEMS and FEEMS implicitly assume timestationarity, so the estimated migration parameters should be considered to be ‘effective’ in the sense of being averaged over time in a reality where migration rates are dynamic and changing (Pickrell and Reich, 2014). The MAPS method is one recent advance that utilizes long stretches of shared haplotypes between pairs of individuals to perform Bayesian inference of time varying migration rates and population sizes (AlAsadi et al., 2019). With the growing ability to extract high quality DNA from ancient samples, another exciting future direction would be to apply FEEMS to ancient DNA datasets over different time transects in the same focal geographic region to elucidate changing migration histories (Mathieson et al., 2018). There are a number of technical challenges in ancient DNA data that make this a difficult problem, particularly high levels of missing and lowcoverage data. Our modeling approach could be potentially more robust, in that it takes allele frequencies as input, which may be estimable from dozens of ancient samples at the same spatial location, in spite of high degrees of missingness (Korneliussen et al., 2014).
In closing, we look back to a review titled ‘How Can We Infer Geography and History from Gene Frequencies?’ published in 1982 (Felsenstein, 1982). In this review, Felsenstein laid out fundamental open problems in statistical inference in population genetic data, a few of which we restate as they are particularly motivating for our work:
For any given covariance matrix, is there a corresponding migration matrix which would be expected to lead to it? If so, how can we find it?
How can we characterize the set of possible migration matrices which are compatible with a given set of observed covariances?
How can we confine our attention to migration patterns which are consistent with the known geometric coordinates of the populations?
How can we make valid statistical estimates of parameters of stepping stone models?
The methods developed here aim to help address these longstanding problems in statistical population genetics and to provide a foundation for future work to elucidate the role of geography and dispersal in ecological and evolutionary processes.
Materials and methods
Model description
Request a detailed protocolSee Appendix 1 ‘Mathematical notation’ for a detailed description of the notation used to describe the model. To visualize and model spatial patterns in a given population genetic dataset, FEEMS uses an undirected graph, $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with $\left\mathcal{\mathcal{V}}\right=d$, where nodes represent subpopulations and edge weights ${\left({w}_{k\mathrm{\ell}}\right)}_{(k,\mathrm{\ell})\in \mathcal{E}}$ represent the level of geneflow between subpopulations $k$ and $\mathrm{\ell}$. For computational convenience, we assume $\mathcal{G}$ is a highly sparse graph, specifically a triangular grid that is embedded in geographic space around the sample coordinates. We observe a genotype matrix, $\mathit{\bm{Y}}\in {\mathbb{R}}^{n\times p}$, with $n$ rows representing individuals and $p$ columns representing SNPs. We imagine diploid individuals are sampled on the nodes of $\mathcal{G}$ so that ${y}_{ij}\left(k\right)\in \{0,1,2\}$ records the count of some arbitrarily predefined allele in individual $i$, SNP $j$, on node $k\in \mathcal{V}$. We assume a commonly used simple Binomial sampling model for the genotypes:
where conditional on ${f}_{j}\left(k\right)$ for all $j,k$, the ${y}_{ij}\left(k\right)$’s are independent. We then estimate an allele frequency at each node and SNP by maximum likelihood:
where ${n}_{k}$ is the number of individuals sampled at node $k$. We estimate allele frequencies at $o$ of the observed nodes out of $d$ total nodes on the graph. From Equation (1), the estimated frequency in a particular subpopulation, conditional on the latent allele frequency, will approximately follow a Gaussian distribution:
Using vector notation, we represent the joint model of estimated allele frequencies as:
where ${\widehat{\mathit{\bm{f}}}}_{j}$ is a $o\times 1$ vector of estimated allele frequencies at observed nodes, ${\mathit{\bm{f}}}_{j}$ is a $d\times 1$ vector of latent allele frequencies at all the nodes (both observed and unobserved), and $\mathit{\bm{A}}$ is a $o\times d$ node assignment matrix where ${\mathit{\bm{A}}}_{k\mathrm{\ell}}=1$ if the kth estimated allele frequency comes from subpopulation $\mathrm{\ell}$ and ${\mathit{\bm{A}}}_{k\mathrm{\ell}}=0$ otherwise; and $\text{diag}\left({\mathit{\bm{d}}}_{\mathit{\bm{f}}\mathbf{,}\mathit{\bm{n}}}\right)$ denotes a $o\times o$ diagonal matrix whose diagonal elements corresponds to the appropriate variance term at observed nodes.
To summarize, we estimate allele frequencies from a subset of nodes on the graph and define latent allele frequencies for all the nodes of the graph. The assignment matrix $\mathit{\bm{A}}$ maps these latent allele frequencies to our observations. Our summary statistics (the data) are thus $(\widehat{\mathit{\bm{F}}},\mathit{\bm{n}})$ where $\widehat{\mathit{\bm{F}}}$ is a $o\times p$ matrix of estimated allele frequencies and $\mathit{\bm{n}}$ is a $o\times 1$ vector of sample sizes for every observed node. We assume the latent allele frequencies come from a Gaussian Markov Random Field:
where $\mathit{\bm{L}}$ is the graph Laplacian, † represents the pseudoinverse operator, and ${\mu}_{j}$ represents the average allele frequency across all of the subpopulations. Note that the multiplication by the SNPspecific factor ${\mu}_{j}\left(1{\mu}_{j}\right)$ ensures that the variance of the latent allele frequencies vanishes as the average allele frequency approaches to 0 or 1. One interpretation of this model is that the expected squared Euclidean distance between latent allele frequencies on the graph, after being rescaled by ${\mu}_{j}\left(1{\mu}_{j}\right)$, is exactly the resistance distance of an electrical circuit (McRae, 2006; Hanks and Hooten, 2013):
where ${\mathit{\bm{o}}}_{i}$ is a onehot vector (i.e. storing a 1 in element $i$ and zeros elsewhere). It is known that the resistance distance ${r}_{k\mathrm{\ell}}$ is equivalent to the expected commute time between nodes $k$ and $\mathrm{\ell}$ of a random walker on the weighted graph $\mathcal{G}$ (Chandra et al., 1996). Additionally, the model (Equation 3) forms a Markov random field, and thus any latent allele frequency ${f}_{j}\left(k\right)$ is conditionally independent of all other allele frequencies given its neighbors which are encoded by nonzero elements of $\mathit{\bm{L}}$ (Lauritzen, 1996; Koller and Friedman, 2009). Since we use a triangular grid embedded in geographic space to define the graph $\mathcal{G}$, the pattern of nonzero elements is prefixed by the structure of the sparse traingular grid.
Using the law of total variance formula, we can derive from (Equations 2, 3) an analytic form for the marginal likelihood. Before proceeding, however, we further approximate the model by assuming $\frac{1}{2}{f}_{j}\left(k\right)\left(1{f}_{j}\left(k\right)\right)\approx {\sigma}^{2}{\mu}_{j}\left(1{\mu}_{j}\right)$ for all $j$ and $k$ (see Appendix 1 ‘Estimating the edge weights under the exact likelihood model’ for the data model without this approximation). This assumption is mainly for computational purposes and may be a coarse approximation in general. On the other hand, the assumption is not too strong if we exclude SNPs with extremely rare allele frequencies, and more importantly, we find it leads to a good empirical performance, both statistically and computationally. With this approximation, the residual variance parameter ${\sigma}^{2}$ is still unknown and needs to be estimated.
Under (Equation 2, 3), the law of total variance formula leads to specific formulas for the mean and variance structure as given in (Equation 4). With those results, we arrive at the following approximate marginal likelihood:
where $\text{diag}\left({\mathit{\bm{n}}}^{1}\right)$ is a $o\times o$ diagonal matrix computed from the sample sizes at observed nodes. We note the marginal distribution of ${\widehat{f}}_{j}$ is not necessarily a Gaussian distribution; however, we use a Gaussian approximation to facilitate computation.
To remove the SNP means we transform the estimated frequencies by a contrast matrix, $\mathit{\bm{C}}\in {\mathbb{R}}^{\left(o1\right)\times o}$, that is orthogonal to the onevector:
Let $\widehat{\mathbf{\mathbf{\Sigma}}}=\frac{1}{p}{\widehat{\mathit{\bm{F}}}}_{\text{s}}{\widehat{\mathit{\bm{F}}}}_{\text{s}}^{\top}$ be the $o\times o$ sample covariance matrix of estimated allele frequencies after rescaling, that is, ${\widehat{\mathit{\bm{F}}}}_{\text{s}}$ is a matrix formed by rescaling the columns of $\widehat{\mathit{\bm{F}}}$ by $\sqrt{{\widehat{\mu}}_{j}\left(1{\widehat{\mu}}_{j}\right)}$, where ${\widehat{\mu}}_{j}$ is an estimate of the average allele frequency (see above). We can then express the model in terms of the transformed sample covariance matrix:
where ${\mathcal{W}}_{p}$ denotes a Wishart distribution with $p$ degrees of freedom. Note we can equivalently use the sample squared Euclidean distance (often refereed to as a genetic distance) as a summary statistic: letting $\widehat{\mathit{\bm{D}}}$ be the genetic distance matrix with ${\widehat{\mathit{\bm{D}}}}_{k\mathrm{\ell}}={\sum}_{j=1}^{p}{\left({\widehat{f}}_{j}\left(k\right){\widehat{f}}_{j}\left(\mathrm{\ell}\right)\right)}^{2}/p\cdot {\widehat{\mu}}_{j}\left(1{\widehat{\mu}}_{j}\right)$, we have
and so
using the fact that the contrast matrix $\mathit{\bm{C}}$ is orthogonal to the onevector. Thus, we can use the same spatial covariance model implied by the allele frequencies once we project the distances on to the space of contrasts:
Overall, the negative loglikelihood function implied by our spatial model is the following (ignoring constant terms):
where $\mathit{\bm{w}}\in {\mathbb{R}}^{m}$ is a vectorized form of the nonzero lowertriangular entries of the weighted adjacency matrix $\mathit{\bm{W}}$ (recall that the graph Laplacian is completely defined by the edge weights, $\mathit{\bm{L}}=\text{diag}\left(\mathit{\bm{W}}\mathrm{\U0001d7cf}\right)\mathit{\bm{W}}$, so there is an implicit dependency here). Since the graph is a triangular lattice, we only need to consider the nonzero entries to save computational time, that is, not all subpopulations are connected to each other.
We note our model (Equation 6) assumes that the $p$ SNPs are independent. This assumption is unlikely to hold when datasets are analyzed with SNPs that statistically covary (linkage disequilibrium). However, we note that the degree of freedom parameter does not affect the point estimate produced by FEEMS because it is treated as a constant term in the loglikelihood function.
One key difference between EEMS (Petkova et al., 2016) and FEEMS is how the edge weights are parameterized. In EEMS, each node is given an effective migration parameter ${m}_{k}$ for node $k\in \mathcal{V}$ and the edge weight is parameterized as the average between the nodes it connects, that is, ${w}_{k\mathrm{\ell}}=\left({m}_{k}+{m}_{\mathrm{\ell}}\right)/2$ for $(k,\mathrm{\ell})\in \mathcal{E}$. FEEMS, on the other hand, assigns a parameter to every nonzero edgeweight. The former has fewer parameters, with the specific consequence that it only allows isotropy and imposes an additional degree of similarity among edge weights; instead, in the latter, the edge weights are free to vary apart from the regularization imposed by the penalty. See Appendix 1 ‘Edge versus node parameterization’ and Appendix 1—figures 15, 17 for more details.
Penalty description
Request a detailed protocolAs mentioned previously, we would like to encourage that nearby edge weights on the graph have similar values to each other. This can be performed by penalizing differences between all edges connected to the same node, that is, spatially adjacent edges:
where, as before, $\mathcal{E}\left(i\right)$ denotes the set of edges that is connected to node $i$. (As mentioned earlier, in practice we choose $\alpha =1/{\widehat{w}}_{0}$, where ${\widehat{w}}_{0}$ is the solution for the ‘constant$w$’ model, but we use the free parameter α here for full generality.) The function $x\mapsto \mathrm{log}\left({e}^{x}1\right)$ (on positive values $x\in (0,\mathrm{\infty})$) is approximately equal to $x$, for $x$ much larger than 1, and is approximately equal to $\mathrm{log}\left(x\right)$, for $x$ much smaller than 1. This means that our penalty function effectively penalizes differences on the log scale for edges $(i,k)$ and $(i,\mathrm{\ell})$ with very small weights, but penalizes differences on the original nonlog scale for edges with large weights. Using a logarithmicscale penalty for edges with low weights (rather than simply penalizing ${\left({w}_{ik}{w}_{i\mathrm{\ell}}\right)}^{2}$) leads to smooth graphs for small edge values, and thus allow for an additional degree of flexibility across orders of magnitude of edge weights. The penalty parameter, λ, controls the overall contribution of the penalty to the objective function. It is convenient to write the penalty in matrixvector form which we will use throughout:
where $\mathbf{\mathbf{\Delta}}$ is a signed graph incidence matrix derived from a unweighted graph denoting if pairs of edges are connected to the same node. Specifically, in this expression, we treat $\mathit{\bm{w}}$ as a vector of length $\left\mathcal{E}\right$ (i.e. the number of edges), and apply the function $w\mapsto \mathrm{log}\left({e}^{\alpha w}1\right)$ entrywise to this vector. For each pair adjacent edges $(i,k)$ and $(i,\mathrm{\ell})$ in the graph, there is a corresponding row of $\mathbf{\mathbf{\Delta}}$ with the value +1 in the entry corresponding to edge $(i,k)$, a −1 in the entry corresponding to edge $(i,\mathrm{\ell})$, and 0’s elsewhere.
One might wonder whether it is possible to use the ${\mathrm{\ell}}_{1}$ norm in the penalty form Equation (8) in place of the ${\mathrm{\ell}}_{2}$ norm. While it is known that the ${\mathrm{\ell}}_{1}$ norm might increase local adaptivity and better capture the sharp changes of the underlying structure of the latent allele frequencies (e.g. Wang et al., 2016), in our case, we found an inferior performance when using the ${\mathrm{\ell}}_{1}$ norm over the ${\mathrm{\ell}}_{2}$ norm—in particular, our primary application of interest is the regime of highly missing nodes, that is, $o\ll d$, in which case the global smoothing seems somewhat necessary to encourage stable recovery of the edge weights at regions with sparsely observed nodes (see Appendix 1 ‘Smooth penalty with ${\mathrm{\ell}}_{\mathrm{1}}$norm’). In addition, adding the penalty ${\varphi}_{\lambda ,\alpha}\left(\mathit{\bm{w}}\right)$ allows us to implement faster algorithms to solve the optimization problem due to the differentiability of the ${\mathrm{\ell}}_{2}$ norm, and as a result, it leads to better overall computational savings and a simpler implementation.
Optimization
Request a detailed protocolPutting Equation (7) and Equation (8) together, we infer the migration edge weights $\widehat{\mathit{\bm{w}}}$ by minimizing the following penalized negative loglikelihood function:
where $\mathit{\bm{l}},\mathit{\bm{u}}\in {\mathbb{R}}_{+}^{m}$ represent respectively the entrywise lower and upper bounds on $\mathit{\bm{w}}$, that is, we constrain the lower and upper bound of the edge weights to $\mathit{\bm{l}}$ and $\mathit{\bm{u}}$ throughout the optimization. When no prior information is available on the range of the edge weights, we often set $\mathit{\bm{l}}=\mathrm{}$ and $\mathit{\bm{u}}=+\mathrm{\infty}$.
One advantage of the formulation of Equation 9 is the use of the vector form parameterization $\mathit{\bm{w}}\in {\mathbb{R}}_{+}^{m}$ of the symmetric weighted adjacency matrix $\mathit{\bm{W}}\in {\mathbb{R}}_{+}^{d\times d}$. In our triangular graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$, the number of nonzero lowertriangular entries is $m=\mathcal{O}\left(d\right)\ll {d}^{2}$, so working directly on the space of vector parameterization saves computational cost. In addition, this avoids the symmetry constraint imposed on the adjacency matrix $\mathit{\bm{W}}$, hence making optimization easier (Kalofolias, 2016).
We solve the optimization problem using a constrained quasiNewton optimization algorithm, specifically LBFGS implemented in scipy (Byrd et al., 1995; Virtanen et al., 2020) (RRID:SCR_008058). Since our objective Equation 9 is nonconvex, the LBFGS algorithm is guaranteed to converge only to a local minimum. Even so, we empirically observe that local minima starting from different initial points are qualitatively similar to each other across many datasets. The LBFGS algorithm requires gradient and objective values as inputs. Note the naive computation of the objective Equation 9 is computationally prohibitive because inverting the graph Laplacian has complexity $\mathcal{O}\left({d}^{3}\right)$. We take advantage of the sparsity of the graph and specific structure of the problem to efficiently compute gradient and objective values. In theory, our implementation has computational complexity of $\mathcal{O}\left(do+{o}^{3}\right)$ per iteration which, in the setting of $o\ll d$, is substantially smaller than $\mathcal{O}\left({d}^{3}\right)$. It is possible to achieve $\mathcal{O}\left(do+{o}^{3}\right)$ periteration complexity by using a solver that is specially designed for a sparse Laplacian system. In our work, we use sparse Cholesky factorization which may slightly slow down the periteration complexity (See Appendix Material for the details of the gradient and objective computation).
Estimating the residual variance and edge weights under the null model
Request a detailed protocolFor estimating the residual variance parameter ${\sigma}^{2}$, we first estimate it via maximum likelihood assuming homogeneous isolation by distance. This corresponds to the scenario where every edgeweight in the graph is given the exact same unknown parameter value ${w}_{0}$. Under this model, we only have two unknown parameters ${w}_{0}$ and the residual variance ${\sigma}^{2}$. We estimate these two parameters by jointly optimizing the marginal likelihood using a NelderMead algorithm implemented in scipy (Virtanen et al., 2020) (RRID:SCR_008058). This requires only likelihood computations which are efficient due to the sparse nature of the graph. This optimization routine outputs an estimate of the residual variance ${\widehat{\sigma}}^{2}$ and the null edge weight ${\widehat{w}}_{0}$, which can be used to construct $\mathit{\bm{W}}\left({\widehat{w}}_{0}\right)$ and in turn $\mathit{\bm{L}}\left({\widehat{w}}_{0}\right)$.
One strategy we found effective is to fit the model of homogeneous isolation by distance and then fix the estimated residual variance ${\widehat{\sigma}}^{2}$ throughout later fits of the more flexible penalized models—See Appendix 1 ‘Jointly estimating the residual variance and edge weights’. Additionally, we find that initializing the edge weights to ${\widehat{w}}_{0}$ to be a useful and intuitive strategy to set the initial values for the entries of $\mathit{\bm{w}}$ to the correct scale.
Leaveoneout crossvalidation to select tuning parameters
Request a detailed protocolFEEMS estimates one set of graph edge weights for each setting of the tuning parameters λ and α which control the smoothness of the fitted edge weights. Figure 3 shows that the estimated migration surfaces vary substantially depending on the particular choices of the tuning parameters, and indeed, due to the large fraction of unobserved nodes, it can highly overfit the observed data unless regularized accordingly. To address the issue of selecting the tuning parameters, we propose using leaveoneout crossvalidation to assess each fitted model’s generalization ability at held out locations.
To simplify the notation, we write the model Equation 4 for the estimated allele frequencies in SNP $j$ as
where
For each fold, we hold out one node from the set of observed nodes in the graph and use the rest of the nodes to fit FEEMS across a sequential grid of regularization parameters. Note that our objective function is nonconvex, so the algorithm converges to different local minima for different regularization parameters, even with the same initial value ${\widehat{w}}_{0}$. To stabilize the crossvalidation procedure, we recommend using a warm start strategy in which one solves the problem for the largest value of regularization parameters first and use this solution to initialize the algorithm at the next largest value of regularization parameters, and so on. Empirically, we find that using warm starts gives far more reliable model selection than with cold starts, where the problems over the sequence of parameters are solved independently with same initial value ${\widehat{w}}_{0}$. We suspect that the poor performance of leaveoneout crossvalidation without warm starts is attributed to spatial dependency of allele frequencies and the large fraction of unobserved nodes. Without loss of generality, we assume that the last node has been held out. Rewriting the distribution of the observed frequencies according to the split of observed nodes,
the conditional mean of the observed frequency ${\widehat{f}}_{j}^{\text{val}}$ on the held out node, given the rest, is given by
Using this formula, we can predict allele frequencies at held out locations using the fitted graph $\widehat{\mathit{\bm{L}}}=\widehat{\mathit{\bm{L}}}(\lambda ,\alpha )$ for each setting of tuning parameters λ and α. Note that in Equation (10), the parameters ${\mu}_{j}$ and σ are also unknown, and we use an estimate of the average allele frequency ${\widehat{\mu}}_{j}$ and the estimated residual variance $\widehat{\sigma}$ from the ‘constant$w$’ model (they are not dependent on λ and α). Then we select the tuning parameters λ and α that output the minimum prediction error averaged over all SNPs $\frac{1}{p}\sum _{j}{\Vert {\hat{f}}_{j}^{\text{val,pred}}{\hat{f}}_{j}^{\text{val}}\Vert}_{2}^{2}$, averaged over all the held out nodes (with $o$ observed nodes in total). As mentioned earlier, in practice we choose $\alpha =1/{\widehat{w}}_{0}$ and hence we can use the leaveoneout crossvalidation to search for λ only, which allows us to avoid the computational cost of searching over the twodimensional parameter space.
Comparison between FEEMS and EEMS models
Request a detailed protocolAt a high level, we can summarize the differences between FEEMS and EEMS as follows: (1) the likelihood functions of FEEMS and EEMS are slightly different as a function of the graph Laplacian $\mathit{\bm{L}}$; (2) the migration rates are parameterized in terms of edge weights or in terms of node weights; and (3) EEMS is based on Bayesian inference and thus chooses a prior and studies the posterior distribution, while FEEMS is an optimizationbased approach and thus chooses a penalty function and minimizes the penalized loglikelihood (in particular, the EEMS prior and the FEEMS penalty are both aiming for locally constant type migration surfaces). The last two points were already discussed in the above sections, so here we focus on the difference of the likelihoods between the two methods.
FEEMS develops the spatial model for the genetic differentiation through Gaussian Markov Random Field, but the resulting likelihood has similarities to EEMS (Petkova et al., 2016) which considers the pairwise coalescent times. Using our notation, we can write the EEMS model as
where $\nu \in [o1,p]$ is the effective degree of freedom, ${\sigma}^{*}>0$ is the scale nuisance parameter, and $\mathit{\bm{q}}$ is a $d\times 1$ vector of the withinsubpopulation coalescent rates. ${\widehat{\mathit{\bm{D}}}}^{*}$ represents the genetic distance matrix without rescaling, where the $(k,\mathrm{\ell})$th element is given by ${\widehat{\mathit{\bm{D}}}}_{k\mathrm{\ell}}^{\star}={\sum}_{j=1}^{p}{\left({\widehat{f}}_{j}\left(k\right){\widehat{f}}_{j}\left(\mathrm{\ell}\right)\right)}^{2}/p$. That is, unlike FEEMS, EEMS does not consider the SNPspecific rescaling factor ${\mu}_{j}\left(1{\mu}_{j}\right)$ to account for the vanishing variance of the observed allele frequencies as the average allele frequency approaches to 0 or 1.
In Equation (11), the effective degree of freedom ν is introduced to account for the dependency across SNPs in close proximity. Because EEMS uses a hierarchical Bayesian model to infer the effective migration rates, ν is being estimated alongside other model parameters. On the other hand, FEEMS uses an optimizationbased approach and the degrees of freedom has no influence on the point estimate of the migration rates. Besides the effective degree of freedom and the SNPspecific rescaling by ${\mu}_{j}\left(1{\mu}_{j}\right)$, the EEMS and FEEMS likelihoods are equivalent up to constant factors, as long as only one individual is observed per node and the residual variance ${\sigma}^{2}$ is allowed to vary across nodes—See Appendix 1 ‘Jointly estimating the residual variance and edge weights’ for details. The constant factors, such as ${\sigma}^{*}$, can be effectively absorbed into the unknown model parameters $\mathit{\bm{L}}$ and $\mathit{\bm{q}}$ and therefore they do not affect the estimation of effective migration rates, up to constant factors.
Data description and quality control
Request a detailed protocolWe analyzed a population genetic dataset of North American gray wolves previously published in Schweizer et al., 2016. For this, we downloaded plink (RRID:SCR_001757) formatted files and spatial coordinates from https://doi.org/10.5061/dryad.c9b25. We removed all SNPs with minor allele frequency less than 5% and with missingness greater then 10%, resulting in a final set of 111 individuals and 17,729 SNPs.
Population structure analyses
Request a detailed protocolWe fit the Pritchard, Donnelly, and Stephens model (PSD) and ran principal components analysis on the genotype matrix of North American gray wolves (Price et al., 2006; Pritchard et al., 2000). For the PSD model, we used the ADMIXTURE software (RRID:SCR_001263) on the unnormalized genotypes, running five replicates per choice of $K$, from $K=2$ to $K=8$ (Alexander et al., 2009). For each $K$, we choose the one that achieved the highest likelihood to visualize. For PCA, we centered and scaled the genotype matrix and then ran sklearn (RRID:SCR_019053) implementation of PCA, truncated to compute 50 eigenvectors.
Grid construction
Request a detailed protocolTo create a dense triangular lattice around the sample locations, we first define an outer boundary polygon. As a default, we construct the lattice by creating a convex hull around the sample points and manually trimming the polygon to adhere to the geography of the study organism and balancing the sample point range with the extent of local geography using the following website https://www.keene.edu/campus/maps/tool/. We often do not exclude internal ‘holes’ in the habitat (e.g. water features for terrestrial animals), and let the model instead fit effective migration rates for those features to the extent they lead to elevated differentiation. We also emphasize the importance of defining the lattice for FEEMS as well as EEMS and suggest this should be carefully curated with prior biological knowledge about the system.
To ensure edges cover an equal area over the entire region, we downloaded and intersected a uniform grid defined on the spherical shape of earth (Sahr et al., 2003). These defined grids are precomputed at a number of different resolutions, allowing a user to test FEEMS at different grid densities which is an important feature to explore.
Code availability
Request a detailed protocolThe code to reproduce the results of this paper and more can be found at https://github.com/jhmarcus/feemsanalysis (Marcus and Ha, 2021a, copy archived at swh:1:rev:f2d7330f25f8a11124db09000918ae38ae00d4a7, Marcus and Ha, 2021b). A python (RRID:SCR_008394) package implementing the method can be found at https://github.com/Novembrelab/feems.
Appendix 1
Mathematical notation
We denote matrices using bold capital letters $\mathit{\bm{A}}$. Bold lowercase letters are vectors $\mathit{\bm{a}}$, and nonbold lowercase letters are scalars $a$. We denote by ${\mathit{\bm{A}}}^{1}$ and ${\mathit{\bm{A}}}^{\u2020}$ the inverse and (MoorePenrose) pseudoinverse of $\mathit{\bm{A}}$, respectively. We use $\mathit{\bm{y}}\sim {N}_{p}(\mathit{\bm{\mu}},\mathbf{\mathbf{\Sigma}})$ to express that the random vector $\mathit{\bm{y}}$ is modeled as a $p$dimensional multivariate Gaussian distribution with fixed parameters $\mathit{\bm{\mu}}$ and $\mathbf{\mathbf{\Sigma}}$ and use the conditional notation $\mathit{\bm{y}}\mathit{\bm{\mu}}\sim {N}_{p}(\mathit{\bm{\mu}},\mathbf{\mathbf{\Sigma}})$ if $\mathit{\bm{\mu}}$ is random.
A graph is a pair $\mathcal{G}=(\mathcal{V},\mathcal{E})$, where $\mathcal{V}$ denotes a set of nodes or vertices and $\mathcal{E}\subseteq \mathcal{V}\times \mathcal{V}$ denotes a set of edges. Throughout we assume the graph $\mathcal{G}$ is undirected, weighted, and contains no self loops, that is, $(k,\mathrm{\ell})\in \mathcal{E}\iff (\mathrm{\ell},k)\in \mathcal{E}$ and $(k,k)\notin \mathcal{E}$ and each edge $(k,\mathrm{\ell})\in \mathcal{E}$ is given a weight ${w}_{k\mathrm{\ell}}={w}_{\mathrm{\ell}k}>0$. We write $\mathit{\bm{W}}$ to indicate the symmetric weighted adjacency matrix, that is,
$\mathit{\bm{w}}\in {\mathbb{R}}^{m}$ is a vectorized form of the nonzero lowertriangular entries of $\mathit{\bm{W}}$ where $m=\left\mathcal{\mathcal{E}}\right/2$ is the number of nonzero lower triangular elements. We denote by $\mathit{\bm{L}}=\text{diag}\left(\mathit{\bm{W}}\mathrm{\U0001d7cf}\right)\mathit{\bm{W}}$ the graph Laplacian.
Gradient computation
In practice, we make a change of variable from $\mathit{\bm{w}}\in {\mathbb{R}}_{+}^{m}$ to $\mathit{\bm{z}}=\mathrm{log}\left(\mathit{\bm{w}}\right)\in {\mathbb{R}}^{m}$ and the algorithm is applied to the transformed objective function:
After the change of variable, the objective value remains the same, whereas it follows from the chain rule that $\nabla \left(\stackrel{~}{\mathrm{\ell}}\left(\mathit{\bm{z}}\right)+{\stackrel{~}{\varphi}}_{\lambda ,\alpha}\left(\mathit{\bm{z}}\right)\right)=\nabla \left(\mathrm{\ell}\left(\mathit{\bm{w}}\right)+{\varphi}_{\lambda ,\alpha}\left(\mathit{\bm{w}}\right)\right)\odot \mathit{\bm{w}}$ where $\odot $ indicates the Hadamard product or elementwise product—for notational convenience, we drop the dependency of $\mathrm{\ell}$ on the quantities ${\sigma}^{2}$ and $\mathit{\bm{C}}\widehat{\mathbf{\mathbf{\Sigma}}}{\mathit{\bm{C}}}^{\top}$. Furthermore, the computation of $\nabla {\varphi}_{\lambda ,\alpha}\left(\mathit{\bm{w}}\right)$ is relatively straightforward, so in the rest of this section, we discuss only the computation of the gradient of the negative loglikelihood function with respect to $\mathit{\bm{w}}$, that is, $\nabla \mathrm{\ell}\left(\mathit{\bm{w}}\right)$.
Recall, by definition, the graph Laplacian $\mathit{\bm{L}}$ implicitly depends on the variable $\mathit{\bm{w}}$ through $\mathit{\bm{L}}=\text{diag}\left(\mathit{\bm{W}}\mathrm{\U0001d7cf}\right)\mathit{\bm{W}}$. Throughout we assume the first $o$ rows and columns of $\mathit{\bm{L}}$ correspond to the observed nodes. With this assumption, our node assignment matrix has block structure $\mathit{\bm{A}}=\left[{\mathbf{\mathbf{I}}}_{o\times o}{\mathbf{\hspace{0.33em}0}}_{o\times \left(do\right)}\right]$. To simplify some of the equations appearing later, we introduce the notation: we define
and
Applying the chain rule and matrix derivatives, we can calculate:
where vec is the vectorization operator and $\partial \mathrm{\ell}/\partial \text{vec}\left(\mathit{\bm{L}}\right)$ and $\partial \text{vec}\left(\mathit{\bm{L}}\right)/\partial {\mathit{\bm{w}}}^{\top}$ are $1\times {d}^{2}$ vector and ${d}^{2}\times d$ matrix, respectively, given by
Here, $\mathit{\bm{S}}$ and $\mathit{\bm{T}}$ are linear operators that satisfy $\mathit{\bm{S}}\mathit{\bm{w}}=\text{diag}\left(\mathit{\bm{W}}\mathrm{\U0001d7cf}\right)$ and $\mathit{\bm{T}}\mathit{\bm{w}}=\mathit{\bm{W}}$. Note $\mathit{\bm{S}}$ and $\mathit{\bm{T}}$ both have $\mathcal{O}\left(d\right)$ many nonzero entries, so we can perform sparse matrix multiplication to efficiently compute the matrixvector multiplication $\partial \mathrm{\ell}/\partial \text{vec}\left(\mathit{\bm{L}}\right)\cdot \left(\mathit{\bm{S}}\mathit{\bm{T}}\right)$. On the other hand, the computation of $\partial \mathrm{\ell}/\partial \text{vec}\left(\mathit{\bm{L}}\right)$ is more challenging as it requires inverting the full $d\times d$ matrix ${\mathit{\bm{L}}}_{\text{full}}$. Next, we develop a procedure that efficiently computes $\partial \mathrm{\ell}/\partial \text{vec}\left(\mathit{\bm{L}}\right)$. We proceed by dividing the task into multiple steps.
1. Computing ${\mathbf{\mathbf{\Sigma}}}^{1}$
Recalling the block structure $\mathit{\bm{A}}=\left[{\mathbf{\mathbf{I}}}_{o\times o}{\mathbf{\hspace{0.33em}0}}_{o\times \left(do\right)}\right]$ of the node assignment matrix, we can write $\mathbf{\mathbf{\Sigma}}$ as:
where ${\left({\mathit{\bm{L}}}_{\text{full}}^{1}\right)}_{o\times o}$ denotes the $o\times o$ upperleft block of ${\mathit{\bm{L}}}_{\text{full}}^{1}$. Following Petkova et al., 2016, the inverse ${\mathbf{\mathbf{\Sigma}}}^{1}$ has the form
for some matrix $\mathit{\bm{X}}\in {\mathbb{R}}^{o\times o}$. Equating $\mathbf{\mathbf{\Sigma}}{\mathbf{\mathbf{\Sigma}}}^{1}=\mathbf{\mathbf{I}}$, it follows that
Therefore, ${\mathbf{\mathbf{\Sigma}}}^{1}$ can be obtained by solving the $o\times o$ linear system Equation (15) and plugging the solution into Equation (14). The challenge here is to compute ${\left({\mathit{\bm{L}}}_{\text{full}}^{1}\right)}_{o\times o}$ without matrix inversion of the fulldimensional ${\mathit{\bm{L}}}_{\text{full}}$.
2. Computing ${\left({\mathit{\bm{L}}}_{\text{full}}^{1}\right)}_{o\times o}$
Let ${\mathit{\bm{L}}}_{\text{full},o\times o}$ be the $o\times o$ block matrix corresponding to the observed nodes of ${\mathit{\bm{L}}}_{\text{full}}$, and similarly let ${\mathit{\bm{L}}}_{\text{full},\left(do\right)\times \left(do\right)}$ and ${\mathit{\bm{L}}}_{\text{full},o\times \left(do\right)}={\mathit{\bm{L}}}_{\text{full},\left(do\right)\times o}^{\top}$ be the corresponding block matrices of ${\mathit{\bm{L}}}_{\text{full}}$, respectively. The inverse of ${\left({\mathit{\bm{L}}}_{\text{full}}^{1}\right)}_{o\times o}$ is then given by the Schur complement of ${\mathit{\bm{L}}}_{\text{full},\left(do\right)\times \left(do\right)}$ in $\mathit{\bm{L}}$:
See also Hanks and Hooten, 2013, Petkova et al., 2016. Since every term in Equation (16) has sparse + rank1 structure, the matrix multiplications can be performed fast. In addition, for the term ${\left({\mathit{\bm{L}}}_{\text{full},\left(do\right)\times \left(do\right)}\right)}^{1}$, we can use the ShermanMorrison formula so that the inverse is given explicitly by
Hence, in order to compute ${\left({\mathit{\bm{L}}}_{\text{full},\left(do\right)\times \left(do\right)}\right)}^{1}{\mathit{\bm{L}}}_{\text{full},\left(do\right)\times o}$, we need to solve two systems of linear equations:
Note that the matrix ${\mathit{\bm{L}}}_{\left(do\right)\times \left(do\right)}$ is sparse, so both systems can be solved efficiently by performing sparse Cholesky factorization on ${\mathit{\bm{L}}}_{\left(do\right)\times \left(do\right)}$ (Hanks and Hooten, 2013). Alternatively, one can implement fast Laplacian solvers (Vishnoi, 2013) that solve the Laplacian system in time nearly linear in the dimension $\mathcal{O}\left(d\right)$. After we obtain ${\left[{\left({\mathit{\bm{L}}}_{\text{full}}^{1}\right)}_{o\times o}\right]}^{1}$ via sparse + rank1 matrix multiplication and sparse Cholesky factorization, we can invert the $o\times o$ matrix to get ${\left({\mathit{\bm{L}}}_{\text{full}}^{1}\right)}_{o\times o}$.
3. Computing ${\left({\mathit{\bm{L}}}_{\text{full}}^{1}\right)}_{d\times o}$
We write
Using the inversion of the matrix in a block form, the $\left(do\right)\times o$ block component is given by
Since each of the two terms (A) and (B) has been already computed in the previous step, there is no need to recompute them. In total, it requires a $(do)\times o$ matrix and $o\times o$ matrix multiplication.
4. Computing the full gradient
Going back to the expression of $\nabla \mathrm{\ell}\left(\mathit{\bm{w}}\right)$ in Equation (13), and noting the block structure of the assignment matrix $\mathit{\bm{A}}$, we have:
Define $\mathrm{\Pi}}_{\mathbf{1}}=\mathbf{1}{\left({\mathbf{1}}^{\mathrm{\top}}{\mathbf{\Sigma}}^{1}\mathbf{1}\right)}^{1}{\mathbf{1}}^{\mathrm{\top}}{\mathbf{\Sigma}}^{1$ which acts as a sort of projection to the space of constant vectors with respect to the inner product $\u27e8\mathit{\bm{x}},\mathit{\bm{y}}\u27e9={\mathit{\bm{x}}}^{\top}{\mathbf{\mathbf{\Sigma}}}^{1}\mathit{\bm{y}}$. Using the identity $\mathbf{\mathbf{I}}{\mathbf{\mathbf{\Pi}}}_{\mathrm{\U0001d7cf}}=\mathbf{\mathbf{\Sigma}}{\mathit{\bm{C}}}^{\top}{\left(\mathit{\bm{C}}\mathbf{\mathbf{\Sigma}}{\mathit{\bm{C}}}^{\top}\right)}^{1}\mathit{\bm{C}}$ (McCullagh, 2009), then we can write $\mathit{\bm{M}}$ in terms of ${\mathbf{\mathbf{\Pi}}}_{\mathrm{\U0001d7cf}}$:
Since ${\mathrm{\Pi}}_{\mathrm{\U0001d7cf}}$ is a rank1 matrix, this expression of $\mathit{\bm{M}}$ allows easier computation. Finally we can put together Equation (14), Equation (15), Equation (17), and Equation (18), to compute the gradient of the negative loglikelihood function with respect to the graph Laplacian.
Objective computation
The graph Laplacian $\mathit{\bm{L}}$ is orthogonal to the one vector 1, so using the notation introduced in Equation (12), we can express our objective function as
With the identity $\mathbf{\mathbf{I}}{\mathbf{\mathbf{\Pi}}}_{\mathrm{\U0001d7cf}}=\mathbf{\mathbf{\Sigma}}{\mathit{\bm{C}}}^{\top}{\left(\mathit{\bm{C}}\mathbf{\mathbf{\Sigma}}{\mathit{\bm{C}}}^{\top}\right)}^{1}\mathit{\bm{C}}$, the trace term is:
The matrix inside the trace has been constructed in the gradient computation, see Equation (18). In terms of the determinant, we use the same approach considered in Petkova et al., 2016—in particular, concatenating ${\mathit{\bm{C}}}^{\top}$ and $\mathrm{\U0001d7cf}$, the matrix $\left[{\mathit{\bm{C}}}^{\top}\mathbf{\hspace{0.33em}1}\right]$ is orthogonal, so it can be shown that
Rearranging terms and using the fact $det\left({\mathit{\bm{U}}}^{1}\right)=det{\left(\mathit{\bm{U}}\right)}^{1}$ for any matrix $\mathit{\bm{U}}$, we obtain:
We have computed ${\mathbf{\mathbf{\Sigma}}}^{1}$ in Equation (14), so each of the terms above can be computed without any additional matrix multiplications. Finally, the signed graph incidence matrix $\mathbf{\mathbf{\Delta}}$ defined on the edges of the graph is, by construction, highly sparse with $\mathcal{O}\left(d\right)$ many nonzero entries. Hence we implement sparse matrix multiplication to evaluate the penalty function ${\varphi}_{\lambda ,\alpha}\left(\mathit{\bm{w}}\right)$ while avoiding the fulldimensional matrixvector product.
Estimating the edge weights under the exact likelihood model
When we developed the FEEMS model, we used the approximation $\frac{1}{2}{f}_{j}\left(k\right)\left(1{f}_{j}\left(k\right)\right)\approx {\sigma}^{2}{\mu}_{j}\left(1{\mu}_{j}\right)$ for all SNPs $j$ and all nodes $k$ (see Equation 4) and estimated the residual variance ${\sigma}^{2}$ under the homogeneous isolation by distance model. The primary reason of using this approximation was primarily computational. While the approximation is not too strong if SNPs with rare allele frequencies are excluded, it is also critical that the estimation quality of the migration rates is not affected. In this subsection we introduce the inferring procedure of the migration rates under the exact likellihood model and compare it with FEEMS.
Note that without approximation, we can calculate the exact analytical form for the marginal likelihood of the estimated frequency as follows (after removing the SNP means):
where ${\left\{{a}_{k}\right\}}_{k=1}^{d}$ represents the vector $\mathit{\bm{a}}=({a}_{1},\mathrm{\dots},{a}_{d})$. Compared to the model Equation (5), this expression does not introduce the unknown residual variance parameter ${\sigma}^{2}$ and instead each node has its own residual parameter given by $\left(1{L}_{kk}^{\u2020}\right)/2$. Because the residual parameters must be positive, this means that we have to search for the graphs that ensure ${L}_{kk}^{\u2020}\le 1$ for all nodes $k$. With that said, we can consider the following constrained optimization problem:
where ${\mathrm{\ell}}_{\text{\U0001d5be\U0001d5d1\U0001d5ba\U0001d5bc\U0001d5cd}}$ is the negative loglikelihood function based on Equation (19) and ${\varphi}_{\lambda ,\alpha}$ is the smooth penalty function defined earlier. The main difficulty of solving Equation (20) is that enforcing the constraint ${L}_{kk}^{\u2020}\le 1$ for all nodes $k\in \mathcal{V}$, requires full computation of the pseudoinverse of a $d\times d$ matrix $\mathit{\bm{L}}$ which is computationally demanding. We instead relax the constraint and consider the following form as a proxy for optimization Equation (20):
Note that the constraint ${L}_{kk}^{\u2020}\le 1$ is now placed at the observed nodes only, which can lead to computational savings if $o\ll d$. The problem Equation (21) can be solved efficiently using any gradientbased algorithms where we can calculate the gradient of ${\mathrm{\ell}}_{\text{\U0001d5be\U0001d5d1\U0001d5ba\U0001d5bc\U0001d5cd}}$ with respect to $\mathit{\bm{L}}$ as
where $\mathit{\bm{M}}$ is a $o\times o$ matrix defined in Equation (18), and $\mathit{\bm{N}}$ is a $o\times {d}^{2}$ matrix whose rows correspond to the observed subsets of the rows of the ${d}^{2}\times {d}^{2}$ matrix ${\mathit{\bm{L}}}_{\text{full}}^{1}\otimes {\mathit{\bm{L}}}_{\text{full}}^{1}$.
Appendix 1—figure 12 shows the result when the penalized maximum likelihood Equation (21) is applied to the North American wolf dataset with a setting of $\lambda =2.06$ (the same value of λ as given in Figure 4) and $\alpha =1/{\widehat{w}}_{0}$, where ${\widehat{w}}_{0}$ is the solution for the ‘constant$w$’ model. We can see that the resulting estimated migration surfaces are qualitatively similar to that shown in Figure 4. We also observed similar results between FEEMS and the penalized maximum likelihood Equation (21) across multiple datasets. On the other hand, we found that at the fitted surface the residual variances $1{L}_{kk}^{\u2020}$ are not always positive because the constraints are enforced only at the observed nodes. This is problematic because it can cause the model to be illdefined at the unobserved nodes and make the algorithm numerically unstable. Note that FEEMS avoids this issue by decoupling the residual variance parameter ${\sigma}^{2}$ from the graphrelated parameters $\mathit{w}$. The resulting model Equation (6) also has more resemblance to spatial coalescent model used in EEMS (Petkova et al., 2016), and we thus recommend using FEEMS as a primary method for inferring migration rates.
Jointly estimating the residual variance and edge weights
One simple strategy we have used throughout the paper was to fit ${\sigma}^{2}$ first under a model of homogeneous isolation by distance and prefix the estimated residual variance to the resulting ${\widehat{\sigma}}^{2}$ for later fits of the effective migration rates. Alternatively, we can consider estimating the unknown residual variance simultaneously with the edge weights, instead of prefixing it from the estimation of the null model—the hope here is to simultaneously correct the model misspecification and allow for improving model fit to the data. To develop the framework for simultaneous estimation of the residual variance and edge weights, let us consider a model that generalizes both Equation (6) and Equation (19), that is,
where ${\mathit{\bm{\sigma}}}^{2}$ is a $d\times 1$ vector of node specific residual variance parameters, that is, each deme has its own residual parameter ${\sigma}_{k}$. If the parameters ${\sigma}_{k}$’s are assumed to be the same across nodes, this reduces to the FEEMS model Equation (6) while setting ${\sigma}_{k}=(1{\mathit{L}}_{kk}^{\u2020})/2$ gives the model Equation (19). Then we solve the following optimization problem
where ${\mathrm{\ell}}_{\text{\U0001d5c3\U0001d5c8\U0001d5c2\U0001d5c7\U0001d5cd}}$ is the negative loglikelihood function based on Equation (22). Note that the residual variances and edge weights are both searched in the optimization for finding the optimal solutions. To solve the problem, we can use the quasinewton algorithm for optimizing the objective function.
Appendix 1—figure 13 shows the fitted graphs with different strategies of estimating the residual variances. Appendix 1—figure 13A shows the result when the model has a single residual variance ${\sigma}^{2}$, and Appendix 1—figure 13B shows the result when the residual variances are allowed to vary across nodes. In both cases, estimating the residual variances jointly with the edge weights yields similar and comparable outputs to the default setting of prefixing it from the null model (Figure 4), except that we can further observe reduced effective migration around Queen Elizabeth Islands as shown in Appendix 1—figure 13B. In EEMS, in order to estimate the genetic diversity parameters for every spatial location, which play a similar role as the residual variances in FEEMS, a Voronoitessellation prior is placed to encourage sharing of information across adjacent nodes and prevent overfitting. Similarly, we can place the spatial smooth penalty on the residual variances (i.e. $\varphi}_{\lambda ,\alpha$ defined on the variable ${\mathit{\bm{\sigma}}}^{2})$, but it introduces additional hyperparameters to tune, without substantially improving the model’s fit to the data. In this work, we choose to fit the single residual variance ${\sigma}^{2}$ under the null model and prefix it as a simple but effective strategy with apparent good empirical performance.
Edge versus node parameterization
One of the novel features of FEEMS is its ability to directly fit the edge weights of the graph that best suit the data. This direct edge parameterization may increase the risk of model’s overfitting, but also allows for more flexible estimation of migration histories. Furthermore, as seen in Figure 2 and Appendix 1—figure 2, it has potential to recover anisotropic migration processes. This is in contrast to EEMS wherein every spatial node is assigned an effective migration parameter ${m}_{k}$ and a migration rate on each edge joining nodes $k$ and $\mathrm{\ell}$ is given by the average effective migration ${w}_{k\ell}=({m}_{k}+{m}_{\ell})/2$. Not surprisingly, by assigning each edge to be the average of connected nodes, a form of implicit spatial regularization is imposed because multiple edges connected to the same node would average that node’s parameter value. In some cases, this has the desirable property of imposing an additional degree of similarity across edge weights, but at the same time it also restricts the model’s capacity to capture a richer set of structure present in the data (e.g. Petkova et al., 2016, Supplementary Figure 2). To be concrete, Appendix 1—figure 15 displays two different fits of FEEMS based on edge parameterization (Appendix 1—figure 15A) and node parameterization (Appendix 1—figure 15B), run on a previously published dataset of human genetic variation from Africa (see Peter et al., 2020 for details on the description of the dataset). Running FEEMS with a nodebased parameterization is straightforward in our framework—all we have to do is to reparameterize the edge weights by the average effective migration and solve the corresponding optimization problem (Optimization) with respect to $\mathit{\bm{m}}$. It is evident from the results that FEEMS with edge parameterization exhibits subtle correlations that exist between the annotated demes in the figure, whereas node parameterization fails to recover them. We also compare the model fit of FEEMS to the observed genetic distance (Appendix 1—figure 16) and find that edgebased parameterization provides a better fit to the African dataset. Appendix 1—figure 17 further demonstrates that in the coalescent simulations with anisotropic migration, the node parameterization is unable to recover the ground truth of the underlying migration rates even when the nodes are fully observed.
Smooth penalty with ${\mathrm{\ell}}_{1}$ norm
FEEMS’s primary optimization objective (see Equation 9) is:
where the spatial smoothness penalty is given by an ${\mathrm{\ell}}_{2}$based penalty function: $\varphi}_{\lambda ,\alpha}(\mathit{w})=\frac{\lambda}{2}{\Vert \mathbf{\Delta}\mathrm{log}({e}^{\alpha \mathit{w}}\mathbf{1})\Vert}_{2}^{2$. It is well known that an ${\mathrm{\ell}}_{1}$based penalty can lead to a better local adaptive fitting and structural recovery than ${\mathrm{\ell}}_{2}$based penaltyies (Wang et al., 2016), but at the cost of handling nonsmooth objective functions that are often computationally more challenging. In a spatial genetic dataset, one major challenge is to deal with the relatively sparse sampling design where there are many unobserved nodes on the graph. In this statistically challenging scenario, we found that an ${\mathrm{\ell}}_{2}$based penalty allows for more accurate and reliable estimation of the geographic features.
Specifically, writing $\varphi}_{\lambda ,\alpha}^{{\ell}_{1}}(\mathit{w})=\lambda {\Vert \mathbf{\Delta}\mathrm{log}({e}^{\alpha \mathit{w}}\mathbf{1})\Vert}_{1$, we considered the alternate following composite objective function:
To solve Equation (23), we apply linearized alternating direction method of multipliers (ADMM) (Boyd, 2010), a variant of the standard ADMM algorithm, that iteratively optimizes the augmented Lagrangian over the primal and dual variables. The derivation of the algorithm is a standard calculation so we omit the detailed description of the algorithm. As opposed to the common belief about the effectiveness of the ${\mathrm{\ell}}_{1}$ norm for structural recovery, the recovered graph of FEEMS using ${\mathrm{\ell}}_{1}$based smooth penalty shows less accurate reconstruction of the migration patterns, especially when the sampling design has many locations with missing data on the graph (Appendix 1—figure 18A, Appendix 1—figure 19H). We can see that the ${\mathrm{\ell}}_{1}$based penalty function is not able to accurately estimate edge weights at regions with little data, partially due to its local adaptation, in contrast to the ${\mathrm{\ell}}_{2}$based method that considers regularization more globally. This suggests that in order to use the ${\mathrm{\ell}}_{1}$ penalty ${\varphi}_{\lambda ,\alpha}^{{\mathrm{\ell}}_{1}}\left(\mathit{\bm{w}}\right)$ in the presence of many missing nodes, one may need an additional degree of regularization that encourages global smoothness of the graph’s edge weights, such as a combination of ${\varphi}_{\lambda ,\alpha}^{{\ell}_{1}}(\mathit{w})$ and ${\varphi}_{\lambda ,\alpha}\left(\mathit{\bm{w}}\right)$ (in the same spirit as elastic net [Zou and Hastie, 2005]), or ${\varphi}_{\lambda ,\alpha}^{{\mathrm{\ell}}_{1}}\left(\mathit{\bm{w}}\right)$ on top of nodebased parameterization (see Appendix 1—figure 18B).
Coalescent simulations with weak migration
In Figure 2, we evaluated FEEMS by applying it to ‘outofmodel’ coalescent simulations. In these simulations, we generated genotype data under a coalescent model with structured metapopulations organized on a spatial triangular lattice. In a relatively ‘strong’ heterogeneous migration scenario (Figure 2D,E,F), we set the coalescent migration rate to be an order of magnitude lower (10fold) in the center of the spatial grid than on the left and right regions, emulating a depression in geneflow caused, for example, by a mountain range or body of water. The variation in migration rates should create a spatially varying covariance structure in the genetic variation data. To get a sense of the level of genetic divergence implied by this simulation setting, we visualized Wright’s fixation index (${F}_{ST}$, Patterson’s estimator [Patterson et al., 2012]) plotted against the geographic distance between nodes (Appendix 1—figure 20). We see in the strong heterogeneous migration simulation there is a clear signal of two clusters of data points (Appendix 1—figure 20B). These clusters correspond to pairwise ${F}_{ST}$ comparisons of two nodes on the same side of the central depression in gene flow, where geneflow roughly follows a homogeneous ‘isolationbydistance’ like pattern, or two nodes across the central depression where geneflow is reduced, hence increasing the expected ${F}_{ST}$ between such nodes.
While simulating this strong reduction of geneflow provides an illustrative and clear example where FEEMS has a lot of signal for accurate inference, we wanted to understand the qualitative performance of FEEMS in an less idealized scenario with weaker signal. To this end, we performed coalescent simulations with only a 25% reduction of geneflow in the center of the habitat (Appendix 1—figure 21). In Appendix 1—figure 21A, when all the nodes are observed on the spatial graph, FEEMS is still able to detect this subtle reduction of geneflow. While FEEMS is able to detect this signal, there remain particularly erroneous estimates among the lower than average edge weights, implying the fit could benefit from additional smoothing by increasing the level regularization on the smoothness penalty. In contrast to the strong heterogeneous migration simulations, we see that the pairwise ${F}_{ST}$ in this weak migration scenario does not obviously show a ‘clustering’ like effect in the data (Appendix 1—figure 20A). The average ${F}_{ST}$ between all pairs of demes is approximately three times lower (mean ${F}_{ST}=.1175$ for the weak heterogeneity simulation versus mean ${F}_{ST}$ = 0.3411 for the strong heterogeneity simulation). When the nodes are sparsely observed on the graph in this weak migration simulation, we see that the FEEMS output is overly smooth (Appendix 1—figure 21B). In the absence of data and thus a weak signal for spatial variation in migration, a smooth visualization is arguably a sensible outcome given the regularization acts like a prior distribution favoring spatial homogeneity in levels of effective migration.
In practice, weak population structure can be more accurately dissected when increasing the number of informative SNPs included in the analysis (Novembre and Peter, 2016). In conjunction with running FEEMS, we recommend for users to create exploratory visualizations such as variograms and PCA biplots to assess the level of population structure in their data, and to consider the number of SNPs used in the analysis.
Data availability
Genotyping data can be found at https://doi.org/10.5061/dryad.c9b25 and stored in the FEEMS python package at https://github.com/Novembrelab/feems (copy archived at https://archive.softwareheritage.org/swh:1:rev:2df82f92ba690f5fd98aee6612b155d973ffb12d).

Dryad Digital RepositoryGenetic subdivision and candidate genes under selection in North American grey wolves.https://doi.org/10.5061/dryad.c9b25
References

Estimating recent migration and populationsize surfacesPLOS Genetics 15:e1007908.https://doi.org/10.1371/journal.pgen.1007908

Fast modelbased estimation of ancestry in unrelated individualsGenome Research 19:1655–1664.https://doi.org/10.1101/gr.094052.109

Distributed optimization and statistical learning via the alternating direction method of multipliersFoundations and Trends in Machine Learning 3:1–122.https://doi.org/10.1561/2200000016

Spatial population genetics: it's about timeAnnual Review of Ecology, Evolution, and Systematics 50:427–449.https://doi.org/10.1146/annurevecolsys110316022659

Evaluation of wolf density estimation from radiotelemetry dataWildlife Society Bulletin 33:1225–1236.https://doi.org/10.2193/00917648(2005)33[1225:EOWDEF]2.0.CO;2

A limited memory algorithm for bound constrained optimizationSIAM Journal on Scientific Computing 16:1190–1208.https://doi.org/10.1137/0916069

The electrical resistance of a graph captures its commute and cover timesComputational Complexity 6:312–340.https://doi.org/10.1007/BF01270385

Learning Laplacian matrix in smooth graph signal representationsIEEE Transactions on Signal Processing 64:6160–6173.https://doi.org/10.1109/TSP.2016.2602809

Learning graphs from data: a signal representation perspectiveIEEE Signal Processing Magazine 36:44–63.https://doi.org/10.1109/MSP.2018.2887284

How can we infer geography and history from gene frequencies?Journal of Theoretical Biology 96:9–20.https://doi.org/10.1016/00225193(82)901527

Circuit theory and modelbased inference for landscape connectivityJournal of the American Statistical Association 108:22–33.https://doi.org/10.1080/01621459.2012.724647

ConferenceHow to learn a graph from smooth signalsProceedings of the 19th International Conference on Artificial Intelligence and Statistics. pp. 920–929.

ConferenceStepping stone model of populationAnnual Report of the National Institute of Genetics Japan. pp. 62–63.

ANGSD: analysis of next generation sequencing dataBMC Bioinformatics 15:356.https://doi.org/10.1186/s1285901403564

Are populations like a circuit? Comparing isolation by resistance to a new coalescentbased methodMolecular Ecology Resources 19:1388–1406.https://doi.org/10.1111/17550998.13035

Connecting the dots: identifying network structure via graph signal processingIEEE Signal Processing Magazine 36:16–43.https://doi.org/10.1109/MSP.2018.2890143

The trouble with isolation by distanceMolecular Ecology 21:2839–2846.https://doi.org/10.1111/j.1365294X.2012.05578.x

Recent advances in the study of finescale population structure in humansCurrent Opinion in Genetics & Development 41:98–105.https://doi.org/10.1016/j.gde.2016.08.007

Ancient admixture in human historyGenetics 192:1065–1093.https://doi.org/10.1534/genetics.112.145037

Genetic landscapes reveal how human genetic diversity aligns with geographyMolecular Biology and Evolution 37:943–951.https://doi.org/10.1093/molbev/msz280

BookInferring Effective Migration From Geographically Indexed Genetic DataChicago, United States: The University of Chicago Press.

Toward a new history and geography of human genes informed by ancient DNATrends in Genetics 30:377–389.https://doi.org/10.1016/j.tig.2014.07.007

Comparison of Bayesian clustering and edge detection methods for inferring boundaries in landscape geneticsInternational Journal of Molecular Sciences 12:865–889.https://doi.org/10.3390/ijms12020865

Geodesic discrete global grid systemsCartography and Geographic Information Science 30:121–134.https://doi.org/10.1559/152304003100011090

Gene flow in natural populationsAnnual Review of Ecology and Systematics 16:393–430.https://doi.org/10.1146/annurev.es.16.110185.002141

Lx = bFoundations and Trends in Theoretical Computer Science 8:1–141.https://doi.org/10.1561/0400000054

Regularization and variable selection via the elastic netJournal of the Royal Statistical Society: Series B 67:301–320.https://doi.org/10.1111/j.14679868.2005.00503.x
Decision letter

George H PerrySenior and Reviewing Editor; Pennsylvania State University, United States

Isabel AlvesReviewer; University of Nantes, France

Wesley TanseyReviewer
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
The authors of the manuscript present a new implementation of the previously developed statistical method called "Estimating Effective Migration Surfaces", which displays on geographical map regions of low or high effective migration under a broad model of isolation by distance. In this new implementation migration surfaces are estimated under a penalizedlikelihood approach coupled with optimization instead of MCMC leading to faster running times. The new implementation facilitates faster running times to make its usage computationally possible for a wider range of research groups and likely be applied to an even larger number of species/populations.
Decision letter after peer review:
Thank you for submitting your article "Fast and Flexible Estimation of Effective Migration Surfaces" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by George Perry as the Senior and Reviewing Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Isabel Alves (Reviewer #1); Wesley Tansey (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
As the editors have judged that your manuscript is of interest, but as described below that additional analyses are required before it is published, we would like to draw your attention to changes in our revision policy that we have made in response to COVID19 (https://elifesciences.org/articles/57162). First, because many researchers have temporarily lost access to the labs, we will give authors as much time as they need to submit revised manuscripts. We are also offering, if you choose, to post the manuscript to bioRxiv (if it is not already there) along with this decision letter and a formal designation that the manuscript is "in revision at eLife". Please let us know if you would like to pursue this option. (If your work is more suitable for medRxiv, you will need to post the preprint yourself, as the mechanisms for us to do so are still in development.)
Summary:
The authors of the manuscript present a new implementation of the previously developed statistical method called "Estimating Effective Migration Surfaces", which displays on a geographical map regions of low or high effective migration under a broad model of isolation by distance. In this new implementation migration surfaces are estimated under a penalizedlikelihood approach coupled with optimization instead of MCMC leading to faster running times. The new implementation appears very promising as faster running times will make its usage computationally possible for a wider range of research groups and likely be applied to an even larger number of species/populations. Overall, we value the approach for its pragmatism but felt that it falls short at the very end by failing to provide any quantitative, objective way to choose the hyperparameters, which needs to be addressed as per the essential revisions detailed below.
Essential revisions:
1. The authors need to provide principled (or at least reproducible) ways to select the hyperparameters. Specific reviewer comments include:
"This is a major benefit of the L1 penalty. You can use BIC as the model selection criterion in the L1 case since the degrees of freedom are welldescribed. In the squared L2 case, it's not really possible. The authors discuss the issue and note LOOCV did not produce stable results, but they do not provide any data or examples. A more thorough investigation of hyperparameter settings is needed along with a recommendation that does not rely on biologists' subjective preference of the results on each dataset."
"FEEMS outcomes are very sensitive to userbased settings such as grid density and tuning parameters, as well as to aspects of the real data (eg. sampling design) that may result in an arbitrary choice of the outcome and lead to overinterpretations. I know the authors recommend to explore several combinations of regularization parameters and then compare FEEMS results with clustering/differentiation patterns based on approaches like ADMIXTURE or FST distances in order to support the results, nevertheless it is still difficult to grasp what's the best strategy to assess if a fitted graph is overfitting the observed data or instead is pointing out to a real area of, let's say, low effective migration rate. Sentences like: "…, while setting up the tuning parameter ɑ to a value that we found that worked for multiple data applications.…" (lines: 225226) or "it is helpful to look more closely at particular solutions that find balance between spatial homogeneity and complexity…" (lines: 243245) are confusing and make difficult the choice of the final regularization parameters."
"The grid design is another arbitrary aspect of the method whose influence on the identification of regions of low or high migration isn't clear. Imagining one has the computational power to construct a very dense grid, is it worth doing it once there is observed data in 1% of the nodes? Is there a good relationship between density and number of sampled points? Does it affect the outcome?"
"I think the points above would be clearer if the authors would provide for instance, stepbystep guidelines to help future users of FEEMS and referring to specific examples in the manuscript in order to more clearly justify their parameter choice (eg ɑ = 50 line 226)."
2. Please clarify the modeling decision and its comparison to the L1 approach. For instance, isn't smoothing over nodes vs edges really just the same thing with different penalties? The authors form a lifted graph, where edges are now nodes and they penalize differences between neighboring edges. In the L1 penalty case with a squared error loss, the fused lasso / total variation penalty on neighboring edges is equivalent to linear trend filtering on the nodes. The choice then of linear trend filtering could have been replaced with a higher order trend filtering step to achieve the smoothness that the authors seem to say is lacking in the L1 model.
3. Are the data points we observe actually sampled at random? Is some sort of latent confounding likely? For example, maybe wolves migrate based on the season and the scientists collecting the data only look in one spot in one particular time of year?
4. Please clarify the simulation results in Supp Fig19, panels I and J. Without any data points in the orange regions for panel I, the model somehow infers that there is a band of different edge weights. How? In the 1d case, it's as if someone showed you: [5, 5, 5, missing, missing, missing, 5 ,5 5], and you come back and told me [5, 5, 5, 3, 3, 3, 5, 5, 5]. Is this possible?
5. At present, there is no real quantitative assessment of how good the FEEMS solutions are relative to the EEMS solution. This should be provided.
6. It would be useful to provide another example of a heterogeneous migration scenario where the reduction in migration is less than one order of magnitude in order to give an idea to the user of how the method performs in a less heterogeneous scenario (ie the lower bound).
https://doi.org/10.7554/eLife.61927.sa1Author response
Essential revisions:
1. The authors need to provide principled (or at least reproducible) ways to select the hyperparameters. Specific reviewer comments include:
"This is a major benefit of the L1 penalty. You can use BIC as the model selection criterion in the L1 case since the degrees of freedom are welldescribed. In the squared L2 case, it's not really possible. The authors discuss the issue and note LOOCV did not produce stable results, but they do not provide any data or examples. A more thorough investigation of hyperparameter settings is needed along with a recommendation that does not rely on biologists' subjective preference of the results on each dataset."
"FEEMS outcomes are very sensitive to userbased settings such as grid density and tuning parameters, as well as to aspects of the real data (eg. sampling design) that may result in an arbitrary choice of the outcome and lead to overinterpretations. I know the authors recommend to explore several combinations of regularization parameters and then compare FEEMS results with clustering/differentiation patterns based on approaches like ADMIXTURE or FST distances in order to support the results, nevertheless it is still difficult to grasp what's the best strategy to assess if a fitted graph is overfitting the observed data or instead is pointing out to a real area of, let's say, low effective migration rate. Sentences like: "…, while setting up the tuning parameter ɑ to a value that we found that worked for multiple data applications.…" (lines: 225226) or "it is helpful to look more closely at particular solutions that find balance between spatial homogeneity and complexity…" (lines: 243245) are confusing and make difficult the choice of the final regularization parameters."
"The grid design is another arbitrary aspect of the method whose influence on the identification of regions of low or high migration isn't clear. Imagining one has the computational power to construct a very dense grid, is it worth doing it once there is observed data in 1% of the nodes? Is there a good relationship between density and number of sampled points? Does it affect the outcome?"
"I think the points above would be clearer if the authors would provide for instance, stepbystep guidelines to help future users of FEEMS and referring to specific examples in the manuscript in order to more clearly justify their parameter choice (eg ɑ = 50 line 226)."
We thank the reviewers for these helpful comments in regards to selecting the hyperparameters of the penalty and agree that an automated selection procedure would help improve the interpretability and usability of FEEMS. To this end, we have made a number of updates to our modeling approach that have allowed for fully automated hyperparameter selection and have proven to work well in practice. These developments were yielded through a simple but effective update of the parameterization of our penalty that allows for a crossvalidation approach over just the smoothness parameter lambda alone. Specifically, we utilize the solution of the edge weights fitted under a homogenous migration model to penalize differences in neighboring edge weights on both the linear and log scale relative to a homogeneous fitted parameter which we call w_{0}. This natural parameterization of the penalty and preestimation of w_{0} under a simple homogenous null model, allowed us to effectively remove the alpha parameter from the original penalty, allowing an onedimensional crossvalidation algorithm for selecting lambda in a computationally efficient and reliable manner. For more details please see the updated “Overview of FEEMS” in the Results section and “Penalty description” in the Materials and methods section.
With this new penalty in hand we used leaveoneout crossvalidation to select the smoothness parameter lambda. In the crossvalidation algorithm, for each grid value of lambda, we heldout an individual observed node (population) on the graph and then predicted underlying allele frequencies at these heldout nodes under our fitted spatial model from the rest of the trainingset nodes (see the new section “Leaveoneout crossvalidation to select tuning parameters” in Materials and methods for details). We found leaveoneout crossvalidation over lambda to provide satisfactory results that recovered true migration histories in coalescent simulations and aligned with biological expectations in real datasets.
We have updated all of the text with descriptions of the new penalty and have reanalyzed and reproduced the figures using leaveoneout crossvalidation to select lambda. This greatly simplifies the text and we hope this helps to alleviate the comments posed by the reviewers in regards to hyperparameter selection. Figure 3 now shows fitted FEEMS visualizations across the grid points of lambda that were used in leaveoneout crossvalidation. We added a new panel Figure 3E which shows the crossvalidation error for the full grid and highlights the visualized maps for a subset of lambda values. In Figure 4, we now display the solution that achieves the minimum crossvalidation error. All coalescent simulation figures have been updated as well. In general, the interpretation and results have not changed using this new penalty and crossvalidation procedure but the ease of use and clarity of FEEMS has greatly improved.
We also thank the reviewers for the suggestion on using the BIC for model selection under the L1 penalty. As mentioned above, for our new penalty – which is still the L2 distance between migration weights on neighboring edges, we have found leaveoneout crossvalidation to perform well. Because crossvalidation, in principle, can work for both the L1 and L2 penalties we prefer to use a method that works more "universally" for any penalty whether it induces exact sparsity or not. We also prefer the L2 penalty for other statistical and computational reasons and expand upon this point in our response in the next section.
In terms of concerns about the grid density, we do not have new solutions for this problem which is also a caveat in the original EEMS method (see Petkova et al. 2016 discussion); however we highlight this issue more prominently with a new paragraph in the discussion.
2. Please clarify the modeling decision and its comparison to the L1 approach. For instance, isn't smoothing over nodes vs edges really just the same thing with different penalties? The authors form a lifted graph, where edges are now nodes and they penalize differences between neighboring edges. In the L1 penalty case with a squared error loss, the fused lasso / total variation penalty on neighboring edges is equivalent to linear trend filtering on the nodes. The choice then of linear trend filtering could have been replaced with a higher order trend filtering step to achieve the smoothness that the authors seem to say is lacking in the L1 model.
In a sense, yes, “smoothing over nodes vs edges is the same thing with different penalties”; however the penalties differ in key ways. In the original parameterization of EEMS, each node was given a parameter and the edge weights were deterministically computed as the average of adjacent connected nodes. While this nodelevel parameterization reduces the number of parameters needed to be estimated, the edgelevel parameterization has two advantages in our view:
1. Each edge is free to take a unique value and that allows for a wider range of anisotropic migration scenarios to be modeled (e.g. spatially homogeneous anisotropy as in Figure 2 right hand column). That was the main driver for the decision for this new smoothing scheme.
2. By assigning each edge to be the average of connected nodes, a form of implicit spatial regularization is imposed because multiple edges connected to the same node would average that node’s parameter value. We found it more natural to separate out the regularization from the parameterization of the model. This preference led us to adding a smoothness penalty on edgelevel parameters.
We thank the reviewer for the suggestion of using higher order trend filtering. We tested a L1 penalization approach on the edge weights and find it often fails to give satisfactory results (e.g. Supplementary Materials “Smooth penalty with L1 norm’’ and Supplementary Figure 18), primarily because the L1 penalty is too locally adaptive to regions with many unobserved locations. In the regime where there is a high degree of missingness in the graph, the global consideration of smoothing the unobserved locations seems to be necessary and this is the primary driver for using the L2 penalty. We believe that using the higher order trend filtering with the L1 penalty may suffer a similar issue. The smoothness of the L2 penalty also allowed us to employ a quasinewton algorithm for optimizing our objective function which decreased our runtime more than 10 fold when compared to first order methods such as proximal gradient descent and ADMM when using the L1 formulation of our objective function. In addition we were able to utilize a widely used and tested implementation of LBFGS in scipy which worked well out of the box and had few algorithmic parameters to tune. Given we observed satisfactory results with the L2 penalty in addition to the fast convergence and runtime of the quasinewton algorithm we preferred it over the L1 penalization approach.
3. Are the data points we observe actually sampled at random? Is some sort of latent confounding likely? For example, maybe wolves migrate based on the season and the scientists collecting the data only look in one spot in one particular time of year?
For (a), in our model we treat the geographic locations of each sample as fixed but the distribution of genetic data as random. We find the method is relatively robust to nonrandom sampling, as long as it is not so sparse as to lose the key signals of differentiation in the data (e.g. Supplementary Figure 2).
Regarding (b), we have expanded the discussion with a paragraph that addresses how a form of confounding between seasonal migration and the sample collection process could be problematic in some datasets. Specifically, we discuss variation in the wolf migratory behavior to illustrate the point, and the suggestion helped us add nuance to our discussion of the results.
4. Please clarify the simulation results in Supp Fig19, panels I and J. Without any data points in the orange regions for panel I, the model somehow infers that there is a band of different edge weights. How? In the 1d case, it's as if someone showed you: [5, 5, 5, missing, missing, missing, 5 ,5 5], and you come back and told me [5, 5, 5, 3, 3, 3, 5, 5, 5]. Is this possible?
It is possible – and there are a few ways to understand where the signal for the inference derives from. (1) Consider that in sampled regions one can “learn” a relationship between geographic and genetic distance, and then with that in hand recognize that the observed covariances across a gap of unobserved locations are too large or too small relative to what is seen in the observed data. Regarding the reviewer’s analogy, let’s define a spatial position of each node as its index in the 1D array. Let's further suppose we collected a dataset of samples from nodes 1,2,3 and 7,8,9. First, assume we see that the observed allele frequency covariance within pairs of nodes 1,2,3 and 7,8,9 decays some constant amount for every unit of geographic distance. Then, if the covariance between pairs nodes 1,2,3 and 7,8,9 is lower than what would be expected given the geographic distance between them, it would provide a signal that the migration rates should be higher for edges connecting 1,2,3 and 7,8,9 and lower for edges connecting 4,5,6. In contrast, if pairwise covariances between all of the observed nodes decayed over geographic space at a constant level, we would infer homogenous migration rates for edges connecting both the missing and nonmissing nodes. Our likelihood uses the migration rate parameters to match expected covariances at all nodes to observed covariances, whereas our penalty encourages the migration rates to be smooth which helps inference in regions with high levels of missing nodes. As a simple onelocus example, consider the allele frequency data observed at the same 1D array of 9 populations: [0.02,0.01,0.03, missing, missing, missing, 0.99,0.98,0.985]. From such allele frequency data, one can infer migration rates are unlikely to be homogeneous, and moreover, likely are lower in the region with no observed data. (2) In a population genetic model, the key determinant of variation in genetic data observed today is where and when the ancestors of the sampled data “coalesce” (i.e. have common ancestry). The genetic ancestors of the present sample can occupy locations where there is no data today, and when they coalesce with each other will be impacted by local migration rates at unsampled locations. That is to say, migration rates at locations where there are no samples today can still impact the genetic data observed. Surprisingly, this implies that, to some extent, one can learn about migration rates even outside the convex hull of the sampled points.
5. At present, there is no real quantitative assessment of how good the FEEMS solutions are relative to the EEMS solution. This should be provided.
We now include a new supplemental figure 22 with the observed vs fitted dissimilarities output by EEMS when applied to the North American gray wolf dataset. This can be compared to the analogous results for FEEMS already presented in supplemental figure 14. The results show that EEMS fits the data relatively well, though there is a collection of points that seem to be poorly fit as illustrated by a systematically higher fitted dissimilarity to what is observed. FEEMS, in contrast, does not seem to have the same cluster of poor fitting samples.
6. It would be useful to provide another example of a heterogeneous migration scenario where the reduction in migration is less than one order of magnitude in order to give an idea to the user of how the method performs in a less heterogeneous scenario (ie the lower bound).
We now include such an example. To assess the performance of FEEMS in a less heterogeneous migration scenario, we applied FEEMS to coalescent simulations where the migration in the center of the habitat (spatial grid) was only reduced by 25% relative to the edges. In Supplementary Figure 21, we can see that FEEMS is still able to recover this reduction of geneflow but the output visualization is slightly noisier than the strong heterogeneous migration simulations in Figure 2. Please refer to a new supplementary section titled “Coalescent simulations with weak migration” for a detailed discussion of the results.
https://doi.org/10.7554/eLife.61927.sa2Article and author information
Author details
Funding
National Science Foundation (DGE1746045)
 Joseph Marcus
National Institute of General Medical Sciences (T32GM007197)
 Joseph Marcus
National Institute of General Medical Sciences (R01GM132383)
 John Novembre
National Science Foundation (TRIPODS Program)
 Wooseok Ha
University of California Berkeley (Institute for Data Science)
 Wooseok Ha
National Science Foundation (DMS1654076)
 Rina Foygel Barber
Office of Naval Research (N000142012337)
 Rina Foygel Barber
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Rena Schweizer for helping us download and process the gray wolf dataset used in the paper, Ben Peter for providing feedback and code for helping to construct the discrete global grids and preparing the human genetic dataset, and Hussein AlAsadi, Peter Carbonetto, Dan Rice for helpful conversations about the optimization and modeling approach. We also acknowledge helpful feedback from Arjun Biddanda, Anna Di Rienzo, Matthew Stephens, the Stephens Lab, the Novembre Lab, and the University of Chicago 4th floor Cummings Life Science Center computational biology community. This study was supported in part by the National Science Foundation via fellowship DGE1746045 and the National Institute of General Medical Sciences via training grant T32GM007197 to JHM and R01GM132383 to JN. WH was partially supported by the NSF via the TRIPODS program and by the Berkeley Institute for Data Science. RFB was supported by the National Science Foundation via grant DMS–1654076, and by the Office of Naval Research via grant N000142012337.
Senior and Reviewing Editor
 George H Perry, Pennsylvania State University, United States
Reviewers
 Isabel Alves, University of Nantes, France
 Wesley Tansey
Publication history
 Received: August 8, 2020
 Accepted: June 7, 2021
 Version of Record published: July 30, 2021 (version 1)
Copyright
© 2021, Marcus et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,271
 Page views

 103
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Further reading

 Computational and Systems Biology
 Evolutionary Biology
Studies of protein fitness landscapes reveal biophysical constraints guiding protein evolution and empower prediction of functional proteins. However, generalisation of these findings is limited due to scarceness of systematic data on fitness landscapes of proteins with a defined evolutionary relationship. We characterized the fitness peaks of four orthologous fluorescent proteins with a broad range of sequence divergence. While two of the four studied fitness peaks were sharp, the other two were considerably flatter, being almost entirely free of epistatic interactions. Mutationally robust proteins, characterized by a flat fitness peak, were not optimal templates for machinelearningdriven protein design – instead, predictions were more accurate for fragile proteins with epistatic landscapes. Our work paves insights for practical application of fitness landscape heterogeneity in protein engineering.

 Computational and Systems Biology
 Evolutionary Biology
Using a neural network to predict how green fluorescent proteins respond to genetic mutations illuminates properties that could help design new proteins.