Conjunction of factors triggering waves of seasonal influenza
Abstract
Using several longitudinal datasets describing putative factors affecting influenza incidence and clinical data on the disease and health status of over 150 million human subjects observed over a decade, we investigated the source and the mechanistic triggers of influenza epidemics. We conclude that the initiation of a pan-continental influenza wave emerges from the simultaneous realization of a complex set of conditions. The strongest predictor groups are as follows, ranked by importance: (1) the host population’s socio- and ethno-demographic properties; (2) weather variables pertaining to specific humidity, temperature, and solar radiation; (3) the virus’ antigenic drift over time; (4) the host population’€™s land-based travel habits, and; (5) recent spatio-temporal dynamics, as reflected in the influenza wave auto-correlation. The models we infer are demonstrably predictive (area under the Receiver Operating Characteristic curve 80%) when tested with out-of-sample data, opening the door to the potential formulation of new population-level intervention and mitigation policies.
https://doi.org/10.7554/eLife.30756.001eLife digest
Influenza – or ‘the flu’ – is a contagious disease which sweeps across the globe like clockwork, claiming tens of thousands of lives. This is known as ‘seasonal flu’.
Many scientists have tried to identify the factors that spark these yearly outbreaks. Some past studies have found that seasonal flu occurs when air that is normally humid turns dry, suggesting weather patterns play an important part. Other research has shown that air travel contributes to the flu spreading across the world. However, these studies typically focus on just one or two factors on their own. It is still not clear how exactly these factors combine to drive outbreaks, and then sustain the wave of infection.
To address this, Chattopadhyay et al. analyze the medical histories of 150 million American people over a decade, combining this information with large datasets about the different factors that trigger flu outbreaks. This includes detailed data about air travel and weather patterns, as well as census data that describe features of the population. Patterns of movement are also examined, for example by processing billions of Twitter messages “tagged” with a location. Chattopadhyay et al. used all of these datasets to model outbreaks of the flu in the United States, and see which factors play the biggest role.
It turns out that yearly outbreaks of seasonal flu are a result of a combination of elements. Some factors interact to help trigger the start of the wave, like humid weather in a highly populated area with nearby airports. Other factors, such how people move, encourage the spread of the infection. Finally, certain features of the population, for example how closely knitted a community is, make specific areas of the country more susceptible to the arrival of the disease. Overall, some of the most important elements of the model relate to the characteristics of the populations, the weather, the type of virus, and the number of short-distance journeys (rather than air travel).
Understanding how and why outbreaks occur can help policy-makers design strategies that reduce the spread and impact of seasonal flu, which could potentially save thousands of lives. Ultimately, the model developed by Chattopadhyay et al. could be used to test whether these policies would work before they are implemented in the real world.
https://doi.org/10.7554/eLife.30756.002Introduction
Seasonal influenza is a serious threat to public health, claiming tens of thousands of lives every year. A large number of past studies have focused on identifying the likely factors responsible for initiating each seasonal disease wave. Typically, each such study focused on one or a few hypothetical factors. Our study aimed at an integrative, joint analysis of numerous suggested disease triggers, comparing their relative importance and possible cooperation in triggering pan-US waves of seasonal influenza infection. The goal of this study was to identify the most informative combinations of statistical predictors associated with the initiation of pan-US influenza infection waves.
Recent computational studies of influenza:
Computational study of the dynamics and factors influencing infectious disease spread began with compartmental models, such as the Susceptible-Infected-Resistant (SIR) model, which traces its origins to the beginning of the last century (Kermack and McKendrick, 1927). Initially a purely theoretical tool, these SIR-style models were subsequently enhanced with population and geographic data, allowing their application to specific cities and the distances between them (Keeling and Rohani, 2002). For instance, one approach, termed ‘gravity wave’ modeling, used geographic, short- and long-range, work-related human movement and demographic data to formulate gravity potentials between US counties in order to infer the dynamics of infection spread (Viboud et al., 2006).
Some studies have focused on one specific factor affecting infection, such as air travel, to simulate the spread of influenza (Colizza et al., 2006); other studies used SIR models, generalized for a collection of interconnected geographic areas, spatial-network or patch models, to model a number of common infections, including influenza, measles, and foot-and-mouth disease (Riley, 2007).
More ambitious network model approaches have simulated the global transmission of infectious disease using high-resolution, worldwide population data and the locations of International Air Transport Association (IATA)-indexed airports (Balcan et al., 2009). Similar to (Viboud et al., 2006), the authors of the study computed the global infection-pre-disposing ‘gravity field’ over the network of international airports. This network-based approach was subsequently developed further (Balcan and Vespignani, 2011) through the modeling of ‘phase transition’–that is the chain-reaction switch of geographic infection status–in complex networks, utilizing approaches introduced in theoretical physics.
Another layer of sophistication was achieved by incorporating rich historical records. For example, Eggo et al. (Eggo et al., 2011) modeled the Spanish influenza epidemic of 1918–1919, using mortality documents from both the UK and the US, explicitly accounting for the size and distances between cities. In the same spirit, Brockmann and Helbing (Brockmann and Helbing, 2013) represented infection as diffusion on a complex network, estimating arrival times for infection across the globe.
Following the formulation of the hypothesis that absolute humidity modulates influenza survival and transmission (Shaman and Kohn, 2009), researchers began incorporating climate variables into SIR-like models (Chowell et al., 2012). More recent dynamic models have incorporated a probabilistic description of influenza infection’s spatial transitions in space and time, accounting for selected demographic confounders (Gog et al., 2014) and (Charu et al., 2017).
In this study, rather than following the SIR-style modeling tradition, we used statistical epidemiology- and econometric-like approaches, in addition to a causality-network method presented here for the first time. There are some prior studies that are close to ours in spirit (but not in details). For example, (Barreca and Shimshack, 2012) used historical US influenza mortality data (1973–2002) in conjunction with collinear humidity and temperature records to establish county-level statistical associations between variables in the datasets. They concluded that absolute humidity was ‘an especially critical determinant of observed human influenza mortality, even after controlling for temperature.’ Another study, focusing on historical influenza records in the Netherlands, (te Beest et al., 2013) used the number of weekly influenza-like patient visits (transformed into an estimated rate of infection) as a response variable in a regression analysis of climate data. They concluded that the bulk of explained variation (57%) was attributed to the depletion of susceptible hosts during the disease season and non-weather-related ‘between-season effects,’ with only 3% explained by absolute humidity, represented as a continuous predictor variable. Additionally, this study observed that school holidays did not have a statistically significant effect on influenza transmission.
As all causality detection methods come with dissimilar limitations and are imperfect in unique ways, we designed our study intentionally to attack the same target problem using three different statistical approaches: Approach 1: A non-parametric Granger analysis (Granger, 1980) focusing on infection flows’ directionalities across the US and whether influenza propagates via long- vs. short-distance travel (we run analysis across all pairs of air- and land-travel county neighbors, respectively). Approach 2: A mixed-effect Poisson regression (Hedeker and Gibbons, 2006) explicitly accounting for the auto-correlation of infection waves in time and space, along with the full set of socioeconomic, climate, and geographic predictors. Approach 3: A county-matching, non-parametric analysis to identify the minimum predictive set of factors that distinguish those counties associated with the onset of the influenza season (Morgan and Winship, 2015).
Our study became possible through access to several, very large longitudinal datasets: (1) a nine-year collection of insurance records capturing the dynamics of influenza-like illnesses (ILIs) in the United States (Truven MarketScan database, see Materials and methods); (2) temporally collinear, high-resolution weather measurements over every US county; (3) detailed air travel (The United States Bureau of Transportation Statistics, 2010) and geographic proximity data (The United States Census, 2016) showing connectivity between US counties; (4) billions of geo-located Twitter messages reflecting long- and short-distance human movement patterns, and; (5) US census data accounting for US county and county-equivalent population distribution, demographic, and socioeconomic properties (HRSA, 2016). An explicit comparison of the ILI data in the insurance claims to the influenza records provided by the Center for Disease Control and Prevention (CDC, 2016) showed that the two sources agree well (), with insurance claims providing higher data resolution, see Figure 1—figure supplement 1. Curiously, the relationship between the two sources of ILI observations is not linear: We attribute this to the lower resolution of the CDC data. These three types of analysis produce congruent–albeit not identical–results.
Results
The logical flow of our analysis is as follows: (1) We first show that our definition of ILIs corresponds well with CDC data, and that our causality coefficients, defined in Approach 1, have similar meanings to coefficients in the regression analysis; (2) We then explain the outcomes of the analysis according to Approach 1; (3) We then analyze the importance of putative casual factors, Figure 1, as applied to an initiation of influenza season, Figure 2; (4) In Approach 2, we pay special attention to the relative importance of short- and long-distance travel in influenza propagation, Figure 4; (5) We further test the best regression model in terms of predictive accuracy, Figure 5, using disjointed data parts for training and testing, and; (6) We culminate our analysis with a county-matching analysis, Figure 6.
Approach 1: Causality streamlines from non-parametric granger analysis
Our analysis of health insurance claims covers nine years of influenza cycles ( to , inclusively), see Figure 2. We visualized weekly, county-level prevalence as a movie (see Supplement); Figure 2A–H show a few relevant weekly snapshots from different years. The plates in Figure 2A–H, and especially the movie, clearly show that seasonal influenza cycles initiate in the South/Southeastern US and sweep the country from south to north. This pattern is repeated, with some variation, each season.
Figure 3G shows the country-wide propagation dynamics as represented by our computed causality streamlines. The alignment of causality flow vectors into long, continuous streamlines suggests a stable propagation mechanism across the country; the probability of a long sequence of summary movement vectors accidentally matching in the direction by mere chance is vanishingly small ( for longer streamlines).
Do epidemics originate from the same counties season after season? To answer this question, we follow ‘causality streamlines’ back to their source county. Informally, influenza onset in these source counties has little or no causal dependency on their neighbors. That is, their epidemic states are seemingly caused by factors outside of disease prevalence in other counties. Figure 2K presents the county-specific likelihood of streamline initiation across our nine years of data. To verify the near-shores position of these source counties is not a mere manifestation of a boundary effect of shore counties (no neighbors at the side of large water body), we carried out identical causality analyses with two different infections, specifically choosing diseases less likely to share etiologies with influenza: HIV and Escherichia coli. The results for both HIV and Escherichia coli infections are shown in Figure 3J and K, which exhibit flow patterns very different from those obtained for influenza. These streamlines almost never originate from the coasts, thus reducing the likelihood that the pattern observed for influenza is a geo-spatial boundary effect. Combined with the exceedingly low probability () of chance inference for the streamlines, this strongly supports our conclusion that the epidemics are of coastal origin.
We directly validated our conclusion that influenza waves tend to start in the South by identifying counties which seem to trigger the epidemic. We computed a ‘trigger period’ of five to six weeks for each season, defined as the period immediately preceding an exponential increase in influenza dispersion. To calculate this weekly dispersion, we treated each county as a node in an undirected graph, each with an edge connecting two geographically adjacent counties–only if they had both reported at least one influenza case in the specified week. We defined dispersion as the size of the largest, connected component in this undirected graph. Thus, a trigger period describes the period in which the size of the giant component of the infection graph rises above counties from being under as shown in Figure 2I, and then proceeds to the seasonal peak. Figure 2J presents the likelihood of a county being part of this largest, connected component during the trigger period. In the second approach, we followed causality streamlines back to their source county. Figure 2K presents the county-specific likelihood of streamline initiation across nine years.
These approaches produced qualitatively similar results (Figure 2J and K). While epidemics seem to start in many places around the country (see the origins of streamlines in Figure 3J and I), they successfully gain traction near large bodies of water (as evidenced by the most likely places of epidemic onset, see Figure 2J and K)). Otherwise, they fizzle out before triggering an actual epidemic cycle (see Figure 2J). Seasonal initiation is neither spatially uniform nor simply a reflection of county-specific population density.
Our analysis of the Twitter movement matrix indicates that people most frequently travel between neighboring counties, preferentially towards higher-population-density areas, which shows that the maximum-probability movement patterns follow the local gradient of increasing population density (see Figure 4—figure supplement 1). In contrast, the geo-spatially-averaged movement vectors for each county reveal global flows in the movement patterns (see Figure 3H, along with Methods for the calculation of spatial averages).
Figure 3H–I suggest that average movement patterns largely agree with the influenza streamlines: Both patterns, especially in the South/Southeast of the country, are associated with flow pointing away from large bodies of water.
In addition to looking at the direction of short-range travel, we used our non-parametric Granger analysis to investigate the comparative strength of short- vs. long-range influenza propagation. In the first case, we considered the neighborhood map shown in Figure 4A (for a detailed definition of ”neighbors,’ see Materials and methods), and the in the second case, we considered associations between major, airport-bearing counties (see Figure 4B). We then plotted the distribution of the maximum pairwise coefficient of causality, where the maximization is carried out by fixing the source and the target and varying the delay in weeks, after which we attempt to predict the target stream.
Conclusions associated with Approach 1: The inferred causality streamlines computed from the infection time series in all counties (Figure 3) show that epidemics are mostly triggered near large water bodies and flow inland and away. They also illustrate that the US continental Southern states act as ‘sinks’ to a large proportion of these streamlines. (‘Sinks,’ in our definition here, are geographic areas that multiple streamlines converge towards; sinks are especially obvious when we look at the vector representation of causality direction. The opposite of a ‘sink’ is a ‘source,’ defined as an area at which at least one streamline starts.) This might explain the increased prevalence in the designated region. Additionally, the analysis shows that human travel is a very important driver of emergent epidemiological patterns, and that short-range, land-based travel is more important than air-travel. This result is cross-corroborated by our Poisson regression analysis (described next in Approach 2).
Approaches 2 and 3 are motivated by the ‘why’ questions: (1) Why do epidemics initiate where and when they do? and; (2) Why do some disease initiations become epidemics while others do not?
Approach 2: Importance of factors from poisson regression
We focused on a subset of weeks associated with the initial rise of influenza waves (indicated by the gray bars in Figure 2I, and calculated as discussed earlier). The results from our best-fit model are illustrated in Figure 1A. We selected this particular model out of a total of compared in the Bayesian analysis, a few of which we list in Table 5, ranked by their decreasing goodness-of-fit, measured with the Deviance Information Criterion, DIC (see Supplement). From the values of the inferred coefficients corresponding to the different factors, and taking into account their significance levels and credible intervals, we concluded that the roles played by weather variables, particularly humidity, appear to be substantially more complicated compared to what has been suggested in the literature.
The surprisingly unimportant factors
School schedule was not predictive of influenza onset in our analysis: We ended up with a -value of and an odds ratio of , strongly suggesting that school opening dates are not a significant factor in triggering the seasonal epidemic.
We are not claiming here that closing down schools during the seasonal peak, or during an initial phase of a seasonal epidemic, would not have a beneficial effect on maximum incidence. Rather, the observed epidemiological patterns over the time period we analyzed (2003–2011) do not seem to name ‘school opening times’ as a significant predictive factor–at least in the continental US.
We factored in the effect of vaccination coverage by estimating the cumulative fraction of the population that received the current influenza vaccine stratified by geo-spatial location and time of inoculation within each influenza season. Our analysis indicated that vaccination coverage is not a significant predictor of influenza onset/triggering period. It could be a reflection of overall vaccine ineffectiveness, or the choice of outcome predicted (i.e. vaccination might effect overall infection numbers over the entire outbreak, but not the timing of the trigger). It could also reflect the fact that different influenza type/subtypes have different virulence–so a vaccine against H3N2 during an H3N2 year, may be more effective, but due to that fact that H3N2 is more virulent, more people still wind up seeking medical care.
Vaccination coverage failed to reach predictive significance. The variables corresponding to spatio-temporal indicators of the cumulative fraction of the inoculated population are included in our best model (see the last entry in Table 5), but their effect fails to be significant. However, if we drop those variables from the model, the DIC increases. We suggest that the strong dependence between antigenic diversity and vaccination coverage (see Figure 1J–M) is responsible for this effect: Vaccination coverage is important, but its influence is captured by the antigenic variation.
The most important factors
The strongest predictor groups (ranked by importance) are the population’s socio-demographic properties, weather, antigenic drift of the virus, land-based travel, and auto-correlation of influenza waves.
Weather As far as weather effects are concerned, epidemics tend to originate in places with high mean, maximum specific humidity, high average temperature, and low average air pressure, namely, in counties at the Southern, and, to a lesser extent, Eastern and Western US coastlines. Additionally, the spread of an epidemic is significantly influenced by a drop in specific humidity up to four weeks before its onset. However, this effect is weaker than the mean maximum specific humidity effect. Drop in average temperature that dips one to three weeks prior to the epidemic onset is also significantly important (this is consistent with earlier experiments [Lowen et al., 2008]), especially when the temperature drop is accompanied by a decrease in specific humidity, average wind speed, and solar flux. However, high levels of solar flux in the week of onset are also important. This complicated set of weather conditions, a signature of the cold air front (Shaman et al., 2010), is validated by our out-of sample predictions to increase the risk of triggering the seasonal epidemic. Total precipitation also plays a positive role.
The weather/humidity paradox
How can colder weather and lower humidity be a predictor of influenza, if influenza epidemic waves tend to start in the South with warmer climates and higher humidity? Our resolution of this seeming controversy is as follows: The stress is on the drop in both humidity and temperature in those areas with high average annual values of these measurements. A blast of colder, lower-humidity weather in these warm-climate areas has two effects: (1) The influenza virus can stay viable in water droplets longer than in hot, sunny weather, and; (2) Humans tend to interact indoors, in more crowded conditions. Both of these factors are favorable for transmission of the virus to the population at large.
Antigenic variation Antigenic diversity for HA, NA, M1, and M2 are important predictors. While HA, NA, and M1 inhibit the trigger, M2 diversity enhances it. This peculiar difference in the direction of influence might be a manifestation of the roles played by the individual viral proteins in its life-cycle.
The first three proteins are directly involved in the viral binding to host cell surface receptors, while M2 activity is needed only during HA biosynthesis. Additionally, proteolysis experiments indicated that M2 proton channel activity helped to protect (H1N1)pdm09 HA from premature conformational changes as it traversed low-pH compartments during transport to the cell surface (Alvarado-Facundo et al., 2015).
We found that antigenic diversity is a significant predictor in all four of the viral proteins we considered. Interestingly, while the increasing diversity found in HA, NA, and M1 inhibits the epidemic trigger, the higher diversity in M2 enhances it (see Discussion).
Travel Land travel intensity one to three weeks before epidemic onset is a strong predictor. Air travel is also predictive, but its strength is an order of magnitude weaker than that of land travel.
Autocorrelation The increase in an influenza outbreak’s infection weekly rate one and two weeks before an epidemic onset (in the epidemic source county itself) is predictive of epidemic wave origin.
We have used substantially richer datasets than those used by earlier studies (Shaman et al., 2010; Tamerius et al., 2013), which lends strong statistical support to our conclusions. It also allowed us to disentangle and make precise the contributions from different factors, , mean county humidity vs. drops in humidity before an infection. While we found the former effect to be clearly stronger (in accordance to previously reported results [Shaman et al., 2010]), the other diverse set of factors were also found to be significant.
Validation of predictive capability
The robustness of our results is established in a number of ways:
In mixed-effect regression (Approach 2), we compared over chosen model variations (see Table 5 for an abridged list, and Supplementary file 3for the complete enumeration of considered models); the results appear to be qualitatively stable, though the quantitative performance of the models vary somewhat as measured by DIC for different configurations of regression equations.
We carried out a direct validation of predictive performance by estimating model parameters using the first six seasons and predicting the epidemic trigger locations using the last three (see Figure 5 and Materials and methods). The out-of-sample predictions of influenza incidence are always positively correlated with observed incidence (Plate A). Perhaps more importantly, we obtained good predictability as measured by the area under the curve (AUC ) for the receiver operating characteristics (ROC, See Plates B-D). Plates E-G show that our out-of-sample predictions correctly identified epidemic initiation in the Southern and Southeastern counties of the continental US.
As we discuss in the next section, in Approach 3, we conducted a corroborating matched effect analysis on the counties, using combinations of county-specific factors as a ‘treatment,’ not unlike clinical trials in which patients on a drug regimen are matched to patients receiving a placebo (Morgan and Winship, 2015).
Approach 3: Matching counties and factor combinations
In Approach 3, we investigated combinations of factors presented as ‘treatment’ via a non-parametric, exact-matching analysis of US counties during the weeks of epidemic onset on a season-by-season basis.
First, we collected the list of all counties with a drop in maximum specific humidity during the weeks leading up to an influenza season in a particular year. This is the ‘treated set’: the set of counties that may be thought of as subjected to the positive ‘treatment’ of a drop in specific humidity. We split this set into two, considering counties that also experience increased influenza prevalence during the epidemic onset, and ones that do not (counties with two different values of the outcome variable). The number of counties in these two sets define the first row of a contingency table. In the second row (the ‘control set’), we focused on counties that do not experience a drop in the maximum specific humidity. However, we only considered counties that have a matching counterpart in the treated set in the following sense: For each county in the control set, we found at least one in the treated set such that the rest of the significant variables (other than specific humidity) had similar variation patterns in both counties. Once we defined the control set, we split it in the manner described for the treated set: We counted the number of control counties that experienced an increased influenza prevalence during epidemic onset, and those which did not. This defined the second row of the contingency table. Finally, we used Fisher’s exact test to compute an odds ratio (the odds of realizing these numbers by chance), along with the test-derived significance of the association between the ‘treatment’ and epidemic wave initiation (p-value). Furthermore, we defined our treatment to consist of multiple factors simultaneously, specific humidity and its change in the preceding week, along with average temperature and degree of urbanity, see Figure 6.
Note that geographic clustering of ‘treated’ and ‘untreated’ counties arose automatically as a result of similar weather patterns being imposed via the constraint of multiple climate variables.
Unlike the mixed-effect regression approach (Approach 2), this matching analysis is non-parametric, and intended to reveal whether multiple factors are, indeed, simultaneously necessary.
The results of Approach 3’s analysis are presented in Figure 6 and Tables 2 and 3. We found that no single variable was able to consistently yield a statistically significant odds ratio greater than one; multiple factors interacted to shape an epidemic trigger (see Table 3 for a few examples). With a total of significant variables in our best mixed-effect model, an exhaustive search for all combinations was not feasible. Instead, we performed a standard evolutionary search, looking for combinations that yielded a significant odds ratio for individual seasons. Additionally, we considered all seasons together (by simply adding the contingency tables, element-wise) in order to increase the test’s statistical power.
We isolated ten variables (as shown in Figure 6, Plate B) in this manner which included maximum specific humidity and average temperature along with their variations, the degree of urbanity, antigenic variation, and vaccination coverage.
The factors that appeared most often in our analysis are illustrated in Plate A: It appears that maximum specific humidity and average temperature, along with their variations, and the degree of urbanity have the most frequent contribution, followed by antigenic variation and vaccination coverage.
We did see some new factors here that failed to be of significance in the regression analysis (Approach 2), , degree of urbanity and vaccination coverage. While vaccination coverage was included as a factor in our best performing model in Approach 2, it failed to achieve significance, perhaps due to its strong dependence on antigenic variation (see Figure 1J–M). Degree of urbanity was indeed significant for some of the regression models we considered (see Supplementary Information), but failed to be so for the model with the least DIC.
Thus, Approach three corroborates and strengthens key claims of Approach 2.
The exact set of factors varied somewhat over the seasons; nevertheless, together, they yield significant results when all seasons are considered together. The matching analysis corroborates our results from both the mixed-effect regression and the geographic streamline analyses: The sets of counties initiating the wave are near coasts on the Southern region of the continental US (see Plates A - I in Figure 6).
Local travel vs. long-distance travel and influenza
Our conclusion that local travel is predominantly responsible for disease wave propagation is supported by several lines of analysis.
First, continuous land-movement infection waves are visible in the weekly influenza rate movie; we computed this movie from insurance claim data and made it available with results of this study.
Second, because our all-weeks-included dataset was too large for the R MCMCglmm package to handle, we performed mixed-effect Poisson regression calculations using a 50 percent random sample of all the weeks for which data were available. In this computation, the airport-proximity, fixed-effect coefficient turned out to be statistically significantly negative (see Supplement A, as well as the editable output file ‘flu-50-percent-weeks.txt’).
Third, the results from our Granger-causality inference showed that:
Local county-to-county movements were much more predictive of influenza wave change than airport movements. In comparing Plates E and F in Figure 4, we see that the local movement causality coefficient () is, on average, twice as large as that for long-range movement (Figure 4E). Figure 4E shows that the mean long-range causality coefficient is approximately 0.05, whereas it is just over 0.1 in the local propagation. As the causality coefficient quantifies the amount of predictability (measured as information in bits) communicated about the target data stream per observed bit in the source data stream, it follows that, on average, every ten bits of sequential incidence data from an influencing location tells us one bit about the unfolding incidence dynamics in the target location. Therefore, in case of the long-range movement, informativeness is twice as low, so we need on average 20 bits to infer one bit about the state of infection. These calculations strongly suggest that local movement is predictively stronger with regards to influenza infection propagation.
While the most frequent value of the computed time delay in influence propagation between counties with large airports is zero weeks, this distribution is significantly flatter compared to that for local, county-to-county influence propagation.
Discussion
A summary of the complex relationship between the driving factors that contribute to the trigger and subsequent development of a seasonal epidemic can be clarified with a forest fire metaphor.
The maturation of a forest fire requires the collusion of multiple factors—namely the presence of flammable media, an initiating spark, and a wind current to help spread the fire. Our conceived mapping of this analogy to influenza infection is as follows:
Flammable Media: The Southern US appears to have an unusually high level of social connectivity; it is at least one order of magnitude higher than that in the north of the country (see GSS survey results (Smith et al., 1972) and Table 1). The number of close friends, close friends who are neighbors, and communities of people who all, or mostly, know each other is much higher in the South than in the country at large. Our conjecture is that a manifestation of this high-connectivity is the highest apparent percentage of people infected with influenza ( as opposed to in other parts of the country).
Initiation Spark: An initial spark for the infection wave is generated by a combination of weather and demographic factors. Specifically, warm, humid places are conducive to influenza wave initiation - particularly in weeks where specific humidity drops. Airport proximity is important, as well as demographic and economic makeup and also the degree of urbanization. Note that the first static condition (warm humid places) is highly correlated with areas in the South with greater social connectivity. It is possible that static meteorological variables (warm mean temperature and high mean humidity) serve as proxies for high social connectivity or other correlated socioeconomic factors.
Wind: The ‘wind’ in this analogy is the collective movement of a large number of people, integrated over time, revealing persistent ‘currents.’ These currents reproducibly point from coastlines and move inwards towards the center of the continent, making them perfect vehicles to transmit the infection inland from the shores.
Each of our three types of computational approaches has their strengths and weaknesses: (1) The Poisson mixed-effect regression allows for the direct comparison of the predictive strength of numerous predictor variables and accounts for spatial and temporal autocorrelation, but relies on strong modeling assumptions; (2) The non-parametric Granger analysis is not limited by restrictive modeling assumptions in our implementation, though it focuses only on trends of infection propagation between counties, and; (3) The county-matching analysis is also model-free, but this freedom comes at the expense of lesser statistical power.
What is new in this study?
The following aspects make our study of influenza triggers new in the influenza literature: (1) Instead of simulating the plausibility of one particular epidemic trigger model with a dynamic disease transmission model, we used formal model selection tools to compare the goodness of fit of hundreds of plausible models; (2) We explicitly attempted to systematically cross-compare the importance of numerous individual factors typically hypothesized to contribute to epidemic onset; (3) To accomplish this, we collected an unprecedented volume of temporal and spatial data on disease dynamics and the dynamics of putative predisposing factors; (4) We used several orthogonal, computational causality-inference techniques (one of which was developed specifically for this study) to probe associations between disease onset and putative epidemic triggers; (5) We tested our best models for their predictive potential and demonstrated that they are, indeed, suitable for forecasting disease waves, and; (6) We combined, for the first time, numerous candidate factors in a single, integrative study.
Convergent conclusions, culled from these radically different techniques, strengthen our claims and make it statistically unlikely that we are observing analysis artifacts. First, the Granger causality analysis results (Approach 1) provide insights into the details of influenza’s epidemiological dynamics. Figure 3G traces out the paths most likely followed by the infection, on average, across the continental US. We note that of the streamlines sink in counties belonging to the Southern states, which matches up well with the streamline-encoded dynamics of weekly disease incidence over nine years (see Figure 3I). What drives this particular causality field’s geometry? While we cannot definitively answer this question, a comparison of the global patterns emerges from the local mobility data culled from the aforementioned Twitter database and offers a tentative explanation (see Figure 3H). Second, contrary to reported human travel pattern influence on seasonal epidemics (Viboud et al., 2006) (but consistent with [Gog et al., 2014]), we find that short-distance travel contributes more significantly to disease spread (see Figure 4). In particular, we find that long-range air travel is important as an epidemic trigger, but once infection waves are triggered, air travel patterns (or proximity to major airports) become less important. Short-range mobility, on the other hand, is apparently important for sustaining infection transmission over each season. Thus, we find short-range travel to be more important for defining the emergent spatio-temporal geometry of infection waves, while proximity to airports is more important for actually triggering an influenza season; the latter loses positive influence once an infection is under way. This conclusion is justified as follows: (1) When we performed regression calculations using all weeks for which data were available (as opposed to wave initiation weeks only) the airport proximity predictor coefficient turned out to be statistically significantly negative (see Supplement). (2) Results from our Granger-causal inference indicate that, on average, the local, putatively causal connections are far stronger compared to the putatively causal connections between counties within which the major airports are located (see Figure 4C and F). Additionally, from our best mixed-effect regression model (Figure 1A), we find that land connectivity effects are significantly stronger than air connectivity effects. The predictive value of Twitter connectivity, which intuitively captures both local and long-distance travel, lies in-between land and air connectivity coefficients. Note that Twitter connectivity is represented as a directed graph, where for each pair of counties, and , the edge weight represents the conditional probability of ending at county , given that a traveler/Twitter user started her journey in county . Transition probabilities from to sum to one over all . Therefore, intuitively, the Twitter connectivity graph should have the features of both a land-connectivity and an air travel graph; which indeed appears to be close to reality. 3) While airport diffusion is a significant factor in our best Poisson model (using data from the initiation period), the causal streamlines (constructed with the complete, all-year incidence data) do not seem to originate from airport-bearing counties.
The role of short-distance travel is particularly crucial in explaining influenza’s time-averaged, geo-spatial prevalence. While the mixed-effect regression analysis explains seasonal initiation in the vicinity of the continental US Southern shores, it might not, by itself, adequately explain its average prevalence patterns across the country.
Also not explained solely by our regression models is the occurrence of relatively high infection prevalence in the central parts of the country. These differences cannot be attributed to long-distance air travel, as discussed before. However, the routes taken by the causality streamlines (as computed by the non-parametric Granger analysis), interpreted as paths followed by an infection on average, suggest an explanation: The close match between the Granger-causal flow and the short-range mobility patterns (derived from Twitter analysis) strongly suggest that average disease prevalence is modulated by short-range mobility.
Rationale for observational analysis
The traditional empirical approach of testing a causal link between a factor and an outcome of an experiment was to vary one factor at a time, while keeping the other factors (experimental conditions) constant. This ‘all the rest of the conditions are equal’ assumption is often referred to by its Latin form as ceteris paribus. R.A. Fisher ([Fisher, 1935], p. 18) noted that, in real-life experiments, perfect ceteris paribus is not achievable ‘because uncontrollable causes which may influence the results are always … innumerable.’ Fisher’s proposed solution to this problem is to design experiments to involve random assignment of treatment (the putative causal factor’s states) to individual trials and then use regression analysis to estimate the value and significance of the putative causal effect.
Likewise, hypothesis-driven science, wherein investigators formulate a single, testable hypothesis and design specific experiments to test it, is a core element of the scientific method, and works well in most scientific fields. However, a new challenge emerges in data-rich scientific fields, such as genomics, epidemiology, economics, climate modeling, and astronomy: How do we choose the most promising hypotheses among millions of eligible candidates that potentially fit data? One solution to this challenge is the many-hypotheses approach, a method of automated hypothesis generation in which many hypotheses are systematically produced and simultaneously tested against all available data. This approach is currently used, for example, in whole-genome association or genetic linkage studies, and often enables truly unexpected discoveries. In contrast to the single-hypothesis approach, the many-hypotheses approach explicitly accounts for the large universe of possible hypotheses through calibrated statistical tests, effectively reducing the likelihood of accidentally accepting a mediocre hypothesis as a proxy for the truth (Nuzzo, 2014).
The many-hypotheses approach provides a complement to carefully controlled and highly focused wet laboratory experiments. Running controlled experiments to test a single hypothesis necessarily ignores many of the complexities of a real-world phenomenon; these complexities are necessarily present in large, longitudinal datasets. Of course, the data-driven ‘many-hypotheses’ approach is only one aspect of the broader scientific process progressing toward the development of verifiable general theories.
Agreement and disagreement between methods
Intuitively, we expected that all three approaches would produce similar, if not identical results. In practice, while the three approaches agreed in most cases, this agreement was not perfect. For example, the highlighted areas (greater incidence) in the first influenza season snapshot for Figure 2A–H each should match relatively well to the maps in Figure 6B (or at least some unspecified subset of ‘high incidence treatment counties’). While the county-matching results point to initiation at coasts, in Figure 2, 2006-2007 initiation seems to spread from the West Coast and, in , 2010 has a scattered pattern across the middle of the US.
The intuitive explanation of perceived discrepancy is that the matching method agrees with other analysis types predominantly, but not in all cases. Each analysis has limitations. In the case of the matching analysis, we have less statistical power than in, for example, Poisson regression; matching by numerous parameters reduces the initial set of thousands of counties to a handful of matching ‘treated’ counties (which meet a particular combination of weather and sociodemographic conditions) and ‘untreated’ counties (very similar to ‘treated’ ones in all respects but treatment). The difficult-to-match, ‘weeded out’ counties may happen to be in the coastal areas indicated as the most likely places of influenza wave origin by other analyses.
In the case of the 2007 and 2010 results, the matching analyses pick patterns that are different from those produced by the causality streamline analysis and mixed-effect Poisson regression models.
Figure 6’s Plate B shows the distribution of the treatment counties and matched-non-treatment counties. Note that here, we are not directly predicting initiation, so while the patterns in Figure 2 and Figure 6 should indeed show some similarity, they are not required to match up perfectly. The most similar treated counties do indeed show up in the Southern shores.
Generalizability
Our analysis uses no prior knowledge specific to influenza epidemiology. As such, these methods are not limited by either the pathogen under consideration (influenza), or the geospatial context (United States). The tools developed here are expected to be equally applicable to analyzing general epidemiological dynamics for pathogens other than influenza, unfolding in arbitrary geographical regions. The specific conclusions we draw about the initiation and propagation of the seasonal influenza in US might not hold true for influenza epidemiology in a different geographical context. However, the analysis tools are still applicable. More broadly, our tools delineate a general approach to modeling complex spatio-temporal dynamics, with applications beyond solely disease epidemiology.
We conclude by highlighting the structure of overlapping conclusions delivered by our three approaches. Approach 1: Granger-causality analysis suggests that an epidemic tends to begin in the South, near water bodies and that short-range, land-based travel is more influential compared to air-travel for infection propagation, providing a map of mean infection flow across the continental US. Approach 2: Poisson regression identifies significant predictive factors, ranks these factors by importance, suggests that Southern shores are where the epidemic begins, and corroborates Approach 1’s result on short-range vs. long-range travel. Approach 3 (county-matching): This approach drills down further to the epidemic onset source to the Southeastern shores of continental US, and identifies a smaller validated subset of predictive factors.
Materials and methods
The methods and putative predictors identified in our study should be directly applicable to data outside the US. While the balance (relative importance) of putative triggers of influenza waves may vary, the factors themselves should be universal globally, because the virus and host biology are universal.
Candidate factors in influenza initiation
To investigate county-specific variability, we grouped candidate factors into several categories: demographic, relation to human movement, infection state of county neighbors, and county’s own recent state, and climatic.
Major hypotheses regarding putative causal factors affecting infection dynamics can be traced to a handful of earlier publications:
Short- and long-range, work-related human movement (Viboud et al., 2006), including air travel (Colizza et al., 2006; Balcan et al., 2009; Viboud et al., 2006)
Demographic confounders (Gog et al., 2014; Charu et al., 2017)
Social contact among children in schools, or the ‘Return-to-school effect’ (Gog et al., 2014)
Absolute humidity (Shaman and Kohn, 2009)
Other climate variables (Chowell et al., 2012; te Beest et al., 2013)
Host immunity, as affected by vaccination coverage, previous infections, and antigenic variation (Centers for Disease Control and Prevention et al., 2009).
Human movement
Request a detailed protocolWe considered two measurements of human movement: (1) The first measurement reflects the proximity of counties to major airports. We computed an exponentially-diffusing influence from counties with major US airports, weighted by passenger enplanements at their respective locations. This accounted for people moving to and from both major airports and neighboring counties, and; (2) The second measurement is a large-scale movement matrix representing people’s week-to-week travels between counties. These data were culled from a complete collection of geo-located Twitter messages, captured over years, and constituted a large-scale, longitudinal sample of individual movements. We used only automatically geo-tagged tweets.
Demographic
Request a detailed protocolInfluenza is transmitted through direct contact with infected individuals, via contaminated objects, and virus-laden aerosols. Thus, human population density (how many people happen to be around) and social connectivity (how many people interact with each other and how frequently [Bedford et al., 2015]) are factors expected to affect local virus incubation and spread. In addition to population density, we considered socioeconomic factors such as county-mean household income, levels of poverty and urbanity, as well as the prevalence of ethnic and age groups. All these socio-demographic and socio-economic data were derived from reports provided by the US Census ([Web Page]; 2017. Available from: https://www.census.gov/geo/reference/ua/uafacts.html).
Return-to-school effect
Request a detailed protocolSocial contact among children in schools has been extensively investigated as a determinant of the peak incidence rate. This is one of the few factors that might lend itself to intervention relatively easily, and hence the interest is well-justified. While any reduction in social contact should, in theory, directly impact transmission, quantifying the effect of this specific mode of contact on the incidence rate has been difficult to calculate. Predictions of the reduction in the peak incidence associated with reduced social contact were typically 20–60% (Ferguson et al., 2006; Haber et al., 2007), with some studies predicting much larger reductions of 90% (Mniszewski et al., 2008; Ghosh and Heffernan, 2010). Reductions in the cumulative attack rate (AR, ratio of the number of new cases to the size of the population at risk) were usually smaller than those in the peak incidence. Several studies predicted small (10%) or no reduction in the cumulative AR (Ferguson et al., 2006; Haber et al., 2007; Yasuda et al., 2005; Ciofi degli Atti et al., 2008; Yasuda et al., 2008; Kelso et al., 2009; Davey et al., 2008; Rizzo et al., 2008; Vynnycky and Edmunds, 2008; Glass and Barnes, 2007; Lee et al., 2010; Yang et al., 2011; Zhang et al., 2012), whilst a few predicted substantial reductions (e.g. 90%) (Glass et al., 2006; Davey et al., 2008; Elveback et al., 1976; Davey and Glass, 2008). Only two studies (Glass et al., 2006; Lee et al., 2010) predicted that peak incidence might increase markedly under certain circumstances following school closures, for example by 27% if school closures caused a doubling in the number of contacts in the household and community, or by 13% if school systems were closed for two weeks at a prevalence of 1% in the general population. Studies have also investigated the effect of such interventions on children vs. adults; one study predicted an overall reduction in the cumulative AR, but an increase of up to 48% in the cumulative AR for adults in some situations (Araz et al., 2012).
While this diverse set of predictions in the literature often pertains to the effect of school closures as an intervention tool, we are more interested in the influence that the current school schedule has, if any, on triggering an epidemic. To answer this specific question, we formulated a simple statistical test to determine whether the timing of return-to-school after summer and winter holidays significantly predicts influenza season initiation. We found insufficient evidence in support of this effect (see Methods and materials).
Climate variables
Request a detailed protocolSpecific humidity and a drop in temperature have been suggested as the key drivers in triggering seasonal influenza epidemics (Lowen et al., 2008; Shaman et al., 2010). These initial conclusions were drawn from experiments conducted using an animal model (guinea pig), under controlled laboratory conditions (Lowen et al., 2008), followed by indirect support from epidemiological modeling (Shaman et al., 2010).
Vaccination coverage
Request a detailed protocolVaccination is widely regarded as our most promising tool to combat influenza, though antigenic variation between seasons makes it difficult to craft an effective vaccination strategy (Boni, 2008). Understanding how the virus will evolve in the short-term is key to finding the correct antigenic match for an upcoming influenza season. Additionally, short-term molecular evolution might rapidly give rise to immune-escape variants that, if detected, might dictate intra-season updates in vaccine composition. More importantly, vaccination itself might exert significant selection pressure to influence antigenic drift. The effect of vaccination on viral evolution has been documented in an avian H5N2 lineage in vaccinated chickens (Lee et al., 2004), suggesting that similar processes might be occurring in human counterparts. The diversity of the surface proteins at any point in time between seasons suggests that our current vaccination strategies are limited to confer partial immunity, which can result in a highly immune or vaccinated population selectively pressurizing the viral population to evolve more quickly than usual. Given that influenza moves quickly across geographies and that there are multiple, co-circulating strains that may confer partial cross-protection, changing viral service proteins represent a ‘moving target’ for the human immune system. There are additional complexities due to early-life immune imprinting in humans.
Accounting for non-vaccination host immunity
Request a detailed protocolResistance to influenza infection in human hosts arises via two related mechanisms: immunological memory from a previous infection by an identical or sufficiently similar strain, or vaccination against the current strain (or a sufficiently similar strain). In our analysis, we explicitly account for the degree of vaccination coverage. Accounting for host immunity is more difficult, because resistance to infection is not directly observable; however, our analysis also uses the antigenic variation of influenza virus as a proxy to host resistance: If the relative antigenic variation is large, then the susceptibility of non-vaccinated hosts increases, while low antigenic variation increases the probability of encountering a resistant host. In addition, we use the absolute antigenic deviation of the later-season virus from the first-season virus in our data set (winter of 2003–2004) as a predictor, in order to capture longer-term effects of immunological memory. Thus, if the magnitude of the relative drift between two succeeding years is small, we have a likely decrease in susceptibility. Likewise, if the absolute deviation from a few years back starts decreasing, we would also register a decrease in susceptibility. Incorporating these factors in the diverse set of models that we investigate, guarantees that we are indeed considering host susceptibility contributions from a wide range of possible mechanisms.
Antigenic Variation
Request a detailed protocolThe influenza virus counteracts host immunity via subtle genomic changes over time. The more gradual process, known as antigenic drift, is a manifestation of a gradual accumulation of mutations within viral surface proteins recognized by human antibodies, such as hemagglutinin (HA) and neuraminidase (NA). These mutations are typically introduced during cycles of viral replication (Boni et al., 2004). Most of these mutations are neutral, they do not affect the functional conformation of the viral proteins. However, some of these alterations do, in fact, sufficiently change secondary and tertiary protein structures to have a negative impact on the binding of host antibodies raised in response to previously circulating strains (Webby and Webster, 2001). (Many such mutations also reduce the virus’s viability.) Thus, while a single infection episode is potentially enough to provide long-term host immunity to the invading strain, antigenic variation due to intense selection pressure gives rise to novel viral strains, making re-infections possible within the span of a few years (Andreasen, 2003). This kind of perpetual Red-Queen arms race injects influenza dynamics with auto-correlative dependencies over multiple seasons. It has been suggested that substantial antigenic drift might be associated with more severe, early-onset influenza epidemics, resulting in increased mortality (Treanor, 2004). In contrast to antigenic drift, antigenic shift is an abrupt, major change in virus structure due to gene segment re-assortment that occurs during simultaneous infection of a single host by multiple influenza subtypes (De Clercq, 2006). Antigenic shift results in new versions of viral surface proteins. Antigenic shift due to re-assortment gives rise to novel influenza subtypes that, if capable of sustained human-to-human transmission, can have devastating consequences for human populations, the 2009 H1N1 pandemic (Neumann et al., 2009).
In our analysis, we factor in the potential effect of antigenic variation by estimating the surface protein’s population diversity – hemagglutinin (HA), neuraminidase (NA), matrix protein 1 (M1), and matrix protien M2 – as a function of time and geographical sample collection location. Our rationale for our focus on these proteins is that HA, NA, and to some degree M1, are all present on the viral surface (Lamb et al., 1985), contribute to viral assembly, and mediate the release of membrane-enveloped particles (Chlanda et al., 2015). M2 has been shown to have enhanced the pandemic 2009 influenza A virus [(H1N1)pdm09] (Friedman et al., 2017) HA-pseudovirus infectivity (Alvarado-Facundo et al., 2015).
Data sources
Social connectivity
Request a detailed protocolData was obtained from the General Social Surveys, NORC, See Table 1, [].
Clinical data source
Request a detailed protocolThe source of the clinical incidence data used in this study is the Truven Health Analytics MarketScan® Commercial Claims and Encounters Database for the years 2003 to 2012 (Hansen, 2017). The database consists of approved commercial health insurance claims for between 17.5 and 45.2 million people annually, with linkage across years, for a total of approximately 150 million individuals. This US national database contains data contributed by over 150 insurance carriers and large, self-insuring companies. We scanned 4.6 billion inpatient and outpatient service claims and identified almost six billion diagnosis codes. After un-duplication, we identified approximately unique diagnostic codes per individual. We processed the Truven database to obtain the reported weekly number of influenza cases over a period of weeks spanning from January to December , at the spatial resolution of US counties. To define influenza in insurance claims, we used the following set of ICD9 codes: 487.8, 488.12, 488.1, 488.0, 488.01, 488.02, 487.0, 487.1, 488.19, 488.09, 488, 487, and 488.11. We also considered including non-specific codes representing unspecified viral infection (079.99), as suggested in (Viboud et al., 2014), and decided against it. (We provide explicit annotation for each code, as well as rationale for choosing narrower ICD-code definition of ILIs, in the Supplement Section S-D).
A peak percentage of ILI-affected people of may seem high–especially given that it is a much greater percentage than the change in seropositivity over most influenza seasons. Note that our apparent estimates of influenza infection rate are unavoidably inflated, because what we are measuring is the week- and county-specific prevalence of influenza-like illnesses among insured patients who contacted their physician for any reason during the week in question. In other words, the denominator that we used for computing prevalence was almost certain to be much smaller than the overall population of the corresponding county.
Data on antigenic drift for influenza A
Request a detailed protocolSequence data for this computation was obtained from the National Institute of Allergy and Infectious Diseases (NIAID) Influenza Research Database (IRD) (Zhang et al., 2017) through their web site at http://www.fludb.org.
Data on vaccination coverage
Request a detailed protocolData on vaccinations was extracted from our EHR database corresponding to the procedural codes 90661 and Q2037, which corresponds to the dominant influenza vaccines. (http://flu.seqirus.com/files/billing_and_coding_guide.pdf)
Data on human air travel
Request a detailed protocolWe used a complete, directed graph of passenger air travel for 2010, accounting for the number of passengers transported in each direction (The United States Bureau of Transportation Statistics, 2010). For each county, we computed an air neighborhood network: For counties and , the incoming edge to county represents the proportion of passengers, and represents the ratio of all passengers who traveled from county to county by plane,mto the total number of travelers who left county by plane during the year, so that .
Data on general human travel patterns in the US
Request a detailed protocolUsing the complete Twitter dataset, we aggregated a movement matrix to capture people’s week-to-week travels between counties from geo-located Twitter messages (captured during the period of January 1, 2011 through June 30, 2014). This dataset includes approximately messages and represents user-days of location information. A small, but significant, percentage of Twitter messages are automatically tagged with the author’s current latitude/longitude information, as determined by their mobile device. Each latitude/longitude-annotated tweet was mapped to a FIPS county code based on Tiger/Line shape files from the 2013 Census dataset (http://www.census.gov/geo/maps-data/data/tiger-line.html). In addition, we calculated a variant of our movement matrix to capture seasonality and other temporal dynamics: A set of 52 movement matrices captured weekly, county-to-county movements, observed in each week of the year, aggregating each week based on the corresponding observations, from each year from 2011 through 2014.
This dataset constitutes a large-scale, longitudinal sample of individual movements. We found that the movement patterns are consistent with intuitive patterns and prior studies of large-scale movements; the majority of people in a county remain in the same county week-over-week. Most travel between counties occurs between neighboring counties, and between counties and large metropolitan areas, conditioned on distance and size of the metropolitan area. We represent our movement data as an matrix that captures the likelihood that a person observed in county during the course of a week will be observed in county in the following week. We calculate the entries of matrix as follows:
For each time interval, we computed the mean movement vector from a given county by summing all county-to-county movement vectors, weighted by the proportion of people moving in each direction.
To investigate the role of proximity to major airports, we modeled influence diffusion as follows: Let be the county, and be the total contribution obtained by diffusing influences from the major airport bearing counties. Let be the airport-bearing county, and let be the total number of major airport locations considered, in our case, . Let be the volume of traffic for the major airport location. We then compute as follows:
where is the distance in miles between two locations, we computed using the Haversine approximation. The value of the constant was chosen to be . Small variations in the constant do not significantly alter our conclusions. As noted above, proximity to airports had a significant positive influence in sparking seasonal epidemics; the influence is significantly weaker once an outbreak is well under way.
Estimating antigenic diversity
Request a detailed protocolIn this study, we measured antigenic diversity as follows: Let be the set of amino acid sequences for the protein (one of HA, NA, M1, or M2), collected in year ( ranging between 2003 and 2011), in state of the continental US. The temporal resolution of the sequence data is thus set to years instead of weeks, and the spatial resolution to states instead of counties. These resolutions are coarse compared to our EHR data on infection incidences, and is set in this manner to maintain sufficient statistical power. For each such set of amino acid sequences , we compute the set of pairwise edit distances:
where is the standard edit distance (also known as the Levenshtein distance) between the sequences . Mathematically, the Levenshtein distance between two strings (of length respectively) is given by:
where is the indicator function equal to 0 when and equal to one otherwise, and is the distance between the first characters of and the first characters of .
The antigenic diversity of the protein at time in state was then defined as the median of the distribution of the values in the set . Clearly, as the sequences became more diverse at a point in time and space due to molecular variations brought about by either drifts or shifts, the measure deviated more from zero. Use of the median provided robustness to outliers.
Importantly, here we did not compare temporal changes in viral antigenic makeup directly, but estimated the time-specific diversity in the viral protein primary structure. We expected our diversity measure to be representative of the cumulative changes that occurred within each of the nine influenza seasons.
Estimating vaccination coverage
Request a detailed protocolWe incorporated the effect of vaccination coverage by estimating the cumulative fraction of the population that received the current influenza vaccine within the previous 20 week period. This approach to estimating vaccination coverage does not correct for season-specific antigenic match, or the lack thereof. Nevertheless, because we explicitly included measures of antigenic diversity in addition to vaccination coverage, we expected that effects arising from the degree of antigenic match would indeed be factored in; if there is a significant mismatch, we expected the antigenic diversity in that year to be less and vice verse. This assumes implicitly that vaccination does indeed play a major role in exerting significant selection pressure, the assumption that was reinforced by our observation of a strong dependency between vaccination coverage and normalized antigenic diversity.
We found that antigenic diversity is quite strongly affected by vaccination coverage. This reflects the theoretical predictions in Boni et al. (Boni et al., 2006), where it is shown that the amount of observed antigenic drift increases as immunity in the host population increases and pressures the virus population to evolve.
Estimating the effect of return-to-school days
Request a detailed protocolThe absence of consensus, and the diversity of modeling assumptions pertaining to this effect (described above) makes it difficult to validate conclusions in large scale epidemiological data. We carried out a simple test to determine whether there is statistical evidence that after-holiday, return-to-school periods in August-September and in January predict or trigger the seasonal epidemic.
For this test, we assumed a broad window to cover all such school openings across the continental US including the last week in August, the entire month of September, two weeks in October, and two weeks in January. We next carried out a Fisher’s exact test to determine whether the overlap between these weeks and the identified ‘trigger period’ (See Table LABEL:SI-tabschool) for the seasonal epidemic are sufficiently non-random.
Weather data
Request a detailed protocolThe dataset starts with the week beginning December 31 st, 2002 and includes 522 weeks (which ends exactly on the week ending December 31 st, 2012). Temperature and precipitation data come from the 2.5 arcminute (approximately 4 km) PRISM (Oregon State University, 2014) dataset and other variables (wind speed, specific humidity, surface pressure, downward incident, and shortwave solar radiation) come from the 7.5 arcminute (approximately 12 km) Phase 2 North American Land Data Assimilation System (NLDAS-2) dataset (Mitchell, 2004; Cosgrove et al., 2003). These datasets were selected in large part due to the fact that both are updated in near real-time, making it possible to use these datasets for future monitoring applications. PRISM is released daily, with an approximately 13 hr delay (data for the previous day is released at 1pm EST each day) while NLDAS is released daily, with an approximately 2.5 day delay).
Variables were aggregated to county boundaries based on shapefiles from the GADM database of Global Administrative Areas (Areas, 2014). Where appropriate, we considered both the average daily climate variable (for example, the daily maximum temperature averaged over the week) as well as the the maximum and/or minimum of the variable experienced over the week. For precipitation, we considered only the cumulative total precipitation experienced during the week.
Three approaches
Approach 1: Non-parametric Ganger analysis of county-specific incidence
Request a detailed protocolAnalysis of quantized clinical data
For the causality analysis, we started with an integer-valued time series for each US county, and to carry out the causality analysis, we first quantized the series in two steps:
1. Computing the difference series, , the weekly change in the number of reported cases.
2. Mapping positive changes to symbol ‘1’ and negative changes to symbol ‘0’.
This mapped each data series to a symbol stream over a binary alphabet. The binary quantization is not a restriction imposed by the inference algorithm; while we do require quantized magnitudes, longer data streams can be encoded with finer alphabets to accommodate an arbitrary precision. For this specific dataset, the relatively short length of the county-specific time-series necessitated a coarse quantization in order for the results to have a meaningful statistical significance.
Given a pair of such quantized streams , the algorithm described in the Supplement computes two coefficients of causality, one for each direction. Intuitively, the coefficient from to is a non-dimensional number between and that quantifies the amount of information that one may acquire about the second stream from observing the first stream (). More specifically, is the average reduction in the uncertainty of the next predicted symbol in stream in bits, per bit, acquired from observed symbols in stream . It can be shown that the coefficient of causality is if and only if there is no causal influence from to in the sense of Granger, and assumes the maximum value if and only if is deterministically predictable from . Moreover, if and only if and are statistically independent processes (Chattopadhyay, 2014). It is trivial to produce examples where we would have illustrating the ability of the algorithm to capture the asymmetric flow of causal influence in one preferred direction and not the other.
Additionally, whenever we computed the causality coefficient, it was always associated with a time delay: We calculated the coefficient for predicting the target stream some specified number of steps in the future, and the computed coefficient was thus parameterized by this delay. In our analysis, we computed coefficients up to a maximum delay of weeks, and, for each pair of counties, selected the optimum delay which gave rise to the largest coefficient. For more details on the algorithm, see the Supplement.
Computation of causality fields and causality streamlines
To analyze the propagation dynamics of disease, we defined the notion of a Granger causal flow between two counties. Treating county-specific changes in disease prevalence as a time-stamped data stream, we quantified the directional strength of the causal flow between two counties as the degree of predictability of one stream’s future, given the history of the other (see Figure 3A–D). Our new algorithm was able to compute a coefficient of causality: an information-theoretic measure of the information in bits that one stream communicates about another in a direction-specific manner. A strong coefficient of this type suggests that either the disease itself, or an underlying cause, may be propagating from the former county to the latter.
To explore the country-wide propagation dynamics, we stitched together the properly-aligned, adjacent between-county causal flow vectors across the whole US map into causality streamlines, representing the average of multiple spatial infection propagation waves over time. To compute causality streamlines, we needed a precise notion of county neighbors. We considered two counties to be neighbors if either they shared a common border, or if one county was reachable within a line-of-sight distance of 50 miles from any point within another. The latter condition removed ambiguities as to whether counties touching at a point should be still considered as neighbors. The exact distance value (50 miles) does not significantly change our results, as long as it does not vary significantly. With the definition of the neighborhood map in place, we proceeded to compute the direction-specific coefficients of causality between neighboring counties. It follows that we would obtain a set of coefficients for each county and one for each of its neighbors, capturing the degree of causal influence from a given particular county to its respective neighbors. Our algorithm also computed the probability of each coefficient arising by chance alone, and we ignored coefficients that have more than a probability that two independent processes lacking any causal connection gave rise to the particular computed value of the coefficient. Once the coefficients had been calculated for each neighbor, we computed the resultant direction of causal influence outflow from that particular county. This was carried out by visualizing the causality coefficients as weights on the length of the vectors, from the centroid of the considered county to the centroids of its neighbors. We then calculated the resultant vector (see Figure 3). Viewed systematically across the continental US, these local vectors formed a discernible pattern; we observed the emergence of a non-random structure with clearly discernible paths outlining the ‘causality field’ (see Figure 3, Plates G, J, K). To interpret the plots, note that streamlines start at their thinnest part, their direction is indicted with thickening line; typically multiple streamlines coalesce into a river-like pattern.
Inferring statistical ‘Granger-causality’ from data
Granger attempted to obtain a precise definition of causal influence (Granger, 1980) from the following intuitive notion: is a cause of , if it has unique information that alters the probabilistic estimate of the immediate future of .
Here, we used a new, non-parametric approach to Granger causal inference (Chattopadhyay, 2014) (Approach 1). In contrast to state-of-the-art binary tests (Baek and Brock, 1992; Hiemstra and Jones, 1994), we computed the degree of causal dependence between quantized data streams from stationary ergodic sources in a non-parametric, non-linear setting.
Our approach was significantly more general to common, regression-based implementations of Granger causal inference, and did not involve any autoregressive moving average (ARMA) modeling. All such commonly used techniques impose an a priori linear structure on the data, which then constrains the class of dependency structures we can hope to distill.
True causality, in a Humean sense, cannot be inferred (Hume, 1993; Kant, 1998). Among other reported approaches to ascertaining causal dependency relationships, the work of J. Pearl (Pearl, 2009a) is perhaps most visible, and builds on the paradigm of structural causal models (SCM) (Pearl, 2009b). While Pearl’s work is often claimed to be able to answer causal queries regarding the effects of potential interventions, as well as regarding counterfactuals, our objective in this paper is somewhat different. We are interested in delineating whether infection transmission pathways can be distilled from the patterns of infection’s spatio-temporal incidence.
Approach 2: Poisson regression on putative factors
Request a detailed protocolIn Approach 2, we investigated the relative importance of putative factors as follows: Specifically, let denote the total number of patients in county , who are of age , and gender . Denoting the number of individuals diagnosed with influenza in a given county during given week as , we modeled the within-county disease incidence counts for every county (for which data was available) in the US using the following mixed-effect Poisson regression model:
where, is the intercept, and are a fixed-effect design matrix and a vector of fixed effects, respectively, is vector of changes in rate of infection in the county weeks prior to the current, is a vector of auto-correlation fixed effects, is a vector of changes in the rate of infection in the neighbors of the county weeks prior to the current (the neighbors are subdivided into land neighbors, Twitter neighbors, and air neighbors), is a vector of county-neighborhoods fixed effects, and and are random effects and their design matrix, respectively. Variable represents the depth of ‘memory’ of auto-regression in weeks, in our case .
In this way, the total number of rows in matrix is (weeks counties), with county-specific socioeconomic covariates and week-specific weather covariates. For disease initiation analysis, we included only a subset of time series covering approximately weeks.
We did not use an explicit spatial smoothing. The county-level random effects were used to account for possible heterogeneity in disease reporting across counties. However, as predictors, we used average infection rate values of immediate neighbors for the county (one, two, three, and four weeks in the past). This prediction structure afforded an implicit spatial smoothing, dampening spurious fluctuations in infection rates.
Out-of-sample prediction and ROC analysis with mixed-effect poisson regression
We carried our out-of-sample prediction with the models inferred with mixed-effect regression. The steps were as follows:
We trained the model parameters with data from the trigger periods corresponding to the first six seasons.
Once we identified the coefficient’s variables, we used it to predict the response variable (influenza incidence) for the last three seasons.
As expected, the predicted incidence does not exactly match the observed out-of-sample data. Nevertheless, we see positive correlation (Figure 5, Plate A). Because we were modeling a necessarily spatio-temporal stochastic process, the predictive ability of the model is difficult to judge simply from the observed positive correlation. To resolve this, we investigated the performance of our model by computing how well it predicted the counties that experience flare-ups during the trigger-periods in the out-of-sample data. This exercise is reduced to a classification problem, by first choosing a threshold on the number of reported cases per week to define what is meant by a ‘flare-up’ (see description below). To quantify the prediction performance, we constructed ROC curves for each of the three target seasons, for each fixed week, as follows:
We first quantized the incidence data to reduce it to a binary stream. In particular, we chose a threshold (ten), such that for each county, and each week, we reported a ‘1’ if the number of reported cases was greater than the threshold, and ‘0’ otherwise. Note this quantization is different to what we used in carrying out our non-parametric Granger analysis in Approach 1.
For each county , and week , we then ended up with the binary class variable and a decision variable , where the former is the quantized incidence described above, and is the response predicted by the model, normalized between and .
For a chosen decision threshold , we could determine the predicted class as:
Comparing the observed and predicted classes, we computed the false positive rate (FPR: defined as the ratio of false positives to the sum of false positives and true negatives), and the true positive rate (TPR: defined as the ratio of true positives to the sum of true positives and false negatives). Finally, we constructed the ROC curve, which shows the relationship of the TPR and the FPR as the decision-threshold is varied.
We constructed Receiver Operating Characteristic (ROC) curves for each week in the out-of-sample period, and estimated the area under curve (AUC). The AUC measures the performance of the predictor (our model) to correctly classify the counties that would go on to have a disease incidence greater than the initial set threshold (ten cases in our analysis). In the perfect case, we would have an AUC of , which implies that we can achieve zero false positives, while getting a true positive rate. Our best model achieves approximately AUC for the trigger weeks, as shown in Figure 5.
Variable and model selection
Request a detailed protocolApproaches 2 and 3 have different limitations; these affect which approach is most effective at which statistical detection task.
Approach 2 (mixed-effect regression analysis) can use a whole dataset for model selection, that can span individual predictors, their combinations, and even interactions between factors (see Table 5; Figure 1—figure supplement 2; Figure 1—figure supplement 3; Figure 1—figure supplement 4; Source data 2; Source data 3; Supplementray file 3; Source code 1). Model selection allows us to choose a model with the right balance of complexity and explanatory power, thus enabling this analysis to detect collinear explanatory variables and drop the weaker predictors that increase model complexity but do not add explanatory power.
For example, the best model in our mixed-effect regression model series, with deviance information criterion (DIC) (see Table 5, the last model), includes a term d_max_HUS_min_3 * d_t_avg_min_3 that denotes an interaction between the weekly change in the maximum specific humidity and the weekly change in the average temperature, both of which are weather parameter changes recorded three weeks before the current week. Because it does not split datasets, Approach two is more powerful at detecting interactions between explanatory variables and rather complicated models. However, this approach is more susceptible to bias (picking non-causal correlations that are inherent in the raw data), if the data are unbalanced, unlike with Approach 3.
We added and removed random effects to the model structure, taking out and adding fixed effects, executing the regression algorithm, and plotting the DIC achieved against model complexity. Here, model complexity is simply proxied by the descriptional complexity in terms of how large the model was (the number of factors in the regression equation). We stopped when the drop in DIC stabilized (See Figure 1—figure supplement 3 ). Notably, the DIC has been shown to select overfits (Ando, 2011; Plummer, 2008; van der Linde, 2012), and does not properly account for model complexity in practice. Our driving idea was the identification of the Pareto front that trades off accuracy (in terms of DIC) with model complexity in a transparent manner. This is, of course, a ‘greedy approach,’ in the sense that we do not guarantee that we have indeed found the ‘best’ possible model. However, because our cross-validation (Figure 5) yielded good results, we deemed the stopping rule to be satisfactory.
Approach 3: County-matching effect analysis
Request a detailed protocolWe designed Approach three specifically to balance biased observational data. While Approach three does not allow for explicit model selection, it is good at detecting simpler combinations of predictors putatively ‘causally’ affecting the outcome variable. However, because this approach involves splitting data into smaller and smaller subsets, as the complexity of predictor function (‘treatment’) grows, it quickly runs out of statistical power.
These two approaches’ properties explain the perceived misalignment of results from the two streams of analysis. For an intuitive understanding of matching approaches, consider specific humidity’s effect on influenza prevalence. The county-matching method’s goal is to deduce associations putatively interpreted as causality relations. For example, consider testing the question of whether counties with higher-than-average mean maximum specific humidity do, indeed, have higher influenza prevalence, in a statistically significant sense, when compared to counties that do not, provided all other factors are held constant.
Let denote the set of all US counties, and let be the set of all factors we find to be significant in our mixed-effect regression analysis. Now, for any subset of factors , we denote the complement set as . Additionally, we define the Boolean function , (the treatment function) where for some factors , and some counties , the Boolean value is true if the signal or treatment corresponding to the factor is present in county . We used the sign for the coefficients we obtained in our best mixed-effect regression model to determine what counts as a positive signal. For example, because maximum specific humidity (denoted by the variable ) enters with a positive coefficient, if county experiences a higher-than-average maximum specific humidity, then is true.
To simplify the notation, we used to denote a true signal, and to denote a false one.
Finally, for any , we define the following three sets, which we refer to as the -sets:
Clearly, is the treated set, that is, the set of counties which exhibit the signal encoded by the set is then the matched control set of counties, which lack the signal, but each county in this set has a matching counterpart in .
The -sets allow us to set up a contingency table for any chosen subset of factors as described before. Specifically, we split the sets and into two subsets each, representing those counties which experience a spike in influenza prevalence and which do not. The contingency table is then subjected to Fisher’s exact test.
Appendix 1
Non-parametric Granger Causal Inference
Standard Implementations of Granger Causal Inference
Suppose that one is interested in the possibility that a vector series causes another vector . Let be an information set available at time , consisting of terms of the vector series , that is,
is said to be a proper information set with respect to , if is included within . Further, suppose that does not include any component of , and define
Inferring causality in the mean is easier, and if one is satisfied with using minimum mean square prediction error as the criterion to evaluate incremental predictive power, then one may use linear, one-step-ahead least squares predictors to obtain an operational procedure: If is the variance of one-step forecast error of given , then is a prima facie cause of with respect to if:
Testing for bivariate Granger causality in the mean involves estimating a linear, reduced-form vector autoregression:
where , , , and are one-sided lag polynomials in the lag operator with roots all-distinct, and outside the unit circle. The regression errors are assumed to be mutually independent and individually i.i.d. with zero mean and constant variance. We used a standard joint test ( or -test) to determine whether lagged has significant linear predictive power for current .
The null hypothesis that does not strictly Granger cause is rejected if the coefficients of the elements in are jointly significantly different from zero.
Linear tests presuppose restrictive and often unrealistic (Darnell and Evans, 1990; Epstein, 1987) structure on data. Brock (Brock, 1991) presents a simple bivariate model to analytically demonstrate the limitations of linear tests in uncovering nonlinear influence. To address this issue, a number of nonlinear tests have been suggested, , with generalized, autoregressive, conditional heteroskedasticity (GARCH) models (Asimakopoulos et al., 2000), using wavelet transforms (Papadimitriou et al., 2003), or heuristic, additive relationships (Chu et al., 2005). However, these approaches often assume a class of allowable non-linearities; thus not quite alleviating the problem of presupposed structure. This is not just an academic issue; Granger causality has been shown to be significantly sensitive to non-linear transformations (Roberts and Nord, 1985).
Non-parametric approaches, the Hiemstra-Jones (HJ) test (Hiemstra and Jones, 1994) on the other hand, attempt to completely dispense with presuppositions regarding causality structure. Given two series, and , the HJ test (which is a modification of the Baek-Brock test [Baek and Brock, 1992]) uses correlation integrals to test if the probability of similar futures for given similar pasts, change significantly if we condition instead on similar pasts for both and simultaneously.
Nevertheless, in order to achieve consistent estimation of the correlation integrals, the data series are required to be ergodic, stationary, and absolutely regular -mixing, with an upper bound on the rate at which the -coefficients approach zero (Denker and Keller, 1983). The additional assumptions beyond ergodicity and stationarity serve to guarantee that sufficiently separated fragments of the data series are nearly independent. The HJ test and its variants (Diks and Panchenko, 2006; Seth and Principe, 2010) have been quite successful in econometrics; uncovering nonlinear causal relations between money and income (Baek and Brock, 1992), aggregate stock returns and macroeconomic factors (Hiemstra and Kramer, 1995), currency future returns (Asimakopoulos et al., 2000) and stock price and trading volume (Hiemstra and Jones, 1994). Surprisingly, despite clear evidence that linear tests typically have low power in uncovering nonlinear causation (Hiemstra and Jones, 1994; Asimakopoulos et al., 2000), the application of non-parametric tests has been limited in areas beyond financial and macroeconomic interests.
1.1 Our Approach: Non-parametric Approach Based on Probabilistic Automata
We use a new, non-parametric Granger causality-based test for quantized processes (Chattopadhyay, 2014).
Going beyond binary hypothesis testing, we quantify the notion of the degree of causal influence between observed data streams, without presupposing any particular model structure. Generative models of causal influence are inferred with no a priori, imposed dynamical structure beyond ergodicity, stationarity, and a form of weak dependence. The explicit generative models may be used for prediction. The proposed inference algorithms are PAC-efficient (Chattopadhyay, 2014), , we are guaranteed to find good models with high probability, and with small sample complexity.
Modeling Framework: Probabilistic Automata and Crossed Probabilistic Automata
In this section, we briefly describe the notion of probabilistic automata, the inference problem, the notion of causal states, and the decision fusion strategy for multiple models.
Self Models
The event catalogue’s quantized data streams may be viewed as sample paths from hidden, quantized stochastic processes that drive the observed dynamics. A self-model for such a stream is a generative model that captures statistically significant symbol patterns that causally determine (in a probabilistic sense) future symbols. A good modeling framework for such self-models are probabilistic finite state automata (PFSA) (Chattopadhyay and Lipson, 2013; Chattopadhyay, 2014).
One such self model, shown in Figure 1 (Plate A(i)), is an example of a simple PFSA with two states generated from a binary symbol sequence. Note that, for this particular example, a sequence localizes or synchronizes the machine to state , which then implies that the next symbol is a with a probability of and a with a probability of . Thus, given this stream’s self-model, along with the short history , we can predict the symbol distribution in the next time step. Notably, the states in a PFSA are not the alphabet symbols themselves, but are equivalence classes of symbol sequences (histories) that lead to statistically equivalent futures. This particular example has two states–not because of the fact that the alphabet is binary, but because there are only two distinct distributions that dictate the next symbol from any point in the dynamical evolution. The symbol value is distributed as either or over the binary event alphabet, and the current context or ‘state’ dictates which distribution is in effect chosen. These contexts are simply historical sequence equivalence classes that could have potentially lead to the present state; the specific history belonging to whichever current equivalence class actually transpired makes no difference in the future. Hence, these classes are dynamical states for the discrete stochastic evolution of the system at hand. The transition structure of the PFSA (represented in the graph as the labeled edges with probabilities) specifies how we move between these classes as new symbols are generated, or as new data is acquired.
The inference problem here is to determine both the number of states and the transition structure given a sufficiently long, quantized data stream. In our unsupervised approach, there are no restrictions on admissible graphs for the inferred PFSA, and no a priori constraints on the number of states that may appear (Chattopadhyay and Lipson, 2013).
Thus, the number of states inferred by our algorithm is a direct measure of the statistical complexity of the underlying process (Crutchfield, 1994). An inferred single-state, self model implies that the stream is a sequence of independently generated symbols (white noise or a Bernoulli process), and is therefore uninformative.
The inference algorithm identifies the number of causal states by searching for distinct ‘contexts’–sequences which, once transpired, lead to a distinct probability distribution of the next symbol. The computation proceeds as follows (Chattopadhyay and Lipson, 2013):
Let the set of all possible sequences up to a length be denoted as .
Compute the probability of a future symbol (at a specified time shift) being or after a specific string from the set is encountered; call this distribution for the string .
Call the set of probability distributions obtained by this method as .
Find clusters in the set , such that individual clusters are separated by some pre-specified distance . These clusters represent the causal states, as they are classes of histories (sequences) that lead to identical, immediate future.
Suppose string is in cluster , and sequence is in cluster ; it then follows that, in the inferred PFSA, there is a transition labeled from the corresponding state to state . Carrying out this procedure for each symbol of the alphabet for each inferred cluster or state identifies the complete transition structure of the model.
Once the transition structure is identified, we choose an arbitrary initial state and step through the model as dictated by the input data stream. We count the number of times each edge or transition is traversed, and, by normalizing the count, we arrive at an estimate of the edge probabilities.
This completes the PFSA inference.
Formal Definition of PFSAs and Future Prediction
Mathematically, a PFSA is a -tuple , where is the set of states, is the symbol alphabet, is the transition function specifying the graph structure such that for any state , is the end state of the transition from state via symbol , and is the symbol probability function such that for any state , is the probability of generating the symbol from the state , with the constraint:
We specify a time shift when we infer a PFSA from a given input stream; this time shift is the delay with which the inferred model makes predictions. Specifically, each transition step in is translated to steps in the quantized stream. Thus, if we know that the current state in the model is , then the predicted symbol steps for the future are with probability .
Note that, if we have two models and with time shifts and respectively, it follows that with each step the models make predictions at different points in time in the future.
Note that to make a prediction using a PFSA, we must know its current state. This is non-trivial in general due to the synchronization problem (Chattopadhyay and Lipson, 2013; Chattopadhyay, 2014). We find an approximation of the current state as follows:
Compute the he PFSA states’ stationary distribution using standard Markov chain tools.
Treat the stationary distribution row vector as the initial state distribution. The rationale here is that if we assume that the model has been operating for a sufficient length of time, the current distribution must be very close to the stationary distribution in the expected sense.
Compute the square matrices, , for each symbol , such that the element of , denoted as , which signifies the probability of going from state to state via the symbol .
Use a relatively short history of past symbols (before the current time) to update the state distribution as follows:
where is the entry of the short history , and is the state distribution at step . It follows from ergodicity that the updated distribution is an estimate of the current state distrbution.
The predicted distribution over the symbol alphabet steps in the future (if the time shift for the model is ), is then simply:
where is the current probability for state , , the entry corresponding to state in .
Cross Models
Unlike a self model, which attempts to predict future symbols in the same stream, a cross model attempts to predict symbols in a target stream after reading a short history in the source stream. Plate B(i) in Figure 1 illustrates a simple cross model with two states. Note that the model syntax is different from a PFSA; in particular, there are no probabilities on the arcs, though each state has a specified output distribution. Thus, if we see sequence in data stream (See Figure 1, Plate B), then we are in state in the cross-model, and hence the next symbol in the data stream can be predicted to be with a probability of and with a probability of . We call such models crossed Probabilistic Finite State Automata (XPFSA) (Chattopadhyay, 2014). A cross model state or an XPFSA has a slightly different interpretation from PFSA; while PFSA states are equivalence classes of histories that lead to identical futures in the same stream, the XPFSA are equivalence classes of histories in the source stream that lead to identical futures in the target stream (the future evolution in the target stream does not matter).
XPFSA inference ˜ (Chattopadhyay, 2014) is similar in principle to PFSA inference. Here, we have two input data streams, the source stream over the source alphabet , and the target stream over the source alphabet . The broad steps are as follows:
Let the set of all possible sequences up to a length in source stream be denoted as .
Compute the probability of a future symbol (at a specified time shift) being in the target stream after a specific string from the set is encountered in the source stream; call this distribution for the string .
Call the set of probability distributions obtained in this manner as .
Find clusters in the set , such that the clusters are separated by some pre-specified distance . These clusters now represent causal states of the cross-dependence between the stream (from to ), because they are classes of histories (sequences) in the source stream that leads to identical, immediate futures in the target stream.
Suppose string is in a cluster corresponding to state , and sequence is in a cluster corresponding to ; it then follows that, in the inferred XPFSA, there is a transition labeled from the corresponding state to state . Carrying out this procedure for each symbol of the source alphabet for each inferred cluster or state identifies the complete transition structure of the cross model. Note that the cross-model transitions are only labelled with symbols from the source alphabet.
Once we have identified the transition structure, we choose an arbitrary initial state in the XPFSA and step through the model as dictated by the source data stream. Each time we reach a particular state in the cross-model, we note which symbol from target alphabet transpires in the target stream (at the specified time shift). Once we reach the end of the input source stream, we normalize the symbol count vectors corresponding to the cross-model states, and this determines the output distribution at each state. Note that each output distribution is a probability distribution over the target alphabet. This completes the XPFSA inference.
An XPFSA may be thought of as a direct generalization of a PFSA, and any PFSA can be represented as an XPFSA. To illustrate this, note that predicting future symbols in the same stream can be thought of as predicting symbols is a second, but identical, copy of the first stream.
If all histories in a source stream are equivalent in this sense, then we have a single state XPFSA, which implies that the source cannot provide any new information on what is going to happen in the target stream based on its own history. Hence, that source lacks any causal influence on the target. Thus, as before, the complexity of the inferred models is directly related to the statistical complexity (Crutchfield, 1994) of the learnable, dynamical relationships underlying data streams themselves.
XPFSAs are asymmetric in general; the model for how a source influences a target does not need to be identical when the roles played by the streams are reversed. However, two streams are statistically independent if and only if the XPFSAs in both directions are single state machines (Chattopadhyay, 2014).
This leads us to the notion of the causality coefficient, , from data stream to data stream . True to Granger’s notion of statistical causality, rigorously quantifies the amount of additional information we can obtain about the immediate future in stream through observing stream .
Explicitly, the coefficient is defined as the ratio of the expected change in the entropy of the next-symbol distribution in stream conditioned over observations in the stream to the entropy of the next-symbol distribution in stream , conditioned on the fact that no observations are made on stream . We show that causality coefficient takes values on the closed unit interval and that higher values indicate a stronger predictability of from , , therefore, a higher degree of causal influence. We have if and only if the inferred machine has a single state, and streams are statistically independent if and only if .
Thus, the interpretation of the causality coefficient which is central to our development is as follows:
(say) from to means that we can acquire, on average, of additional information about from each bit read from , over what is already available from the past history of .
This is very different from computing correlations, and it assumes no model structure a priori for the hidden dynamics driving the data streams.
Importantly, it turns out that inferring self models is simply a special case of cross model inference, in which the target stream is simply a time-shifted version of the first stream.
Formal Definition of XPFSAs and Future Prediction
Formally, an XPFSA is also a -tuple , where is a set of states, is the input alphabet. The transition function is defined as before (in the case of the PFSA), but the symbol probability function is defined as , with
where is the target alphabet, which is distinct from in general. Note that the symbol distribution specified by is over the target alphabet .
As before, there is a time shift associated with inferred XPFSA, such that each symbol transition in the model maps to steps in the target stream. If we know the current state of the XPFSA to be , and we observe symbol in the source stream, then the predicted symbol in the target stream ( steps in the future) is , with a probability of .
As before, determining the current state in the XPFSA is non-trivial. We estimate the current state distribution in a manner similar to that used for self models:
Note that the XPFSA graph does not have probabilities on it edges. However, the corresponding source stream’s self model does have symbol probabilities on its own edges, and it models the source stream. The problem is that the graph for the self model, and that of the cross model, might not be identical. We solve this problem using projective composition of probabilistic automata (Chattopadhyay and Ray, 2008). This operation takes the self model for the source stream , and projects it on the inferred XPFSA graph , and the result is a PFSA with the same structure as that of .
Once we obtain , we can estimate its current state as described in the case for self models (using observed symbols in the source stream).
Let the current state distribution be . Then, the predicted future symbol (at time shift ) in the target stream is with a probability of:
where (note we are using the from the XPFSA, and not from the projected PFSA).
Computation of the Coefficient of Causality
Let be stationary ergodic processes over finite alphabets respectively. Then, the causal dependence coefficient of on , denoted as , is formally defined as the ratio of the expected change in the entropy of the next symbol distribution in . This is due to observations in , which show the entropy of the next symbol distribution in . In the absence of data in , , we have (Chattopadhyay, 2014):
where the entropy of discrete probability distribution is given by , is the empty string, and is the expectation operator over all possible sequences in the source stream. Once we infer the cross model, computing is straightforward.
Choice of Quantization Schemes
The weekly time series of county-specific records for the number of reported influenza cases is quantized to a binary stream. If is the number of reported cases for a given fixed county at the week index , then the quantized binary stream is obtained as follows:
References
-
Predictive bayesian model selectionAmerican Journal of Mathematical and Management Sciences 31:13–38.https://doi.org/10.1080/01966324.2011.10737798
-
Dynamics of annual influenza A epidemics with immuno-selectionJournal of Mathematical Biology 46:504–536.https://doi.org/10.1007/s00285-002-0186-2
-
Non-linear Granger causality in the currency futures returnsEconomics Letters 68:25–30.https://doi.org/10.1016/S0165-1765(00)00219-6
-
A General Test for Nonlinear Granger Causality: Bivariate ModelA General Test for Nonlinear Granger Causality: Bivariate Model.
-
Absolute humidity, temperature, and influenza mortality: 30 years of county-level evidence from the United StatesAmerican Journal of Epidemiology 176:S114–S122.https://doi.org/10.1093/aje/kws259
-
Influenza drift and epidemic size: the race between generating and escaping immunityTheoretical Population Biology 65:179–191.https://doi.org/10.1016/j.tpb.2003.10.002
-
Epidemic dynamics and antigenic evolution in a single season of influenza AProceedings of the Royal Society B: Biological Sciences 273:1307–1316.https://doi.org/10.1098/rspb.2006.3466
-
Beyond Belief: Randomness, Prediction, and Explanation in Science230–279, Causality, chaos, explanation and prediction in economics and finance, Beyond Belief: Randomness, Prediction, and Explanation in Science.
-
Prevention and control of seasonal influenza with vaccines: recommendations of the Advisory Committee on Immunization Practices (ACIP), 2009MMWR. Recommendations and Reports : Morbidity and Mortality Weekly Report. Recommendations and Reports 58:1–52.
-
Human mobility and the spatial transmission of influenza in the United StatesPLoS Computational Biology 13:e1005382.https://doi.org/10.1371/journal.pcbi.1005382
-
Abductive learning of quantized stochastic processes with probabilistic finite automataPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371:20110543.https://doi.org/10.1098/rsta.2011.0543
-
Structural transformations of probabilistic finite state machinesInternational Journal of Control 81:820–835.https://doi.org/10.1080/00207170701704746
-
BookData Driven Methods for Nonlinear Granger Causality: Climate Teleconnection MechanismsPittsburgh, Pennsylvania: Carnegie Mellon University.
-
Real-time and retrospective forcing in the North American land data assimilation system (NLDAS) projectJournal of Geophysical Research 108:.https://doi.org/10.1029/2002JD003118
-
The calculi of emergence: computation, dynamics and inductionPhysica D: Nonlinear Phenomena 75:11–54.https://doi.org/10.1016/0167-2789(94)90273-9
-
Rescinding community mitigation strategies in an influenza pandemicEmerging Infectious Diseases 14:365–372.https://doi.org/10.3201/eid1403.070673
-
Antiviral agents active against influenza A virusesNature Reviews Drug Discovery 5:1015–1025.https://doi.org/10.1038/nrd2175
-
On U-statistics and v. mise? statistics for weakly dependent processesZeitschrift fur Wahrscheinlichkeitstheorie Und Verwandte Gebiete 64:505–522.https://doi.org/10.1007/BF00534953
-
A new statistic and practical guidelines for nonparametric Granger causality testingJournal of Economic Dynamics and Control 30:1647–1669.https://doi.org/10.1016/j.jedc.2005.08.008
-
Spatial dynamics of the 1918 influenza pandemic in England, Wales and the United StatesJournal of the Royal Society Interface 8:233–243.https://doi.org/10.1098/rsif.2010.0216
-
An influenza simulation model for immunization studiesAmerican Journal of Epidemiology 103:152–165.https://doi.org/10.1093/oxfordjournals.aje.a112213
-
A(H1N1)pdm09 influenza infection: vaccine inefficiencyOncotarget 8:32856–32863.https://doi.org/10.18632/oncotarget.16459
-
Targeted social distancing design for pandemic influenzaEmerging Infectious Diseases 12:1671–1681.https://doi.org/10.3201/eid1211.060255
-
Spatial Transmission of 2009 Pandemic Influenza in the USPLoS Computational Biology 10:e1003635.https://doi.org/10.1371/journal.pcbi.1003635
-
Testing for causalityJournal of Economic Dynamics and Control 2:329–352.https://doi.org/10.1016/0165-1889(80)90069-X
-
Effectiveness of interventions to reduce contact rates during a simulated influenza pandemicEmerging Infectious Diseases 13:581–589.https://doi.org/10.3201/eid1304.060828
-
The truven health marketscan databases for life sciences researchersTruven Health Ananlytics IBM Watson Health.
-
Wiley Series in Probability and StatisticsLongitudinal data analysis, Wiley Series in Probability and Statistics, Hoboken, N.J, Wiley-Interscience.
-
Testing for linear and nonlinear granger causality in the stock price-volume relationThe Journal of Finance 49:1639–1664.
-
Nonlinearity and endogeneity in macro-asset pricingIMF Working Papers 95:1.https://doi.org/10.5089/9781451845082.001
-
BookAn Enquiry Concerning Human UnderstandingUnited States: Hackett Publishing Country.
-
The Cambridge Edition of the Works of Immanuel KantCritique of pure reason, The Cambridge Edition of the Works of Immanuel Kant, New York, NY, Cambridge University Press, 10.1017/CBO9780511804649.
-
A contribution to the mathematical theory of epidemicsProceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 115:700–721.https://doi.org/10.1098/rspa.1927.0118
-
Simulating school closure strategies to mitigate an influenza epidemicJournal of Public Health Management and Practice 16:252–261.https://doi.org/10.1097/PHH.0b013e3181ce594e
-
Pandemic simulation of antivirals + school closures: buying time until strain-specific vaccine is availableComputational and Mathematical Organization Theory 14:209–221.https://doi.org/10.1007/s10588-008-9027-1
-
Analytical Methods for Social ResearchCounterfactuals and causal inference : methods and principles for social research, Analytical Methods for Social Research, Second edition, New York, Cambridge University Press.
-
Proceedings 2003 VLDB Conference560–571, Adaptive, hands-off stream mining, Proceedings 2003 VLDB Conference, 10.1016/B978-012722442-8/50056-2.
-
Causal inference in statistics: An overviewStatistics Surveys 3:96–146.https://doi.org/10.1214/09-SS057
-
BookCausality: models, reasoning and inference (2nd edition)New York, NY, USA: Cambridge University Press.https://doi.org/10.1017/CBO9780511803161
-
Scenarios of diffusion and control of an influenza pandemic in ItalyEpidemiology and Infection 136:1650–1657.https://doi.org/10.1017/S095026880800037X
-
Causality tests and functional form sensitivityApplied Economics 17:135–141.https://doi.org/10.1080/00036848500000011
-
ConferenceA test of granger non-causality based on nonparametric conditional independence2010 20th International Conference on Pattern Recognition. pp. 2620–2623.
-
Driving factors of influenza transmission in the NetherlandsAmerican Journal of Epidemiology 178:1469–1477.https://doi.org/10.1093/aje/kwt132
-
Influenza vaccine--outmaneuvering antigenic shift and driftNew England Journal of Medicine 350:218–220.https://doi.org/10.1056/NEJMp038238
-
A Bayesian view of model complexityStatistica Neerlandica 66:253–271.https://doi.org/10.1111/j.1467-9574.2011.00518.x
-
Analyses of the 1957 (Asian) influenza pandemic in the United Kingdom and the impact of school closuresEpidemiology and Infection 136:166–179.https://doi.org/10.1017/S0950268807008369
-
Emergence of influenza A virusesPhilosophical transactions of the Royal Society of London. Series B, Biological sciences 356:1817–1828.https://doi.org/10.1098/rstb.2001.0997
-
Modeling on social spread from immunityJapanese Journal of Infectious Diseases 58:S14.
-
Influenza research database: an integrated bioinformatics resource for influenza virus researchNucleic Acids Research 45:D466–D474.https://doi.org/10.1093/nar/gkw857
Article and author information
Author details
Funding
National Institutes of Health (U01HL108634-01)
- Jeffrey L Shaman
National Institutes of Health (R01GM100467)
- Jeffrey L Shaman
National Institutes of Health (U01GM110748)
- Jeffrey L Shaman
National Institutes of Health (1P50MH094267)
- Andrey Rzhetsky
Defense Sciences Office, DARPA (W911NF1410333)
- Andrey Rzhetsky
National Institutes of Health (R01HL122712)
- Andrey Rzhetsky
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Erin Gannon, Mark Jit, Prabhat Jha, Elizabeth C Lee, and Margarita Rzhetsky for numerous comments on earlier versions of the manuscript. This work was supported by NIH grants 1P50MH094267, U01HL108634-01, R01HL122712-02, R01GM100467, and U01GM110748, in addition to DARPA contract W911NF1410333, and a gift from Liz and Kent Dauten.
Copyright
© 2018, Chattopadhyay et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 3,483
- views
-
- 485
- downloads
-
- 39
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Epidemiology and Global Health
Artificially sweetened beverages containing noncaloric monosaccharides were suggested as healthier alternatives to sugar-sweetened beverages. Nevertheless, the potential detrimental effects of these noncaloric monosaccharides on blood vessel function remain inadequately understood. We have established a zebrafish model that exhibits significant excessive angiogenesis induced by high glucose, resembling the hyperangiogenic characteristics observed in proliferative diabetic retinopathy (PDR). Utilizing this model, we observed that glucose and noncaloric monosaccharides could induce excessive formation of blood vessels, especially intersegmental vessels (ISVs). The excessively branched vessels were observed to be formed by ectopic activation of quiescent endothelial cells (ECs) into tip cells. Single-cell transcriptomic sequencing analysis of the ECs in the embryos exposed to high glucose revealed an augmented ratio of capillary ECs, proliferating ECs, and a series of upregulated proangiogenic genes. Further analysis and experiments validated that reduced foxo1a mediated the excessive angiogenesis induced by monosaccharides via upregulating the expression of marcksl1a. This study has provided new evidence showing the negative effects of noncaloric monosaccharides on the vascular system and the underlying mechanisms.
-
- Epidemiology and Global Health
- Microbiology and Infectious Disease
Influenza viruses continually evolve new antigenic variants, through mutations in epitopes of their major surface proteins, hemagglutinin (HA) and neuraminidase (NA). Antigenic drift potentiates the reinfection of previously infected individuals, but the contribution of this process to variability in annual epidemics is not well understood. Here, we link influenza A(H3N2) virus evolution to regional epidemic dynamics in the United States during 1997—2019. We integrate phenotypic measures of HA antigenic drift and sequence-based measures of HA and NA fitness to infer antigenic and genetic distances between viruses circulating in successive seasons. We estimate the magnitude, severity, timing, transmission rate, age-specific patterns, and subtype dominance of each regional outbreak and find that genetic distance based on broad sets of epitope sites is the strongest evolutionary predictor of A(H3N2) virus epidemiology. Increased HA and NA epitope distance between seasons correlates with larger, more intense epidemics, higher transmission, greater A(H3N2) subtype dominance, and a greater proportion of cases in adults relative to children, consistent with increased population susceptibility. Based on random forest models, A(H1N1) incidence impacts A(H3N2) epidemics to a greater extent than viral evolution, suggesting that subtype interference is a major driver of influenza A virus infection ynamics, presumably via heterosubtypic cross-immunity.