Abstract
Using several longitudinal datasets describing putative factors affecting influenza incidence and clinical data on the disease and health status of over 150 million human subjects observed over a decade, we investigated the source and the mechanistic triggers of influenza epidemics. We conclude that the initiation of a pancontinental influenza wave emerges from the simultaneous realization of a complex set of conditions. The strongest predictor groups are as follows, ranked by importance: (1) the host population’s socio and ethnodemographic properties; (2) weather variables pertaining to specific humidity, temperature, and solar radiation; (3) the virus’ antigenic drift over time; (4) the host population’€™s landbased travel habits, and; (5) recent spatiotemporal dynamics, as reflected in the influenza wave autocorrelation. The models we infer are demonstrably predictive (area under the Receiver Operating Characteristic curve 80%) when tested with outofsample data, opening the door to the potential formulation of new populationlevel intervention and mitigation policies.
https://doi.org/10.7554/eLife.30756.001eLife digest
Influenza – or ‘the flu’ – is a contagious disease which sweeps across the globe like clockwork, claiming tens of thousands of lives. This is known as ‘seasonal flu’.
Many scientists have tried to identify the factors that spark these yearly outbreaks. Some past studies have found that seasonal flu occurs when air that is normally humid turns dry, suggesting weather patterns play an important part. Other research has shown that air travel contributes to the flu spreading across the world. However, these studies typically focus on just one or two factors on their own. It is still not clear how exactly these factors combine to drive outbreaks, and then sustain the wave of infection.
To address this, Chattopadhyay et al. analyze the medical histories of 150 million American people over a decade, combining this information with large datasets about the different factors that trigger flu outbreaks. This includes detailed data about air travel and weather patterns, as well as census data that describe features of the population. Patterns of movement are also examined, for example by processing billions of Twitter messages “tagged” with a location. Chattopadhyay et al. used all of these datasets to model outbreaks of the flu in the United States, and see which factors play the biggest role.
It turns out that yearly outbreaks of seasonal flu are a result of a combination of elements. Some factors interact to help trigger the start of the wave, like humid weather in a highly populated area with nearby airports. Other factors, such how people move, encourage the spread of the infection. Finally, certain features of the population, for example how closely knitted a community is, make specific areas of the country more susceptible to the arrival of the disease. Overall, some of the most important elements of the model relate to the characteristics of the populations, the weather, the type of virus, and the number of shortdistance journeys (rather than air travel).
Understanding how and why outbreaks occur can help policymakers design strategies that reduce the spread and impact of seasonal flu, which could potentially save thousands of lives. Ultimately, the model developed by Chattopadhyay et al. could be used to test whether these policies would work before they are implemented in the real world.
https://doi.org/10.7554/eLife.30756.002Introduction
Seasonal influenza is a serious threat to public health, claiming tens of thousands of lives every year. A large number of past studies have focused on identifying the likely factors responsible for initiating each seasonal disease wave. Typically, each such study focused on one or a few hypothetical factors. Our study aimed at an integrative, joint analysis of numerous suggested disease triggers, comparing their relative importance and possible cooperation in triggering panUS waves of seasonal influenza infection. The goal of this study was to identify the most informative combinations of statistical predictors associated with the initiation of panUS influenza infection waves.
Recent computational studies of influenza:
Computational study of the dynamics and factors influencing infectious disease spread began with compartmental models, such as the SusceptibleInfectedResistant (SIR) model, which traces its origins to the beginning of the last century (Kermack and McKendrick, 1927). Initially a purely theoretical tool, these SIRstyle models were subsequently enhanced with population and geographic data, allowing their application to specific cities and the distances between them (Keeling and Rohani, 2002). For instance, one approach, termed ‘gravity wave’ modeling, used geographic, short and longrange, workrelated human movement and demographic data to formulate gravity potentials between US counties in order to infer the dynamics of infection spread (Viboud et al., 2006).
Some studies have focused on one specific factor affecting infection, such as air travel, to simulate the spread of influenza (Colizza et al., 2006); other studies used SIR models, generalized for a collection of interconnected geographic areas, spatialnetwork or patch models, to model a number of common infections, including influenza, measles, and footandmouth disease (Riley, 2007).
More ambitious network model approaches have simulated the global transmission of infectious disease using highresolution, worldwide population data and the locations of International Air Transport Association (IATA)indexed airports (Balcan et al., 2009). Similar to (Viboud et al., 2006), the authors of the study computed the global infectionpredisposing ‘gravity field’ over the network of international airports. This networkbased approach was subsequently developed further (Balcan and Vespignani, 2011) through the modeling of ‘phase transition’–that is the chainreaction switch of geographic infection status–in complex networks, utilizing approaches introduced in theoretical physics.
Another layer of sophistication was achieved by incorporating rich historical records. For example, Eggo et al. (Eggo et al., 2011) modeled the Spanish influenza epidemic of 1918–1919, using mortality documents from both the UK and the US, explicitly accounting for the size and distances between cities. In the same spirit, Brockmann and Helbing (Brockmann and Helbing, 2013) represented infection as diffusion on a complex network, estimating arrival times for infection across the globe.
Following the formulation of the hypothesis that absolute humidity modulates influenza survival and transmission (Shaman and Kohn, 2009), researchers began incorporating climate variables into SIRlike models (Chowell et al., 2012). More recent dynamic models have incorporated a probabilistic description of influenza infection’s spatial transitions in space and time, accounting for selected demographic confounders (Gog et al., 2014) and (Charu et al., 2017).
In this study, rather than following the SIRstyle modeling tradition, we used statistical epidemiology and econometriclike approaches, in addition to a causalitynetwork method presented here for the first time. There are some prior studies that are close to ours in spirit (but not in details). For example, (Barreca and Shimshack, 2012) used historical US influenza mortality data (1973–2002) in conjunction with collinear humidity and temperature records to establish countylevel statistical associations between variables in the datasets. They concluded that absolute humidity was ‘an especially critical determinant of observed human influenza mortality, even after controlling for temperature.’ Another study, focusing on historical influenza records in the Netherlands, (te Beest et al., 2013) used the number of weekly influenzalike patient visits (transformed into an estimated rate of infection) as a response variable in a regression analysis of climate data. They concluded that the bulk of explained variation (57%) was attributed to the depletion of susceptible hosts during the disease season and nonweatherrelated ‘betweenseason effects,’ with only 3% explained by absolute humidity, represented as a continuous predictor variable. Additionally, this study observed that school holidays did not have a statistically significant effect on influenza transmission.
As all causality detection methods come with dissimilar limitations and are imperfect in unique ways, we designed our study intentionally to attack the same target problem using three different statistical approaches: Approach 1: A nonparametric Granger analysis (Granger, 1980) focusing on infection flows’ directionalities across the US and whether influenza propagates via long vs. shortdistance travel (we run analysis across all pairs of air and landtravel county neighbors, respectively). Approach 2: A mixedeffect Poisson regression (Hedeker and Gibbons, 2006) explicitly accounting for the autocorrelation of infection waves in time and space, along with the full set of socioeconomic, climate, and geographic predictors. Approach 3: A countymatching, nonparametric analysis to identify the minimum predictive set of factors that distinguish those counties associated with the onset of the influenza season (Morgan and Winship, 2015).
Our study became possible through access to several, very large longitudinal datasets: (1) a nineyear collection of insurance records capturing the dynamics of influenzalike illnesses (ILIs) in the United States (Truven MarketScan database, see Materials and methods); (2) temporally collinear, highresolution weather measurements over every US county; (3) detailed air travel (The United States Bureau of Transportation Statistics, 2010) and geographic proximity data (The United States Census, 2016) showing connectivity between US counties; (4) billions of geolocated Twitter messages reflecting long and shortdistance human movement patterns, and; (5) US census data accounting for US county and countyequivalent population distribution, demographic, and socioeconomic properties (HRSA, 2016). An explicit comparison of the ILI data in the insurance claims to the influenza records provided by the Center for Disease Control and Prevention (CDC, 2016) showed that the two sources agree well ($\rho =0.91,p=3.5\times {10}^{201}$), with insurance claims providing higher data resolution, see Figure 1—figure supplement 1. Curiously, the relationship between the two sources of ILI observations is not linear: We attribute this to the lower resolution of the CDC data. These three types of analysis produce congruent–albeit not identical–results.
Results
The logical flow of our analysis is as follows: (1) We first show that our definition of ILIs corresponds well with CDC data, and that our causality coefficients, defined in Approach 1, have similar meanings to coefficients in the regression analysis; (2) We then explain the outcomes of the analysis according to Approach 1; (3) We then analyze the importance of putative casual factors, Figure 1, as applied to an initiation of influenza season, Figure 2; (4) In Approach 2, we pay special attention to the relative importance of short and longdistance travel in influenza propagation, Figure 4; (5) We further test the best regression model in terms of predictive accuracy, Figure 5, using disjointed data parts for training and testing, and; (6) We culminate our analysis with a countymatching analysis, Figure 6.
Approach 1: Causality streamlines from nonparametric granger analysis
Our analysis of health insurance claims covers nine years of influenza cycles ($2003$ to $2011$, inclusively), see Figure 2. We visualized weekly, countylevel prevalence as a movie (see Supplement); Figure 2A–H show a few relevant weekly snapshots from different years. The plates in Figure 2A–H, and especially the movie, clearly show that seasonal influenza cycles initiate in the South/Southeastern US and sweep the country from south to north. This pattern is repeated, with some variation, each season.
Figure 3G shows the countrywide propagation dynamics as represented by our computed causality streamlines. The alignment of causality flow vectors into long, continuous streamlines suggests a stable propagation mechanism across the country; the probability of a long sequence of summary movement vectors accidentally matching in the direction by mere chance is vanishingly small ($p<{10}^{16}$ for longer streamlines).
Do epidemics originate from the same counties season after season? To answer this question, we follow ‘causality streamlines’ back to their source county. Informally, influenza onset in these source counties has little or no causal dependency on their neighbors. That is, their epidemic states are seemingly caused by factors outside of disease prevalence in other counties. Figure 2K presents the countyspecific likelihood of streamline initiation across our nine years of data. To verify the nearshores position of these source counties is not a mere manifestation of a boundary effect of shore counties (no neighbors at the side of large water body), we carried out identical causality analyses with two different infections, specifically choosing diseases less likely to share etiologies with influenza: HIV and Escherichia coli. The results for both HIV and Escherichia coli infections are shown in Figure 3J and K, which exhibit flow patterns very different from those obtained for influenza. These streamlines almost never originate from the coasts, thus reducing the likelihood that the pattern observed for influenza is a geospatial boundary effect. Combined with the exceedingly low probability ($\sim {10}^{185}$) of chance inference for the streamlines, this strongly supports our conclusion that the epidemics are of coastal origin.
We directly validated our conclusion that influenza waves tend to start in the South by identifying counties which seem to trigger the epidemic. We computed a ‘trigger period’ of five to six weeks for each season, defined as the period immediately preceding an exponential increase in influenza dispersion. To calculate this weekly dispersion, we treated each county as a node in an undirected graph, each with an edge connecting two geographically adjacent counties–only if they had both reported at least one influenza case in the specified week. We defined dispersion as the size of the largest, connected component in this undirected graph. Thus, a trigger period describes the period in which the size of the giant component of the infection graph rises above $250$ counties from being under $100$ as shown in Figure 2I, and then proceeds to the seasonal peak. Figure 2J presents the likelihood of a county being part of this largest, connected component during the trigger period. In the second approach, we followed causality streamlines back to their source county. Figure 2K presents the countyspecific likelihood of streamline initiation across nine years.
These approaches produced qualitatively similar results (Figure 2J and K). While epidemics seem to start in many places around the country (see the origins of streamlines in Figure 3J and I), they successfully gain traction near large bodies of water (as evidenced by the most likely places of epidemic onset, see Figure 2J and K)). Otherwise, they fizzle out before triggering an actual epidemic cycle (see Figure 2J). Seasonal initiation is neither spatially uniform nor simply a reflection of countyspecific population density.
Our analysis of the Twitter movement matrix indicates that people most frequently travel between neighboring counties, preferentially towards higherpopulationdensity areas, which shows that the maximumprobability movement patterns follow the local gradient of increasing population density (see Figure 4—figure supplement 1). In contrast, the geospatiallyaveraged movement vectors for each county reveal global flows in the movement patterns (see Figure 3H, along with Methods for the calculation of spatial averages).
Figure 3H–I suggest that average movement patterns largely agree with the influenza streamlines: Both patterns, especially in the South/Southeast of the country, are associated with flow pointing away from large bodies of water.
In addition to looking at the direction of shortrange travel, we used our nonparametric Granger analysis to investigate the comparative strength of short vs. longrange influenza propagation. In the first case, we considered the neighborhood map shown in Figure 4A (for a detailed definition of ”neighbors,’ see Materials and methods), and the in the second case, we considered associations between major, airportbearing counties (see Figure 4B). We then plotted the distribution of the maximum pairwise coefficient of causality, where the maximization is carried out by fixing the source and the target and varying the delay in weeks, after which we attempt to predict the target stream.
Conclusions associated with Approach 1: The inferred causality streamlines computed from the infection time series in all counties (Figure 3) show that epidemics are mostly triggered near large water bodies and flow inland and away. They also illustrate that the US continental Southern states act as ‘sinks’ to a large proportion of these streamlines. (‘Sinks,’ in our definition here, are geographic areas that multiple streamlines converge towards; sinks are especially obvious when we look at the vector representation of causality direction. The opposite of a ‘sink’ is a ‘source,’ defined as an area at which at least one streamline starts.) This might explain the increased prevalence in the designated region. Additionally, the analysis shows that human travel is a very important driver of emergent epidemiological patterns, and that shortrange, landbased travel is more important than airtravel. This result is crosscorroborated by our Poisson regression analysis (described next in Approach 2).
Approaches 2 and 3 are motivated by the ‘why’ questions: (1) Why do epidemics initiate where and when they do? and; (2) Why do some disease initiations become epidemics while others do not?
Approach 2: Importance of factors from poisson regression
We focused on a subset of weeks associated with the initial rise of influenza waves (indicated by the gray bars in Figure 2I, and calculated as discussed earlier). The results from our bestfit model are illustrated in Figure 1A. We selected this particular model out of a total of $126$ compared in the Bayesian analysis, a few of which we list in Table 5, ranked by their decreasing goodnessoffit, measured with the Deviance Information Criterion, DIC (see Supplement). From the values of the inferred coefficients corresponding to the different factors, and taking into account their significance levels and credible intervals, we concluded that the roles played by weather variables, particularly humidity, appear to be substantially more complicated compared to what has been suggested in the literature.
The surprisingly unimportant factors
School schedule was not predictive of influenza onset in our analysis: We ended up with a $p$value of $0.84$ and an odds ratio of $0.8403$, strongly suggesting that school opening dates are not a significant factor in triggering the seasonal epidemic.
We are not claiming here that closing down schools during the seasonal peak, or during an initial phase of a seasonal epidemic, would not have a beneficial effect on maximum incidence. Rather, the observed epidemiological patterns over the time period we analyzed (2003–2011) do not seem to name ‘school opening times’ as a significant predictive factor–at least in the continental US.
We factored in the effect of vaccination coverage by estimating the cumulative fraction of the population that received the current influenza vaccine stratified by geospatial location and time of inoculation within each influenza season. Our analysis indicated that vaccination coverage is not a significant predictor of influenza onset/triggering period. It could be a reflection of overall vaccine ineffectiveness, or the choice of outcome predicted (i.e. vaccination might effect overall infection numbers over the entire outbreak, but not the timing of the trigger). It could also reflect the fact that different influenza type/subtypes have different virulence–so a vaccine against H3N2 during an H3N2 year, may be more effective, but due to that fact that H3N2 is more virulent, more people still wind up seeking medical care.
Vaccination coverage failed to reach predictive significance. The variables corresponding to spatiotemporal indicators of the cumulative fraction of the inoculated population are included in our best model (see the last entry in Table 5), but their effect fails to be significant. However, if we drop those variables from the model, the DIC increases. We suggest that the strong dependence between antigenic diversity and vaccination coverage (see Figure 1J–M) is responsible for this effect: Vaccination coverage is important, but its influence is captured by the antigenic variation.
The most important factors
The strongest predictor groups (ranked by importance) are the population’s sociodemographic properties, weather, antigenic drift of the virus, landbased travel, and autocorrelation of influenza waves.
Weather As far as weather effects are concerned, epidemics tend to originate in places with high mean, maximum specific humidity, high average temperature, and low average air pressure, namely, in counties at the Southern, and, to a lesser extent, Eastern and Western US coastlines. Additionally, the spread of an epidemic is significantly influenced by a drop in specific humidity up to four weeks before its onset. However, this effect is weaker than the mean maximum specific humidity effect. Drop in average temperature that dips one to three weeks prior to the epidemic onset is also significantly important (this is consistent with earlier experiments [Lowen et al., 2008]), especially when the temperature drop is accompanied by a decrease in specific humidity, average wind speed, and solar flux. However, high levels of solar flux in the week of onset are also important. This complicated set of weather conditions, a signature of the cold air front (Shaman et al., 2010), is validated by our outof sample predictions to increase the risk of triggering the seasonal epidemic. Total precipitation also plays a positive role.
The weather/humidity paradox
How can colder weather and lower humidity be a predictor of influenza, if influenza epidemic waves tend to start in the South with warmer climates and higher humidity? Our resolution of this seeming controversy is as follows: The stress is on the drop in both humidity and temperature in those areas with high average annual values of these measurements. A blast of colder, lowerhumidity weather in these warmclimate areas has two effects: (1) The influenza virus can stay viable in water droplets longer than in hot, sunny weather, and; (2) Humans tend to interact indoors, in more crowded conditions. Both of these factors are favorable for transmission of the virus to the population at large.
Antigenic variation Antigenic diversity for HA, NA, M1, and M2 are important predictors. While HA, NA, and M1 inhibit the trigger, M2 diversity enhances it. This peculiar difference in the direction of influence might be a manifestation of the roles played by the individual viral proteins in its lifecycle.
The first three proteins are directly involved in the viral binding to host cell surface receptors, while M2 activity is needed only during HA biosynthesis. Additionally, proteolysis experiments indicated that M2 proton channel activity helped to protect (H1N1)pdm09 HA from premature conformational changes as it traversed lowpH compartments during transport to the cell surface (AlvaradoFacundo et al., 2015).
We found that antigenic diversity is a significant predictor in all four of the viral proteins we considered. Interestingly, while the increasing diversity found in HA, NA, and M1 inhibits the epidemic trigger, the higher diversity in M2 enhances it (see Discussion).
Travel Land travel intensity one to three weeks before epidemic onset is a strong predictor. Air travel is also predictive, but its strength is an order of magnitude weaker than that of land travel.
Autocorrelation The increase in an influenza outbreak’s infection weekly rate one and two weeks before an epidemic onset (in the epidemic source county itself) is predictive of epidemic wave origin.
We have used substantially richer datasets than those used by earlier studies (Shaman et al., 2010; Tamerius et al., 2013), which lends strong statistical support to our conclusions. It also allowed us to disentangle and make precise the contributions from different factors, $e.g.$, mean county humidity vs. drops in humidity before an infection. While we found the former effect to be clearly stronger (in accordance to previously reported results [Shaman et al., 2010]), the other diverse set of factors were also found to be significant.
Validation of predictive capability
The robustness of our results is established in a number of ways:
In mixedeffect regression (Approach 2), we compared over $120$ chosen model variations (see Table 5 for an abridged list, and Supplementary file 3for the complete enumeration of considered models); the results appear to be qualitatively stable, though the quantitative performance of the models vary somewhat as measured by DIC for different configurations of regression equations.
We carried out a direct validation of predictive performance by estimating model parameters using the first six seasons and predicting the epidemic trigger locations using the last three (see Figure 5 and Materials and methods). The outofsample predictions of influenza incidence are always positively correlated with observed incidence (Plate A). Perhaps more importantly, we obtained good predictability as measured by the area under the curve (AUC $\approx 80\%$) for the receiver operating characteristics (ROC, See Plates BD). Plates EG show that our outofsample predictions correctly identified epidemic initiation in the Southern and Southeastern counties of the continental US.
As we discuss in the next section, in Approach 3, we conducted a corroborating matched effect analysis on the counties, using combinations of countyspecific factors as a ‘treatment,’ not unlike clinical trials in which patients on a drug regimen are matched to patients receiving a placebo (Morgan and Winship, 2015).
Approach 3: Matching counties and factor combinations
In Approach 3, we investigated combinations of factors presented as ‘treatment’ via a nonparametric, exactmatching analysis of US counties during the weeks of epidemic onset on a seasonbyseason basis.
First, we collected the list of all counties with a drop in maximum specific humidity during the weeks leading up to an influenza season in a particular year. This is the ‘treated set’: the set of counties that may be thought of as subjected to the positive ‘treatment’ of a drop in specific humidity. We split this set into two, considering counties that also experience increased influenza prevalence during the epidemic onset, and ones that do not (counties with two different values of the outcome variable). The number of counties in these two sets define the first row of a $2\times 2$ contingency table. In the second row (the ‘control set’), we focused on counties that do not experience a drop in the maximum specific humidity. However, we only considered counties that have a matching counterpart in the treated set in the following sense: For each county in the control set, we found at least one in the treated set such that the rest of the significant variables (other than specific humidity) had similar variation patterns in both counties. Once we defined the control set, we split it in the manner described for the treated set: We counted the number of control counties that experienced an increased influenza prevalence during epidemic onset, and those which did not. This defined the second row of the contingency table. Finally, we used Fisher’s exact test to compute an odds ratio (the odds of realizing these numbers by chance), along with the testderived significance of the association between the ‘treatment’ and epidemic wave initiation (pvalue). Furthermore, we defined our treatment to consist of multiple factors simultaneously, $e.g.$ specific humidity and its change in the preceding week, along with average temperature and degree of urbanity, see Figure 6.
Note that geographic clustering of ‘treated’ and ‘untreated’ counties arose automatically as a result of similar weather patterns being imposed via the constraint of multiple climate variables.
Unlike the mixedeffect regression approach (Approach 2), this matching analysis is nonparametric, and intended to reveal whether multiple factors are, indeed, simultaneously necessary.
The results of Approach 3’s analysis are presented in Figure 6 and Tables 2 and 3. We found that no single variable was able to consistently yield a statistically significant odds ratio greater than one; multiple factors interacted to shape an epidemic trigger (see Table 3 for a few examples). With a total of $47$ significant variables in our best mixedeffect model, an exhaustive search for all combinations was not feasible. Instead, we performed a standard evolutionary search, looking for combinations that yielded a significant odds ratio for individual seasons. Additionally, we considered all seasons together (by simply adding the contingency tables, elementwise) in order to increase the test’s statistical power.
We isolated ten variables (as shown in Figure 6, Plate B) in this manner which included maximum specific humidity and average temperature along with their variations, the degree of urbanity, antigenic variation, and vaccination coverage.
The factors that appeared most often in our analysis are illustrated in Plate A: It appears that maximum specific humidity and average temperature, along with their variations, and the degree of urbanity have the most frequent contribution, followed by antigenic variation and vaccination coverage.
We did see some new factors here that failed to be of significance in the regression analysis (Approach 2), $e.g.$, degree of urbanity and vaccination coverage. While vaccination coverage was included as a factor in our best performing model in Approach 2, it failed to achieve significance, perhaps due to its strong dependence on antigenic variation (see Figure 1J–M). Degree of urbanity was indeed significant for some of the regression models we considered (see Supplementary Information), but failed to be so for the model with the least DIC.
Thus, Approach three corroborates and strengthens key claims of Approach 2.
The exact set of factors varied somewhat over the seasons; nevertheless, together, they yield significant results when all seasons are considered together. The matching analysis corroborates our results from both the mixedeffect regression and the geographic streamline analyses: The sets of counties initiating the wave are near coasts on the Southern region of the continental US (see Plates A  I in Figure 6).
Local travel vs. longdistance travel and influenza
Our conclusion that local travel is predominantly responsible for disease wave propagation is supported by several lines of analysis.
First, continuous landmovement infection waves are visible in the weekly influenza rate movie; we computed this movie from insurance claim data and made it available with results of this study.
Second, because our allweeksincluded dataset was too large for the R MCMCglmm package to handle, we performed mixedeffect Poisson regression calculations using a 50 percent random sample of all the weeks for which data were available. In this computation, the airportproximity, fixedeffect coefficient turned out to be statistically significantly negative (see Supplement A, as well as the editable output file ‘flu50percentweeks.txt’).
Third, the results from our Grangercausality inference showed that:
Local countytocounty movements were much more predictive of influenza wave change than airport movements. In comparing Plates E and F in Figure 4, we see that the local movement causality coefficient ($\gamma $) is, on average, twice as large as that for longrange movement (Figure 4E). Figure 4E shows that the mean longrange causality coefficient is approximately 0.05, whereas it is just over 0.1 in the local propagation. As the causality coefficient quantifies the amount of predictability (measured as information in bits) communicated about the target data stream per observed bit in the source data stream, it follows that, on average, every ten bits of sequential incidence data from an influencing location tells us one bit about the unfolding incidence dynamics in the target location. Therefore, in case of the longrange movement, informativeness is twice as low, so we need on average 20 bits to infer one bit about the state of infection. These calculations strongly suggest that local movement is predictively stronger with regards to influenza infection propagation.
While the most frequent value of the computed time delay in influence propagation between counties with large airports is zero weeks, this distribution is significantly flatter compared to that for local, countytocounty influence propagation.
Discussion
A summary of the complex relationship between the driving factors that contribute to the trigger and subsequent development of a seasonal epidemic can be clarified with a forest fire metaphor.
The maturation of a forest fire requires the collusion of multiple factors—namely the presence of flammable media, an initiating spark, and a wind current to help spread the fire. Our conceived mapping of this analogy to influenza infection is as follows:
Flammable Media: The Southern US appears to have an unusually high level of social connectivity; it is at least one order of magnitude higher than that in the north of the country (see GSS survey results (Smith et al., 1972) and Table 1). The number of close friends, close friends who are neighbors, and communities of people who all, or mostly, know each other is much higher in the South than in the country at large. Our conjecture is that a manifestation of this highconnectivity is the highest apparent percentage of people infected with influenza ($20\%$ as opposed to $4\%$ in other parts of the country).
Initiation Spark: An initial spark for the infection wave is generated by a combination of weather and demographic factors. Specifically, warm, humid places are conducive to influenza wave initiation  particularly in weeks where specific humidity drops. Airport proximity is important, as well as demographic and economic makeup and also the degree of urbanization. Note that the first static condition (warm humid places) is highly correlated with areas in the South with greater social connectivity. It is possible that static meteorological variables (warm mean temperature and high mean humidity) serve as proxies for high social connectivity or other correlated socioeconomic factors.
Wind: The ‘wind’ in this analogy is the collective movement of a large number of people, integrated over time, revealing persistent ‘currents.’ These currents reproducibly point from coastlines and move inwards towards the center of the continent, making them perfect vehicles to transmit the infection inland from the shores.
Each of our three types of computational approaches has their strengths and weaknesses: (1) The Poisson mixedeffect regression allows for the direct comparison of the predictive strength of numerous predictor variables and accounts for spatial and temporal autocorrelation, but relies on strong modeling assumptions; (2) The nonparametric Granger analysis is not limited by restrictive modeling assumptions in our implementation, though it focuses only on trends of infection propagation between counties, and; (3) The countymatching analysis is also modelfree, but this freedom comes at the expense of lesser statistical power.
What is new in this study?
The following aspects make our study of influenza triggers new in the influenza literature: (1) Instead of simulating the plausibility of one particular epidemic trigger model with a dynamic disease transmission model, we used formal model selection tools to compare the goodness of fit of hundreds of plausible models; (2) We explicitly attempted to systematically crosscompare the importance of numerous individual factors typically hypothesized to contribute to epidemic onset; (3) To accomplish this, we collected an unprecedented volume of temporal and spatial data on disease dynamics and the dynamics of putative predisposing factors; (4) We used several orthogonal, computational causalityinference techniques (one of which was developed specifically for this study) to probe associations between disease onset and putative epidemic triggers; (5) We tested our best models for their predictive potential and demonstrated that they are, indeed, suitable for forecasting disease waves, and; (6) We combined, for the first time, numerous candidate factors in a single, integrative study.
Convergent conclusions, culled from these radically different techniques, strengthen our claims and make it statistically unlikely that we are observing analysis artifacts. First, the Granger causality analysis results (Approach 1) provide insights into the details of influenza’s epidemiological dynamics. Figure 3G traces out the paths most likely followed by the infection, on average, across the continental US. We note that $\sim 75\%$ of the streamlines sink in counties belonging to the Southern states, which matches up well with the streamlineencoded dynamics of weekly disease incidence over nine years (see Figure 3I). What drives this particular causality field’s geometry? While we cannot definitively answer this question, a comparison of the global patterns emerges from the local mobility data culled from the aforementioned Twitter database and offers a tentative explanation (see Figure 3H). Second, contrary to reported human travel pattern influence on seasonal epidemics (Viboud et al., 2006) (but consistent with [Gog et al., 2014]), we find that shortdistance travel contributes more significantly to disease spread (see Figure 4). In particular, we find that longrange air travel is important as an epidemic trigger, but once infection waves are triggered, air travel patterns (or proximity to major airports) become less important. Shortrange mobility, on the other hand, is apparently important for sustaining infection transmission over each season. Thus, we find shortrange travel to be more important for defining the emergent spatiotemporal geometry of infection waves, while proximity to airports is more important for actually triggering an influenza season; the latter loses positive influence once an infection is under way. This conclusion is justified as follows: (1) When we performed regression calculations using all weeks for which data were available (as opposed to wave initiation weeks only) the airport proximity predictor coefficient turned out to be statistically significantly negative (see Supplement). (2) Results from our Grangercausal inference indicate that, on average, the local, putatively causal connections are far stronger compared to the putatively causal connections between counties within which the major airports are located (see Figure 4C and F). Additionally, from our best mixedeffect regression model (Figure 1A), we find that land connectivity effects are significantly stronger than air connectivity effects. The predictive value of Twitter connectivity, which intuitively captures both local and longdistance travel, lies inbetween land and air connectivity coefficients. Note that Twitter connectivity is represented as a directed graph, where for each pair of counties, $i$ and $j$, the $(i,j)$ edge weight represents the conditional probability of ending at county $j$, given that a traveler/Twitter user started her journey in county $i$. Transition probabilities from $i$ to $j$ sum to one over all $j$. Therefore, intuitively, the Twitter connectivity graph should have the features of both a landconnectivity and an air travel graph; which indeed appears to be close to reality. 3) While airport diffusion is a significant factor in our best Poisson model (using data from the initiation period), the causal streamlines (constructed with the complete, allyear incidence data) do not seem to originate from airportbearing counties.
The role of shortdistance travel is particularly crucial in explaining influenza’s timeaveraged, geospatial prevalence. While the mixedeffect regression analysis explains seasonal initiation in the vicinity of the continental US Southern shores, it might not, by itself, adequately explain its average prevalence patterns across the country.
Also not explained solely by our regression models is the occurrence of relatively high infection prevalence in the central parts of the country. These differences cannot be attributed to longdistance air travel, as discussed before. However, the routes taken by the causality streamlines (as computed by the nonparametric Granger analysis), interpreted as paths followed by an infection on average, suggest an explanation: The close match between the Grangercausal flow and the shortrange mobility patterns (derived from Twitter analysis) strongly suggest that average disease prevalence is modulated by shortrange mobility.
Rationale for observational analysis
The traditional empirical approach of testing a causal link between a factor and an outcome of an experiment was to vary one factor at a time, while keeping the other factors (experimental conditions) constant. This ‘all the rest of the conditions are equal’ assumption is often referred to by its Latin form as ceteris paribus. R.A. Fisher ([Fisher, 1935], p. 18) noted that, in reallife experiments, perfect ceteris paribus is not achievable ‘because uncontrollable causes which may influence the results are always … innumerable.’ Fisher’s proposed solution to this problem is to design experiments to involve random assignment of treatment (the putative causal factorâ€™s states) to individual trials and then use regression analysis to estimate the value and significance of the putative causal effect.
Likewise, hypothesisdriven science, wherein investigators formulate a single, testable hypothesis and design specific experiments to test it, is a core element of the scientific method, and works well in most scientific fields. However, a new challenge emerges in datarich scientific fields, such as genomics, epidemiology, economics, climate modeling, and astronomy: How do we choose the most promising hypotheses among millions of eligible candidates that potentially fit data? One solution to this challenge is the manyhypotheses approach, a method of automated hypothesis generation in which many hypotheses are systematically produced and simultaneously tested against all available data. This approach is currently used, for example, in wholegenome association or genetic linkage studies, and often enables truly unexpected discoveries. In contrast to the singlehypothesis approach, the manyhypotheses approach explicitly accounts for the large universe of possible hypotheses through calibrated statistical tests, effectively reducing the likelihood of accidentally accepting a mediocre hypothesis as a proxy for the truth (Nuzzo, 2014).
The manyhypotheses approach provides a complement to carefully controlled and highly focused wet laboratory experiments. Running controlled experiments to test a single hypothesis necessarily ignores many of the complexities of a realworld phenomenon; these complexities are necessarily present in large, longitudinal datasets. Of course, the datadriven ‘manyhypotheses’ approach is only one aspect of the broader scientific process progressing toward the development of verifiable general theories.
Agreement and disagreement between methods
Intuitively, we expected that all three approaches would produce similar, if not identical results. In practice, while the three approaches agreed in most cases, this agreement was not perfect. For example, the highlighted areas (greater incidence) in the first influenza season snapshot for Figure 2A–H each should match relatively well to the maps in Figure 6B (or at least some unspecified subset of ‘high incidence treatment counties’). While the countymatching results point to initiation at coasts, in Figure 2, 20062007 initiation seems to spread from the West Coast and, in , 2010 has a scattered pattern across the middle of the US.
The intuitive explanation of perceived discrepancy is that the matching method agrees with other analysis types predominantly, but not in all cases. Each analysis has limitations. In the case of the matching analysis, we have less statistical power than in, for example, Poisson regression; matching by numerous parameters reduces the initial set of thousands of counties to a handful of matching ‘treated’ counties (which meet a particular combination of weather and sociodemographic conditions) and ‘untreated’ counties (very similar to ‘treated’ ones in all respects but treatment). The difficulttomatch, ‘weeded out’ counties may happen to be in the coastal areas indicated as the most likely places of influenza wave origin by other analyses.
In the case of the 2007 and 2010 results, the matching analyses pick patterns that are different from those produced by the causality streamline analysis and mixedeffect Poisson regression models.
Figure 6’s Plate B shows the distribution of the treatment counties and matchednontreatment counties. Note that here, we are not directly predicting initiation, so while the patterns in Figure 2 and Figure 6 should indeed show some similarity, they are not required to match up perfectly. The most similar treated counties do indeed show up in the Southern shores.
Generalizability
Our analysis uses no prior knowledge specific to influenza epidemiology. As such, these methods are not limited by either the pathogen under consideration (influenza), or the geospatial context (United States). The tools developed here are expected to be equally applicable to analyzing general epidemiological dynamics for pathogens other than influenza, unfolding in arbitrary geographical regions. The specific conclusions we draw about the initiation and propagation of the seasonal influenza in US might not hold true for influenza epidemiology in a different geographical context. However, the analysis tools are still applicable. More broadly, our tools delineate a general approach to modeling complex spatiotemporal dynamics, with applications beyond solely disease epidemiology.
We conclude by highlighting the structure of overlapping conclusions delivered by our three approaches. Approach 1: Grangercausality analysis suggests that an epidemic tends to begin in the South, near water bodies and that shortrange, landbased travel is more influential compared to airtravel for infection propagation, providing a map of mean infection flow across the continental US. Approach 2: Poisson regression identifies significant predictive factors, ranks these factors by importance, suggests that Southern shores are where the epidemic begins, and corroborates Approach 1’s result on shortrange vs. longrange travel. Approach 3 (countymatching): This approach drills down further to the epidemic onset source to the Southeastern shores of continental US, and identifies a smaller validated subset of predictive factors.
Materials and methods
The methods and putative predictors identified in our study should be directly applicable to data outside the US. While the balance (relative importance) of putative triggers of influenza waves may vary, the factors themselves should be universal globally, because the virus and host biology are universal.
Candidate factors in influenza initiation
To investigate countyspecific variability, we grouped candidate factors into several categories: demographic, relation to human movement, infection state of county neighbors, and county’s own recent state, and climatic.
Major hypotheses regarding putative causal factors affecting infection dynamics can be traced to a handful of earlier publications:
Short and longrange, workrelated human movement (Viboud et al., 2006), including air travel (Colizza et al., 2006; Balcan et al., 2009; Viboud et al., 2006)
Demographic confounders (Gog et al., 2014; Charu et al., 2017)
Social contact among children in schools, or the ‘Returntoschool effect’ (Gog et al., 2014)
Absolute humidity (Shaman and Kohn, 2009)
Other climate variables (Chowell et al., 2012; te Beest et al., 2013)
Host immunity, as affected by vaccination coverage, previous infections, and antigenic variation (Centers for Disease Control and Prevention et al., 2009).
Human movement
Request a detailed protocolWe considered two measurements of human movement: (1) The first measurement reflects the proximity of counties to major airports. We computed an exponentiallydiffusing influence from counties with major US airports, weighted by passenger enplanements at their respective locations. This accounted for people moving to and from both major airports and neighboring counties, and; (2) The second measurement is a largescale movement matrix representing people’s weektoweek travels between counties. These data were culled from a complete collection of geolocated Twitter messages, captured over $3.5$ years, and constituted a largescale, longitudinal sample of individual movements. We used only automatically geotagged tweets.
Demographic
Request a detailed protocolInfluenza is transmitted through direct contact with infected individuals, via contaminated objects, and virusladen aerosols. Thus, human population density (how many people happen to be around) and social connectivity (how many people interact with each other and how frequently [Bedford et al., 2015]) are factors expected to affect local virus incubation and spread. In addition to population density, we considered socioeconomic factors such as countymean household income, levels of poverty and urbanity, as well as the prevalence of ethnic and age groups. All these sociodemographic and socioeconomic data were derived from reports provided by the US Census ([Web Page]; 2017. Available from: https://www.census.gov/geo/reference/ua/uafacts.html).
Returntoschool effect
Request a detailed protocolSocial contact among children in schools has been extensively investigated as a determinant of the peak incidence rate. This is one of the few factors that might lend itself to intervention relatively easily, and hence the interest is welljustified. While any reduction in social contact should, in theory, directly impact transmission, quantifying the effect of this specific mode of contact on the incidence rate has been difficult to calculate. Predictions of the reduction in the peak incidence associated with reduced social contact were typically 20–60% (Ferguson et al., 2006; Haber et al., 2007), with some studies predicting much larger reductions of $\geqq $90% (Mniszewski et al., 2008; Ghosh and Heffernan, 2010). Reductions in the cumulative attack rate (AR, ratio of the number of new cases to the size of the population at risk) were usually smaller than those in the peak incidence. Several studies predicted small ($\sim $10%) or no reduction in the cumulative AR (Ferguson et al., 2006; Haber et al., 2007; Yasuda et al., 2005; Ciofi degli Atti et al., 2008; Yasuda et al., 2008; Kelso et al., 2009; Davey et al., 2008; Rizzo et al., 2008; Vynnycky and Edmunds, 2008; Glass and Barnes, 2007; Lee et al., 2010; Yang et al., 2011; Zhang et al., 2012), whilst a few predicted substantial reductions (e.g. $\geqq $90%) (Glass et al., 2006; Davey et al., 2008; Elveback et al., 1976; Davey and Glass, 2008). Only two studies (Glass et al., 2006; Lee et al., 2010) predicted that peak incidence might increase markedly under certain circumstances following school closures, for example by 27% if school closures caused a doubling in the number of contacts in the household and community, or by 13% if school systems were closed for two weeks at a prevalence of 1% in the general population. Studies have also investigated the effect of such interventions on children vs. adults; one study predicted an overall reduction in the cumulative AR, but an increase of up to 48% in the cumulative AR for adults in some situations (Araz et al., 2012).
While this diverse set of predictions in the literature often pertains to the effect of school closures as an intervention tool, we are more interested in the influence that the current school schedule has, if any, on triggering an epidemic. To answer this specific question, we formulated a simple statistical test to determine whether the timing of returntoschool after summer and winter holidays significantly predicts influenza season initiation. We found insufficient evidence in support of this effect (see Methods and materials).
Climate variables
Request a detailed protocolSpecific humidity and a drop in temperature have been suggested as the key drivers in triggering seasonal influenza epidemics (Lowen et al., 2008; Shaman et al., 2010). These initial conclusions were drawn from experiments conducted using an animal model (guinea pig), under controlled laboratory conditions (Lowen et al., 2008), followed by indirect support from epidemiological modeling (Shaman et al., 2010).
Vaccination coverage
Request a detailed protocolVaccination is widely regarded as our most promising tool to combat influenza, though antigenic variation between seasons makes it difficult to craft an effective vaccination strategy (Boni, 2008). Understanding how the virus will evolve in the shortterm is key to finding the correct antigenic match for an upcoming influenza season. Additionally, shortterm molecular evolution might rapidly give rise to immuneescape variants that, if detected, might dictate intraseason updates in vaccine composition. More importantly, vaccination itself might exert significant selection pressure to influence antigenic drift. The effect of vaccination on viral evolution has been documented in an avian H5N2 lineage in vaccinated chickens (Lee et al., 2004), suggesting that similar processes might be occurring in human counterparts. The diversity of the surface proteins at any point in time between seasons suggests that our current vaccination strategies are limited to confer partial immunity, which can result in a highly immune or vaccinated population selectively pressurizing the viral population to evolve more quickly than usual. Given that influenza moves quickly across geographies and that there are multiple, cocirculating strains that may confer partial crossprotection, changing viral service proteins represent a ‘moving target’ for the human immune system. There are additional complexities due to earlylife immune imprinting in humans.
Accounting for nonvaccination host immunity
Request a detailed protocolResistance to influenza infection in human hosts arises via two related mechanisms: immunological memory from a previous infection by an identical or sufficiently similar strain, or vaccination against the current strain (or a sufficiently similar strain). In our analysis, we explicitly account for the degree of vaccination coverage. Accounting for host immunity is more difficult, because resistance to infection is not directly observable; however, our analysis also uses the antigenic variation of influenza virus as a proxy to host resistance: If the relative antigenic variation is large, then the susceptibility of nonvaccinated hosts increases, while low antigenic variation increases the probability of encountering a resistant host. In addition, we use the absolute antigenic deviation of the laterseason virus from the firstseason virus in our data set (winter of 2003–2004) as a predictor, in order to capture longerterm effects of immunological memory. Thus, if the magnitude of the relative drift between two succeeding years is small, we have a likely decrease in susceptibility. Likewise, if the absolute deviation from a few years back starts decreasing, we would also register a decrease in susceptibility. Incorporating these factors in the diverse set of models that we investigate, guarantees that we are indeed considering host susceptibility contributions from a wide range of possible mechanisms.
Antigenic Variation
Request a detailed protocolThe influenza virus counteracts host immunity via subtle genomic changes over time. The more gradual process, known as antigenic drift, is a manifestation of a gradual accumulation of mutations within viral surface proteins recognized by human antibodies, such as hemagglutinin (HA) and neuraminidase (NA). These mutations are typically introduced during cycles of viral replication (Boni et al., 2004). Most of these mutations are neutral, $i.e.$ they do not affect the functional conformation of the viral proteins. However, some of these alterations do, in fact, sufficiently change secondary and tertiary protein structures to have a negative impact on the binding of host antibodies raised in response to previously circulating strains (Webby and Webster, 2001). (Many such mutations also reduce the virus’s viability.) Thus, while a single infection episode is potentially enough to provide longterm host immunity to the invading strain, antigenic variation due to intense selection pressure gives rise to novel viral strains, making reinfections possible within the span of a few years (Andreasen, 2003). This kind of perpetual RedQueen arms race injects influenza dynamics with autocorrelative dependencies over multiple seasons. It has been suggested that substantial antigenic drift might be associated with more severe, earlyonset influenza epidemics, resulting in increased mortality (Treanor, 2004). In contrast to antigenic drift, antigenic shift is an abrupt, major change in virus structure due to gene segment reassortment that occurs during simultaneous infection of a single host by multiple influenza subtypes (De Clercq, 2006). Antigenic shift results in new versions of viral surface proteins. Antigenic shift due to reassortment gives rise to novel influenza subtypes that, if capable of sustained humantohuman transmission, can have devastating consequences for human populations, $e.g.$ the 2009 H1N1 pandemic (Neumann et al., 2009).
In our analysis, we factor in the potential effect of antigenic variation by estimating the surface protein’s population diversity – hemagglutinin (HA), neuraminidase (NA), matrix protein 1 (M1), and matrix protien M2 – as a function of time and geographical sample collection location. Our rationale for our focus on these proteins is that HA, NA, and to some degree M1, are all present on the viral surface (Lamb et al., 1985), contribute to viral assembly, and mediate the release of membraneenveloped particles (Chlanda et al., 2015). M2 has been shown to have enhanced the pandemic 2009 influenza A virus [(H1N1)pdm09] (Friedman et al., 2017) HApseudovirus infectivity (AlvaradoFacundo et al., 2015).
Data sources
Social connectivity
Request a detailed protocolData was obtained from the General Social Surveys, NORC, See Table 1, [].
Clinical data source
Request a detailed protocolThe source of the clinical incidence data used in this study is the Truven Health Analytics MarketScan^{®} Commercial Claims and Encounters Database for the years 2003 to 2012 (Hansen, 2017). The database consists of approved commercial health insurance claims for between 17.5 and 45.2 million people annually, with linkage across years, for a total of approximately 150 million individuals. This US national database contains data contributed by over 150 insurance carriers and large, selfinsuring companies. We scanned 4.6 billion inpatient and outpatient service claims and identified almost six billion diagnosis codes. After unduplication, we identified approximately $12.8$ unique diagnostic codes per individual. We processed the Truven database to obtain the reported weekly number of influenza cases over a period of $471$ weeks spanning from January $2003$ to December $2013$, at the spatial resolution of US counties. To define influenza in insurance claims, we used the following set of ICD9 codes: 487.8, 488.12, 488.1, 488.0, 488.01, 488.02, 487.0, 487.1, 488.19, 488.09, 488, 487, and 488.11. We also considered including nonspecific codes representing unspecified viral infection (079.99), as suggested in (Viboud et al., 2014), and decided against it. (We provide explicit annotation for each code, as well as rationale for choosing narrower ICDcode definition of ILIs, in the Supplement Section SD).
A peak percentage of ILIaffected people of $20\%$ may seem high–especially given that it is a much greater percentage than the change in seropositivity over most influenza seasons. Note that our apparent estimates of influenza infection rate are unavoidably inflated, because what we are measuring is the week and countyspecific prevalence of influenzalike illnesses among insured patients who contacted their physician for any reason during the week in question. In other words, the denominator that we used for computing prevalence was almost certain to be much smaller than the overall population of the corresponding county.
Data on antigenic drift for influenza A
Request a detailed protocolSequence data for this computation was obtained from the National Institute of Allergy and Infectious Diseases (NIAID) Influenza Research Database (IRD) (Zhang et al., 2017) through their web site at http://www.fludb.org.
Data on vaccination coverage
Request a detailed protocolData on vaccinations was extracted from our EHR database corresponding to the procedural codes 90661 and Q2037, which corresponds to the dominant influenza vaccines. (http://flu.seqirus.com/files/billing_and_coding_guide.pdf)
Data on human air travel
Request a detailed protocolWe used a complete, directed graph of passenger air travel for 2010, accounting for the number of passengers transported in each direction (The United States Bureau of Transportation Statistics, 2010). For each county, we computed an air neighborhood network: For counties $i$ and $j$, the incoming edge to county $i$ represents the proportion of passengers, and ${p}_{ji}$ represents the ratio of all passengers who traveled from county $i$ to county $j$ by plane,mto the total number of travelers who left county $i$ by plane during the year, so that ${\sum}_{j\ne i}{p}_{ji}=1$.
Data on general human travel patterns in the US
Request a detailed protocolUsing the complete Twitter dataset, we aggregated a movement matrix to capture people’s weektoweek travels between counties from geolocated Twitter messages (captured during the period of January 1, 2011 through June 30, 2014). This dataset includes approximately $1.7\times {10}^{9}$ messages and represents $3\times {10}^{8}$ userdays of location information. A small, but significant, percentage of Twitter messages are automatically tagged with the author’s current latitude/longitude information, as determined by their mobile device. Each latitude/longitudeannotated tweet was mapped to a FIPS county code based on Tiger/Line shape files from the 2013 Census dataset (http://www.census.gov/geo/mapsdata/data/tigerline.html). In addition, we calculated a variant of our movement matrix to capture seasonality and other temporal dynamics: A set of 52 movement matrices captured weekly, countytocounty movements, observed in each week of the year, aggregating each week based on the corresponding observations, from each year from 2011 through 2014.
This dataset constitutes a largescale, longitudinal sample of individual movements. We found that the movement patterns are consistent with intuitive patterns and prior studies of largescale movements; the majority of people in a county remain in the same county weekoverweek. Most travel between counties occurs between neighboring counties, and between counties and large metropolitan areas, conditioned on distance and size of the metropolitan area. We represent our movement data as an $n\times n$ matrix $\mathbf{\mathbf{M}}$ that captures the likelihood that a person observed in county $i$ during the course of a week will be observed in county $j$ in the following week. We calculate the entries ${m}_{i,j}$ of matrix $\mathbf{\mathbf{M}}$ as follows:
For each time interval, we computed the mean movement vector from a given county by summing all countytocounty movement vectors, weighted by the proportion of people moving in each direction.
To investigate the role of proximity to major airports, we modeled influence diffusion as follows: Let ${x}_{i}$ be the ${i}^{th}$ county, and ${v}_{i}$ be the total contribution obtained by diffusing influences from the major airport bearing counties. Let ${x}_{k}^{\star}$ be the ${k}^{th}$ airportbearing county, and let $N$ be the total number of major airport locations considered, in our case, $N=27$. Let ${g}_{k}$ be the volume of traffic for the ${k}^{th}$ major airport location. We then compute ${v}_{i}$ as follows:
where $\mathrm{\Theta}(\cdot ,\cdot )$ is the distance in miles between two locations, we computed using the Haversine approximation. The value of the constant ${C}_{0}$ was chosen to be $0.1$. Small variations in the constant do not significantly alter our conclusions. As noted above, proximity to airports had a significant positive influence in sparking seasonal epidemics; the influence is significantly weaker once an outbreak is well under way.
Estimating antigenic diversity
Request a detailed protocolIn this study, we measured antigenic diversity as follows: Let ${S}_{i,x,t}$ be the set of amino acid sequences for the ${i}^{th}$ protein (one of HA, NA, M1, or M2), collected in year $t$ ($t$ ranging between 2003 and 2011), in state $x$ of the continental US. The temporal resolution of the sequence data is thus set to years instead of weeks, and the spatial resolution to states instead of counties. These resolutions are coarse compared to our EHR data on infection incidences, and is set in this manner to maintain sufficient statistical power. For each such set of amino acid sequences ${S}_{i,x,t}$, we compute the set $D({S}_{i,x,t})$ of pairwise edit distances:
where $\mathcal{L}({s}_{1},{s}_{2})$ is the standard edit distance (also known as the Levenshtein distance) between the sequences ${s}_{1},{s}_{2}$. Mathematically, the Levenshtein distance between two strings $a,b$ (of length $a,b$ respectively) is given by:
where ${1}_{({a}_{i}\ne {b}_{j})}$ is the indicator function equal to 0 when ${a}_{i}={b}_{j}$ and equal to one otherwise, and ${lev}_{a,b}(i,j)$ is the distance between the first $i$ characters of $a$ and the first $j$ characters of $b$.
The antigenic diversity of the ${i}^{th}$ protein at time $t$ in state $x$ was then defined as the median of the distribution of the values in the set $D({S}_{i,x,t})$. Clearly, as the sequences became more diverse at a point in time and space due to molecular variations brought about by either drifts or shifts, the measure deviated more from zero. Use of the median provided robustness to outliers.
Importantly, here we did not compare temporal changes in viral antigenic makeup directly, but estimated the timespecific diversity in the viral protein primary structure. We expected our diversity measure to be representative of the cumulative changes that occurred within each of the nine influenza seasons.
Estimating vaccination coverage
Request a detailed protocolWe incorporated the effect of vaccination coverage by estimating the cumulative fraction of the population that received the current influenza vaccine within the previous 20 week period. This approach to estimating vaccination coverage does not correct for seasonspecific antigenic match, or the lack thereof. Nevertheless, because we explicitly included measures of antigenic diversity in addition to vaccination coverage, we expected that effects arising from the degree of antigenic match would indeed be factored in; if there is a significant mismatch, we expected the antigenic diversity in that year to be less and vice verse. This assumes implicitly that vaccination does indeed play a major role in exerting significant selection pressure, the assumption that was reinforced by our observation of a strong dependency between vaccination coverage and normalized antigenic diversity.
We found that antigenic diversity is quite strongly affected by vaccination coverage. This reflects the theoretical predictions in Boni et al. (Boni et al., 2006), where it is shown that the amount of observed antigenic drift increases as immunity in the host population increases and pressures the virus population to evolve.
Estimating the effect of returntoschool days
Request a detailed protocolThe absence of consensus, and the diversity of modeling assumptions pertaining to this effect (described above) makes it difficult to validate conclusions in large scale epidemiological data. We carried out a simple test to determine whether there is statistical evidence that afterholiday, returntoschool periods in AugustSeptember and in January predict or trigger the seasonal epidemic.
For this test, we assumed a broad window to cover all such school openings across the continental US including the last week in August, the entire month of September, two weeks in October, and two weeks in January. We next carried out a Fisher’s exact test to determine whether the overlap between these weeks and the identified ‘trigger period’ (See Table LABEL:SItabschool) for the seasonal epidemic are sufficiently nonrandom.
Weather data
Request a detailed protocolThe dataset starts with the week beginning December 31 st, 2002 and includes 522 weeks (which ends exactly on the week ending December 31 st, 2012). Temperature and precipitation data come from the 2.5 arcminute (approximately 4 km) PRISM (Oregon State University, 2014) dataset and other variables (wind speed, specific humidity, surface pressure, downward incident, and shortwave solar radiation) come from the 7.5 arcminute (approximately 12 km) Phase 2 North American Land Data Assimilation System (NLDAS2) dataset (Mitchell, 2004; Cosgrove et al., 2003). These datasets were selected in large part due to the fact that both are updated in near realtime, making it possible to use these datasets for future monitoring applications. PRISM is released daily, with an approximately 13 hr delay (data for the previous day is released at 1pm EST each day) while NLDAS is released daily, with an approximately 2.5 day delay).
Variables were aggregated to county boundaries based on shapefiles from the GADM database of Global Administrative Areas (Areas, 2014). Where appropriate, we considered both the average daily climate variable (for example, the daily maximum temperature averaged over the week) as well as the the maximum and/or minimum of the variable experienced over the week. For precipitation, we considered only the cumulative total precipitation experienced during the week.
Three approaches
Approach 1: Nonparametric Ganger analysis of countyspecific incidence
Request a detailed protocolAnalysis of quantized clinical data
For the causality analysis, we started with an integervalued time series for each US county, and to carry out the causality analysis, we first quantized the series in two steps:
1. Computing the difference series, $i.e.$, the weekly change in the number of reported cases.
2. Mapping positive changes to symbol ‘1’ and negative changes to symbol ‘0’.
This mapped each data series to a symbol stream over a binary alphabet. The binary quantization is not a restriction imposed by the inference algorithm; while we do require quantized magnitudes, longer data streams can be encoded with finer alphabets to accommodate an arbitrary precision. For this specific dataset, the relatively short length of the countyspecific timeseries necessitated a coarse quantization in order for the results to have a meaningful statistical significance.
Given a pair of such quantized streams ${s}_{a},{s}_{b}$, the algorithm described in the Supplement computes two coefficients of causality, one for each direction. Intuitively, the coefficient ${\gamma}_{b}^{a}$ from ${s}_{a}$ to ${s}_{b}$ is a nondimensional number between $0$ and $1$ that quantifies the amount of information that one may acquire about the second stream ${s}_{b}$ from observing the first stream (${s}_{a}$). More specifically, ${\gamma}_{b}^{a}$ is the average reduction in the uncertainty of the next predicted symbol in stream ${s}_{b}$ in bits, per bit, acquired from observed symbols in stream ${s}_{a}$. It can be shown that the coefficient of causality ${\gamma}_{b}^{a}$ is $0$ if and only if there is no causal influence from ${s}_{a}$ to ${s}_{b}$ in the sense of Granger, and assumes the maximum value $1$ if and only if ${s}_{b}$ is deterministically predictable from ${s}_{a}$. Moreover, ${\gamma}_{b}^{a}={\gamma}_{a}^{b}=0$ if and only if ${s}_{a}$ and ${s}_{b}$ are statistically independent processes (Chattopadhyay, 2014). It is trivial to produce examples where we would have ${\gamma}_{b}^{a}=0,{\gamma}_{a}^{b}>0$ illustrating the ability of the algorithm to capture the asymmetric flow of causal influence in one preferred direction and not the other.
Additionally, whenever we computed the causality coefficient, it was always associated with a time delay: We calculated the coefficient for predicting the target stream some specified number of steps in the future, and the computed coefficient was thus parameterized by this delay. In our analysis, we computed coefficients up to a maximum delay of $10$ weeks, and, for each pair of counties, selected the optimum delay which gave rise to the largest coefficient. For more details on the algorithm, see the Supplement.
Computation of causality fields and causality streamlines
To analyze the propagation dynamics of disease, we defined the notion of a Granger causal flow between two counties. Treating countyspecific changes in disease prevalence as a timestamped data stream, we quantified the directional strength of the causal flow between two counties as the degree of predictability of one stream’s future, given the history of the other (see Figure 3A–D). Our new algorithm was able to compute a coefficient of causality: an informationtheoretic measure of the information in bits that one stream communicates about another in a directionspecific manner. A strong coefficient of this type suggests that either the disease itself, or an underlying cause, may be propagating from the former county to the latter.
To explore the countrywide propagation dynamics, we stitched together the properlyaligned, adjacent betweencounty causal flow vectors across the whole US map into causality streamlines, representing the average of multiple spatial infection propagation waves over time. To compute causality streamlines, we needed a precise notion of county neighbors. We considered two counties to be neighbors if either they shared a common border, or if one county was reachable within a lineofsight distance of 50 miles from any point within another. The latter condition removed ambiguities as to whether counties touching at a point should be still considered as neighbors. The exact distance value (50 miles) does not significantly change our results, as long as it does not vary significantly. With the definition of the neighborhood map in place, we proceeded to compute the directionspecific coefficients of causality between neighboring counties. It follows that we would obtain a set of coefficients for each county and one for each of its neighbors, capturing the degree of causal influence from a given particular county to its respective neighbors. Our algorithm also computed the probability of each coefficient arising by chance alone, and we ignored coefficients that have more than a $6\%$ probability that two independent processes lacking any causal connection gave rise to the particular computed value of the coefficient. Once the coefficients had been calculated for each neighbor, we computed the resultant direction of causal influence outflow from that particular county. This was carried out by visualizing the causality coefficients as weights on the length of the vectors, from the centroid of the considered county to the centroids of its neighbors. We then calculated the resultant vector (see Figure 3). Viewed systematically across the continental US, these local vectors formed a discernible pattern; we observed the emergence of a nonrandom structure with clearly discernible paths outlining the ‘causality field’ (see Figure 3, Plates G, J, K). To interpret the plots, note that streamlines start at their thinnest part, their direction is indicted with thickening line; typically multiple streamlines coalesce into a riverlike pattern.
Inferring statistical ‘Grangercausality’ from data
Granger attempted to obtain a precise definition of causal influence (Granger, 1980) from the following intuitive notion: $Y$ is a cause of $X$, if it has unique information that alters the probabilistic estimate of the immediate future of $X$.
Here, we used a new, nonparametric approach to Granger causal inference (Chattopadhyay, 2014) (Approach 1). In contrast to stateoftheart binary tests (Baek and Brock, 1992; Hiemstra and Jones, 1994), we computed the degree of causal dependence between quantized data streams from stationary ergodic sources in a nonparametric, nonlinear setting.
Our approach was significantly more general to common, regressionbased implementations of Granger causal inference, and did not involve any autoregressive moving average (ARMA) modeling. All such commonly used techniques impose an a priori linear structure on the data, which then constrains the class of dependency structures we can hope to distill.
True causality, in a Humean sense, cannot be inferred (Hume, 1993; Kant, 1998). Among other reported approaches to ascertaining causal dependency relationships, the work of J. Pearl (Pearl, 2009a) is perhaps most visible, and builds on the paradigm of structural causal models (SCM) (Pearl, 2009b). While Pearl’s work is often claimed to be able to answer causal queries regarding the effects of potential interventions, as well as regarding counterfactuals, our objective in this paper is somewhat different. We are interested in delineating whether infection transmission pathways can be distilled from the patterns of infection’s spatiotemporal incidence.
Approach 2: Poisson regression on putative factors
Request a detailed protocolIn Approach 2, we investigated the relative importance of putative factors as follows: Specifically, let ${N}_{ijk}$ denote the total number of patients in county $i$, who are of age $j$, and gender $k$. Denoting the number of individuals diagnosed with influenza in a given county during given week as ${y}_{ijk}$, we modeled the withincounty disease incidence counts for every county (for which data was available) in the US using the following mixedeffect Poisson regression model:
where, $\alpha $ is the intercept, $\mathbf{\mathbf{X}}$ and $\mathbf{\mathbf{b}}$ are a fixedeffect design matrix and a vector of fixed effects, respectively, ${\mathbf{\mathbf{s}}}_{i}$ is $1\times m$ vector of changes in rate of infection in the ${i}^{th}$ county $1,2,..,m$ weeks prior to the current, ${\mathbf{\mathbf{b}}}_{1,i}$ is a $m\times 1$ vector of autocorrelation fixed effects, ${\mathbf{\mathbf{n}}}_{i}$ is a $1\times (3m)$ vector of changes in the rate of infection in the neighbors of the ${i}^{th}$ county $1,2,..,m$ weeks prior to the current (the neighbors are subdivided into land neighbors, Twitter neighbors, and air neighbors), ${\mathbf{\mathbf{b}}}_{2,i}$ is a $(3m)\times 1$ vector of countyneighborhoods fixed effects, and $\mathbf{\mathbf{v}}$ and $\mathbf{\mathbf{Z}}$ are random effects and their design matrix, respectively. Variable $m$ represents the depth of ‘memory’ of autoregression in weeks, in our case $m=4$.
In this way, the total number of rows in matrix $\mathbf{\mathbf{X}}$ is $510\times 3,143$ (weeks $\times $ counties), with countyspecific socioeconomic covariates and weekspecific weather covariates. For disease initiation analysis, we included only a subset of time series covering approximately $50$ weeks.
We did not use an explicit spatial smoothing. The countylevel random effects were used to account for possible heterogeneity in disease reporting across counties. However, as predictors, we used average infection rate values of immediate neighbors for the county (one, two, three, and four weeks in the past). This prediction structure afforded an implicit spatial smoothing, dampening spurious fluctuations in infection rates.
Outofsample prediction and ROC analysis with mixedeffect poisson regression
We carried our outofsample prediction with the models inferred with mixedeffect regression. The steps were as follows:
We trained the model parameters with data from the trigger periods corresponding to the first six seasons.
Once we identified the coefficient’s variables, we used it to predict the response variable (influenza incidence) for the last three seasons.
As expected, the predicted incidence does not exactly match the observed outofsample data. Nevertheless, we see positive correlation (Figure 5, Plate A). Because we were modeling a necessarily spatiotemporal stochastic process, the predictive ability of the model is difficult to judge simply from the observed positive correlation. To resolve this, we investigated the performance of our model by computing how well it predicted the counties that experience flareups during the triggerperiods in the outofsample data. This exercise is reduced to a classification problem, by first choosing a threshold on the number of reported cases per week to define what is meant by a ‘flareup’ (see description below). To quantify the prediction performance, we constructed ROC curves for each of the three target seasons, for each fixed week, as follows:
We first quantized the incidence data to reduce it to a binary stream. In particular, we chose a threshold (ten), such that for each county, and each week, we reported a ‘1’ if the number of reported cases was greater than the threshold, and ‘0’ otherwise. Note this quantization is different to what we used in carrying out our nonparametric Granger analysis in Approach 1.
For each county $i$, and week $j$, we then ended up with the binary class variable ${X}_{j,i}\in \{0,1\}$ and a decision variable ${Y}_{j,i}\in \mathbb{R}$, where the former is the quantized incidence described above, and ${Y}_{j,i}$ is the response predicted by the model, normalized between $0$ and $1$.
For a chosen decision threshold ${\theta}_{D}\in \mathbb{R}$, we could determine the predicted class ${\widehat{X}}_{j,i}\in \{0,1\}$ as:$\begin{array}{r}{\displaystyle {\hat{X}}_{j,i}=\{\begin{array}{cc}1,& \text{if}{Y}_{j,i}\geqq {\theta}_{D}\\ 0& \text{otherwise}\end{array}}\end{array}$
Comparing the observed and predicted classes, we computed the false positive rate (FPR: defined as the ratio of false positives to the sum of false positives and true negatives), and the true positive rate (TPR: defined as the ratio of true positives to the sum of true positives and false negatives). Finally, we constructed the ROC curve, which shows the relationship of the TPR and the FPR as the decisionthreshold ${\theta}_{D}$ is varied.
We constructed Receiver Operating Characteristic (ROC) curves for each week in the outofsample period, and estimated the area under curve (AUC). The AUC measures the performance of the predictor (our model) to correctly classify the counties that would go on to have a disease incidence greater than the initial set threshold (ten cases in our analysis). In the perfect case, we would have an AUC of $100\%$, which implies that we can achieve zero false positives, while getting a $100\%$ true positive rate. Our best model achieves approximately $80\%$ AUC for the trigger weeks, as shown in Figure 5.
Variable and model selection
Request a detailed protocolApproaches 2 and 3 have different limitations; these affect which approach is most effective at which statistical detection task.
Approach 2 (mixedeffect regression analysis) can use a whole dataset for model selection, that can span individual predictors, their combinations, and even interactions between factors (see Table 5; Figure 1—figure supplement 2; Figure 1—figure supplement 3; Figure 1—figure supplement 4; Source data 2; Source data 3; Supplementray file 3; Source code 1). Model selection allows us to choose a model with the right balance of complexity and explanatory power, thus enabling this analysis to detect collinear explanatory variables and drop the weaker predictors that increase model complexity but do not add explanatory power.
For example, the best model in our mixedeffect regression model series, with deviance information criterion (DIC) $185,926.6$ (see Table 5, the last model), includes a term d_max_HUS_min_3 * d_t_avg_min_3 that denotes an interaction between the weekly change in the maximum specific humidity and the weekly change in the average temperature, both of which are weather parameter changes recorded three weeks before the current week. Because it does not split datasets, Approach two is more powerful at detecting interactions between explanatory variables and rather complicated models. However, this approach is more susceptible to bias (picking noncausal correlations that are inherent in the raw data), if the data are unbalanced, unlike with Approach 3.
We added and removed random effects to the model structure, taking out and adding fixed effects, executing the regression algorithm, and plotting the DIC achieved against model complexity. Here, model complexity is simply proxied by the descriptional complexity in terms of how large the model was (the number of factors in the regression equation). We stopped when the drop in DIC stabilized (See Figure 1—figure supplement 3 ). Notably, the DIC has been shown to select overfits (Ando, 2011; Plummer, 2008; van der Linde, 2012), and does not properly account for model complexity in practice. Our driving idea was the identification of the Pareto front that trades off accuracy (in terms of DIC) with model complexity in a transparent manner. This is, of course, a ‘greedy approach,’ in the sense that we do not guarantee that we have indeed found the ‘best’ possible model. However, because our crossvalidation (Figure 5) yielded good results, we deemed the stopping rule to be satisfactory.
Approach 3: Countymatching effect analysis
Request a detailed protocolWe designed Approach three specifically to balance biased observational data. While Approach three does not allow for explicit model selection, it is good at detecting simpler combinations of predictors putatively ‘causally’ affecting the outcome variable. However, because this approach involves splitting data into smaller and smaller subsets, as the complexity of predictor function (‘treatment’) grows, it quickly runs out of statistical power.
These two approaches’ properties explain the perceived misalignment of results from the two streams of analysis. For an intuitive understanding of matching approaches, consider specific humidity’s effect on influenza prevalence. The countymatching method’s goal is to deduce associations putatively interpreted as causality relations. For example, consider testing the question of whether counties with higherthanaverage mean maximum specific humidity do, indeed, have higher influenza prevalence, in a statistically significant sense, when compared to counties that do not, provided all other factors are held constant.
Let $\mathbb{Y}$ denote the set of all US counties, and let $\mathbb{S}$ be the set of all factors we find to be significant in our mixedeffect regression analysis. Now, for any subset of factors $K\u2ac5\mathbb{S}$, we denote the complement set as $\overline{K}=\mathbb{S}\setminus K$. Additionally, we define the Boolean function $\mathcal{T}:\mathbb{S}\times \mathbb{Y}\to \{true,false\}$, (the treatment function) where for some factors $s\in \mathbb{S}$, and some counties $y\in \mathbb{Y}$, the Boolean value $\mathcal{T}(s,y)$ is true if the signal or treatment corresponding to the factor $s$ is present in county $y$. We used the sign for the coefficients we obtained in our best mixedeffect regression model to determine what counts as a positive signal. For example, because maximum specific humidity (denoted by the variable $max\mathrm{\_}hus\mathrm{\_}avg$) enters with a positive coefficient, if county $y$ experiences a higherthanaverage maximum specific humidity, then $\mathcal{T}(max\mathrm{\_}hus\mathrm{\_}avg,y)$ is true.
To simplify the notation, we used ${\mathcal{T}}_{y}^{s}$ to denote a true signal, and $\mathrm{\neg}{\mathcal{T}}_{y}^{s}$ to denote a false one.
Finally, for any $K\u2ac5\mathbb{S}$, we define the following three sets, which we refer to as the $W$sets:
Clearly, ${W}_{treated}^{K}$ is the treated set, that is, the set of counties which exhibit the signal encoded by the set $K$ is then the matched control set of counties, which lack the signal, but each county in this set has a matching counterpart in ${W}_{treated}^{K}$.
The $W$sets allow us to set up a $2\times 2$ contingency table for any chosen subset of factors $K\u2ac5\mathbb{S}$ as described before. Specifically, we split the sets ${W}_{treated}^{K}$ and ${W}_{matchedcontrol}^{K}$ into two subsets each, representing those counties which experience a spike in influenza prevalence and which do not. The $2\times 2$ contingency table is then subjected to Fisher’s exact test.
Appendix 1
Nonparametric Granger Causal Inference
Standard Implementations of Granger Causal Inference
Suppose that one is interested in the possibility that a vector series ${Y}_{t}$ causes another vector ${X}_{t}$. Let ${J}_{n}$ be an information set available at time $n$, consisting of terms of the vector series ${Z}_{t}$, that is,
${J}_{n}$ is said to be a proper information set with respect to ${X}_{t}$, if ${X}_{t}$ is included within ${Z}_{t}$. Further, suppose that ${Z}_{t}$ does not include any component of ${Y}_{t}$, and define
Inferring causality in the mean is easier, and if one is satisfied with using minimum mean square prediction error as the criterion to evaluate incremental predictive power, then one may use linear, onestepahead least squares predictors to obtain an operational procedure: If $VAR(X{J}_{n})$ is the variance of onestep forecast error of ${X}_{n+1}$ given ${J}_{n}$, then $Y$ is a prima facie cause of $X$ with respect to ${J}_{n}^{\prime}$ if:
Testing for bivariate Granger causality in the mean involves estimating a linear, reducedform vector autoregression:
where $A(L)$, $B(L)$, $C(L)$, and $D(L)$ are onesided lag polynomials in the lag operator $L$ with roots alldistinct, and outside the unit circle. The regression errors ${U}_{X,t},{V}_{Y,t}$ are assumed to be mutually independent and individually i.i.d. with zero mean and constant variance. We used a standard joint test ($F$ or ${\chi}^{2}$test) to determine whether lagged $Y$ has significant linear predictive power for current $X$.
The null hypothesis that $Y$ does not strictly Granger cause $X$ is rejected if the coefficients of the elements in $B(L)$ are jointly significantly different from zero.
Linear tests presuppose restrictive and often unrealistic (Darnell and Evans, 1990; Epstein, 1987) structure on data. Brock (Brock, 1991) presents a simple bivariate model to analytically demonstrate the limitations of linear tests in uncovering nonlinear influence. To address this issue, a number of nonlinear tests have been suggested, $e.g.$, with generalized, autoregressive, conditional heteroskedasticity (GARCH) models (Asimakopoulos et al., 2000), using wavelet transforms (Papadimitriou et al., 2003), or heuristic, additive relationships (Chu et al., 2005). However, these approaches often assume a class of allowable nonlinearities; thus not quite alleviating the problem of presupposed structure. This is not just an academic issue; Granger causality has been shown to be significantly sensitive to nonlinear transformations (Roberts and Nord, 1985).
Nonparametric approaches, $e.g.$ the HiemstraJones (HJ) test (Hiemstra and Jones, 1994) on the other hand, attempt to completely dispense with presuppositions regarding causality structure. Given two series, ${X}_{t}$ and ${Y}_{t}$, the HJ test (which is a modification of the BaekBrock test [Baek and Brock, 1992]) uses correlation integrals to test if the probability of similar futures for ${X}_{t}$ given similar pasts, change significantly if we condition instead on similar pasts for both ${X}_{t}$ and ${Y}_{t}$ simultaneously.
Nevertheless, in order to achieve consistent estimation of the correlation integrals, the data series are required to be ergodic, stationary, and absolutely regular $i.e.\beta $mixing, with an upper bound on the rate at which the $\beta $coefficients approach zero (Denker and Keller, 1983). The additional assumptions beyond ergodicity and stationarity serve to guarantee that sufficiently separated fragments of the data series are nearly independent. The HJ test and its variants (Diks and Panchenko, 2006; Seth and Principe, 2010) have been quite successful in econometrics; uncovering nonlinear causal relations between money and income (Baek and Brock, 1992), aggregate stock returns and macroeconomic factors (Hiemstra and Kramer, 1995), currency future returns (Asimakopoulos et al., 2000) and stock price and trading volume (Hiemstra and Jones, 1994). Surprisingly, despite clear evidence that linear tests typically have low power in uncovering nonlinear causation (Hiemstra and Jones, 1994; Asimakopoulos et al., 2000), the application of nonparametric tests has been limited in areas beyond financial and macroeconomic interests.
1.1 Our Approach: Nonparametric Approach Based on Probabilistic Automata
We use a new, nonparametric Granger causalitybased test for quantized processes (Chattopadhyay, 2014).
Going beyond binary hypothesis testing, we quantify the notion of the degree of causal influence between observed data streams, without presupposing any particular model structure. Generative models of causal influence are inferred with no a priori, imposed dynamical structure beyond ergodicity, stationarity, and a form of weak dependence. The explicit generative models may be used for prediction. The proposed inference algorithms are PACefficient (Chattopadhyay, 2014), $i.e$, we are guaranteed to find good models with high probability, and with small sample complexity.
Modeling Framework: Probabilistic Automata and Crossed Probabilistic Automata
In this section, we briefly describe the notion of probabilistic automata, the inference problem, the notion of causal states, and the decision fusion strategy for multiple models.
Self Models
The event catalogue’s quantized data streams may be viewed as sample paths from hidden, quantized stochastic processes that drive the observed dynamics. A selfmodel for such a stream is a generative model that captures statistically significant symbol patterns that causally determine (in a probabilistic sense) future symbols. A good modeling framework for such selfmodels are probabilistic finite state automata (PFSA) (Chattopadhyay and Lipson, 2013; Chattopadhyay, 2014).
One such self model, shown in Figure 1 (Plate A(i)), is an example of a simple PFSA with two states generated from a binary symbol sequence. Note that, for this particular example, a sequence $11$ localizes or synchronizes the machine to state ${q}_{2}$, which then implies that the next symbol is a $0$ with a probability of $0.1$ and a $1$ with a probability of $0.9$. Thus, given this stream’s selfmodel, along with the short history $11$, we can predict the symbol distribution in the next time step. Notably, the states in a PFSA are not the alphabet symbols themselves, but are equivalence classes of symbol sequences (histories) that lead to statistically equivalent futures. This particular example has two states–not because of the fact that the alphabet is binary, but because there are only two distinct distributions that dictate the next symbol from any point in the dynamical evolution. The symbol value is distributed as either $[\mathrm{0.1\hspace{0.25em}0.9}]$ or $[\mathrm{0.3\hspace{0.25em}0.7}]$ over the binary event alphabet, and the current context or ‘state’ dictates which distribution is in effect chosen. These contexts are simply historical sequence equivalence classes that could have potentially lead to the present state; the specific history belonging to whichever current equivalence class actually transpired makes no difference in the future. Hence, these classes are dynamical states for the discrete stochastic evolution of the system at hand. The transition structure of the PFSA (represented in the graph as the labeled edges with probabilities) specifies how we move between these classes as new symbols are generated, or as new data is acquired.
The inference problem here is to determine both the number of states and the transition structure given a sufficiently long, quantized data stream. In our unsupervised approach, there are no restrictions on admissible graphs for the inferred PFSA, and no a priori constraints on the number of states that may appear (Chattopadhyay and Lipson, 2013).
Thus, the number of states inferred by our algorithm is a direct measure of the statistical complexity of the underlying process (Crutchfield, 1994). An inferred singlestate, self model implies that the stream is a sequence of independently generated symbols (white noise or a Bernoulli process), and is therefore uninformative.
The inference algorithm identifies the number of causal states by searching for distinct ‘contexts’–sequences which, once transpired, lead to a distinct probability distribution of the next symbol. The computation proceeds as follows (Chattopadhyay and Lipson, 2013):
Let the set of all possible sequences up to a length $L$ be denoted as ${S}_{L}$.
Compute the probability of a future symbol (at a specified time shift) being ${\sigma}_{0}$ or ${\sigma}_{1}$ after a specific string $\omega $ from the set ${S}_{L}$ is encountered; call this distribution ${\varphi}_{\omega}$ for the string $\omega $.
Call the set of probability distributions obtained by this method as $\mathrm{\Phi}$.
Find clusters in the set $\mathrm{\Phi}$, such that individual clusters are separated by some prespecified distance $\u03f5>0$. These clusters represent the causal states, as they are classes of histories (sequences) that lead to identical, immediate future.
Suppose string $\omega $ is in cluster ${q}_{i}$, and sequence $\omega {\sigma}_{0}$ is in cluster ${q}_{j}$; it then follows that, in the inferred PFSA, there is a transition labeled ${\sigma}_{0}$ from the corresponding state ${q}_{i}$ to state ${q}_{j}$. Carrying out this procedure for each symbol of the alphabet for each inferred cluster or state identifies the complete transition structure of the model.
Once the transition structure is identified, we choose an arbitrary initial state and step through the model as dictated by the input data stream. We count the number of times each edge or transition is traversed, and, by normalizing the count, we arrive at an estimate of the edge probabilities.
This completes the PFSA inference.
Formal Definition of PFSAs and Future Prediction
Mathematically, a PFSA is a $4$tuple $G=(Q,\mathrm{\Sigma},\delta ,\stackrel{~}{\pi})$, where $Q$ is the set of states, $\mathrm{\Sigma}$ is the symbol alphabet, $\delta :Q\times \mathrm{\Sigma}\to Q$ is the transition function specifying the graph structure such that for any state $q\in Q$, $\delta (q,\sigma )\in Q$ is the end state of the transition from state $q$ via symbol $\sigma $, and $\stackrel{~}{\pi}:\mathrm{\Sigma}\times Q\to [0,1]$ is the symbol probability function such that for any state $q\in Q$, $\stackrel{~}{\pi}(q,\sigma )\in [0,1]$ is the probability of generating the symbol $\sigma \in \mathrm{\Sigma}$ from the state $q$, with the constraint:
We specify a time shift $\mathrm{\Delta}$ when we infer a PFSA $G$ from a given input stream; this time shift is the delay with which the inferred model makes predictions. Specifically, each transition step in $G$ is translated to $\mathrm{\Delta}$ steps in the quantized stream. Thus, if we know that the current state in the model is ${q}_{0}$, then the predicted symbol $\mathrm{\Delta}$ steps for the future are $\sigma $ with probability $\stackrel{~}{\pi}({q}_{0},\sigma )$.
Note that, if we have two models ${G}_{1}=({Q}_{1},\mathrm{\Sigma},{\delta}_{1},{\stackrel{~}{\pi}}_{1})$ and ${G}_{2}=({Q}_{2},\mathrm{\Sigma},{\delta}_{2},{\stackrel{~}{\pi}}_{2})$ with time shifts ${\mathrm{\Delta}}_{1}$ and ${\mathrm{\Delta}}_{2}$ respectively, it follows that with each step the models make predictions at different points in time in the future.
Note that to make a prediction using a PFSA, we must know its current state. This is nontrivial in general due to the synchronization problem (Chattopadhyay and Lipson, 2013; Chattopadhyay, 2014). We find an approximation of the current state as follows:
Compute the he PFSA states’ stationary distribution using standard Markov chain tools.
Treat the stationary distribution row vector $\mathrm{\wp}$ as the initial state distribution. The rationale here is that if we assume that the model has been operating for a sufficient length of time, the current distribution must be very close to the stationary distribution in the expected sense.
Compute the square matrices, ${\mathrm{\Gamma}}_{\sigma}\in {\mathbb{R}}^{Q\times Q}$, for each symbol $\sigma \in \mathrm{\Sigma}$, such that the $i{j}^{th}$ element of ${\mathrm{\Gamma}}_{\sigma}$, denoted as ${{\mathrm{\Gamma}}_{\sigma}}_{ij}$, which signifies the probability of going from state ${q}_{i}$ to state ${q}_{j}$ via the symbol $\sigma $.
Use a relatively short history $h={\sigma}_{1}\mathrm{\cdots}{\sigma}_{\mathrm{\ell}}$ of past symbols (before the current time) to update the state distribution as follows:
where ${h}_{k}$ is the ${k}^{th}$ entry of the short history $h$, and ${\mathrm{\wp}}^{[k]}$ is the state distribution at step $k$. It follows from ergodicity that the updated distribution ${\mathrm{\wp}}^{\mathrm{\ell}}$ is an estimate of the current state distrbution.
The predicted distribution over the symbol alphabet $\mathrm{\Delta}$ steps in the future (if the time shift for the model is $\mathrm{\Delta}$), is then simply:
where ${\mathrm{\wp}}_{q}^{\mathrm{\ell}}$ is the current probability for state $q$, $i.e.$, the entry corresponding to state $q$ in ${\mathrm{\wp}}^{\mathrm{\ell}}$.
Cross Models
Unlike a self model, which attempts to predict future symbols in the same stream, a cross model attempts to predict symbols in a target stream after reading a short history in the source stream. Plate B(i) in Figure 1 illustrates a simple cross model with two states. Note that the model syntax is different from a PFSA; in particular, there are no probabilities on the arcs, though each state has a specified output distribution. Thus, if we see sequence $101101$ in data stream ${s}_{B}$ (See Figure 1, Plate B), then we are in state ${q}_{2}$ in the crossmodel, and hence the next symbol in the data stream ${s}_{A}$ can be predicted to be $0$ with a probability of $0.7$ and $1$ with a probability of $0.3$. We call such models crossed Probabilistic Finite State Automata (XPFSA) (Chattopadhyay, 2014). A cross model state or an XPFSA has a slightly different interpretation from PFSA; while PFSA states are equivalence classes of histories that lead to identical futures in the same stream, the XPFSA are equivalence classes of histories in the source stream that lead to identical futures in the target stream (the future evolution in the target stream does not matter).
XPFSA inference ˜ (Chattopadhyay, 2014) is similar in principle to PFSA inference. Here, we have two input data streams, the source stream ${s}_{A}$ over the source alphabet ${\mathrm{\Sigma}}_{A}$, and the target stream ${s}_{B}$ over the source alphabet ${\mathrm{\Sigma}}_{B}$. The broad steps are as follows:
Let the set of all possible sequences up to a length $L$ in source stream ${s}_{A}$ be denoted as ${S}_{L}^{A}$.
Compute the probability of a future symbol (at a specified time shift) being $\sigma \in {\mathrm{\Sigma}}_{B}$ in the target stream ${s}_{B}$ after a specific string $\omega $ from the set ${S}_{L}^{A}$ is encountered in the source stream; call this distribution ${\varphi}_{\omega}^{{s}_{A},{s}_{B}}$ for the string $\omega $.
Call the set of probability distributions obtained in this manner as ${\mathrm{\Phi}}^{{s}_{A},{s}_{B}}$.
Find clusters in the set ${\mathrm{\Phi}}^{{s}_{A},{s}_{B}}$, such that the clusters are separated by some prespecified distance $\u03f5>0$. These clusters now represent causal states of the crossdependence between the stream (from ${s}_{A}$ to ${s}_{B}$), because they are classes of histories (sequences) in the source stream that leads to identical, immediate futures in the target stream.
Suppose string $\omega $ is in a cluster corresponding to state ${q}_{i}$, and sequence $\omega {\sigma}_{0}$ is in a cluster corresponding to ${q}_{j}$; it then follows that, in the inferred XPFSA, there is a transition labeled ${\sigma}_{0}$ from the corresponding state ${q}_{i}$ to state ${q}_{j}$. Carrying out this procedure for each symbol of the source alphabet for each inferred cluster or state identifies the complete transition structure of the cross model. Note that the crossmodel transitions are only labelled with symbols from the source alphabet.
Once we have identified the transition structure, we choose an arbitrary initial state in the XPFSA and step through the model as dictated by the source data stream. Each time we reach a particular state in the crossmodel, we note which symbol from target alphabet ${\mathrm{\Sigma}}_{B}$ transpires in the target stream (at the specified time shift). Once we reach the end of the input source stream, we normalize the symbol count vectors corresponding to the crossmodel states, and this determines the output distribution at each state. Note that each output distribution is a probability distribution over the target alphabet. This completes the XPFSA inference.
An XPFSA may be thought of as a direct generalization of a PFSA, and any PFSA can be represented as an XPFSA. To illustrate this, note that predicting future symbols in the same stream can be thought of as predicting symbols is a second, but identical, copy of the first stream.
If all histories in a source stream are equivalent in this sense, then we have a single state XPFSA, which implies that the source cannot provide any new information on what is going to happen in the target stream based on its own history. Hence, that source lacks any causal influence on the target. Thus, as before, the complexity of the inferred models is directly related to the statistical complexity (Crutchfield, 1994) of the learnable, dynamical relationships underlying data streams themselves.
XPFSAs are asymmetric in general; the model for how a source influences a target does not need to be identical when the roles played by the streams are reversed. However, two streams are statistically independent if and only if the XPFSAs in both directions are single state machines (Chattopadhyay, 2014).
This leads us to the notion of the causality coefficient, ${\gamma}_{B}^{A}$, from data stream ${s}_{A}$ to data stream ${s}_{B}$. True to Granger’s notion of statistical causality, ${\gamma}_{B}^{A}$ rigorously quantifies the amount of additional information we can obtain about the immediate future in stream ${s}_{B}$ through observing stream ${s}_{A}$.
Explicitly, the coefficient is defined as the ratio of the expected change in the entropy of the nextsymbol distribution in stream ${s}_{B}$ conditioned over observations in the stream ${s}_{A}$ to the entropy of the nextsymbol distribution in stream ${s}_{B}$, conditioned on the fact that no observations are made on stream ${s}_{A}$. We show that causality coefficient ${\gamma}_{B}^{A}$ takes values on the closed unit interval and that higher values indicate a stronger predictability of ${s}_{B}$ from ${s}_{A}$, $i.e.$, therefore, a higher degree of causal influence. We have $\gamma =0$ if and only if the inferred machine has a single state, and streams ${s}_{A},{s}_{B}$ are statistically independent if and only if ${\gamma}_{B}^{A}={\gamma}_{A}^{B}=0$.
Thus, the interpretation of the causality coefficient which is central to our development is as follows:
${\gamma}_{B}^{A}=0.3$ (say) from ${s}_{A}$ to ${s}_{B}$ means that we can acquire, on average, $\mathrm{0.3}\mathit{}b\mathit{}i\mathit{}t\mathit{}s$ of additional information about ${s}_{B}$ from each bit read from ${s}_{A}$, over what is already available from the past history of ${s}_{B}$.
This is very different from computing correlations, and it assumes no model structure a priori for the hidden dynamics driving the data streams.
Importantly, it turns out that inferring self models is simply a special case of cross model inference, in which the target stream is simply a timeshifted version of the first stream.
Formal Definition of XPFSAs and Future Prediction
Formally, an XPFSA is also a $4$tuple ${G}_{A\to B}=(Q,{\mathrm{\Sigma}}_{A},\mathrm{\Delta},\stackrel{~}{\pi})$, where $Q$ is a set of states, ${\mathrm{\Sigma}}_{A}$ is the input alphabet. The transition function $\delta :Q\times {\mathrm{\Sigma}}_{A}\to Q$ is defined as before (in the case of the PFSA), but the symbol probability function is defined as $\stackrel{~}{\pi}:Q\times {\mathrm{\Sigma}}_{B}\to [0,1]$, with
where ${\mathrm{\Sigma}}_{B}$ is the target alphabet, which is distinct from ${\mathrm{\Sigma}}_{A}$ in general. Note that the symbol distribution specified by $\stackrel{~}{\pi}$ is over the target alphabet ${\mathrm{\Sigma}}_{B}$.
As before, there is a time shift $\mathrm{\Delta}$ associated with inferred XPFSA, such that each symbol transition in the model maps to $\mathrm{\Delta}$ steps in the target stream. If we know the current state of the XPFSA to be ${q}_{0}$, and we observe symbol $\sigma \in {\mathrm{\Sigma}}_{A}$ in the source stream, then the predicted symbol in the target stream ($\mathrm{\Delta}$ steps in the future) is ${\sigma}^{\prime}\in {\mathrm{\Sigma}}_{B}$, with a probability of $\stackrel{~}{\pi}(\delta ({q}_{0},\sigma ),{\sigma}^{\prime})$.
As before, determining the current state in the XPFSA is nontrivial. We estimate the current state distribution in a manner similar to that used for self models:
Note that the XPFSA graph does not have probabilities on it edges. However, the corresponding source stream’s self model does have symbol probabilities on its own edges, and it models the source stream. The problem is that the graph for the self model, and that of the cross model, might not be identical. We solve this problem using projective composition of probabilistic automata (Chattopadhyay and Ray, 2008). This operation takes the self model for the source stream ${G}_{A}$, and projects it on the inferred XPFSA graph ${G}_{A\to B}$, and the result is a PFSA ${G}_{A}\overrightarrow{\otimes}{G}_{A\to B}$ with the same structure as that of ${G}_{A\to B}$.
Once we obtain ${G}_{A}\overrightarrow{\otimes}{G}_{A\to B}$, we can estimate its current state as described in the case for self models (using observed symbols in the source stream).
Let the current state distribution be ${\overline{\mathrm{\wp}}}^{\mathrm{\ell}}$. Then, the predicted future symbol (at time shift $\mathrm{\Delta}$) in the target stream is ${\sigma}^{\prime}\in {\mathrm{\Sigma}}_{B}$ with a probability of:
where ${G}_{A}\overrightarrow{\otimes}{G}_{A\to B}$ (note we are using the $\stackrel{~}{\pi}$ from the XPFSA, and not from the projected PFSA).
Computation of the Coefficient of Causality
Let ${\mathscr{H}}_{A},{\mathscr{H}}_{B}$ be stationary ergodic processes over finite alphabets ${\mathrm{\Sigma}}_{A},{\mathrm{\Sigma}}_{B}$ respectively. Then, the causal dependence coefficient of ${\mathscr{H}}_{B}$ on ${\mathscr{H}}_{A}$, denoted as ${\gamma}_{B}^{A}$, is formally defined as the ratio of the expected change in the entropy of the next symbol distribution in ${\mathscr{H}}_{B}$. This is due to observations in ${\mathscr{H}}_{A}$, which show the entropy of the next symbol distribution in ${\mathscr{H}}_{B}$. In the absence of data in ${\mathscr{H}}_{A}$, $i.e.$, we have (Chattopadhyay, 2014):
where the entropy $\mathit{\bm{h}}\left(u\right)$ of discrete probability distribution $u$ is given by ${\sum}_{i}{u}_{i}{\mathrm{log}}_{2}{u}_{i}$, $\lambda $ is the empty string, and $\mathbf{\mathbf{E}}(\cdot )$ is the expectation operator over all possible sequences in the source stream. Once we infer the cross model, computing ${\gamma}_{B}^{A}$ is straightforward.
Choice of Quantization Schemes
The weekly time series of countyspecific records for the number of reported influenza cases is quantized to a binary stream. If ${y}_{k}$ is the number of reported cases for a given fixed county at the week index $k$, then the quantized binary stream ${\overline{y}}_{k}$ is obtained as follows:
References

Predictive bayesian model selectionAmerican Journal of Mathematical and Management Sciences 31:13–38.https://doi.org/10.1080/01966324.2011.10737798

Dynamics of annual influenza A epidemics with immunoselectionJournal of Mathematical Biology 46:504–536.https://doi.org/10.1007/s0028500201862

Nonlinear Granger causality in the currency futures returnsEconomics Letters 68:25–30.https://doi.org/10.1016/S01651765(00)002196

A General Test for Nonlinear Granger Causality: Bivariate ModelA General Test for Nonlinear Granger Causality: Bivariate Model.

Absolute humidity, temperature, and influenza mortality: 30 years of countylevel evidence from the United StatesAmerican Journal of Epidemiology 176:S114–S122.https://doi.org/10.1093/aje/kws259

Influenza drift and epidemic size: the race between generating and escaping immunityTheoretical Population Biology 65:179–191.https://doi.org/10.1016/j.tpb.2003.10.002

Epidemic dynamics and antigenic evolution in a single season of influenza AProceedings of the Royal Society B: Biological Sciences 273:1307–1316.https://doi.org/10.1098/rspb.2006.3466

Beyond Belief: Randomness, Prediction, and Explanation in Science230–279, Causality, chaos, explanation and prediction in economics and finance, Beyond Belief: Randomness, Prediction, and Explanation in Science.

Prevention and control of seasonal influenza with vaccines: recommendations of the Advisory Committee on Immunization Practices (ACIP), 2009MMWR. Recommendations and Reports : Morbidity and Mortality Weekly Report. Recommendations and Reports 58:1–52.

Human mobility and the spatial transmission of influenza in the United StatesPLoS Computational Biology 13:e1005382.https://doi.org/10.1371/journal.pcbi.1005382

Abductive learning of quantized stochastic processes with probabilistic finite automataPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371:20110543.https://doi.org/10.1098/rsta.2011.0543

Structural transformations of probabilistic finite state machinesInternational Journal of Control 81:820–835.https://doi.org/10.1080/00207170701704746

BookData Driven Methods for Nonlinear Granger Causality: Climate Teleconnection MechanismsPittsburgh, Pennsylvania: Carnegie Mellon University.

Realtime and retrospective forcing in the North American land data assimilation system (NLDAS) projectJournal of Geophysical Research 108:.https://doi.org/10.1029/2002JD003118

The calculi of emergence: computation, dynamics and inductionPhysica D: Nonlinear Phenomena 75:11–54.https://doi.org/10.1016/01672789(94)902739

Rescinding community mitigation strategies in an influenza pandemicEmerging Infectious Diseases 14:365–372.https://doi.org/10.3201/eid1403.070673

Antiviral agents active against influenza A virusesNature Reviews Drug Discovery 5:1015–1025.https://doi.org/10.1038/nrd2175

On Ustatistics and v. mise? statistics for weakly dependent processesZeitschrift fur Wahrscheinlichkeitstheorie Und Verwandte Gebiete 64:505–522.https://doi.org/10.1007/BF00534953

A new statistic and practical guidelines for nonparametric Granger causality testingJournal of Economic Dynamics and Control 30:1647–1669.https://doi.org/10.1016/j.jedc.2005.08.008

Spatial dynamics of the 1918 influenza pandemic in England, Wales and the United StatesJournal of the Royal Society Interface 8:233–243.https://doi.org/10.1098/rsif.2010.0216

An influenza simulation model for immunization studiesAmerican Journal of Epidemiology 103:152–165.https://doi.org/10.1093/oxfordjournals.aje.a112213

A(H1N1)pdm09 influenza infection: vaccine inefficiencyOncotarget 8:32856–32863.https://doi.org/10.18632/oncotarget.16459

Targeted social distancing design for pandemic influenzaEmerging Infectious Diseases 12:1671–1681.https://doi.org/10.3201/eid1211.060255

Spatial Transmission of 2009 Pandemic Influenza in the USPLoS Computational Biology 10:e1003635.https://doi.org/10.1371/journal.pcbi.1003635

Testing for causalityJournal of Economic Dynamics and Control 2:329–352.https://doi.org/10.1016/01651889(80)90069X

Effectiveness of interventions to reduce contact rates during a simulated influenza pandemicEmerging Infectious Diseases 13:581–589.https://doi.org/10.3201/eid1304.060828

The truven health marketscan databases for life sciences researchersTruven Health Ananlytics IBM Watson Health.

Wiley Series in Probability and StatisticsLongitudinal data analysis, Wiley Series in Probability and Statistics, Hoboken, N.J, WileyInterscience.

Testing for linear and nonlinear granger causality in the stock pricevolume relationThe Journal of Finance 49:1639–1664.

Nonlinearity and endogeneity in macroasset pricingIMF Working Papers 95:1.https://doi.org/10.5089/9781451845082.001

BookAn Enquiry Concerning Human UnderstandingUnited States: Hackett Publishing Country.

The Cambridge Edition of the Works of Immanuel KantCritique of pure reason, The Cambridge Edition of the Works of Immanuel Kant, New York, NY, Cambridge University Press, 10.1017/CBO9780511804649.

A contribution to the mathematical theory of epidemicsProceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 115:700–721.https://doi.org/10.1098/rspa.1927.0118

Simulating school closure strategies to mitigate an influenza epidemicJournal of Public Health Management and Practice 16:252–261.https://doi.org/10.1097/PHH.0b013e3181ce594e

Pandemic simulation of antivirals + school closures: buying time until strainspecific vaccine is availableComputational and Mathematical Organization Theory 14:209–221.https://doi.org/10.1007/s1058800890271

Analytical Methods for Social ResearchCounterfactuals and causal inference : methods and principles for social research, Analytical Methods for Social Research, Second edition, New York, Cambridge University Press.

Proceedings 2003 VLDB Conference560–571, Adaptive, handsoff stream mining, Proceedings 2003 VLDB Conference, 10.1016/B9780127224428/500562.

Causal inference in statistics: An overviewStatistics Surveys 3:96–146.https://doi.org/10.1214/09SS057

BookCausality: models, reasoning and inference (2nd edition)New York, NY, USA: Cambridge University Press.https://doi.org/10.1017/CBO9780511803161

Scenarios of diffusion and control of an influenza pandemic in ItalyEpidemiology and Infection 136:1650–1657.https://doi.org/10.1017/S095026880800037X

Causality tests and functional form sensitivityApplied Economics 17:135–141.https://doi.org/10.1080/00036848500000011

ConferenceA test of granger noncausality based on nonparametric conditional independence2010 20th International Conference on Pattern Recognition. pp. 2620–2623.

Driving factors of influenza transmission in the NetherlandsAmerican Journal of Epidemiology 178:1469–1477.https://doi.org/10.1093/aje/kwt132

Influenza vaccineoutmaneuvering antigenic shift and driftNew England Journal of Medicine 350:218–220.https://doi.org/10.1056/NEJMp038238

A Bayesian view of model complexityStatistica Neerlandica 66:253–271.https://doi.org/10.1111/j.14679574.2011.00518.x

Analyses of the 1957 (Asian) influenza pandemic in the United Kingdom and the impact of school closuresEpidemiology and Infection 136:166–179.https://doi.org/10.1017/S0950268807008369

Emergence of influenza A virusesPhilosophical transactions of the Royal Society of London. Series B, Biological sciences 356:1817–1828.https://doi.org/10.1098/rstb.2001.0997

Modeling on social spread from immunityJapanese Journal of Infectious Diseases 58:S14.

Influenza research database: an integrated bioinformatics resource for influenza virus researchNucleic Acids Research 45:D466–D474.https://doi.org/10.1093/nar/gkw857
Decision letter

Mark JitReviewing Editor; London School of Hygiene & Tropical Medicine, and Public Health England, United Kingdom
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]
Thank you for submitting your work entitled "Conjunctions of factors triggering waves of seasonal influenza" for consideration by eLife. Your article has been evaluated by Prabhat Jha (Senior Editor) and four reviewers, one of whom, Mark Jit, is a member of our Board of Reviewing Editors.
All the reviewers and I agreed that this is an interesting, innovative and sophisticated attempt to understand the drivers of seasonal influenza spread throughout continental United States. However, we had wideranging concerns about multiple areas of the paper.
Based on these discussions and the individual reviews below, we have decided to reject the paper as we think that substantial time and work will be needed to address all the concerns. However, if you would like to substantially revise the paper in a way that takes into account the concerns, we would be interested to consider a revised version of the manuscript as a new submission. Please note that we cannot guarantee review of the revised version until we have seen it.
The full reviews are included, but the most important concerns that would need to be addressed in a new submission are the following:
1) Presentation: While eLife does not have strict format requirements, this paper in particular would benefit from a more classic presentation (e.g. Introduction, Materials and methods, Results, Discussion) that provides a fuller description of the background, purpose and how the paper interacts with and adds to the existing (vast) literature on this topic.
2) The data sources and the way they are used need a more detailed description and justification. The definition of ILI was challenged by one of the reviewers and needs to be justified.
3) While the environmental and demographic factors driving seasonality are extensively examined, there is a need also to take into account viral and host immunity factors such as antigenic drift, seasonal vaccine uptake and antigenic match, host susceptibility and betweenhost transmissibility (with the latter heavily influenced by the school calendar).
4) The conclusions (e.g. the "Southern hypothesis" and bodies of water being sources) are interesting but need stronger explanation and justification. The different approaches taken did not all give exactly the same results. Claims of causality in epidemiology require an extremely high level of evidence so in this case more moderate language may be appropriate (e.g. "the association suggests.[…]").
5) It would have been useful to make some attempt to validate the methods used, such as by holding back a validation dataset.
Reviewer #1:
This is an innovative and wellillustrated attempt to understand the drivers of seasonal influenza spread each season throughout continental United States. The techniques used are interesting and the array of datasets that the investigators have compiled are impressive. Nevertheless, I do think that their conclusions place far too much certainty than can be justified from statistical analysis of observational data.
1) The claims about causality in the paper should be tempered. While Approach 1 may historically have been called Granger causal inference, there is a limit to the extent to which causality can be claimed in the absence of (and sometimes even with) experimental data. At best, the manuscript should claim that the associations in the data are consistent with their hypotheses.
2) The concept of streamlines is an interesting one but it's not clear what the underlying mechanisms behind them are. They seem to imply that there is a continuous line linking infection in a particular location with a source in another, usually distant, location. Is it not more consistent with our understanding of infection spread to assume that infection spread is a diffusion process, even though environmental and host factors may enhance diffusion in particular directions?
3) All three approaches used are essentially data mining exercises, followed by posthoc generation of hypotheses to explain the associations found. This does not mean they are invalid – plenty of science works in this way, and some care has been taken to avoid type I errors. Nevertheless, the possibility of bias, confounding, alternative explanations and simply random error should not be discounted. This is particularly the case because there is strong geographic clustering of many of the variables identified as key drivers (e.g. densely populated areas, coastal areas, places with airports, places at the geographical and climactic extremes of the USA tend to be the same places). Hence the language used should be substantially tempered to be more tentative. While the authors have used E. coli and HIV spread as controls to ensure that these display different patterns of spatial spread, the problem is that the seasonality of these pathogens is completely different or (in the case of HIV) largely absent. Looking at a nonrespiratory but still highly seasonal pathogen e.g. norovirus might be interesting.
4) One shortcoming of the data analysis is that although a large number of variables related to environmental and demographic factors are considered, there is little consideration of viral and host characteristics which are key to driving influenza spread. For instance, over the nine years the model is run, the United States has experienced waves of different influenza types, a pandemic (which led to a shift in the influenza A/H1 subtype), different levels of vaccine coverage with varying levels of match to the prevailing strains, as well as changes in immune profiles of individuals over the population. There is no attempt to understand these drivers and capture their spatiotemporal heterogeneity even though there are datasets (e.g. vaccine coverage, serology, virological testing) that could be used to inform this.
Note about presentation:
The presentation of the paper is rather confusing. There is hardly any introductory material; the first section is essential a summary of methods and results. This is followed by a more detailed description of the methods and finally a discussion. While eLife does not have strict rules about manuscript formatting, the paper should at least follow the general conventions of medical/life sciences journals i.e. the Introduction, Materials and methods, Results, Discussion division which will make finding information easier for most readers. A fuller Introduction section setting the background is also needed e.g. why this is an important problem from a clinical/public health point of view, what other attempts have been made to address it and what their findings and shortcomings have been?
Reviewer #2:
This is an interesting Big Data study that aims to uncover the causal factors driving the onset of epidemic influenza at the county level in the United States. Authors rely on multiple datasets including medical claim records, demographic characteristics, human movement inferred from Twitter data, and climate data. Authors employed three different methodological approaches and found a complex relationship with influenza epidemic onset. Although the study is of significant interest, I do have several concerns indicated below.
1) The definition of epidemic onset is loosely defined and is unclear whether the same definition was employed across approaches. They discuss a "trigger period" of several weeks to exponential expansion. Is exponential expansion a necessary condition? This is the response variable in their study and little is discussed about its variation across counties and studied seasons.
2) Patterns from different approaches were not necessarily in agreement as authors discussed. In particular, authors did not attempt to reconstruct seasontoseason variation in influenza susceptibility. A mechanistic approach that accounts for temporal variation in influenza susceptibility is missing (perhaps departing from previous work by Shaman and his colleagues). Authors did not attempt to capture variation in reproduction numbers, vaccination rates, vaccine efficacy, and likely others across seasons.
3) Authors found strong support for a southern origin of influenza in the US. It would have been interesting to predict flu onset across the vast range of southern counties.
4) A validation phase was not attempted. For instance, authors could use the first few years of data for model calibration and the remainder of the data for forecasting purposes. It would have been interesting to quantify the predictive power of their approaches.
Reviewer #3:
This is a methodologically sophisticated paper which cleverly uses medical claims and other large datasets, and three different computational approaches, to explore whether three factors (weather, X and Y) might explain the timing of seasonal influenza in the US. They find that influenza epidemics tend to start in Southern states.
My impression is that the paper is highly sophisticated in terms of analytics, but that the data sources that are used to describe influenza and the possible drivers (such as weather, mobility) are not always well developed and ill described (Truwen) or even possibly inadequate (Twitter). The findings of shortdistance traveling being important driver corroborates findings by Gog et al., another pioneer study using medical claims data to study influenza dissemination, which is nice.
Trusting other reviewers will take on an evaluation of the analytic approaches, I will focus instead on the validity of the data (outcomes, explanatory) side of the matter.
It is my impression that the paper needs substantial revision in order to be ready for publication. For example, data sources and approaches must be far better described and more sensitivity analyses done. The lack of consideration of genetic drift variants and school schedule patterns, along with the likely consequences of regional differences in Truwen data coverage and ICD coding, if any, should be argued.
On the topic of drivers not considered:…
1) One obvious driver of seasonal influenza timing is not mentioned in this paper: the degree of genetic drift of the influenza virus relative to those that circulated in earlier years. That influenza occurs early in big "drift" years is evident from the unusually early onset of the 2003 Fujian H3N2 epidemic, and also of most pandemics in history including the 2009 pandemic. This is not discussed and that is a caveat that should be discussed. If such data are too much trouble, then at the very least consideration of which influenza subtype is dominant in each season would be useful; for example influenza B tends to be epidemic later than influenza A.
a) Could authors do a sensitivity subanalysis of seasons of major drift variants (2003 Fujian and 2009 pandemic flu) to see if their identified drivers also are found for those?
2) Another driver that is not discussed is the school session patterns across the US during the autumn and winter. As it was picked up as a major driver of pandemic spread by Gog et al., 2014, why would it not also be important for seasonal influenza? This should be listed as a shortcoming of the analysis, or authors could explain in the text why they did not think this was important.
On the topic of validity and validation of the observational data underlying the analyses:
3) I am concerned about the robustness of the analysis due to possible low counts of ILI as the authors defined it narrowly as ICD codes 487488 – which could mean much noise in ILI time series at the local level. The authors may benefit from a very similar study of influenza dissemination in the 2009 pandemic in the US, which was based on IMS CMS1500 claims data (Gog et al., 2014); the underlying paper preparing the medical claims data and validating against CDC influenza data (Viboud et al., Plos One 2013). Viboud et al., established that adding 079.9 (ICD9) to the case definition dramatically (by 90%) increased the sensitivity of the case definition, without much cost on the specificity.
a) Yet in the "Flammable media" (Discussion section) the authors discuss how 20% of Southeners have ILI each season. This seems way too high, as most people do not seek care for ILI and 20% (and what would the rate have been if the ILI definition was not so narrow?) is already higher than populationbased seroepidemiological assessments of attack rates and it is well known that most of people with ILI do not seek medical care and that at least half of infections are asymptomatic. So 20% ILI population prevalence in Truwen data is unbelievable.
i) Furthermore, it is difficult to believe that Southern US is better connected than, for example, dense US Northern cities like New York with large scale public transportation where influenza may be easily spread. See my later point about ICD coding preferences –… another worry that seasonal initial levels in southern US is a data artefact somehow, due to for example higher Truwen coverage and different ICD coding practices.
ii) Moreover, the authors comment at some point that the ILI prevalence is higher in the midwest. However, local differences in ICD coding preferences by doctors are likely a better explanation, and so these differences could just be artefacts. For that reason it would be better to not include prevalence in the modeling, but rather base the analysis solely on the initial epidemic phase and/or the peak timing.
4) The Truwen data source and extraction of the ILI and other time series should be described in much more detail. Is it CMS^{1}500 data? Were influenza codes sought among all codes listed or only the firstlisted? Etc. Also, it would be helpful to show graphically the time series data, to allow the reader to evaluate data stability and volume – over time and between locales. Or else, add some summary table that summarizes such observational findings. The Truwen claims database is likely a convenience sample, so it is likely that the coverage increases dramatically over time since 2003, and also that there are substantial differences in coverage by US geographical areas depending on Truwen levels of foot holds. Did the authors see such intrinsic variability in the data (for example, consider another code in summer months? How would such variability in coverage influence the analyses presented here – or is it somehow taken into account?
a) The authors offer a comparison with HIV patterns – but is that appropriate as HIV is a chronic disease – how exactly was this done? Did authors extract only ICD codes for new onset HIV or using longitudinal patientlevel data (nothing in the Materials and methods section about this)? A better comparison would perhaps be something else epidemic but with another mode of transmission such as rotavirus or norovirus diarrhea epidemics?
5) Validity of Twitter data to describe mobility? The authors conclude that shortdistance land travel is an important determinant of the spread of influenza, using Twitter data to determine mobility. But is this defensible, as only older kids and adults have Twitter accounts – yet we know that children are important drivers of influenza (see fx Gog et al., two major factors determining spread of influenza was school schedules and short distance traveling). Please comment on this possibility that the Twitter data are not determinants of children's movements.
6) On the "Initiation spark" (DDiscussion section). The authors conclude that cold weather and low humidity are important drivers. Put together with their finding that most epidemics originate in the Southern United states, one must ask: How could cold weather/low humidity be a predictor of influenza and influenza starts in the South – when Southern states have warmer climate and higher humidity in early winter than Northern states? Would be great if the authors would discuss this possible paradox.
Reviewer #4:
This manuscript presents an analysis of a large dataset of ILI from insurance records to identify spatial patterns in influenza outbreak initiation and spread. The authors present 3 classes of analysis to identify variables correlated with outbreak initiation. Overall, I found this work very interesting but in dire need of restructuring to clarify the main questions and methods. At present, this reads as an explanation of several very innovative approaches to mining these large datasets for patterns, but doesn't lay out a clear goal or hypothesis. Similarly, the Discussion lists an array of patterns that were discovered, but fails to identify a single (or even just a few) key insight that was gained through this approach.
I'm usually not a stickler for presentation of papers in the classic Introduction/Materials and methods/Results/Discussion format – but given that the authors are presenting several novel methods here, I think the manuscript would really benefit from being more classic. At present, some aspects of the methods are described in the "Approach #" sections, but others are presented in the "Materials and methods". I think it would be much clearer to group all the methods together so that the results could themselves be presented in a more concise format that would highlight the key insights.
The figures are excellent, but they are referenced out of order in the text; e.g. Figure 2I gives results from Approach 2 which is presented after the results from Approach 1, which are in Figure 4 and 5.
Overall, I would like to see a more concise statement of the major innovation from this approach and this analysis. At present, this is a list of variables that were identified as significant (albeit using some really interesting methods and excellent data), many of which have already been identified as predictive of flu in the past. What would be really interesting, if possible, would be to illustrate whether this method was in some way more predictive of flu dynamics than conventional methods. For example, could these analyses be carried out on the first 8 years of ILI data and the resulting model used to predict the 9th; if so, does the addition of Approach 1/3 do better than the standard Poisson regression?
[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]
Thank you for submitting your article "Conjunction of Factors Triggering Waves of Seasonal Influenza" for consideration by eLife. Your resubmission has been evaluated by Prabhat Jha (Senior Editor) and three reviewers, one of whom, Mark Jit (Reviewer #1), is a member of our Board of Reviewing Editors The following individual involved in review of your submission has agreed to reveal their identity: Elizabeth C Lee (Reviewer #3).
The reviewers have discussed the reviews with one another. We consider that you have undertaken a major effort in putting their paper in an appropriate format and addressing the reviewers' comments and are, in general, satisfied with the revisions made. However, we still have a few remaining concerns that should be addressed by the authors to make their paper suitable for potential publication in eLife.
The Reviewing Editor has drafted this decision to help you prepare a revised submission. If you are able to address at least all the essential points to our satisfaction, we should be able to make a quick decision about the manuscript.
Essential revisions:
1) The literature review is still not adequate. Reviewing the literature is not simply a matter of having a string of references. Ideally a couple of paragraphs would be added to the Introduction or Discussion discussing the history of related statistical models for understanding the timing, spread and burden of seasonal influenza, and how this paper builds on the existing literature.
Also several of the previous reviewers found it difficult to determine the actual aim of the datadriven exercise. While we understand and are sympathetic to the point that this is not a traditional hypothesisdriven study, it still needs an aim if not a hypothesis. Could the last line of the first paragraph be reworked into what the aim of this was? (e.g. is it to determine the most important factors that are associated with initiating influenza transmission in the continental USA at the beginning of each season?)
2) You state that "our more precisely defined ILI data defines better against the CDC dataset" – could details (including quantitative measures of better definition) be given? Both narrower and broader definitions of ILI are imperfect proxies of influenza activity, although in different ways, and it is not a given that a more specific measure is necessarily less biased.
3) The incorporation of antigenic drift, vaccine coverage and school data are important and innovative additions to the paper. However, one missing factor is host immunity not due to vaccination i.e. if there is a strong influenza season in one year with no substantial antigenic drift after that then the population immune to influenza is likely to be greater the following year. Ideally, this could be captured mechanistically by incorporating SIRtype epidemiological models, or by more statistical approaches, ideally informed by measures of prior immunity such as serology. At the very least, the implications of this factor need to be adequately discussed.
4) We still think (as reviewer 3 of the previous round of reviews pointed out) that a 20% influenza infection rate is far too high, even if it incorporates asymptomatic/subclinical infection. This is much greater than the change in seropositivity over most influenza seasons other than pandemic years.
5) Given the spatiotemporal nature of the findings, the authors should include more information about the Truven database. For instance, are the authors able to provide additional details about the spatial distribution of reporting rates (e.g., number of physicians contributing claims in a given county) within the Truven database? While insurance claims do provide high resolution spatial information, the use of raw counts for influenzarelated diagnosis codes could be problematic if there is great spatial or temporal variation in reporting rates.
6) While the methods and results are clearly described, the organization of the paper can be improved to facilitate easier reading. We found ourselves flipping back and forth a lot between the text because details that appeared to be missing earlier on would then appear later in different parts of the text. Figures were also cited out of order across the methods descriptions, and in some cases we don't believe all of the figures were discussed. As a few examples, the methods used to generate Figures 3I3K were not discussed until the Results (subsection “Approach 1: Nonparametric Granger analysis”, fourth and fifth paragraphs) but Figure 3 results are first discussed in the Materials and methods (subsection “Analysis of quantized clinical data (Approach 1)”, first paragraph).
For more concrete advice, we would remove references to resultsrelated figures from the Materials and methods section so that readers examine the figures only after all of the terms and analyses have been explained, and, having a cleaner separation between Materials and methods and Results will improve readability. There were many times where the authors alluded to some small part of the results in the Materials and methods before the description of the analysis had been completed. Conversely, the Results section had relatively detailed methods descriptions, which seemed redundant since the Results followed the Materials and methods. The authors may also consider a greater overhaul of the manuscript structure that groups the primary results in the text and figures (as opposed to organizing the manuscript by methodological approach) and moves many of the supporting or diagnostic figures to the supplement.
7) Beyond improving the organization of the paper, we think the reporting of results for Approach 1 could be better explained and supported in the text. For instance, it was not clear how the claims in the sixth and eighth paragraphs of the subsection “Approach 1: Nonparametric Granger analysis” were demonstrated with the cited figures.
8) Finally, given the emphasis on the novelty of comparing results across three methodological approaches, the authors should discuss the generalizability of their findings to influenza epidemiology beyond the United States and the generalizability of their framework to other contexts with high spatiotemporal resolution data.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for submitting your article "Conjunction of Factors Triggering Waves of Seasonal Influenza" for consideration by eLife. Your article has been reviewed by one of the original peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Prabhat Jha as the Senior Editor.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. We appreciate your revisions to the organization of the text and interpretation of the results, and think that many of the changes improve the readability and flow of the paper and clarify the goals of the analysis. Nevertheless, we still have remaining concerns about some parts of the analyses and about the sparsity of information for others. Hence we think that the article is still not ready to be published in its current state.
We would like to avoid multiple further rounds of revision, so we can give you one more opportunity to address the remaining issues before a final decision about the manuscript is made.
The remaining issues with the manuscript can be divided into two categories: (i) substantial scientific issues that need to be addressed before the manuscript is published, (ii) recommendations about the description of the work that we strongly encourage you to take but which are optional.
Essential revisions about scientific content:i) (Discussion, second paragraph) It is still difficult to interpret the results related to shortrange and longrange travel. We were unable to find the regression results for the full set of data weeks in the supplement. Also, there is not enough detail in the analysis description to see how Figure 5 Plates CF demonstrate that local connections are more explanatory than airport connections.
ii) The highlighted areas (greater incidence) in the first flu season snapshot for Figures 3AH each should match relatively well to the maps in Figure 7B (or at least some unspecified subset of "high incidence treatment counties"). While the countymatching results to point to initiation at coasts, 20062007 in Figure 3 seems to initiate from the West Coast and 2010 in Figure 7 has a scattered pattern across the middle of the US. Please check this or justify the differences.
iii) Also related to Figure 7, was there also a geographic constraint on the choice of the matched control counties? It is surprising that everything is so geographically clustered.
Recommendations about presentation of content:
i) It would be useful to provide a quantitative example of how to interpret one of the results in Figure 2A.
ii) (Subsection “The weather/humidity paradox”) This humidity paradox has come up quite a bit recently and it may be cleaner to have a fuller treatment of this in the Discussion.
iii) The subsection on Variable and model selection at the end of the Discussion seems out of place. Was this meant to be moved to the Materials and methods?
iv) Please point out the Discussion section describing the generalizability of the findings (as mentioned in the response).
v) The Conclusion section does much more of the storytelling that would be expected from a Discussion. It may be more compelling to kick off the Discussion with the analogy describing the epidemiological findings, and then elaborate on the strengths of using multiple methods to corroborate statistical findings.
vi) You may consider condensing the figures in the main text to those that present primary findings (removing some plates and potentially reducing the total number of figures). The figures are all visually appealing, but every single one is densely packed with primary findings, descriptive information, data processing details, and diagnostics/sensitivity. Some of the main takeaways may get lost.
vii) Figure 2 might be more aptly named "Putative determinants of seasonal influenza onset[…]"
viii) Figure 4, Plate I: Please add an explanation of the red circles to the legend title or caption.
ix) Please provide some explanation of how you developed the 126 models examined in Approach 2. Based on the models listed in Table 5 and in the supplement, it seems that there is substantial variation in the number of predictors included in the model and all of the models appear to include multiple autoregressive terms for the same environmental and travel predictors. What is the motivation for this approach? In addition, it would be good to better understand the random effects in the model (subsection “Approach 2: Poisson Regression on Putative Factors”, first paragraph) and the motivation for this model structure. Does this term include a spatial smoothing component?
https://doi.org/10.7554/eLife.30756.035Author response
[Editors’ note: the author responses to the first round of peer review follow.]
The full reviews are included, but the most important concerns that would need to be addressed in a new submission are the following:
1) Presentation: While eLife does not have strict format requirements, this paper in particular would benefit from a more classic presentation (e.g. Introduction, Materials and methods, Results, Discussion) that provides a fuller description of the background, purpose and how the paper interacts with and adds to the existing (vast) literature on this topic.
As requested, we introduced classic paper structure.
We also increased coverage of literature, for example see new references AlvaradoFacundo et al., 2009; Andreasen, 2003; Araz et al., 2012; Bedford et al., 2015; Boni, 2008; Boni et al., 2004; Chlanda et al., 2015; Rizzo et al., 2008; Shaman and Kohn, 2009; Sham et al., 2010; Tamerius et al., 2013; Treanor, 2004; Vynnycky and Edmunds, 2008; Yasuda et al., 2008; Yasuda, Yoshizawa and Suzuki, 2005.
2) The data sources and the way they are used need a more detailed description and justification. The definition of ILI was challenged by one of the reviewers and needs to be justified.
We expanded descriptions and added a special comment on ILI definition, see subsection “Clinical data source”.
3) While the environmental and demographic factors driving seasonality are extensively examined, there is a need also to take into account viral and host immunity factors such as antigenic drift, seasonal vaccine uptake and antigenic match, host susceptibility and betweenhost transmissibility (with the latter heavily influenced by the school calendar).
As requested, we included into analysis, as requested, viral antigen diversity(subsection “Vaccination Coverage”), human immunization rates(subsections “Data on vaccination coverage” and “Estimating vaccination coverage”), and schools schedule(subsection “Estimating the effect of returntoschool days”).
4) The conclusions (e.g. the "Southern hypothesis" and bodies of water being sources) are interesting but need stronger explanation and justification. The different approaches taken did not all give exactly the same results. Claims of causality in epidemiology require an extremely high level of evidence so in this case more moderate language may be appropriate (e.g. "the association suggests.…").
We reworded our clams to sound more moderately (no causality statement or “putative causality” at most) throughout the text.
5) It would have been useful to make some attempt to validate the methods used, such as by holding back a validation dataset.
As suggested, we added to our analysis experiments with estimating parameters with test data and then testing the model with a heldout dataset. The predictive power of the best models appears reasonably high (subsection “Validation of Predictive Capability”).
Reviewer #1:
[…] 1) The claims about causality in the paper should be tempered. While Approach 1 may historically have been called Granger causal inference, there is a limit to the extent to which causality can be claimed in the absence of (and sometimes even with) experimental data. At best, the manuscript should claim that the associations in the data are consistent with their hypotheses.
Agreed, carefully changed throughout.
2) The concept of streamlines is an interesting one but it's not clear what the underlying mechanisms behind them are. They seem to imply that there is a continuous line linking infection in a particular location with a source in another, usually distant, location. Is it not more consistent with our understanding of infection spread to assume that infection spread is a diffusion process, even though environmental and host factors may enhance diffusion in particular directions?
We absolutely agree, the streamlines show the patterns of dominant flow that are realized via diffusionlike process.
3) All three approaches used are essentially data mining exercises, followed by posthoc generation of hypotheses to explain the associations found. This does not mean they are invalid – plenty of science works in this way, and some care has been taken to avoid type I errors. Nevertheless, the possibility of bias, confounding, alternative explanations and simply random error should not be discounted. This is particularly the case because there is strong geographic clustering of many of the variables identified as key drivers (e.g. densely populated areas, coastal areas, places with airports, places at the geographical and climactic extremes of the USA tend to be the same places). Hence the language used should be substantially tempered to be more tentative.
Agreed, changed.
While the authors have used E. coli and HIV spread as controls to ensure that these display different patterns of spatial spread, the problem is that the seasonality of these pathogens is completely different or (in the case of HIV) largely absent. Looking at a nonrespiratory but still highly seasonal pathogen e.g. norovirus might be interesting.
4) One shortcoming of the data analysis is that although a large number of variables related to environmental and demographic factors are considered, there is little consideration of viral and host characteristics which are key to driving influenza spread. For instance, over the nine years the model is run, the United States has experienced waves of different influenza types, a pandemic (which led to a shift in the influenza A/H1 subtype), different levels of vaccine coverage with varying levels of match to the prevailing strains, as well as changes in immune profiles of individuals over the population. There is no attempt to understand these drivers and capture their spatiotemporal heterogeneity even though there are datasets (e.g. vaccine coverage, serology, virological testing) that could be used to inform this.
We now repeated our analysis, incorporating additional data types, as suggested. (Please see our reply to point #3 raised by the Editor.)
Note about presentation:
The presentation of the paper is rather confusing. There is hardly any introductory material; the first section is essential a summary of methods and results. This is followed by a more detailed description of the methods and finally a discussion. While eLife does not have strict rules about manuscript formatting, the paper should at least follow the general conventions of medical/life sciences journals i.e. the Introduction, Materials and methods, Results, Discussion division which will make finding information easier for most readers. A fuller Introduction section setting the background is also needed e.g. why this is an important problem from a clinical/public health point of view, what other attempts have been made to address it and what their findings and shortcomings have been?
Agreed, the required changes were made, now we have the standard section titles (please see our reply to point #1 raised by the Editor).
Reviewer #2:
This is an interesting Big Data study that aims to uncover the causal factors driving the onset of epidemic influenza at the county level in the United States. Authors rely on multiple datasets including medical claim records, demographic characteristics, human movement inferred from Twitter data, and climate data. Authors employed three different methodological approaches and found a complex relationship with influenza epidemic onset. Although the study is of significant interest, I do have several concerns indicated below.
1) The definition of epidemic onset is loosely defined and is unclear whether the same definition was employed across approaches. They discuss a "trigger period" of several weeks to exponential expansion. Is exponential expansion a necessary condition? This is the response variable in their study and little is discussed about its variation across counties and studied seasons.
We made it clear that the same definition was used across all approaches and clarified the definition of onset, see subsection “Analysis of quantized clinical data (Approach 1)”, first paragraph.
2) Patterns from different approaches were not necessarily in agreement as authors discussed. In particular, authors did not attempt to reconstruct seasontoseason variation in influenza susceptibility. A mechanistic approach that accounts for temporal variation in influenza susceptibility is missing (perhaps departing from previous work by Shaman and his colleagues). Authors did not attempt to capture variation in reproduction numbers, vaccination rates, vaccine efficacy, and likely others across seasons.
In the revised version, we have additional predictors (such as immunization and viral antigenic drift) that bring us closer to the mechanistic interpretation, please see our response to the Editor’s point #3.
3) Authors found strong support for a southern origin of influenza in the US. It would have been interesting to predict flu onset across the vast range of southern counties.
Following the reviewer’s suggestion, we did run a crossvalidation analysis, estimating and testing models on disjoint subsets of data (see subsection “Validation of Predictive Capability”).
4) A validation phase was not attempted. For instance, authors could use the first few years of data for model calibration and the remainder of the data for forecasting purposes. It would have been interesting to quantify the predictive power of their approaches.
Agreed. In the revised version, we attempted splitting data into training and testing (see subsection “Validation of Predictive Capability”).
Reviewer #3:
This is a methodologically sophisticated paper which cleverly uses medical claims and other large datasets, and three different computational approaches, to explore whether three factors (weather, X and Y) might explain the timing of seasonal influenza in the US. They find that influenza epidemics tend to start in Southern states.
My impression is that the paper is highly sophisticated in terms of analytics, but that the data sources that are used to describe influenza and the possible drivers (such as weather, mobility) are not always well developed and ill described (Truwen) or even possibly inadequate (Twitter). The findings of shortdistance traveling being important driver corroborates findings by Gog et al., another pioneer study using medical claims data to study influenza dissemination, which is nice.
Trusting other reviewers will take on an evaluation of the analytic approaches, I will focus instead on the validity of the data (outcomes, explanatory) side of the matter.
It is my impression that the paper needs substantial revision in order to be ready for publication. For example, data sources and approaches must be far better described and more sensitivity analyses done. The lack of consideration of genetic drift variants and school schedule patterns, along with the likely consequences of regional differences in Truwen data coverage and ICD coding, if any, should be argued.
On the topic of drivers not considered:
1) One obvious driver of seasonal influenza timing is not mentioned in this paper: the degree of genetic drift of the influenza virus relative to those that circulated in earlier years. That influenza occurs early in big "drift" years is evident from the unusually early onset of the 2003 Fujian H3N2 epidemic, and also of most pandemics in history including the 2009 pandemic. This is not discussed and that is a caveat that should be discussed. If such data are too much trouble, then at the very least consideration of which influenza subtype is dominant in each season would be useful; for example influenza B tends to be epidemic later than influenza A.
a) Could authors do a sensitivity subanalysis of seasons of major drift variants (2003 Fujian and 2009 pandemic flu) to see if their identified drivers also are found for those?
We now included genetic drift into our modeling as suggested by the reviewer, please see our response to the Editor’s point #3.
2) Another driver that is not discussed is the school session patterns across the US during the autumn and winter. As it was picked up as a major driver of pandemic spread by Gog et al., 2014, why would it not also be important for seasonal influenza? This should be listed as a shortcoming of the analysis, or authors could explain in the text why they did not think this was important.
We included the school seasons in our analysis, please see our response to the Editor’s point #3.
On the topic of validity and validation of the observational data underlying the analyses:
3) I am concerned about the robustness of the analysis due to possible low counts of ILI as the authors defined it narrowly as ICD codes 487488 – which could mean much noise in ILI time series at the local level. The authors may benefit from a very similar study of influenza dissemination in the 2009 pandemic in the US, which was based on IMS CMS1500 claims data (Gog et al., 2014); the underlying paper preparing the medical claims data and validating against CDC influenza data (Viboud et al., Plos One 2013). Viboud et al., established that adding 079.9 (ICD9) to the case definition dramatically (by 90%) increased the sensitivity of the case definition, without much cost on the specificity.
The addition of codes 079.98 and 079.99 does substantially increase the number of unique people with viral infections cases (over all years) from 4,009,982 to 10,682,458. However, after carefully considering description of codes 079.99 and 079.98, we were uncomfortable using them. For example, code 079.99 is associated with the following list of conditions.
 Acute retinal necrosis
 Acute viral bronchiolitis
 Acute viral disease
 Acute viral laryngotracheitis
 Acute viral otitis externa
 Acute viral thyroiditis
 Amantadine resistant virus present
 Arthritis due to viral infection
 Arthritis of hand due to viral infection
 Arthritis of knee due to viral infection
 Arthropathy associated with viral disease
 Boid inclusion body disease
 Cardiomyopathy due to viral infection
 Cardiomyopathy, due to viral infection
 Congenital viral disease
 Congenital viral infection
 Disease of possible viral origin
 Encephalitis due to influenzaspecific virus not identified
 Maternal viral disease complicating pregnancy
 Maternal viral disease in pregnancy
 Neonatal viral infection of skin
 Nonspecific syndrome suggestive of viral illness (finding)
 Oral mucosal viral disease
 Postpartum (after childbirth) viral disease
 Postpartum viral disease
 Postviral depression
 Postviral excessive daytime sleepiness
 Postviral infection debility
 Viral acute pancreatitis
 Viral bronchitis
 Viral carditis
 Viral dermatitis of eyelid
 Viral disease
 Viral disease in childbirth
 Viral disease in pregnancy
 Viral ear infection
 Viral esophagitis
 Viral eye infection
 Viral infection
 Viral infection by site
 Viral lower respiratory infection
 Viral musculoskeletal infection
 Viral myositis
 Viral pleurisy
 Viral respiratory infection
 Viral retinitis
 Viral syndrome
 Viral ulcer of esophagus
 Virus present
 Zanamivir resistant virus present
While we are not disputing Viboud et al.'s conclusions in the paper cited by the reviewer, given the large volume of specific data in our sample (4,009,982 unique people with ILIs), we do not see a compelling reason to add cases with nonspecific codes. We directly evaluated “our” stricterdefined ILIs against the CDC data (temporal prevalence of influenza) and observed a very good correspondence (that deteriorates with addition of nonspecific viral infection codes).
a) Yet in the "Flammable media" (Discussion section) the authors discuss how 20% of Southeners have ILI each season. This seems way too high, as most people do not seek care for ILI and 20% (and what would the rate have been if the ILI definition was not so narrow?) is already higher than populationbased seroepidemiological assessments of attack rates and it is well known that most of people with ILI do not seek medical care and that at least half of infections are asymptomatic. So 20% ILI population prevalence in Truwen data is unbelievable.
We introduced more careful wording, mentioning unreported cases, see page 22, lines 625—626.
i) Furthermore, it is difficult to believe that Southern US is better connected than, for example, dense US Northern cities like New York with large scale public transportation where influenza may be easily spread. See my later point about ICD coding preferences – another worry that seasonal initial levels in southern US is a data artefact somehow, due to for example higher Truwen coverage and different ICD coding practices.
We are talking about empirically ascertained social networks, measuring actual number of contacts among people, see Table 1.
ii) Moreover, the authors comment at some point that the ILI prevalence is higher in the midwest. However, local differences in ICD coding preferences by doctors are likely a better explanation, and so these differences could just be artefacts. For that reason it would be better to not include prevalence in the modeling, but rather base the analysis solely on the initial epidemic phase and/or the peak timing.
In predictors we use a derivative of prevalence, which should be invariant to coding biases. Our crossvalidation results suggest that, even if biases are present in coding, the modeling does capture predictive information.
4) The Truwen data source and extraction of the ILI and other time series should be described in much more detail. Is it CMS^{1}500 data? Were influenza codes sought among all codes listed or only the firstlisted? Etc. Also, it would be helpful to show graphically the time series data, to allow the reader to evaluate data stability and volume – over time and between locales. Or else, add some summary table that summarizes such observational findings. The Truwen claims database is likely a convenience sample, so it is likely that the coverage increases dramatically over time since 2003, and also that there are substantial differences in coverage by US geographical areas depending on Truwen levels of foot holds. Did the authors see such intrinsic variability in the data (for example, consider another code in summer months? How would such variability in coverage influence the analyses presented here – or is it somehow taken into account?
Agreed: we used percent of insured per county to adjust for insurance coverage variation. Also, we used the total number of people described each week for given county as an offset in the Poisson regression. To answer the reviewer’s question about geographic distribution of cases, we have a movie showing ILI spread over nearly a decadeperiod.
a) The authors offer a comparison with HIV patterns – but is that appropriate as HIV is a chronic disease – how exactly was this done? Did authors extract only ICD codes for new onset HIV or using longitudinal patientlevel data (nothing in the Materials and methods section about this)? A better comparison would perhaps be something else epidemic but with another mode of transmission such as rotavirus or norovirus diarrhea epidemics?
The HIV figure is shown only to illustrate that different diseases have distinct patterns of flow. As norovirus and rotavirus infections are much rarer, we have concerns about consistency of coding and testing (i.e., specificity) for these infections.
5) Validity of Twitter data to describe mobility? The authors conclude that shortdistance land travel is an important determinant of the spread of influenza, using Twitter data to determine mobility. But is this defensible, as only older kids and adults have Twitter accounts – yet we know that children are important drivers of influenza (see fx Gog et al., two major factors determining spread of influenza was school schedules and short distance traveling). Please comment on this possibility that the Twitter data are not determinants of children's movements.
We also used a measure of direct proximity of counties to account for land movements, and also explicitly tested the school schedule hypothesis.
6) On the "Initiation spark" (Discussion section). The authors conclude that cold weather and low humidity are important drivers. Put together with their finding that most epidemics originate in the Southern United states, one must ask: How could cold weather/low humidity be a predictor of influenza and influenza starts in the South – when Southern states have warmer climate and higher humidity in early winter than Northern states? Would be great if the authors would discuss this possible paradox.
We added a new paragraph explicitly discussing this “paradox,” see subsection “Is there a paradox here?”.
Reviewer #4:
This manuscript presents an analysis of a large dataset of ILI from insurance records to identify spatial patterns in influenza outbreak initiation and spread. The authors present 3 classes of analysis to identify variables correlated with outbreak initiation. Overall, I found this work very interesting but in dire need of restructuring to clarify the main questions and methods. At present, this reads as an explanation of several very innovative approaches to mining these large datasets for patterns, but doesn't lay out a clear goal or hypothesis. Similarly, the Discussion lists an array of patterns that were discovered, but fails to identify a single (or even just a few) key insight that was gained through this approach.
Agreed, restructured. The first paragraph in Discussion now explains the innovative claims of the manuscript.
I'm usually not a stickler for presentation of papers in the classic Introduction/Materials and methods/Results/Discussion format – but given that the authors are presenting several novel methods here, I think the manuscript would really benefit from being more classic. At present, some aspects of the methods are described in the "Approach #" sections, but others are presented in the "Materials and methods". I think it would be much clearer to group all the methods together so that the results could themselves be presented in a more concise format that would highlight the key insights.
Agreed, done as requested: descriptions of all methods are combined in “Materials and methods,” while “Results” summarize conclusions. Please see reply to the Editor’s comment #1.
The figures are excellent, but they are referenced out of order in the text; e.g. Figure 2I gives results from Approach 2 which is presented after the results from Approach 1, which are in Figure 4 and 5.
Corrected, figures are reordered.
Overall, I would like to see a more concise statement of the major innovation from this approach and this analysis. At present, this is a list of variables that were identified as significant (albeit using some really interesting methods and excellent data), many of which have already been identified as predictive of flu in the past. What would be really interesting, if possible, would be to illustrate whether this method was in some way more predictive of flu dynamics than conventional methods. For example, could these analyses be carried out on the first 8 years of ILI data and the resulting model used to predict the 9th; if so, does the addition of Approach 1/3 do better than the standard Poisson regression?
We added explicit highlights of innovation in our work: see the first paragraph in Discussion, which now explains the innovative claims of the manuscript.
[Editors' note: the author responses to the rereview follow.]
Essential revisions:
1) The literature review is still not adequate. Reviewing the literature is not simply a matter of having a string of references. Ideally a couple of paragraphs would be added to the Introduction or Discussion discussing the history of related statistical models for understanding the timing, spread and burden of seasonal influenza, and how this paper builds on the existing literature.
As requested, we have augmented the manuscript with a short review of prior mathematical modeling approaches to analyzing influenza epidemics (see subsection “Recent computational studies of influenza”, fourth paragraph).
Also several of the previous reviewers found it difficult to determine the actual aim of the datadriven exercise. While we understand and are sympathetic to the point that this is not a traditional hypothesisdriven study, it still needs an aim if not a hypothesis. Could the last line of the first paragraph be reworked into what the aim of this was? (e.g. is it to determine the most important factors that are associated with initiating influenza transmission in the continental USA at the beginning of each season?)
Done as requested, see Introduction, first paragraph.
2) You state that "our more precisely defined ILI data defines better against the CDC dataset" – could details (including quantitative measures of better definition) be given? Both narrower and broader definitions of ILI are imperfect proxies of influenza activity, although in different ways, and it is not a given that a more specific measure is necessarily less biased.
In the revised manuscript, we provide a whole new section dedicated to this issue. We provide definitions of ICD9CM codes, and argue that the narrow definition we chose is most appropriate for our purposes. We argue this while directly accepting that, similar to what is outlined in Viboud et al.’s PLOS ONE article, we have no perfect gold standard to evaluate the quality of this choice; precision appears to be more important for our analysis than sensitivity.
3) The incorporation of antigenic drift, vaccine coverage and school data are important and innovative additions to the paper. However, one missing factor is host immunity not due to vaccination i.e. if there is a strong influenza season in one year with no substantial antigenic drift after that then the population immune to influenza is likely to be greater the following year. Ideally, this could be captured mechanistically by incorporating SIRtype epidemiological models, or by more statistical approaches, ideally informed by measures of prior immunity such as serology. At the very least, the implications of this factor need to be adequately discussed.
We added the following statement to the Materials and methods to address this issue:
“Accounting for nonvaccination host immunity: Resistance to influenza infection in human hosts arises via two related mechanisms: immunological memory from a previous infection by an identical or sufficiently similar strain, or vaccination against the current strain (or a sufficiently similar strain). […] Likewise, if the absolute deviation from years previous starts decreasing, we would also register a decrease in susceptibility. Incorporating these factors in the diverse set of models that we investigate guarantees that we are indeed considering host susceptibility contributions from a wide range of possible mechanisms.
4) We still think (as reviewer 3 of the previous round of reviews pointed out) that a 20% influenza infection rate is far too high, even if it incorporates asymptomatic/subclinical infection. This is much greater than the change in seropositivity over most influenza seasons other than pandemic years.
Agreed.
Our estimates of influenza infection rate are unavoidably inflated. This is because what we are measuring is the week and countyspecific prevalence of influenzalike illnesses among insured patients who contacted their physician for any reason during the week in question. In other words, the denominator that we use for computing prevalence is almost certain to be much smaller than the overall population of the corresponding county.
We introduced the corresponding disclaimer into the manuscript; see the “Flammable Media” paragraph in the Discussion.
5) Given the spatiotemporal nature of the findings, the authors should include more information about the Truven database. For instance, are the authors able to provide additional details about the spatial distribution of reporting rates (e.g., number of physicians contributing claims in a given county) within the Truven database? While insurance claims do provide high resolution spatial information, the use of raw counts for influenzarelated diagnosis codes could be problematic if there is great spatial or temporal variation in reporting rates.
Note that we are using proportions (of people affected with ILIs out of total number of visible people in the MarketScan) rather than raw counts of affected individuals.
We provide a movie that represents influenza prevalence across space and time (each frame is one week), which is the best summary of data we can think of. We supplemented the description of the database with reference to 2017 White Paper, and additional statistics on ILI spatial and temporal distribution in the Supplement (see Figure 1—figure supplement 4C).
6) While the methods and results are clearly described, the organization of the paper can be improved to facilitate easier reading. We found ourselves flipping back and forth a lot between the text because details that appeared to be missing earlier on would then appear later in different parts of the text. Figures were also cited out of order across the methods descriptions, and in some cases we don't believe all of the figures were discussed.
We made a significant effort to streamline and harmonize our exposition of methods and results.
As a few examples, the methods used to generate Figures 3I3K were not discussed until the Results (subsection “Approach 1: Nonparametric Granger analysis”, fourth and fifth paragraphs) but Figure 3 results are first discussed in the Materials and methods (subsection “Analysis of quantized clinical data (Approach 1)”, first paragraph).
Changed as suggested.
For more concrete advice, we would remove references to resultsrelated figures from the Materials and methods section so that readers examine the figures only after all of the terms and analyses have been explained, and, having a cleaner separation between Materials and methods and Results will improve readability. There were many times where the authors alluded to some small part of the results in the Materials and methods before the description of the analysis had been completed.
Done as requested.
Conversely, the Results section had relatively detailed methods descriptions, which seemed redundant since the Results followed the Materials and methods. The authors may also consider a greater overhaul of the manuscript structure that groups the primary results in the text and figures (as opposed to organizing the manuscript by methodological approach) and moves many of the supporting or diagnostic figures to the supplement.
We removed the redundancy.
7) Beyond improving the organization of the paper, we think the reporting of results for Approach 1 could be better explained and supported in the text. For instance, it was not clear how the claims in the sixth and eighth paragraphs of the subsection “Approach 1: Nonparametric Granger analysis” were demonstrated with the cited figures.
We improved the description of Approach 1, providing more extended description.
8) Finally, given the emphasis on the novelty of comparing results across three methodological approaches, the authors should discuss the generalizability of their findings to influenza epidemiology beyond the United States and the generalizability of their framework to other contexts with high spatiotemporal resolution data.
We added a discussion of this generalizability (see subsection Generalizability in Discussion).
[Editors' note: further revisions were requested prior to acceptance, as described below.]
The remaining issues with the manuscript can be divided into two categories: (i) substantial scientific issues that need to be addressed before the manuscript is published, (ii) recommendations about the description of the work that we strongly encourage you to take but which are optional.
Essential revisions about scientific content:i) (Discussion, second paragraph) It is still difficult to interpret the results related to shortrange and longrange travel. We were unable to find the regression results for the full set of data weeks in the supplement. Also, there is not enough detail in the analysis description to see how Figure 5 Plates CF demonstrate that local connections are more explanatory than airport connections.
Our conclusion that local travel is predominantly responsible for disease wave propagation is supported by several lines of analysis.
First, continuous landmovement infection waves are visible in the weekly influenza rate movie; we computed this movie from insurance claim data and made it available with results of this study.
Second, because our allweeksincluded dataset was too large for the R MCMCglmm package to handle, we performed mixedeffect Poisson regression calculations using a 50 percent random sample of all the weeks for which data were available. In this computation, the airport proximity fixedeffect coefficient turned out to be statistically significantly negative (see Source data 2, as well as the editable output file “flu50percentweeks.txt”).
Third, the results from our Grangercausality inference showed that:
1) Local countytocounty movements were much more predictive of influenza wave change than airport movements. In comparing Plates E and F in Figure 5F’s local movement causality coefficient (γ) is, on average, twice as large as that for longrange movement (Figure 5E). Figure 5E shows that the mean longrange causality coefficient is approximately 0.05, whereas it is just over 0.1 in the local propagation. As the causality coefficient quantifies the amount of predictability (measured as information in bits) communicated about the target data stream per observed bit in the source data stream, it follows that, on average, every ten bits of sequential incidence data from an influencing location tells us one bit about the unfolding incidence dynamics in the target location. Therefore, in the longrange movement case, informativeness is twice as low, so we need on average 20 bits to infer one bit about the state of infection. These calculations strongly suggest that local movement is predictively stronger with regards to influenza infection propagation.
2) While the most frequent value of the computed time delay in influence propagation between counties with large airports is zero weeks, this distribution is significantly flatter compared to that for local, countytocounty influence propagation.
We updated the main text of the manuscript with the above prose.
ii) The highlighted areas (greater incidence) in the first flu season snapshot for Figures 3AH each should match relatively well to the maps in Figure 7B (or at least some unspecified subset of "high incidence treatment counties"). While the countymatching results to point to initiation at coasts, 20062007 in Figure 3 seems to initiate from the West Coast and 2010 in Figure 7 has a scattered pattern across the middle of the US. Please check this or justify the differences.
There is no mistake; the resulting patterns are what they turn out to be. The intuitive explanation of perceived discrepancy is that the matching method agrees with other analysis types predominantly, but not in all cases. Each analysis has limitations. In the case of the matching analysis, we have less statistical power than in, for example, Poisson regression; matching by numerous parameters reduces the initial set of thousands of counties to a handful of matching “treated” counties (which meet a particular combination of weather and sociodemographic conditions) and “untreated” counties (very similar to “treated” ones in all respects but treatment). The difficulttomatch, “weeded out” counties may happen to be in the coastal areas indicated as the most likely places of influenza wave origin by other analyses.
In the case of the 2007 and 2010 results, the matching analyses pick patterns that are different from those produced by the causality streamline analysis and mixedeffect Poisson regression models.
Figure 6’s Plate B shows the distribution of the treatment counties and matchednontreatment counties. Note that here, we are not directly predicting initiation, so while the patterns in Figure 3 and Figure 6 should indeed have some similarity, they are not required to match up perfectly. The most similar treated counties do indeed show up in the southern shores.
iii) Also related to Figure 7, was there also a geographic constraint on the choice of the matched control counties? It is surprising that everything is so geographically clustered.
The geographic clustering arose from imposing similar weather patterns from multiple climate variables; we added explanation of this observation to the main text. (See subsection “Approach 3: Matching counties and factor combinations” in Results.)
Recommendations about presentation of content:
i) It would be useful to provide a quantitative example of how to interpret one of the results in Figure 2A.
The fixedeffect regression coefficients plotted in Plate A are shown on logarithmic scale, meaning that the absolute magnitude of predictorspecific effect is obtained by exponentiating the parameter value. A negative coefficient for a predictor variable suggests that influenza rate falls as this factor increases, while a positive coefficient predicts growing rate of infection as the parameter value grows. The integrated influence of individual predictors, under this model, is additive with respect to the countyspecific rate of infection.
For example, a coefficient of 0.6for parameter AVG_PRESS_mean tells us that the average atmospheric pressure has a negative association with the influenza rate. As the mean atmospheric pressure for the county grows, the probability that the county would participate in an infection initiation wave falls. As exp(0.6)=0.54, the rate of infection drops by 46 percent when atmospheric pressure increases by one unit of zerocentered and standarddeviationnormalized atmospheric pressure. Similarly, an increase in the share of a white Hispanic population predicts an increase in influenza rate: A coefficient of 1.3 translates into a exp(1.3) ⋅ 100% – 100% = 267% rate increase, possibly, because of the higher social network connectivity associated with this segment of population.
We put this information to legend of Figure 1.
ii) (Subsection “The weather/humidity paradox”) This humidity paradox has come up quite a bit recently and it may be cleaner to have a fuller treatment of this in the Discussion.
We considered this suggestion; we do not have much to add on this topic, beyond what is already discussed. Apologies.
iii) The subsection on Variable and model selection at the end of the Discussion seems out of place. Was this meant to be moved to the Materials and methods?
Yes, it has been moved.
iv) Please point out the Discussion section describing the generalizability of the findings (as mentioned in the response).
We have added the section: our apologies, it was added, then deleted by mistake.
v) The Conclusion section does much more of the storytelling that would be expected from a Discussion. It may be more compelling to kick off the Discussion with the analogy describing the epidemiological findings, and then elaborate on the strengths of using multiple methods to corroborate statistical findings.
Agreed; it has been changed as suggested.
vi) You may consider condensing the figures in the main text to those that present primary findings (removing some plates and potentially reducing the total number of figures). The figures are all visually appealing, but every single one is densely packed with primary findings, descriptive information, data processing details, and diagnostics/sensitivity. Some of the main takeaways may get lost.
We made an effort to streamline figures, moving whole Figure 1 to the supplement.
vii) Figure 2 might be more aptly named "Putative determinants of seasonal influenza onset[…]"
Agreed. Corrected.
viii) Figure 4, Plate I: Please add an explanation of the red circles to the legend title or caption.
Done.
ix) Please provide some explanation of how you developed the 126 models examined in Approach 2. Based on the models listed in Table 5 and in the supplement, it seems that there is substantial variation in the number of predictors included in the model and all of the models appear to include multiple autoregressive terms for the same environmental and travel predictors. What is the motivation for this approach?
We added and removed random effects to the model structure, taking out and adding fixed effects, running the regression, and plotting the DIC. Model complexity was simply proxied by the descriptional complexity in terms of how large the model was (the number of factors in the regression equation). We stopped when the drop in DIC stabilized.
The driving idea here is the identification of the Pareto front that trades off accuracy (in terms of DIC) with model complexity. This is, of course, a “greedy approach,” in the sense that we do not guarantee that we have indeed found the “best” possible model. However, since our crossvalidation (Figure 5 in main text) yielded good results, we deemed the stopping rule to be satisfactory.
We put this information into the main text (see subsection “Variable and Model Selection” in Materials and methods).
In addition, it would be good to better understand the random effects in the model (subsection “Approach 2: Poisson Regression on Putative Factors”, first paragraph) and the motivation for this model structure. Does this term include a spatial smoothing component?
No, we did not use an explicit spatial smoothing. The countylevel random effects were used to account for possible heterogeneity in disease reporting across counties. However, we used as predictors average values of infection rates of immediate neighbors for the county (1, 2, 3 and 4 weeks in the past), this prediction structure afforded an implicit spatial smoothing, dampening spurious fluctuations in infection rates.
We clarified this point in the main text, see Approach 2 in Materials and methods.
https://doi.org/10.7554/eLife.30756.036Article and author information
Author details
Funding
National Institutes of Health (U01HL10863401)
 Jeffrey L Shaman
National Institutes of Health (R01GM100467)
 Jeffrey L Shaman
National Institutes of Health (U01GM110748)
 Jeffrey L Shaman
National Institutes of Health (1P50MH094267)
 Andrey Rzhetsky
Defense Sciences Office, DARPA (W911NF1410333)
 Andrey Rzhetsky
National Institutes of Health (R01HL122712)
 Andrey Rzhetsky
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Erin Gannon, Mark Jit, Prabhat Jha, Elizabeth C Lee, and Margarita Rzhetsky for numerous comments on earlier versions of the manuscript. This work was supported by NIH grants 1P50MH094267, U01HL10863401, R01HL12271202, R01GM100467, and U01GM110748, in addition to DARPA contract W911NF1410333, and a gift from Liz and Kent Dauten.
Reviewing Editor
 Mark Jit, London School of Hygiene & Tropical Medicine, and Public Health England, United Kingdom
Publication history
 Received: July 26, 2017
 Accepted: February 13, 2018
 Accepted Manuscript published: February 27, 2018 (version 1)
 Version of Record published: March 22, 2018 (version 2)
Copyright
© 2018, Chattopadhyay et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 3,218
 Page views

 449
 Downloads

 24
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.