1. Epidemiology and Global Health
Download icon

Conjunction of factors triggering waves of seasonal influenza

  1. Ishanu Chattopadhyay
  2. Emre Kiciman
  3. Joshua W Elliott
  4. Jeffrey L Shaman
  5. Andrey Rzhetsky  Is a corresponding author
  1. University of Chicago, United States
  2. Microsoft Research, United States
  3. Columbia University, United States
Research Article
  • Cited 1
  • Views 1,527
  • Annotations
Cite as: eLife 2018;7:e30756 doi: 10.7554/eLife.30756

Abstract

Using several longitudinal datasets describing putative factors affecting influenza incidence and clinical data on the disease and health status of over 150 million human subjects observed over a decade, we investigated the source and the mechanistic triggers of influenza epidemics. We conclude that the initiation of a pan-continental influenza wave emerges from the simultaneous realization of a complex set of conditions. The strongest predictor groups are as follows, ranked by importance: (1) the host population’s socio- and ethno-demographic properties; (2) weather variables pertaining to specific humidity, temperature, and solar radiation; (3) the virus’ antigenic drift over time; (4) the host population’€™s land-based travel habits, and; (5) recent spatio-temporal dynamics, as reflected in the influenza wave auto-correlation. The models we infer are demonstrably predictive (area under the Receiver Operating Characteristic curve 80%) when tested with out-of-sample data, opening the door to the potential formulation of new population-level intervention and mitigation policies.

https://doi.org/10.7554/eLife.30756.001

eLife digest

Influenza – or ‘the flu’ – is a contagious disease which sweeps across the globe like clockwork, claiming tens of thousands of lives. This is known as ‘seasonal flu’.

Many scientists have tried to identify the factors that spark these yearly outbreaks. Some past studies have found that seasonal flu occurs when air that is normally humid turns dry, suggesting weather patterns play an important part. Other research has shown that air travel contributes to the flu spreading across the world. However, these studies typically focus on just one or two factors on their own. It is still not clear how exactly these factors combine to drive outbreaks, and then sustain the wave of infection.

To address this, Chattopadhyay et al. analyze the medical histories of 150 million American people over a decade, combining this information with large datasets about the different factors that trigger flu outbreaks. This includes detailed data about air travel and weather patterns, as well as census data that describe features of the population. Patterns of movement are also examined, for example by processing billions of Twitter messages “tagged” with a location. Chattopadhyay et al. used all of these datasets to model outbreaks of the flu in the United States, and see which factors play the biggest role.

It turns out that yearly outbreaks of seasonal flu are a result of a combination of elements. Some factors interact to help trigger the start of the wave, like humid weather in a highly populated area with nearby airports. Other factors, such how people move, encourage the spread of the infection. Finally, certain features of the population, for example how closely knitted a community is, make specific areas of the country more susceptible to the arrival of the disease. Overall, some of the most important elements of the model relate to the characteristics of the populations, the weather, the type of virus, and the number of short-distance journeys (rather than air travel).

Understanding how and why outbreaks occur can help policy-makers design strategies that reduce the spread and impact of seasonal flu, which could potentially save thousands of lives. Ultimately, the model developed by Chattopadhyay et al. could be used to test whether these policies would work before they are implemented in the real world.

https://doi.org/10.7554/eLife.30756.002

Introduction

Seasonal influenza is a serious threat to public health, claiming tens of thousands of lives every year. A large number of past studies have focused on identifying the likely factors responsible for initiating each seasonal disease wave. Typically, each such study focused on one or a few hypothetical factors. Our study aimed at an integrative, joint analysis of numerous suggested disease triggers, comparing their relative importance and possible cooperation in triggering pan-US waves of seasonal influenza infection. The goal of this study was to identify the most informative combinations of statistical predictors associated with the initiation of pan-US influenza infection waves.

Recent computational studies of influenza: 

Computational study of the dynamics and factors influencing infectious disease spread began with compartmental models, such as the Susceptible-Infected-Resistant (SIR) model, which traces its origins to the beginning of the last century (Kermack and McKendrick, 1927). Initially a purely theoretical tool, these SIR-style models were subsequently enhanced with population and geographic data, allowing their application to specific cities and the distances between them (Keeling and Rohani, 2002). For instance, one approach, termed ‘gravity wave’ modeling, used geographic, short- and long-range, work-related human movement and demographic data to formulate gravity potentials between US counties in order to infer the dynamics of infection spread (Viboud et al., 2006).

Some studies have focused on one specific factor affecting infection, such as air travel, to simulate the spread of influenza (Colizza et al., 2006); other studies used SIR models, generalized for a collection of interconnected geographic areas, spatial-network or patch models, to model a number of common infections, including influenza, measles, and foot-and-mouth disease (Riley, 2007).

More ambitious network model approaches have simulated the global transmission of infectious disease using high-resolution, worldwide population data and the locations of International Air Transport Association (IATA)-indexed airports (Balcan et al., 2009). Similar to (Viboud et al., 2006), the authors of the study computed the global infection-pre-disposing ‘gravity field’ over the network of international airports. This network-based approach was subsequently developed further (Balcan and Vespignani, 2011) through the modeling of ‘phase transition’–that is the chain-reaction switch of geographic infection status–in complex networks, utilizing approaches introduced in theoretical physics.

Another layer of sophistication was achieved by incorporating rich historical records. For example, Eggo et al. (Eggo et al., 2011) modeled the Spanish influenza epidemic of 1918–1919, using mortality documents from both the UK and the US, explicitly accounting for the size and distances between cities. In the same spirit, Brockmann and Helbing (Brockmann and Helbing, 2013) represented infection as diffusion on a complex network, estimating arrival times for infection across the globe.

Following the formulation of the hypothesis that absolute humidity modulates influenza survival and transmission (Shaman and Kohn, 2009), researchers began incorporating climate variables into SIR-like models (Chowell et al., 2012). More recent dynamic models have incorporated a probabilistic description of influenza infection’s spatial transitions in space and time, accounting for selected demographic confounders (Gog et al., 2014) and (Charu et al., 2017).

In this study, rather than following the SIR-style modeling tradition, we used statistical epidemiology- and econometric-like approaches, in addition to a causality-network method presented here for the first time. There are some prior studies that are close to ours in spirit (but not in details). For example, (Barreca and Shimshack, 2012) used historical US influenza mortality data (1973–2002) in conjunction with collinear humidity and temperature records to establish county-level statistical associations between variables in the datasets. They concluded that absolute humidity was ‘an especially critical determinant of observed human influenza mortality, even after controlling for temperature.’ Another study, focusing on historical influenza records in the Netherlands, (te Beest et al., 2013) used the number of weekly influenza-like patient visits (transformed into an estimated rate of infection) as a response variable in a regression analysis of climate data. They concluded that the bulk of explained variation (57%) was attributed to the depletion of susceptible hosts during the disease season and non-weather-related ‘between-season effects,’ with only 3% explained by absolute humidity, represented as a continuous predictor variable. Additionally, this study observed that school holidays did not have a statistically significant effect on influenza transmission.

As all causality detection methods come with dissimilar limitations and are imperfect in unique ways, we designed our study intentionally to attack the same target problem using three different statistical approaches: Approach 1: A non-parametric Granger analysis (Granger, 1980) focusing on infection flows’ directionalities across the US and whether influenza propagates via long- vs. short-distance travel (we run analysis across all pairs of air- and land-travel county neighbors, respectively). Approach 2: A mixed-effect Poisson regression (Hedeker and Gibbons, 2006) explicitly accounting for the auto-correlation of infection waves in time and space, along with the full set of socioeconomic, climate, and geographic predictors. Approach 3: A county-matching, non-parametric analysis to identify the minimum predictive set of factors that distinguish those counties associated with the onset of the influenza season (Morgan and Winship, 2015).

Our study became possible through access to several, very large longitudinal datasets: (1) a nine-year collection of insurance records capturing the dynamics of influenza-like illnesses (ILIs) in the United States (Truven MarketScan database, see Materials and methods); (2) temporally collinear, high-resolution weather measurements over every US county; (3) detailed air travel (The United States Bureau of Transportation Statistics, 2010) and geographic proximity data (The United States Census, 2016) showing connectivity between US counties; (4) billions of geo-located Twitter messages reflecting long- and short-distance human movement patterns, and; (5) US census data accounting for US county and county-equivalent population distribution, demographic, and socioeconomic properties (HRSA, 2016). An explicit comparison of the ILI data in the insurance claims to the influenza records provided by the Center for Disease Control and Prevention (CDC, 2016) showed that the two sources agree well (ρ=0.91,p=3.5×10-201), with insurance claims providing higher data resolution, see Figure 1—figure supplement 1. Curiously, the relationship between the two sources of ILI observations is not linear: We attribute this to the lower resolution of the CDC data. These three types of analysis produce congruent–albeit not identical–results.

Results

The logical flow of our analysis is as follows: (1) We first show that our definition of ILIs corresponds well with CDC data, and that our causality coefficients, defined in Approach 1, have similar meanings to coefficients in the regression analysis; (2) We then explain the outcomes of the analysis according to Approach 1; (3) We then analyze the importance of putative casual factors, Figure 1, as applied to an initiation of influenza season, Figure 2; (4) In Approach 2, we pay special attention to the relative importance of short- and long-distance travel in influenza propagation, Figure 4; (5) We further test the best regression model in terms of predictive accuracy, Figure 5, using disjointed data parts for training and testing, and; (6) We culminate our analysis with a county-matching analysis, Figure 6.

Figure 1 with 6 supplements see all
Putative determinants of seasonal influenza onset in the continental US and Poisson mixed-effect regression analysis (Approach 2).

Plate A shows the significant variables along with their computed influence coefficients from the mixed-effect Poisson regression analysis (the best model chosen from 126 different regression equations with different variable combinations). The statistically significant estimates of fixed effects are grouped into several classes: climate variables, economic and demographic variables, auto-regression variables, variables related to travel, and those related to antigenic diversity (see the last entry in Table 5 for the detailed regression equation used. The complete list of all models considered is given in Table S-D7). The fixed-effect regression coefficients plotted in Plate A are shown on a logarithmic scale, meaning that the absolute magnitude of predictor-specific effect is obtained by exponentiating the parameter value. A negative coefficient for a predictor variable suggests that the influenza rate falls as this factor increases, while a positive coefficient predicts a growing rate of infection as the parameter value grows. The integrated influence of individual predictors, under this model, is additive with respect to the county-specific rate of infection. For example, a coefficient of -0.6 for parameter AVG_PRESS_mean tells us that the average atmospheric pressure has a negative association with the influenza rate. As the mean atmospheric pressure for the county grows, the probability that the county would participate in an infection initiation wave falls. As exp(-0.6)=0.54, the rate of infection drops by 46% when atmospheric pressure increases by one unit of zero-centered and standard-deviation-normalized atmospheric pressure. Similarly, an increase in the share of a white Hispanic population predicts an increase in influenza rate: A coefficient of 1.3 translates into a exp(1.3)×100%-100%=267% rate increase, possibly, because of the higher social network connectivity associated with this segment of population. Plates B - I enumerate the average spatial distribution of a few key significant factors considered in Poisson regression: (B) average temperature; (C) average maximum specific humidity; (D) average wind velocity in miles per hour; (E) average solar flux; (F) logarithm of population density (people per square mile); (G) total precipitation; (H) income, and; (I) percent of poor as deviations about the country average. Plates J-M show the strong dependence between our estimated antigenic diversity (normalized, see Definition in text) corresponding to the HA, NA, M1, and M2 viral proteins, and the cumulative fraction of the inoculated population (normalized between 0 and 1), where both sets of variables are geo-spatially and temporally stratified. Pearson’s correlation tests shown in Plates J-M were performed under null hypothesis that there the two quantities (plotted along axes X and Y) are statistically independent (H0:ρ=0).

https://doi.org/10.7554/eLife.30756.003
Figure 2 with 1 supplement see all
Characteristics of seasonal influenza in the continental US An analysis of county-specific, weekly reports on the number of influenza cases for a period of 471 weeks spanning January 2003 to December 2013 (Plates A-H) for recurrent patterns of disease propagation.

In particular, the weeks leading up to that in which an epidemic season peaks (determined by significant infection reports from the maximum number of counties for that season) demonstrate an apparent flow of disease from south to north, which cannot be explained by population density alone (also see movie in Supplement). Plate I illustrates the near-perfect time table for a seasonal epidemic. Plates J and K compare the county-specific initiation probabilities of an influenza season, and the causality streamlines.

https://doi.org/10.7554/eLife.30756.010
Table 1
Social connectivity: The US Southern region appears to have an unusually high level of social connectivity.

(In GSS survey results, the number of close friends, close friends who are neighbors, and number of friends who all or mostly know each other is higher in the South, especially in the East/South/Central census region, than in the country at large.)

https://doi.org/10.7554/eLife.30756.011
WSC (TX, OK, AR, LA)ESC (MS, AL, TN, KY)SA (FL, GA, SC, NC, VA, WV, MD, DC)Country-at-largeWNC (ND, SD, NE, KS, MO, IA, MN) (not in South/Southeast) this is the second most social region following ESC (MS, AL, TN, KY)
Close friends7.2212.768.207.5710.56
Close friends who are neighbors1.023.401.321.453.15
% of friends who all or mostly know each otherAll:20% Mostly: 43%All:18% Mostly: 58%All: 11% Mostly: 52%All: 12 Mostly: 50%All: 16% Mostly: 58%
How often visit closest friends*107151126122129
  1. *Survey options are: lives in household, daily, several times a week, once a week, once a month, several times a year, and less often. These are converted to approximate number of visits per year (see Supplement for more information about the GSS analysis).

Table 2
Fisher’s exact test results on matched treatment combinations
https://doi.org/10.7554/eLife.30756.012
YRdh0dh1dt0htuM1M2V0V1ap-valueOdds ratioLower 99% cnf. bnds.Upper 99% cnf. bnds.
2003XXYYYYXXXXX1.9×10-82.831.734.66
2004XXYYYYXXYXX6.5×10-36.221.08132.03
2005XXYYYYXYXXX3.4×10-68.312.1654.93
2006XXYYYYXYXXX5.3×10-74.561.9612.0
2007YYXYYYXYXXX2.1×10-23.850.8228.16
2008YYYYYXXXXXX1.9×10-35.261.2350.2
2009YYXYYYXXXXX3.1×10-104.782.3810.34
2010XXYYYYXXXYX1.4×10-23.640.9324.27
2011XXYYYYXXYXX4.9×10-114.912.5110.05
All YearsYYYYYYYYYYX7.2×10-93.882.107.89
All YearsYYYYYYYYYYY1.01.00.482.15
Table 3
Fisher’s exact test results on matched treatment combinations
https://doi.org/10.7554/eLife.30756.013
(a)
YR p-value 99%Conf. Bnd.
max hus avg
20030.0036031.0, 4.21
20040.69190.16, 17.63
2005 0.19480.61, 7.89
2006 0.6525 0.28, 3.06
2007 0.3574 0.49, 18.85
2008 0.1030.55, 1.23
2009 0.10670.68, 8.77
2010 0.5318 0.27, 41.03
20110.090540.74, 5.17
ALL YRS1[1]10-41.12, 1.88
t avg mean
20030.064390.81, 3.65
200410.27, 10.62
20050.0033391.17, 123.0
2006 0.81720.29, 4.0
20070.5370.47, 7.42
20080.059850.59, Inf
2009 0.00063371.37, 51.68
2010 0.2853 0.50, 9.28
2011 0.05729 0.85, 3.49
ALL YRS 5.87[1]10-91.36, 2.23
d hus 0
2003 0.5374 0.55, 3.41
200410.27, 11.01
20050.04401 0.81, 7.13
2006 0.0017081.31, Inf
20070.009199 1.0, 37.34
20080.30510.60, 5.92
20090.02726 0.82, 90.16
2010 10.41, 2.78
2011 0.5770.57, 2.77
ALL YRS1.48[1]10-51.12, 1.64
d t avg mean 0
2003 0.0049561.0, 24.12
20040.4450.14, 6.0
20050.001164 1.23, 12.03
2006 0.011980.97, 11.08
20070.011470.96, 11.05
2008 0.08552 0.74, 11.18
20090.06847 0.73, 17.69
20100.082510.15, 1.63
20110.60310.47, 3.55
ALL YRS 4.98[1]10-111.35, 2.06
(b)
YR p-value 99% Conf. Bnd.
hus 1
20030.1652 0.10, 2.43
2004 1 0.22, 24.42
2005 0.002004 0.09, 0.87
20061 0.32, 7.65
2007 0.3890.33, 1.90
20080.02142 0.9, 8.48
2009 0.1822 0.67, 6.06
20100.6005 0.23, 4.22
2011 0.9166 0.6, 1.88
ALL YRS0.07 0.72, 1.06
d hus 2
2003 0.0083 1.01, 5.77
2004 0.790.36, 13.33
2005 0.275 0.71, 2.54
2006 0.24 0.66, 4.36
2007 0.19 0.62, 9.33
2008 0.18 0.65, 7.52
2009 0.53 0.44, 6.25
2010 0.08 0.21, 1.51
20110.59 0.69, 1.87
ALL YRS 0.13 0.78, 1.07
urbanity
20030.0083 1.01, 5.77
2004 0.79 0.36, 13.33
2005 0.2750.71, 2.54
20060.24 0.66, 4.36
2007 0.19 0.62, 9.33
2008 0.180.65, 7.52
20090.53 0.44, 6.25
2010 0.08 0.21, 1.51
2011 0.59 0.69, 1.87
ALL YRS2.2_10-16 3.67, 5.06
airport proximity
2003 0.004956 1.0, 24.12
20040.445 0.14, 6.0
2005 0.0011641.23, 12.03
2006 0.01198 0.97, 11.08
2007 0.01147 0.96, 11.05
2008 0.08552 0.74, 11.18
20090.068470.73, 17.69
2010 0.08251 0.15, 1.63
2011 0.60310.47, 3.55
ALL YRS1_10-161.73, 2.93

Approach 1: Causality streamlines from non-parametric granger analysis

Our analysis of health insurance claims covers nine years of influenza cycles (2003 to 2011, inclusively), see Figure 2. We visualized weekly, county-level prevalence as a movie (see Supplement); Figure 2A–H show a few relevant weekly snapshots from different years. The plates in Figure 2A–H, and especially the movie, clearly show that seasonal influenza cycles initiate in the South/Southeastern US and sweep the country from south to north. This pattern is repeated, with some variation, each season.

Figure 3G shows the country-wide propagation dynamics as represented by our computed causality streamlines. The alignment of causality flow vectors into long, continuous streamlines suggests a stable propagation mechanism across the country; the probability of a long sequence of summary movement vectors accidentally matching in the direction by mere chance is vanishingly small (p<10-16 for longer streamlines).

Computation of causality field, Approach 1 Plates A and B: Incidence data from neighboring counties in Alabama, US.

Plates C and D: Transformation to difference-series, i.e., change in the number of reported cases between weeks. We imposed a binary quantization, with positive changes mapping to ‘1,’ and negative changes mapping to ‘0.’ From a pair of such symbol streams, we computed the direction-specific coefficients of Granger causality (see Supplement). For each county, we obtained a coefficient for each of its neighbors, which captured the degree of influence flowing outward to its respective neighbors (Plate L). We computed the expected outgoing influence by considering these coefficients as representative of the vector lengths from the centroid of the originating county to centroids of its neighbors. Viewed across the continental US, we then observed the emergence of clearly discernible paths outlining the ‘causality field’ (Plate G). The long streamlines shown are highly significant, with the probability of chance occurrence due to accidental alignment of component stitched vectors less than 10-185; while each individual relationship has a chance occurrence probability of 6% (Plates E and F). Plate H: Spatially-averaged travel patterns (see text in Materials and methods) and the sink distribution between expected travel patterns. These patterns (Plate H), along with the inferred causality field (Plate I), match up closely, with sinks showing up largely in the Southern US, explaining the central role played there. In Plate H, the size of the blue circles indicate the percentage of movement streamlines (computed by interpreting the locally averaged movement directions as a vector field) that sink to those locations. In Plate I, the size of the red circles indicate the percentage of causality streamlines that sink to the indicated locations. We note that 75% of the movement streamlines sink in counties belonging to the Southern states, which matches up well with the sinks of the causality streamlines. In Plates J and K show spatial analysis results for two different infections (HIV and E. coli, respectively) and which exhibit very different causality fields, negating the possibility of boundary effects.

https://doi.org/10.7554/eLife.30756.015

Do epidemics originate from the same counties season after season? To answer this question, we follow ‘causality streamlines’ back to their source county. Informally, influenza onset in these source counties has little or no causal dependency on their neighbors. That is, their epidemic states are seemingly caused by factors outside of disease prevalence in other counties. Figure 2K presents the county-specific likelihood of streamline initiation across our nine years of data. To verify the near-shores position of these source counties is not a mere manifestation of a boundary effect of shore counties (no neighbors at the side of large water body), we carried out identical causality analyses with two different infections, specifically choosing diseases less likely to share etiologies with influenza: HIV and Escherichia coli. The results for both HIV and Escherichia coli infections are shown in Figure 3J and K, which exhibit flow patterns very different from those obtained for influenza. These streamlines almost never originate from the coasts, thus reducing the likelihood that the pattern observed for influenza is a geo-spatial boundary effect. Combined with the exceedingly low probability (10-185) of chance inference for the streamlines, this strongly supports our conclusion that the epidemics are of coastal origin.

We directly validated our conclusion that influenza waves tend to start in the South by identifying counties which seem to trigger the epidemic. We computed a ‘trigger period’ of five to six weeks for each season, defined as the period immediately preceding an exponential increase in influenza dispersion. To calculate this weekly dispersion, we treated each county as a node in an undirected graph, each with an edge connecting two geographically adjacent counties–only if they had both reported at least one influenza case in the specified week. We defined dispersion as the size of the largest, connected component in this undirected graph. Thus, a trigger period describes the period in which the size of the giant component of the infection graph rises above 250 counties from being under 100 as shown in Figure 2I, and then proceeds to the seasonal peak. Figure 2J presents the likelihood of a county being part of this largest, connected component during the trigger period. In the second approach, we followed causality streamlines back to their source county. Figure 2K presents the county-specific likelihood of streamline initiation across nine years.

These approaches produced qualitatively similar results (Figure 2J and K). While epidemics seem to start in many places around the country (see the origins of streamlines in Figure 3J and I), they successfully gain traction near large bodies of water (as evidenced by the most likely places of epidemic onset, see Figure 2J and K)). Otherwise, they fizzle out before triggering an actual epidemic cycle (see Figure 2J). Seasonal initiation is neither spatially uniform nor simply a reflection of county-specific population density.

Our analysis of the Twitter movement matrix indicates that people most frequently travel between neighboring counties, preferentially towards higher-population-density areas, which shows that the maximum-probability movement patterns follow the local gradient of increasing population density (see Figure 4—figure supplement 1). In contrast, the geo-spatially-averaged movement vectors for each county reveal global flows in the movement patterns (see Figure 3H, along with Methods for the calculation of spatial averages).

Figure 3H–I suggest that average movement patterns largely agree with the influenza streamlines: Both patterns, especially in the South/Southeast of the country, are associated with flow pointing away from large bodies of water.

In addition to looking at the direction of short-range travel, we used our non-parametric Granger analysis to investigate the comparative strength of short- vs. long-range influenza propagation. In the first case, we considered the neighborhood map shown in Figure 4A (for a detailed definition of ”neighbors,’ see Materials and methods), and the in the second case, we considered associations between major, airport-bearing counties (see Figure 4B). We then plotted the distribution of the maximum pairwise coefficient of causality, where the maximization is carried out by fixing the source and the target and varying the delay in weeks, after which we attempt to predict the target stream.

Figure 4 with 1 supplement see all
Comparing influence of short- and long-distance travel on infection propagation Plate A shows land connectivity visualized as a graph with edges between neighboring counties.

Plate B shows air connectivity as links between airports, with edge thickness proportional to traffic volume. Plate C shows the delay in weeks for the propagation of Granger-causal influence between counties in which major airports are located, and Plate E shows the distribution of the inferred causality coefficient between those same counties. Plates D and F show the delay and the causality coefficient distribution respectively, which we computed by considering spatially neighboring counties. The results show that local connectivity is more important. We reached a similar conclusion using mixed-effect Poisson regression, as shown in Plate G: The inferred coefficients for land connectivity are significantly larger than those for air connectivity, tweet-based connectivity, or exponential diffusion from the top 30 largest airports. The coefficients shown in Plate G are exponentiated, allowing us to visualize probability magnitudes (see ‘Model Definition’).

https://doi.org/10.7554/eLife.30756.016

Conclusions associated with Approach 1: The inferred causality streamlines computed from the infection time series in all counties (Figure 3) show that epidemics are mostly triggered near large water bodies and flow inland and away. They also illustrate that the US continental Southern states act as ‘sinks’ to a large proportion of these streamlines. (‘Sinks,’ in our definition here, are geographic areas that multiple streamlines converge towards; sinks are especially obvious when we look at the vector representation of causality direction. The opposite of a ‘sink’ is a ‘source,’ defined as an area at which at least one streamline starts.) This might explain the increased prevalence in the designated region. Additionally, the analysis shows that human travel is a very important driver of emergent epidemiological patterns, and that short-range, land-based travel is more important than air-travel. This result is cross-corroborated by our Poisson regression analysis (described next in Approach 2).

Approaches 2 and 3 are motivated by the ‘why’ questions: (1) Why do epidemics initiate where and when they do? and; (2) Why do some disease initiations become epidemics while others do not?

Approach 2: Importance of factors from poisson regression

We focused on a subset of weeks associated with the initial rise of influenza waves (indicated by the gray bars in Figure 2I, and calculated as discussed earlier). The results from our best-fit model are illustrated in Figure 1A. We selected this particular model out of a total of 126 compared in the Bayesian analysis, a few of which we list in Table 5, ranked by their decreasing goodness-of-fit, measured with the Deviance Information Criterion, DIC (see Supplement). From the values of the inferred coefficients corresponding to the different factors, and taking into account their significance levels and credible intervals, we concluded that the roles played by weather variables, particularly humidity, appear to be substantially more complicated compared to what has been suggested in the literature.

The surprisingly unimportant factors

School schedule was not predictive of influenza onset in our analysis: We ended up with a p-value of 0.84 and an odds ratio of 0.8403, strongly suggesting that school opening dates are not a significant factor in triggering the seasonal epidemic.

We are not claiming here that closing down schools during the seasonal peak, or during an initial phase of a seasonal epidemic, would not have a beneficial effect on maximum incidence. Rather, the observed epidemiological patterns over the time period we analyzed (2003–2011) do not seem to name ‘school opening times’ as a significant predictive factor–at least in the continental US.

We factored in the effect of vaccination coverage by estimating the cumulative fraction of the population that received the current influenza vaccine stratified by geo-spatial location and time of inoculation within each influenza season. Our analysis indicated that vaccination coverage is not a significant predictor of influenza onset/triggering period. It could be a reflection of overall vaccine ineffectiveness, or the choice of outcome predicted (i.e. vaccination might effect overall infection numbers over the entire outbreak, but not the timing of the trigger). It could also reflect the fact that different influenza type/subtypes have different virulence–so a vaccine against H3N2 during an H3N2 year, may be more effective, but due to that fact that H3N2 is more virulent, more people still wind up seeking medical care.

Vaccination coverage failed to reach predictive significance. The variables corresponding to spatio-temporal indicators of the cumulative fraction of the inoculated population are included in our best model (see the last entry in Table 5), but their effect fails to be significant. However, if we drop those variables from the model, the DIC increases. We suggest that the strong dependence between antigenic diversity and vaccination coverage (see Figure 1J–M) is responsible for this effect: Vaccination coverage is important, but its influence is captured by the antigenic variation.

The most important factors

The strongest predictor groups (ranked by importance) are the population’s socio-demographic properties, weather, antigenic drift of the virus, land-based travel, and auto-correlation of influenza waves.

Weather As far as weather effects are concerned, epidemics tend to originate in places with high mean, maximum specific humidity, high average temperature, and low average air pressure, namely, in counties at the Southern, and, to a lesser extent, Eastern and Western US coastlines. Additionally, the spread of an epidemic is significantly influenced by a drop in specific humidity up to four weeks before its onset. However, this effect is weaker than the mean maximum specific humidity effect. Drop in average temperature that dips one to three weeks prior to the epidemic onset is also significantly important (this is consistent with earlier experiments [Lowen et al., 2008]), especially when the temperature drop is accompanied by a decrease in specific humidity, average wind speed, and solar flux. However, high levels of solar flux in the week of onset are also important. This complicated set of weather conditions, a signature of the cold air front (Shaman et al., 2010), is validated by our out-of sample predictions to increase the risk of triggering the seasonal epidemic. Total precipitation also plays a positive role.

The weather/humidity paradox

How can colder weather and lower humidity be a predictor of influenza, if influenza epidemic waves tend to start in the South with warmer climates and higher humidity? Our resolution of this seeming controversy is as follows: The stress is on the drop in both humidity and temperature in those areas with high average annual values of these measurements. A blast of colder, lower-humidity weather in these warm-climate areas has two effects: (1) The influenza virus can stay viable in water droplets longer than in hot, sunny weather, and; (2) Humans tend to interact indoors, in more crowded conditions. Both of these factors are favorable for transmission of the virus to the population at large.

Antigenic variation Antigenic diversity for HA, NA, M1, and M2 are important predictors. While HA, NA, and M1 inhibit the trigger, M2 diversity enhances it. This peculiar difference in the direction of influence might be a manifestation of the roles played by the individual viral proteins in its life-cycle.

The first three proteins are directly involved in the viral binding to host cell surface receptors, while M2 activity is needed only during HA biosynthesis. Additionally, proteolysis experiments indicated that M2 proton channel activity helped to protect (H1N1)pdm09 HA from premature conformational changes as it traversed low-pH compartments during transport to the cell surface (Alvarado-Facundo et al., 2015).

We found that antigenic diversity is a significant predictor in all four of the viral proteins we considered. Interestingly, while the increasing diversity found in HA, NA, and M1 inhibits the epidemic trigger, the higher diversity in M2 enhances it (see Discussion).

Travel Land travel intensity one to three weeks before epidemic onset is a strong predictor. Air travel is also predictive, but its strength is an order of magnitude weaker than that of land travel.

Autocorrelation The increase in an influenza outbreak’s infection weekly rate one and two weeks before an epidemic onset (in the epidemic source county itself) is predictive of epidemic wave origin.

We have used substantially richer datasets than those used by earlier studies (Shaman et al., 2010; Tamerius et al., 2013), which lends strong statistical support to our conclusions. It also allowed us to disentangle and make precise the contributions from different factors, e.g., mean county humidity vs. drops in humidity before an infection. While we found the former effect to be clearly stronger (in accordance to previously reported results [Shaman et al., 2010]), the other diverse set of factors were also found to be significant.

Validation of predictive capability

The robustness of our results is established in a number of ways:

  1. In mixed-effect regression (Approach 2), we compared over 120 chosen model variations (see Table 5 for an abridged list, and Supplementary file 3for the complete enumeration of considered models); the results appear to be qualitatively stable, though the quantitative performance of the models vary somewhat as measured by DIC for different configurations of regression equations.

  2. We carried out a direct validation of predictive performance by estimating model parameters using the first six seasons and predicting the epidemic trigger locations using the last three (see Figure 5 and Materials and methods). The out-of-sample predictions of influenza incidence are always positively correlated with observed incidence (Plate A). Perhaps more importantly, we obtained good predictability as measured by the area under the curve (AUC 80%) for the receiver operating characteristics (ROC, See Plates B-D). Plates E-G show that our out-of-sample predictions correctly identified epidemic initiation in the Southern and Southeastern counties of the continental US.

  3. As we discuss in the next section, in Approach 3, we conducted a corroborating matched effect analysis on the counties, using combinations of county-specific factors as a ‘treatment,’ not unlike clinical trials in which patients on a drug regimen are matched to patients receiving a placebo (Morgan and Winship, 2015).

Prediction performance with training data from the first six seasons and validation on the last three.

Plate A shows the correlation between the observed incidence and the model-predicted response. We show significant positive correlation, particularly within the trigger periods, between the model predictions and the actual held-out data. This gives us confidence to construct ROC curves for each week. Plates B-D show the ROC curves for the last three weeks of each of the three seasons in the out-of-sample period (potentially, these computations can be repeated for all possible partitions of study weeks into training and test samples). Plates E-G illustrate that the normalized decision variable, which is the normalized response from the model, identifies the South and Southeastern counties as the trigger zones.

https://doi.org/10.7554/eLife.30756.018

Approach 3: Matching counties and factor combinations

In Approach 3, we investigated combinations of factors presented as ‘treatment’ via a non-parametric, exact-matching analysis of US counties during the weeks of epidemic onset on a season-by-season basis.

First, we collected the list of all counties with a drop in maximum specific humidity during the weeks leading up to an influenza season in a particular year. This is the ‘treated set’: the set of counties that may be thought of as subjected to the positive ‘treatment’ of a drop in specific humidity. We split this set into two, considering counties that also experience increased influenza prevalence during the epidemic onset, and ones that do not (counties with two different values of the outcome variable). The number of counties in these two sets define the first row of a 2×2 contingency table. In the second row (the ‘control set’), we focused on counties that do not experience a drop in the maximum specific humidity. However, we only considered counties that have a matching counterpart in the treated set in the following sense: For each county in the control set, we found at least one in the treated set such that the rest of the significant variables (other than specific humidity) had similar variation patterns in both counties. Once we defined the control set, we split it in the manner described for the treated set: We counted the number of control counties that experienced an increased influenza prevalence during epidemic onset, and those which did not. This defined the second row of the contingency table. Finally, we used Fisher’s exact test to compute an odds ratio (the odds of realizing these numbers by chance), along with the test-derived significance of the association between the ‘treatment’ and epidemic wave initiation (p-value). Furthermore, we defined our treatment to consist of multiple factors simultaneously, e.g. specific humidity and its change in the preceding week, along with average temperature and degree of urbanity, see Figure 6.

Results for our analysis involving county-matching (Approach 3).

Plate A illustrates the factor combinations that turn out to be significant over the nine seasons. Notably, for each season, we have multiple, distinct factor sets that turn out to be significant (p<0.05) and yield a greater-than-unity odd ratio. Plotting the probability with which different factors are selected when we look at season-specific county matchings (the top panel in Plate A), we see a corroboration of the conclusions drawn in Approach 2. We find that specific humidity and average temperature, along with their variations, are almost always included. We do see some new factors that fail to be significant in the regression analysis, e.g., degree of urbanity and vaccination coverage. While vaccination coverage is indeed included as a factor in our best performing model, in Approach two it failed to achieve significance, perhaps due to its strong dependence on antigenic variation (see Figure 1J–M). Degree of urbanity is indeed significant for some of the regression models we considered (see Supplementary Information), but was not significant for the model with the smallest DIC. Note that ‘Treatment’ here is defined as a logical combination of weather factors. A treatment is typically a conjunction of several weather variables. For example, the treatment shown in top left panel of Plate B involves a conjunction of: (1) a drop in average temperature during the week of infection; (2) a drop in temperature during the week of infection; (3) a higher-than-average specific humidity; (4) a higher-than-average temperature, and; (5) a high degree of urbanity. With respect to the ‘treatment,’ we can divide counties into three groups: (1) ‘treated counties,’ shown in green; (2) at least one matching county for each of the treated counties (matching counties are very close to the treated counties in all aspects but in treatment, which we called ‘control’ counties), shown in black, and; (3) other counties, shown in grey. The counties in the ‘treatment’ and ‘control’ groups are further subdivided into those counties that initiated an influenza wave and those that have not, resulting in four counts arranged into a two-by-two contingency table. We then used the Fisher exact test to test for association between treatment and influenza onset. Panels in Plate B show both the treated and control sets for the 9 seasons for a subset of chosen factors. The results are significant, as shown in Tables 2 and 3. The variable definitions are given in Table 4. Notably, some of the variables found significant in the regression analysis are not included above, and some which are not found to be significant in the best regression model show up here. This is not to imply that they are not predictive or lack causal influence. The matched treatment approach, as described above, is not very effective if we use more than 10-15 factors simultaneously to define the treated set (for the amount of data we have); this results in a contingency table populated with zero entries.

https://doi.org/10.7554/eLife.30756.019

Note that geographic clustering of ‘treated’ and ‘untreated’ counties arose automatically as a result of similar weather patterns being imposed via the constraint of multiple climate variables.

Table 4
Variables in mixed-effect Poisson regression analysis (Approach 2)
https://doi.org/10.7554/eLife.30756.020
Variable namephysical effect
NTotal number of patient visits given week and county (the offset)
max_HUS_meanMean county-specific maximum specific humidity over nine years
d_max_HUS_iNormalized and zero-centered deviations of maximum humidity, i = 0–4 weeks before
t_avg_meanMean county-specific temperature over nine years
d_t_avg_iNormalized and zero-centered deviations of mean temperature, i = 0–4 weeks before
RSDS_meanMean county-specific solar insolation over nine years
d_RSDS_iNormalized and zero-centered deviations of mean solar insolation, i = 0–4 weeks before
AVG_PRESS meanMean county-specific solar insolation over nine years
d_AVG_PRESS_iNormalized and zero-centered deviations of mean surface pressure, i = 0–4 weeks before
tot_prec_meanMean county-specific total precipitation over nine years
d_tot_prec_iNormalized and zero-centered deviations of mean total precipitation, i = 0–4 weeks before
Wind_avg_meanMean county-specific average wind speed over nine years
d_Wind_avg_iNormalized and zero-centered deviations of average wind speed, i = 0–4 weeks before
IncomeCounty-specific mean income
airport_diffusionInfluence from proximity to airports, modeled as human traffic-weighted exponential diffusion from the 30 largest US airports
Am_Ind% of American Indians in the county
Asian% of Asians
White_Hisp% of Caucasian/Hispanics
W_non_Hisp% of Caucasian/Non-Hispanics
Black_Hisp% of Black/Hispanics
B_non_Hisp% of Black/Non-Hispanics
Pacific% of Pacific Islanders
Insured% of county population insured
Poor% of county population under poverty line
Urban% of county population classified as urban
land_iinfluenza velocity change in the land neighbors of the county i weeks before the current week
tweet_iinfluenza velocity change in the Twitter neighbors of the county i weeks before the current week
air_iinfluenza velocity change in airport neighbors of the county i weeks before the current week
v_ichange in rate of infection in the county itself i weeks from the current
M1Diversity in M1 protein primary structure
M2Diversity in M2 protein primary structure
NADiversity in NA protein primary structure
HADiversity in HA protein primary structure
Cum_vac_per_N_ivaccination coverage in the county cumulated over past 20 weeks i weeks from the current
  1. (a) Definition of variables

Unlike the mixed-effect regression approach (Approach 2), this matching analysis is non-parametric, and intended to reveal whether multiple factors are, indeed, simultaneously necessary.

The results of Approach 3’s analysis are presented in Figure 6 and Tables 2 and 3. We found that no single variable was able to consistently yield a statistically significant odds ratio greater than one; multiple factors interacted to shape an epidemic trigger (see Table 3 for a few examples). With a total of 47 significant variables in our best mixed-effect model, an exhaustive search for all combinations was not feasible. Instead, we performed a standard evolutionary search, looking for combinations that yielded a significant odds ratio for individual seasons. Additionally, we considered all seasons together (by simply adding the contingency tables, element-wise) in order to increase the test’s statistical power.

We isolated ten variables (as shown in Figure 6, Plate B) in this manner which included maximum specific humidity and average temperature along with their variations, the degree of urbanity, antigenic variation, and vaccination coverage.

The factors that appeared most often in our analysis are illustrated in Plate A: It appears that maximum specific humidity and average temperature, along with their variations, and the degree of urbanity have the most frequent contribution, followed by antigenic variation and vaccination coverage.

We did see some new factors here that failed to be of significance in the regression analysis (Approach 2), e.g., degree of urbanity and vaccination coverage. While vaccination coverage was included as a factor in our best performing model in Approach 2, it failed to achieve significance, perhaps due to its strong dependence on antigenic variation (see Figure 1J–M). Degree of urbanity was indeed significant for some of the regression models we considered (see Supplementary Information), but failed to be so for the model with the least DIC.

Thus, Approach three corroborates and strengthens key claims of Approach 2.

The exact set of factors varied somewhat over the seasons; nevertheless, together, they yield significant results when all seasons are considered together. The matching analysis corroborates our results from both the mixed-effect regression and the geographic streamline analyses: The sets of counties initiating the wave are near coasts on the Southern region of the continental US (see Plates A - I in Figure 6).

Local travel vs. long-distance travel and influenza

Our conclusion that local travel is predominantly responsible for disease wave propagation is supported by several lines of analysis.

First, continuous land-movement infection waves are visible in the weekly influenza rate movie; we computed this movie from insurance claim data and made it available with results of this study.

Second, because our all-weeks-included dataset was too large for the R MCMCglmm package to handle, we performed mixed-effect Poisson regression calculations using a 50 percent random sample of all the weeks for which data were available. In this computation, the airport-proximity, fixed-effect coefficient turned out to be statistically significantly negative (see Supplement A, as well as the editable output file ‘flu-50-percent-weeks.txt’).

Third, the results from our Granger-causality inference showed that:

  1. Local county-to-county movements were much more predictive of influenza wave change than airport movements. In comparing Plates E and F in Figure 4, we see that the local movement causality coefficient (γ) is, on average, twice as large as that for long-range movement (Figure 4E). Figure 4E shows that the mean long-range causality coefficient is approximately 0.05, whereas it is just over 0.1 in the local propagation. As the causality coefficient quantifies the amount of predictability (measured as information in bits) communicated about the target data stream per observed bit in the source data stream, it follows that, on average, every ten bits of sequential incidence data from an influencing location tells us one bit about the unfolding incidence dynamics in the target location. Therefore, in case of the long-range movement, informativeness is twice as low, so we need on average 20 bits to infer one bit about the state of infection. These calculations strongly suggest that local movement is predictively stronger with regards to influenza infection propagation.

  2. While the most frequent value of the computed time delay in influence propagation between counties with large airports is zero weeks, this distribution is significantly flatter compared to that for local, county-to-county influence propagation.

Discussion

A summary of the complex relationship between the driving factors that contribute to the trigger and subsequent development of a seasonal epidemic can be clarified with a forest fire metaphor.

The maturation of a forest fire requires the collusion of multiple factors—namely the presence of flammable media, an initiating spark, and a wind current to help spread the fire. Our conceived mapping of this analogy to influenza infection is as follows:

  1. Flammable Media: The Southern US appears to have an unusually high level of social connectivity; it is at least one order of magnitude higher than that in the north of the country (see GSS survey results (Smith et al., 1972) and Table 1). The number of close friends, close friends who are neighbors, and communities of people who all, or mostly, know each other is much higher in the South than in the country at large. Our conjecture is that a manifestation of this high-connectivity is the highest apparent percentage of people infected with influenza (20% as opposed to 4% in other parts of the country).

  2. Initiation Spark: An initial spark for the infection wave is generated by a combination of weather and demographic factors. Specifically, warm, humid places are conducive to influenza wave initiation - particularly in weeks where specific humidity drops. Airport proximity is important, as well as demographic and economic makeup and also the degree of urbanization. Note that the first static condition (warm humid places) is highly correlated with areas in the South with greater social connectivity. It is possible that static meteorological variables (warm mean temperature and high mean humidity) serve as proxies for high social connectivity or other correlated socioeconomic factors.

  3. Wind: The ‘wind’ in this analogy is the collective movement of a large number of people, integrated over time, revealing persistent ‘currents.’ These currents reproducibly point from coastlines and move inwards towards the center of the continent, making them perfect vehicles to transmit the infection inland from the shores.

Each of our three types of computational approaches has their strengths and weaknesses: (1) The Poisson mixed-effect regression allows for the direct comparison of the predictive strength of numerous predictor variables and accounts for spatial and temporal autocorrelation, but relies on strong modeling assumptions; (2) The non-parametric Granger analysis is not limited by restrictive modeling assumptions in our implementation, though it focuses only on trends of infection propagation between counties, and; (3) The county-matching analysis is also model-free, but this freedom comes at the expense of lesser statistical power.

What is new in this study?

The following aspects make our study of influenza triggers new in the influenza literature: (1) Instead of simulating the plausibility of one particular epidemic trigger model with a dynamic disease transmission model, we used formal model selection tools to compare the goodness of fit of hundreds of plausible models; (2) We explicitly attempted to systematically cross-compare the importance of numerous individual factors typically hypothesized to contribute to epidemic onset; (3) To accomplish this, we collected an unprecedented volume of temporal and spatial data on disease dynamics and the dynamics of putative predisposing factors; (4) We used several orthogonal, computational causality-inference techniques (one of which was developed specifically for this study) to probe associations between disease onset and putative epidemic triggers; (5) We tested our best models for their predictive potential and demonstrated that they are, indeed, suitable for forecasting disease waves, and; (6) We combined, for the first time, numerous candidate factors in a single, integrative study.

Convergent conclusions, culled from these radically different techniques, strengthen our claims and make it statistically unlikely that we are observing analysis artifacts. First, the Granger causality analysis results (Approach 1) provide insights into the details of influenza’s epidemiological dynamics. Figure 3G traces out the paths most likely followed by the infection, on average, across the continental US. We note that 75% of the streamlines sink in counties belonging to the Southern states, which matches up well with the streamline-encoded dynamics of weekly disease incidence over nine years (see Figure 3I). What drives this particular causality field’s geometry? While we cannot definitively answer this question, a comparison of the global patterns emerges from the local mobility data culled from the aforementioned Twitter database and offers a tentative explanation (see Figure 3H). Second, contrary to reported human travel pattern influence on seasonal epidemics (Viboud et al., 2006) (but consistent with [Gog et al., 2014]), we find that short-distance travel contributes more significantly to disease spread (see Figure 4). In particular, we find that long-range air travel is important as an epidemic trigger, but once infection waves are triggered, air travel patterns (or proximity to major airports) become less important. Short-range mobility, on the other hand, is apparently important for sustaining infection transmission over each season. Thus, we find short-range travel to be more important for defining the emergent spatio-temporal geometry of infection waves, while proximity to airports is more important for actually triggering an influenza season; the latter loses positive influence once an infection is under way. This conclusion is justified as follows: (1) When we performed regression calculations using all weeks for which data were available (as opposed to wave initiation weeks only) the airport proximity predictor coefficient turned out to be statistically significantly negative (see Supplement). (2) Results from our Granger-causal inference indicate that, on average, the local, putatively causal connections are far stronger compared to the putatively causal connections between counties within which the major airports are located (see Figure 4C and F). Additionally, from our best mixed-effect regression model (Figure 1A), we find that land connectivity effects are significantly stronger than air connectivity effects. The predictive value of Twitter connectivity, which intuitively captures both local and long-distance travel, lies in-between land and air connectivity coefficients. Note that Twitter connectivity is represented as a directed graph, where for each pair of counties, i and j, the (i,j) edge weight represents the conditional probability of ending at county j, given that a traveler/Twitter user started her journey in county i. Transition probabilities from i to j sum to one over all j. Therefore, intuitively, the Twitter connectivity graph should have the features of both a land-connectivity and an air travel graph; which indeed appears to be close to reality. 3) While airport diffusion is a significant factor in our best Poisson model (using data from the initiation period), the causal streamlines (constructed with the complete, all-year incidence data) do not seem to originate from airport-bearing counties.

The role of short-distance travel is particularly crucial in explaining influenza’s time-averaged, geo-spatial prevalence. While the mixed-effect regression analysis explains seasonal initiation in the vicinity of the continental US Southern shores, it might not, by itself, adequately explain its average prevalence patterns across the country.

Also not explained solely by our regression models is the occurrence of relatively high infection prevalence in the central parts of the country. These differences cannot be attributed to long-distance air travel, as discussed before. However, the routes taken by the causality streamlines (as computed by the non-parametric Granger analysis), interpreted as paths followed by an infection on average, suggest an explanation: The close match between the Granger-causal flow and the short-range mobility patterns (derived from Twitter analysis) strongly suggest that average disease prevalence is modulated by short-range mobility.

Rationale for observational analysis

The traditional empirical approach of testing a causal link between a factor and an outcome of an experiment was to vary one factor at a time, while keeping the other factors (experimental conditions) constant. This ‘all the rest of the conditions are equal’ assumption is often referred to by its Latin form as ceteris paribus. R.A. Fisher ([Fisher, 1935], p. 18) noted that, in real-life experiments, perfect ceteris paribus is not achievable ‘because uncontrollable causes which may influence the results are always … innumerable.’ Fisher’s proposed solution to this problem is to design experiments to involve random assignment of treatment (the putative causal factor’s states) to individual trials and then use regression analysis to estimate the value and significance of the putative causal effect.

Likewise, hypothesis-driven science, wherein investigators formulate a single, testable hypothesis and design specific experiments to test it, is a core element of the scientific method, and works well in most scientific fields. However, a new challenge emerges in data-rich scientific fields, such as genomics, epidemiology, economics, climate modeling, and astronomy: How do we choose the most promising hypotheses among millions of eligible candidates that potentially fit data? One solution to this challenge is the many-hypotheses approach, a method of automated hypothesis generation in which many hypotheses are systematically produced and simultaneously tested against all available data. This approach is currently used, for example, in whole-genome association or genetic linkage studies, and often enables truly unexpected discoveries. In contrast to the single-hypothesis approach, the many-hypotheses approach explicitly accounts for the large universe of possible hypotheses through calibrated statistical tests, effectively reducing the likelihood of accidentally accepting a mediocre hypothesis as a proxy for the truth (Nuzzo, 2014).

The many-hypotheses approach provides a complement to carefully controlled and highly focused wet laboratory experiments. Running controlled experiments to test a single hypothesis necessarily ignores many of the complexities of a real-world phenomenon; these complexities are necessarily present in large, longitudinal datasets. Of course, the data-driven ‘many-hypotheses’ approach is only one aspect of the broader scientific process progressing toward the development of verifiable general theories.

Agreement and disagreement between methods

Intuitively, we expected that all three approaches would produce similar, if not identical results. In practice, while the three approaches agreed in most cases, this agreement was not perfect. For example, the highlighted areas (greater incidence) in the first influenza season snapshot for Figure 2A–H each should match relatively well to the maps in Figure 6B (or at least some unspecified subset of ‘high incidence treatment counties’). While the county-matching results point to initiation at coasts, in Figure 2, 2006-2007 initiation seems to spread from the West Coast and, in , 2010 has a scattered pattern across the middle of the US.

The intuitive explanation of perceived discrepancy is that the matching method agrees with other analysis types predominantly, but not in all cases. Each analysis has limitations. In the case of the matching analysis, we have less statistical power than in, for example, Poisson regression; matching by numerous parameters reduces the initial set of thousands of counties to a handful of matching ‘treated’ counties (which meet a particular combination of weather and sociodemographic conditions) and ‘untreated’ counties (very similar to ‘treated’ ones in all respects but treatment). The difficult-to-match, ‘weeded out’ counties may happen to be in the coastal areas indicated as the most likely places of influenza wave origin by other analyses.

In the case of the 2007 and 2010 results, the matching analyses pick patterns that are different from those produced by the causality streamline analysis and mixed-effect Poisson regression models.

Figure 6’s Plate B shows the distribution of the treatment counties and matched-non-treatment counties. Note that here, we are not directly predicting initiation, so while the patterns in Figure 2 and Figure 6 should indeed show some similarity, they are not required to match up perfectly. The most similar treated counties do indeed show up in the Southern shores.

Generalizability

Our analysis uses no prior knowledge specific to influenza epidemiology. As such, these methods are not limited by either the pathogen under consideration (influenza), or the geospatial context (United States). The tools developed here are expected to be equally applicable to analyzing general epidemiological dynamics for pathogens other than influenza, unfolding in arbitrary geographical regions. The specific conclusions we draw about the initiation and propagation of the seasonal influenza in US might not hold true for influenza epidemiology in a different geographical context. However, the analysis tools are still applicable. More broadly, our tools delineate a general approach to modeling complex spatio-temporal dynamics, with applications beyond solely disease epidemiology.

We conclude by highlighting the structure of overlapping conclusions delivered by our three approaches. Approach 1: Granger-causality analysis suggests that an epidemic tends to begin in the South, near water bodies and that short-range, land-based travel is more influential compared to air-travel for infection propagation, providing a map of mean infection flow across the continental US. Approach 2: Poisson regression identifies significant predictive factors, ranks these factors by importance, suggests that Southern shores are where the epidemic begins, and corroborates Approach 1’s result on short-range vs. long-range travel. Approach 3 (county-matching): This approach drills down further to the epidemic onset source to the Southeastern shores of continental US, and identifies a smaller validated subset of predictive factors.

Materials and methods

The methods and putative predictors identified in our study should be directly applicable to data outside the US. While the balance (relative importance) of putative triggers of influenza waves may vary, the factors themselves should be universal globally, because the virus and host biology are universal.

Candidate factors in influenza initiation

To investigate county-specific variability, we grouped candidate factors into several categories: demographic, relation to human movement, infection state of county neighbors, and county’s own recent state, and climatic.

Major hypotheses regarding putative causal factors affecting infection dynamics can be traced to a handful of earlier publications:

Human movement

We considered two measurements of human movement: (1) The first measurement reflects the proximity of counties to major airports. We computed an exponentially-diffusing influence from counties with major US airports, weighted by passenger enplanements at their respective locations. This accounted for people moving to and from both major airports and neighboring counties, and; (2) The second measurement is a large-scale movement matrix representing people’s week-to-week travels between counties. These data were culled from a complete collection of geo-located Twitter messages, captured over 3.5 years, and constituted a large-scale, longitudinal sample of individual movements. We used only automatically geo-tagged tweets.

Demographic

Influenza is transmitted through direct contact with infected individuals, via contaminated objects, and virus-laden aerosols. Thus, human population density (how many people happen to be around) and social connectivity (how many people interact with each other and how frequently [Bedford et al., 2015]) are factors expected to affect local virus incubation and spread. In addition to population density, we considered socioeconomic factors such as county-mean household income, levels of poverty and urbanity, as well as the prevalence of ethnic and age groups. All these socio-demographic and socio-economic data were derived from reports provided by the US Census ([Web Page]; 2017. Available from: https://www.census.gov/geo/reference/ua/uafacts.html).

Return-to-school effect

Social contact among children in schools has been extensively investigated as a determinant of the peak incidence rate. This is one of the few factors that might lend itself to intervention relatively easily, and hence the interest is well-justified. While any reduction in social contact should, in theory, directly impact transmission, quantifying the effect of this specific mode of contact on the incidence rate has been difficult to calculate. Predictions of the reduction in the peak incidence associated with reduced social contact were typically 20–60% (Ferguson et al., 2006; Haber et al., 2007), with some studies predicting much larger reductions of 90% (Mniszewski et al., 2008; Ghosh and Heffernan, 2010). Reductions in the cumulative attack rate (AR, ratio of the number of new cases to the size of the population at risk) were usually smaller than those in the peak incidence. Several studies predicted small (10%) or no reduction in the cumulative AR (Ferguson et al., 2006; Haber et al., 2007; Yasuda et al., 2005; Ciofi degli Atti et al., 2008; Yasuda et al., 2008; Kelso et al., 2009; Davey et al., 2008; Rizzo et al., 2008; Vynnycky and Edmunds, 2008; Glass and Barnes, 2007; Lee et al., 2010; Yang et al., 2011; Zhang et al., 2012), whilst a few predicted substantial reductions (e.g. 90%) (Glass et al., 2006; Davey et al., 2008; Elveback et al., 1976; Davey and Glass, 2008). Only two studies (Glass et al., 2006; Lee et al., 2010) predicted that peak incidence might increase markedly under certain circumstances following school closures, for example by 27% if school closures caused a doubling in the number of contacts in the household and community, or by 13% if school systems were closed for two weeks at a prevalence of 1% in the general population. Studies have also investigated the effect of such interventions on children vs. adults; one study predicted an overall reduction in the cumulative AR, but an increase of up to 48% in the cumulative AR for adults in some situations (Araz et al., 2012).

While this diverse set of predictions in the literature often pertains to the effect of school closures as an intervention tool, we are more interested in the influence that the current school schedule has, if any, on triggering an epidemic. To answer this specific question, we formulated a simple statistical test to determine whether the timing of return-to-school after summer and winter holidays significantly predicts influenza season initiation. We found insufficient evidence in support of this effect (see Methods and materials).

Climate variables

Specific humidity and a drop in temperature have been suggested as the key drivers in triggering seasonal influenza epidemics (Lowen et al., 2008; Shaman et al., 2010). These initial conclusions were drawn from experiments conducted using an animal model (guinea pig), under controlled laboratory conditions (Lowen et al., 2008), followed by indirect support from epidemiological modeling (Shaman et al., 2010).

Vaccination coverage

Vaccination is widely regarded as our most promising tool to combat influenza, though antigenic variation between seasons makes it difficult to craft an effective vaccination strategy (Boni, 2008). Understanding how the virus will evolve in the short-term is key to finding the correct antigenic match for an upcoming influenza season. Additionally, short-term molecular evolution might rapidly give rise to immune-escape variants that, if detected, might dictate intra-season updates in vaccine composition. More importantly, vaccination itself might exert significant selection pressure to influence antigenic drift. The effect of vaccination on viral evolution has been documented in an avian H5N2 lineage in vaccinated chickens (Lee et al., 2004), suggesting that similar processes might be occurring in human counterparts. The diversity of the surface proteins at any point in time between seasons suggests that our current vaccination strategies are limited to confer partial immunity, which can result in a highly immune or vaccinated population selectively pressurizing the viral population to evolve more quickly than usual. Given that influenza moves quickly across geographies and that there are multiple, co-circulating strains that may confer partial cross-protection, changing viral service proteins represent a ‘moving target’ for the human immune system. There are additional complexities due to early-life immune imprinting in humans.

Accounting for non-vaccination host immunity

Resistance to influenza infection in human hosts arises via two related mechanisms: immunological memory from a previous infection by an identical or sufficiently similar strain, or vaccination against the current strain (or a sufficiently similar strain). In our analysis, we explicitly account for the degree of vaccination coverage. Accounting for host immunity is more difficult, because resistance to infection is not directly observable; however, our analysis also uses the antigenic variation of influenza virus as a proxy to host resistance: If the relative antigenic variation is large, then the susceptibility of non-vaccinated hosts increases, while low antigenic variation increases the probability of encountering a resistant host. In addition, we use the absolute antigenic deviation of the later-season virus from the first-season virus in our data set (winter of 2003–2004) as a predictor, in order to capture longer-term effects of immunological memory. Thus, if the magnitude of the relative drift between two succeeding years is small, we have a likely decrease in susceptibility. Likewise, if the absolute deviation from a few years back starts decreasing, we would also register a decrease in susceptibility. Incorporating these factors in the diverse set of models that we investigate, guarantees that we are indeed considering host susceptibility contributions from a wide range of possible mechanisms.

Antigenic Variation

The influenza virus counteracts host immunity via subtle genomic changes over time. The more gradual process, known as antigenic drift, is a manifestation of a gradual accumulation of mutations within viral surface proteins recognized by human antibodies, such as hemagglutinin (HA) and neuraminidase (NA). These mutations are typically introduced during cycles of viral replication (Boni et al., 2004). Most of these mutations are neutral, i.e. they do not affect the functional conformation of the viral proteins. However, some of these alterations do, in fact, sufficiently change secondary and tertiary protein structures to have a negative impact on the binding of host antibodies raised in response to previously circulating strains (Webby and Webster, 2001). (Many such mutations also reduce the virus’s viability.) Thus, while a single infection episode is potentially enough to provide long-term host immunity to the invading strain, antigenic variation due to intense selection pressure gives rise to novel viral strains, making re-infections possible within the span of a few years (Andreasen, 2003). This kind of perpetual Red-Queen arms race injects influenza dynamics with auto-correlative dependencies over multiple seasons. It has been suggested that substantial antigenic drift might be associated with more severe, early-onset influenza epidemics, resulting in increased mortality (Treanor, 2004). In contrast to antigenic drift, antigenic shift is an abrupt, major change in virus structure due to gene segment re-assortment that occurs during simultaneous infection of a single host by multiple influenza subtypes (De Clercq, 2006). Antigenic shift results in new versions of viral surface proteins. Antigenic shift due to re-assortment gives rise to novel influenza subtypes that, if capable of sustained human-to-human transmission, can have devastating consequences for human populations, e.g. the 2009 H1N1 pandemic (Neumann et al., 2009).

In our analysis, we factor in the potential effect of antigenic variation by estimating the surface protein’s population diversity – hemagglutinin (HA), neuraminidase (NA), matrix protein 1 (M1), and matrix protien M2 – as a function of time and geographical sample collection location. Our rationale for our focus on these proteins is that HA, NA, and to some degree M1, are all present on the viral surface (Lamb et al., 1985), contribute to viral assembly, and mediate the release of membrane-enveloped particles (Chlanda et al., 2015). M2 has been shown to have enhanced the pandemic 2009 influenza A virus [(H1N1)pdm09] (Friedman et al., 2017) HA-pseudovirus infectivity (Alvarado-Facundo et al., 2015).

Data sources

Social connectivity

Data was obtained from the General Social Surveys, NORC, See Table 1, [].

Clinical data source

The source of the clinical incidence data used in this study is the Truven Health Analytics MarketScan® Commercial Claims and Encounters Database for the years 2003 to 2012 (Hansen, 2017). The database consists of approved commercial health insurance claims for between 17.5 and 45.2 million people annually, with linkage across years, for a total of approximately 150 million individuals. This US national database contains data contributed by over 150 insurance carriers and large, self-insuring companies. We scanned 4.6 billion inpatient and outpatient service claims and identified almost six billion diagnosis codes. After un-duplication, we identified approximately 12.8 unique diagnostic codes per individual. We processed the Truven database to obtain the reported weekly number of influenza cases over a period of 471 weeks spanning from January 2003 to December 2013, at the spatial resolution of US counties. To define influenza in insurance claims, we used the following set of ICD9 codes: 487.8, 488.12, 488.1, 488.0, 488.01, 488.02, 487.0, 487.1, 488.19, 488.09, 488, 487, and 488.11. We also considered including non-specific codes representing unspecified viral infection (079.99), as suggested in (Viboud et al., 2014), and decided against it. (We provide explicit annotation for each code, as well as rationale for choosing narrower ICD-code definition of ILIs, in the Supplement Section S-D).

A peak percentage of ILI-affected people of 20% may seem high–especially given that it is a much greater percentage than the change in seropositivity over most influenza seasons. Note that our apparent estimates of influenza infection rate are unavoidably inflated, because what we are measuring is the week- and county-specific prevalence of influenza-like illnesses among insured patients who contacted their physician for any reason during the week in question. In other words, the denominator that we used for computing prevalence was almost certain to be much smaller than the overall population of the corresponding county.

Data on antigenic drift for influenza A

Sequence data for this computation was obtained from the National Institute of Allergy and Infectious Diseases (NIAID) Influenza Research Database (IRD) (Zhang et al., 2017) through their web site at http://www.fludb.org.

Data on vaccination coverage

Data on vaccinations was extracted from our EHR database corresponding to the procedural codes 90661 and Q2037, which corresponds to the dominant influenza vaccines. (http://flu.seqirus.com/files/billing_and_coding_guide.pdf)

Data on human air travel

We used a complete, directed graph of passenger air travel for 2010, accounting for the number of passengers transported in each direction (The United States Bureau of Transportation Statistics, 2010). For each county, we computed an air neighborhood network: For counties i and j, the incoming edge to county i represents the proportion of passengers, and pji represents the ratio of all passengers who traveled from county i to county j by plane,mto the total number of travelers who left county i by plane during the year, so that jipji=1.

Data on general human travel patterns in the US

Using the complete Twitter dataset, we aggregated a movement matrix to capture people’s week-to-week travels between counties from geo-located Twitter messages (captured during the period of January 1, 2011 through June 30, 2014). This dataset includes approximately 1.7×109 messages and represents 3×108 user-days of location information. A small, but significant, percentage of Twitter messages are automatically tagged with the author’s current latitude/longitude information, as determined by their mobile device. Each latitude/longitude-annotated tweet was mapped to a FIPS county code based on Tiger/Line shape files from the 2013 Census dataset (http://www.census.gov/geo/maps-data/data/tiger-line.html). In addition, we calculated a variant of our movement matrix to capture seasonality and other temporal dynamics: A set of 52 movement matrices captured weekly, county-to-county movements, observed in each week of the year, aggregating each week based on the corresponding observations, from each year from 2011 through 2014.

This dataset constitutes a large-scale, longitudinal sample of individual movements. We found that the movement patterns are consistent with intuitive patterns and prior studies of large-scale movements; the majority of people in a county remain in the same county week-over-week. Most travel between counties occurs between neighboring counties, and between counties and large metropolitan areas, conditioned on distance and size of the metropolitan area. We represent our movement data as an n×n matrix 𝐌 that captures the likelihood that a person observed in county i during the course of a week will be observed in county j in the following week. We calculate the entries mi,j of matrix 𝐌 as follows:

(1) mi,j=t<T1Popi,tu<mxu,i,txu,j,t+1

For each time interval, we computed the mean movement vector from a given county by summing all county-to-county movement vectors, weighted by the proportion of people moving in each direction.

To investigate the role of proximity to major airports, we modeled influence diffusion as follows: Let xi be the ith county, and vi be the total contribution obtained by diffusing influences from the major airport bearing counties. Let xk be the kth airport-bearing county, and let N be the total number of major airport locations considered, in our case, N=27. Let gk be the volume of traffic for the kth major airport location. We then compute vi as follows:

(2) vi=k=1Ngke-C0Θ(xk,xi)

where Θ(,) is the distance in miles between two locations, we computed using the Haversine approximation. The value of the constant C0 was chosen to be 0.1. Small variations in the constant do not significantly alter our conclusions. As noted above, proximity to airports had a significant positive influence in sparking seasonal epidemics; the influence is significantly weaker once an outbreak is well under way.

Estimating antigenic diversity

In this study, we measured antigenic diversity as follows: Let Si,x,t be the set of amino acid sequences for the ith protein (one of HA, NA, M1, or M2), collected in year t (t ranging between 2003 and 2011), in state x of the continental US. The temporal resolution of the sequence data is thus set to years instead of weeks, and the spatial resolution to states instead of counties. These resolutions are coarse compared to our EHR data on infection incidences, and is set in this manner to maintain sufficient statistical power. For each such set of amino acid sequences Si,x,t, we compute the set D(Si,x,t) of pairwise edit distances:

(3) D(Si,x,t)={y:y=(s1,s2),s1,s2Si,x,t,s1s2}

where (s1,s2) is the standard edit distance (also known as the Levenshtein distance) between the sequences s1,s2. Mathematically, the Levenshtein distance between two strings a,b (of length |a|,|b| respectively) is given by:

(4) (a,b)=leva,b(|a|,|b|), where
(5) leva,b(i,j)={max(i,j) ifmin(i,j)=0,min{leva,b(i-1,j)+1leva,b(i,j-1)+1leva,b(i-1,j-1)+1(aibj) otherwise.

where 1(aibj) is the indicator function equal to 0 when ai=bj and equal to one otherwise, and leva,b(i,j) is the distance between the first i characters of a and the first j characters of b.

The antigenic diversity of the ith protein at time t in state x was then defined as the median of the distribution of the values in the set D(Si,x,t). Clearly, as the sequences became more diverse at a point in time and space due to molecular variations brought about by either drifts or shifts, the measure deviated more from zero. Use of the median provided robustness to outliers.

Importantly, here we did not compare temporal changes in viral antigenic makeup directly, but estimated the time-specific diversity in the viral protein primary structure. We expected our diversity measure to be representative of the cumulative changes that occurred within each of the nine influenza seasons.

Estimating vaccination coverage

We incorporated the effect of vaccination coverage by estimating the cumulative fraction of the population that received the current influenza vaccine within the previous 20 week period. This approach to estimating vaccination coverage does not correct for season-specific antigenic match, or the lack thereof. Nevertheless, because we explicitly included measures of antigenic diversity in addition to vaccination coverage, we expected that effects arising from the degree of antigenic match would indeed be factored in; if there is a significant mismatch, we expected the antigenic diversity in that year to be less and vice verse. This assumes implicitly that vaccination does indeed play a major role in exerting significant selection pressure, the assumption that was reinforced by our observation of a strong dependency between vaccination coverage and normalized antigenic diversity.

We found that antigenic diversity is quite strongly affected by vaccination coverage. This reflects the theoretical predictions in Boni et al. (Boni et al., 2006), where it is shown that the amount of observed antigenic drift increases as immunity in the host population increases and pressures the virus population to evolve.

Estimating the effect of return-to-school days

The absence of consensus, and the diversity of modeling assumptions pertaining to this effect (described above) makes it difficult to validate conclusions in large scale epidemiological data. We carried out a simple test to determine whether there is statistical evidence that after-holiday, return-to-school periods in August-September and in January predict or trigger the seasonal epidemic.

For this test, we assumed a broad window to cover all such school openings across the continental US including the last week in August, the entire month of September, two weeks in October, and two weeks in January. We next carried out a Fisher’s exact test to determine whether the overlap between these weeks and the identified ‘trigger period’ (See Table LABEL:SI-tabschool) for the seasonal epidemic are sufficiently non-random.

Weather data

The dataset starts with the week beginning December 31 st, 2002 and includes 522 weeks (which ends exactly on the week ending December 31 st, 2012). Temperature and precipitation data come from the 2.5 arcminute (approximately 4 km) PRISM (Oregon State University, 2014) dataset and other variables (wind speed, specific humidity, surface pressure, downward incident, and shortwave solar radiation) come from the 7.5 arcminute (approximately 12 km) Phase 2 North American Land Data Assimilation System (NLDAS-2) dataset (Mitchell, 2004; Cosgrove et al., 2003). These datasets were selected in large part due to the fact that both are updated in near real-time, making it possible to use these datasets for future monitoring applications. PRISM is released daily, with an approximately 13 hr delay (data for the previous day is released at 1pm EST each day) while NLDAS is released daily, with an approximately 2.5 day delay).

Variables were aggregated to county boundaries based on shapefiles from the GADM database of Global Administrative Areas (Areas, 2014). Where appropriate, we considered both the average daily climate variable (for example, the daily maximum temperature averaged over the week) as well as the the maximum and/or minimum of the variable experienced over the week. For precipitation, we considered only the cumulative total precipitation experienced during the week.

Three approaches

Approach 1: Non-parametric Ganger analysis of county-specific incidence

Analysis of quantized clinical data

For the causality analysis, we started with an integer-valued time series for each US county, and to carry out the causality analysis, we first quantized the series in two steps:

  1. 1. Computing the difference series, i.e., the weekly change in the number of reported cases.

  2. 2. Mapping positive changes to symbol ‘1’ and negative changes to symbol ‘0’.

This mapped each data series to a symbol stream over a binary alphabet. The binary quantization is not a restriction imposed by the inference algorithm; while we do require quantized magnitudes, longer data streams can be encoded with finer alphabets to accommodate an arbitrary precision. For this specific dataset, the relatively short length of the county-specific time-series necessitated a coarse quantization in order for the results to have a meaningful statistical significance.

Given a pair of such quantized streams sa,sb, the algorithm described in the Supplement computes two coefficients of causality, one for each direction. Intuitively, the coefficient γba from sa to sb is a non-dimensional number between 0 and 1 that quantifies the amount of information that one may acquire about the second stream sb from observing the first stream (sa). More specifically, γba is the average reduction in the uncertainty of the next predicted symbol in stream sb in bits, per bit, acquired from observed symbols in stream sa. It can be shown that the coefficient of causality γba is 0 if and only if there is no causal influence from sa to sb in the sense of Granger, and assumes the maximum value 1 if and only if sb is deterministically predictable from sa. Moreover, γba=γab=0 if and only if sa and sb are statistically independent processes (Chattopadhyay, 2014). It is trivial to produce examples where we would have γba=0,γab>0 illustrating the ability of the algorithm to capture the asymmetric flow of causal influence in one preferred direction and not the other.

Additionally, whenever we computed the causality coefficient, it was always associated with a time delay: We calculated the coefficient for predicting the target stream some specified number of steps in the future, and the computed coefficient was thus parameterized by this delay. In our analysis, we computed coefficients up to a maximum delay of 10 weeks, and, for each pair of counties, selected the optimum delay which gave rise to the largest coefficient. For more details on the algorithm, see the Supplement.

Computation of causality fields and causality streamlines

To analyze the propagation dynamics of disease, we defined the notion of a Granger causal flow between two counties. Treating county-specific changes in disease prevalence as a time-stamped data stream, we quantified the directional strength of the causal flow between two counties as the degree of predictability of one stream’s future, given the history of the other (see Figure 3A–D). Our new algorithm was able to compute a coefficient of causality: an information-theoretic measure of the information in bits that one stream communicates about another in a direction-specific manner. A strong coefficient of this type suggests that either the disease itself, or an underlying cause, may be propagating from the former county to the latter.

To explore the country-wide propagation dynamics, we stitched together the properly-aligned, adjacent between-county causal flow vectors across the whole US map into causality streamlines, representing the average of multiple spatial infection propagation waves over time. To compute causality streamlines, we needed a precise notion of county neighbors. We considered two counties to be neighbors if either they shared a common border, or if one county was reachable within a line-of-sight distance of 50 miles from any point within another. The latter condition removed ambiguities as to whether counties touching at a point should be still considered as neighbors. The exact distance value (50 miles) does not significantly change our results, as long as it does not vary significantly. With the definition of the neighborhood map in place, we proceeded to compute the direction-specific coefficients of causality between neighboring counties. It follows that we would obtain a set of coefficients for each county and one for each of its neighbors, capturing the degree of causal influence from a given particular county to its respective neighbors. Our algorithm also computed the probability of each coefficient arising by chance alone, and we ignored coefficients that have more than a 6% probability that two independent processes lacking any causal connection gave rise to the particular computed value of the coefficient. Once the coefficients had been calculated for each neighbor, we computed the resultant direction of causal influence outflow from that particular county. This was carried out by visualizing the causality coefficients as weights on the length of the vectors, from the centroid of the considered county to the centroids of its neighbors. We then calculated the resultant vector (see Figure 3). Viewed systematically across the continental US, these local vectors formed a discernible pattern; we observed the emergence of a non-random structure with clearly discernible paths outlining the ‘causality field’ (see Figure 3, Plates G, J, K). To interpret the plots, note that streamlines start at their thinnest part, their direction is indicted with thickening line; typically multiple streamlines coalesce into a river-like pattern.

Inferring statistical ‘Granger-causality’ from data

Granger attempted to obtain a precise definition of causal influence (Granger, 1980) from the following intuitive notion: Y is a cause of X, if it has unique information that alters the probabilistic estimate of the immediate future of X.

Here, we used a new, non-parametric approach to Granger causal inference (Chattopadhyay, 2014) (Approach 1). In contrast to state-of-the-art binary tests (Baek and Brock, 1992; Hiemstra and Jones, 1994), we computed the degree of causal dependence between quantized data streams from stationary ergodic sources in a non-parametric, non-linear setting.

Our approach was significantly more general to common, regression-based implementations of Granger causal inference, and did not involve any autoregressive moving average (ARMA) modeling. All such commonly used techniques impose an a priori linear structure on the data, which then constrains the class of dependency structures we can hope to distill.

True causality, in a Humean sense, cannot be inferred (Hume, 1993; Kant, 1998). Among other reported approaches to ascertaining causal dependency relationships, the work of J. Pearl (Pearl, 2009a) is perhaps most visible, and builds on the paradigm of structural causal models (SCM) (Pearl, 2009b). While Pearl’s work is often claimed to be able to answer causal queries regarding the effects of potential interventions, as well as regarding counterfactuals, our objective in this paper is somewhat different. We are interested in delineating whether infection transmission pathways can be distilled from the patterns of infection’s spatio-temporal incidence.

Approach 2: Poisson regression on putative factors

In Approach 2, we investigated the relative importance of putative factors as follows: Specifically, let Nijk denote the total number of patients in county i, who are of age j, and gender k. Denoting the number of individuals diagnosed with influenza in a given county during given week as yijk, we modeled the within-county disease incidence counts for every county (for which data was available) in the US using the following mixed-effect Poisson regression model:

(6) P(yijk|λi,j,k)=λijkyijke-λijkyijk!
with λijk=Nijkexp{α+Xb+sib1,i+nib2,i+Zv},

where, α is the intercept, 𝐗 and 𝐛 are a fixed-effect design matrix and a vector of fixed effects, respectively, 𝐬i is 1×m vector of changes in rate of infection in the ith county 1,2,..,m weeks prior to the current, 𝐛1,i is a m×1 vector of auto-correlation fixed effects, 𝐧i is a 1×(3m) vector of changes in the rate of infection in the neighbors of the ith county 1,2,..,m weeks prior to the current (the neighbors are subdivided into land neighbors, Twitter neighbors, and air neighbors), 𝐛2,i is a (3m)×1 vector of county-neighborhoods fixed effects, and 𝐯 and 𝐙 are random effects and their design matrix, respectively. Variable m represents the depth of ‘memory’ of auto-regression in weeks, in our case m=4.

In this way, the total number of rows in matrix 𝐗 is 510×3,143 (weeks × counties), with county-specific socioeconomic covariates and week-specific weather covariates. For disease initiation analysis, we included only a subset of time series covering approximately 50 weeks.

We did not use an explicit spatial smoothing. The county-level random effects were used to account for possible heterogeneity in disease reporting across counties. However, as predictors, we used average infection rate values of immediate neighbors for the county (one, two, three, and four weeks in the past). This prediction structure afforded an implicit spatial smoothing, dampening spurious fluctuations in infection rates.

Out-of-sample prediction and ROC analysis with mixed-effect poisson regression

We carried our out-of-sample prediction with the models inferred with mixed-effect regression. The steps were as follows:

  1. We trained the model parameters with data from the trigger periods corresponding to the first six seasons.

  2. Once we identified the coefficient’s variables, we used it to predict the response variable (influenza incidence) for the last three seasons.

As expected, the predicted incidence does not exactly match the observed out-of-sample data. Nevertheless, we see positive correlation (Figure 5, Plate A). Because we were modeling a necessarily spatio-temporal stochastic process, the predictive ability of the model is difficult to judge simply from the observed positive correlation. To resolve this, we investigated the performance of our model by computing how well it predicted the counties that experience flare-ups during the trigger-periods in the out-of-sample data. This exercise is reduced to a classification problem, by first choosing a threshold on the number of reported cases per week to define what is meant by a ‘flare-up’ (see description below). To quantify the prediction performance, we constructed ROC curves for each of the three target seasons, for each fixed week, as follows:

  1. We first quantized the incidence data to reduce it to a binary stream. In particular, we chose a threshold (ten), such that for each county, and each week, we reported a ‘1’ if the number of reported cases was greater than the threshold, and ‘0’ otherwise. Note this quantization is different to what we used in carrying out our non-parametric Granger analysis in Approach 1.

  2. For each county i, and week j, we then ended up with the binary class variable Xj,i{0,1} and a decision variable Yj,i, where the former is the quantized incidence described above, and Yj,i is the response predicted by the model, normalized between 0 and 1.

  3. For a chosen decision threshold θD, we could determine the predicted class X^j,i{0,1} as:X^j,i={1, if Yj,iθD0 otherwise

  4. Comparing the observed and predicted classes, we computed the false positive rate (FPR: defined as the ratio of false positives to the sum of false positives and true negatives), and the true positive rate (TPR: defined as the ratio of true positives to the sum of true positives and false negatives). Finally, we constructed the ROC curve, which shows the relationship of the TPR and the FPR as the decision-threshold θD is varied.

  5. We constructed Receiver Operating Characteristic (ROC) curves for each week in the out-of-sample period, and estimated the area under curve (AUC). The AUC measures the performance of the predictor (our model) to correctly classify the counties that would go on to have a disease incidence greater than the initial set threshold (ten cases in our analysis). In the perfect case, we would have an AUC of 100%, which implies that we can achieve zero false positives, while getting a 100% true positive rate. Our best model achieves approximately 80% AUC for the trigger weeks, as shown in Figure 5.

Variable and model selection

Approaches 2 and 3 have different limitations; these affect which approach is most effective at which statistical detection task.

Approach 2 (mixed-effect regression analysis) can use a whole dataset for model selection, that can span individual predictors, their combinations, and even interactions between factors (see Table 5Figure 1—figure supplement 2; Figure 1—figure supplement 3; Figure 1—figure supplement 4; Source data 2; Source data 3; Supplementray file 3; Source code 1). Model selection allows us to choose a model with the right balance of complexity and explanatory power, thus enabling this analysis to detect collinear explanatory variables and drop the weaker predictors that increase model complexity but do not add explanatory power.

Table 5
Different Models Considered and DIC Ranking
https://doi.org/10.7554/eLife.30756.021
Equation used in Poisson regressionDIC
flu LOGN + 1 + max_HUS_mean + d_max_HUS_0 + d_max_HUS_min_1 + d_max_HUS_min_2 + d_max_HUS_min_3 + t_avg_mean + d_t_avg_0 + d_t_avg_min_1 + d_t_avg_min_2 + d_t_avg_min_3 + max_HUS_mean * t_avg_mean + d_max_HUS_0 * d_t_avg_0 + d_max_HUS_min_1 * d_t_avg_min_1 + d_max_HUS_min_2 * d_t_avg_min_2 + d_max_HUS_min_3 * d_t_avg_min_3 + RSDS_mean + d_RSDS_0 + d_RSDS_min_1 + d_RSDS_min_2 + d_RSDS_min_3 + AVG_PRESS_mean + d_AVG_PRESS_0 + d_AVG_PRESS_min_1 + d_AVG_PRESS_min_2 + d_AVG_PRESS_min_3 + Income + airport_diffusion + Am_Ind + Asian + White_Hisp + W_non_Hisp + Black_Hisp + B_non_Hisp + Pacific + Insured+Poor + Urban+v1+v2+land1+land2+land3+land4+tweet1+tweet2+tweet3+tweet4+air1+air2+air3+air4+HA + M1+M2+NA.185942
flu LOGN + 1 + max_HUS_mean + d_max_HUS_0 + d_max_HUS_min_1 + d_max_HUS_min_2 + d_max_HUS_min_3 + d_max_HUS_min_4 + t_avg_mean + d_t_avg_0 + d_t_avg_min_1 + d_t_avg_min_2 + d_t_avg_min_3 + d_t_avg_min_4 + RSDS_mean + d_RSDS_0 + d_RSDS_min_1 + d_RSDS_min_2 + d_RSDS_min_3 + d_RSDS_min_4 + AVG_PRESS_mean + d_AVG_PRESS_0 + d_AVG_PRESS_min_1 + d_AVG_PRESS_min_2 + d_AVG_PRESS_min_3 + d_AVG_PRESS_min_4 + tot_prec_mean + d_tot_prec_0 + d_tot_prec_min_1 + d_tot_prec_min_2 + d_tot_prec_min_3 + d_tot_prec_min_4 + Wind_avg_mean + d_Wind_avg_0 + d_Wind_avg_min_1 + d_Wind_avg_min_2 + d_Wind_avg_min_3 + d_Wind_avg_min_4 + Income + airport_diffusion + Am_Ind + Asian + White_Hisp + W_non_Hisp + Black_Hisp + B_non_Hisp + Pacific + Insured+Poor + Urban+v1+v2+land1+land2+land3+land4+tweet1+tweet2+tweet3+tweet4+air1+air2+air3+air4+HA + M1+M2+NA.+HA * NA.185940.6
flu LOGN + 1 + max_HUS_mean + d_max_HUS_0 + d_max_HUS_min_1 + d_max_HUS_min_2 + d_max_HUS_min_3 + t_avg_mean + d_t_avg_0 + d_t_avg_min_1 + d_t_avg_min_2 + d_t_avg_min_3 + max_HUS_mean * t_avg_mean + d_max_HUS_0 * d_t_avg_0 + d_max_HUS_min_1 * d_t_avg_min_1 + d_max_HUS_min_2 * d_t_avg_min_2 + d_max_HUS_min_3 * d_t_avg_min_3 + RSDS_mean + d_RSDS_0 + d_RSDS_min_1 + d_RSDS_min_2 + d_RSDS_min_3 + AVG_PRESS_mean + d_AVG_PRESS_0 + d_AVG_PRESS_min_1 + d_AVG_PRESS_min_2 + d_AVG_PRESS_min_3 + Income + airport_diffusion + Am_Ind + Asian + White_Hisp + W_non_Hisp + Black_Hisp + B_non_Hisp + Pacific + Insured+Poor + Urban+v1+v2+land1+land2+land3+land4+tweet1+tweet2+tweet3+tweet4air1+air2+air3+air4+HA + M1+M2+NA.+Cum_vac_per_N_0185938.1
flu LOGN + 1 + max_HUS_mean + d_max_HUS_0 + d_max_HUS_min_1 + d_max_HUS_min_2 + d_max_HUS_min_3 + d_max_HUS_min_4 + t_avg_mean + d_t_avg_0 + d_t_avg_min_1 + d_t_avg_min_2 + d_t_avg_min_3 + d_t_avg_min_4 + RSDS_mean + d_RSDS_0 + d_RSDS_min_1 + d_RSDS_min_2 + d_RSDS_min_3 + d_RSDS_min_4 + AVG_PRESS_mean + d_AVG_PRESS_0 + d_AVG_PRESS_min_1 + d_AVG_PRESS_min_2 + d_AVG_PRESS_min_3 + d_AVG_PRESS_min_4 + tot_prec_mean + d_tot_prec_0 + d_tot_prec_min_1 + d_tot_prec_min_2 + d_tot_prec_min_3 + d_tot_prec_min_4 + Wind_avg_mean + d_Wind_avg_0 + d_Wind_avg_min_1 + d_Wind_avg_min_2 + d_Wind_avg_min_3 + d_Wind_avg_min_4 + Income + airport_diffusion + Am_Ind + Asian + White_Hisp + W_non_Hisp + Black_Hisp + B_non_Hisp + Pacific + Insured+Poor + Urban+v1+v2+land1+land2+land3+land4+tweet1+tweet2+tweet3+tweet4+air1+air2+air3+air4+HA + M1+M2+NA.+Cum_vac_per_N_diff_1 + Cum_vac_per_N_diff_2 + Cum_vac_per_N_diff_3 + Cum_vac_per_N_diff_4185935.9
flu LOGN + 1 + max_HUS_mean + d_max_HUS_0 + d_max_HUS_min_1 + d_max_HUS_min_2 + d_max_HUS_min_3 + d_max_HUS_min_4 + t_avg_mean + d_t_avg_0 + d_t_avg_min_1 + d_t_avg_min_2 + d_t_avg_min_3 + d_t_avg_min_4 + RSDS_mean + d_RSDS_0 + d_RSDS_min_1 + d_RSDS_min_2 + d_RSDS_min_3 + d_RSDS_min_4 + AVG_PRESS_mean + d_AVG_PRESS_0 + d_AVG_PRESS_min_1 + d_AVG_PRESS_min_2 + d_AVG_PRESS_min_3 + d_AVG_PRESS_min_4 + tot_prec_mean + d_tot_prec_0 + d_tot_prec_min_1 + d_tot_prec_min_2 + d_tot_prec_min_3 + d_tot_prec_min_4 + Wind_avg_mean + d_Wind_avg_0 + d_Wind_avg_min_1 + d_Wind_avg_min_2 + d_Wind_avg_min_3 + d_Wind_avg_min_4 + Income + airport_diffusion + Am_Ind + Asian + White_Hisp + W_non_Hisp + Black_Hisp + B_non_Hisp + Pacific + Insured+Poor + Urban+v1+v2+land1+land2+land3+land4+tweet1+tweet2+tweet3+tweet4+air1+air2+air3+air4+HA + M1+M2+NA.185933.7
flu LOGN + 1 + max_HUS_mean + d_max_HUS_0 + d_max_HUS_min_1 + d_max_HUS_min_2 + d_max_HUS_min_3 + d_max_HUS_min_4 + t_avg_mean + d_t_avg_0 + d_t_avg_min_1 + d_t_avg_min_2 + d_t_avg_min_3 + d_t_avg_min_4 + RSDS_mean + d_RSDS_0 + d_RSDS_min_1 + d_RSDS_min_2 + d_RSDS_min_3 + d_RSDS_min_4 + AVG_PRESS_mean + d_AVG_PRESS_0 + d_AVG_PRESS_min_1 + d_AVG_PRESS_min_2 + d_AVG_PRESS_min_3 + d_AVG_PRESS_min_4 + tot_prec_mean + d_tot_prec_0 + d_tot_prec_min_1 + d_tot_prec_min_2 + d_tot_prec_min_3 + d_tot_prec_min_4 + Wind_avg_mean + d_Wind_avg_0 + d_Wind_avg_min_1 + d_Wind_avg_min_2 + d_Wind_avg_min_3 + d_Wind_avg_min_4 + Income + airport_diffusion + Am_Ind + Asian + White_Hisp + W_non_Hisp + Black_Hisp + B_non_Hisp + Pacific + Insured+Poor + Urban+v1+v2+land1+land2+land3+land4+tweet1+tweet2+tweet3+tweet4+air1+air2+air3+air4+HA + M1+M2+NA.+Cum_vac_per_N_0185932.3
flu LOGN + 1 + max_HUS_mean + d_max_HUS_0 + d_max_HUS_min_1 + d_max_HUS_min_2 + d_max_HUS_min_3 + d_max_HUS_min_4 + t_avg_mean + d_t_avg_0 + d_t_avg_min_1 + d_t_avg_min_2 + d_t_avg_min_3 + d_t_avg_min_4 + RSDS_mean + d_RSDS_0 + d_RSDS_min_1 + d_RSDS_min_2 + d_RSDS_min_3 + d_RSDS_min_4 + AVG_PRESS_mean + d_AVG_PRESS_0 + d_AVG_PRESS_min_1 + d_AVG_PRESS_min_2 + d_AVG_PRESS_min_3 + d_AVG_PRESS_min_4 + tot_prec_mean + d_tot_prec_0 + d_tot_prec_min_1 + d_tot_prec_min_2 + d_tot_prec_min_3 + d_tot_prec_min_4 + Wind_avg_mean + d_Wind_avg_0 + d_Wind_avg_min_1 + d_Wind_avg_min_2 + d_Wind_avg_min_3 + d_Wind_avg_min_4 + Income + airport_diffusion + Am_Ind + Asian + White_Hisp + W_non_Hisp + Black_Hisp + B_non_Hisp + Pacific + Insured+Poor + Urban+v1+v2+land1+land2+land3+land4+tweet1+tweet2+tweet3+tweet4+air1+air2+air3+air4+HA + M1+M2+NA.+Cum_vac_per_N_0 + Cum_vac_per_N_1 + Cum_vac_per_N_2 + Cum_vac_per_N_3185926.6

For example, the best model in our mixed-effect regression model series, with deviance information criterion (DIC) 185,926.6 (see Table 5, the last model), includes a term d_max_HUS_min_3 * d_t_avg_min_3 that denotes an interaction between the weekly change in the maximum specific humidity and the weekly change in the average temperature, both of which are weather parameter changes recorded three weeks before the current week. Because it does not split datasets, Approach two is more powerful at detecting interactions between explanatory variables and rather complicated models. However, this approach is more susceptible to bias (picking non-causal correlations that are inherent in the raw data), if the data are unbalanced, unlike with Approach 3.

We added and removed random effects to the model structure, taking out and adding fixed effects, executing the regression algorithm, and plotting the DIC achieved against model complexity. Here, model complexity is simply proxied by the descriptional complexity in terms of how large the model was (the number of factors in the regression equation). We stopped when the drop in DIC stabilized (See Figure 1—figure supplement 3 ). Notably, the DIC has been shown to select overfits (Ando, 2011; Plummer, 2008; van der Linde, 2012), and does not properly account for model complexity in practice. Our driving idea was the identification of the Pareto front that trades off accuracy (in terms of DIC) with model complexity in a transparent manner. This is, of course, a ‘greedy approach,’ in the sense that we do not guarantee that we have indeed found the ‘best’ possible model. However, because our cross-validation (Figure 5) yielded good results, we deemed the stopping rule to be satisfactory.

Approach 3: County-matching effect analysis

We designed Approach three specifically to balance biased observational data. While Approach three does not allow for explicit model selection, it is good at detecting simpler combinations of predictors putatively ‘causally’ affecting the outcome variable. However, because this approach involves splitting data into smaller and smaller subsets, as the complexity of predictor function (‘treatment’) grows, it quickly runs out of statistical power.

These two approaches’ properties explain the perceived misalignment of results from the two streams of analysis. For an intuitive understanding of matching approaches, consider specific humidity’s effect on influenza prevalence. The county-matching method’s goal is to deduce associations putatively interpreted as causality relations. For example, consider testing the question of whether counties with higher-than-average mean maximum specific humidity do, indeed, have higher influenza prevalence, in a statistically significant sense, when compared to counties that do not, provided all other factors are held constant.

Let 𝕐 denote the set of all US counties, and let 𝕊 be the set of all factors we find to be significant in our mixed-effect regression analysis. Now, for any subset of factors K𝕊, we denote the complement set as K¯=𝕊K. Additionally, we define the Boolean function 𝒯:𝕊×𝕐{true,false}, (the treatment function) where for some factors s𝕊, and some counties y𝕐, the Boolean value 𝒯(s,y) is true if the signal or treatment corresponding to the factor s is present in county y. We used the sign for the coefficients we obtained in our best mixed-effect regression model to determine what counts as a positive signal. For example, because maximum specific humidity (denoted by the variable max_hus_avg) enters with a positive coefficient, if county y experiences a higher-than-average maximum specific humidity, then 𝒯(max_hus_avg,y) is true.

To simplify the notation, we used 𝒯ys to denote a true signal, and ¬𝒯ys to denote a false one.

Finally, for any K𝕊, we define the following three sets, which we refer to as the W-sets:

(8) WtreatedK={y𝕐:kK𝒯yk}
Wmatched-controlK=
(9) {y𝕐:(¬kK𝒯yk)(yWtrueK(rK¯((𝒯yr𝒯yr)(¬𝒯yr¬𝒯yr))))}
(10) WotherK={yY:(¬kK𝒯yk)}WmatchedcontrolK

Clearly, WtreatedK is the treated set, that is, the set of counties which exhibit the signal encoded by the set K is then the matched control set of counties, which lack the signal, but each county in this set has a matching counterpart in WtreatedK.

The W-sets allow us to set up a 2×2 contingency table for any chosen subset of factors K𝕊 as described before. Specifically, we split the sets WtreatedK and Wmatched-controlK into two subsets each, representing those counties which experience a spike in influenza prevalence and which do not. The 2×2 contingency table is then subjected to Fisher’s exact test.

Appendix 1

Non-parametric Granger Causal Inference

Standard Implementations of Granger Causal Inference

Suppose that one is interested in the possibility that a vector series Yt causes another vector Xt. Let Jn be an information set available at time n, consisting of terms of the vector series Zt, that is,

(11) Jn={Zt:tn}

Jn is said to be a proper information set with respect to Xt, if Xt is included within Zt. Further, suppose that Zt does not include any component of Yt, and define

(12) Jn={(Zt,Yt):tn}

Inferring causality in the mean is easier, and if one is satisfied with using minimum mean square prediction error as the criterion to evaluate incremental predictive power, then one may use linear, one-step-ahead least squares predictors to obtain an operational procedure: If VAR(X|Jn) is the variance of one-step forecast error of Xn+1 given Jn, then Y is a prima facie cause of X with respect to Jn if:

(13) VAR(X|Jn)<VAR(X|Jn)

Testing for bivariate Granger causality in the mean involves estimating a linear, reduced-form vector autoregression:

(14) Xt=A(L)Xt+B(L)Yt+UX,t
(15) Yt=C(L)Xt+D(L)Yt+VY,t

where A(L), B(L), C(L), and D(L) are one-sided lag polynomials in the lag operator L with roots all-distinct, and outside the unit circle. The regression errors UX,t,VY,t are assumed to be mutually independent and individually i.i.d. with zero mean and constant variance. We used a standard joint test (F or χ2-test) to determine whether lagged Y has significant linear predictive power for current X.

The null hypothesis that Y does not strictly Granger cause X is rejected if the coefficients of the elements in B(L) are jointly significantly different from zero.

Linear tests presuppose restrictive and often unrealistic (Darnell and Evans, 1990; Epstein, 1987) structure on data. Brock (Brock, 1991) presents a simple bivariate model to analytically demonstrate the limitations of linear tests in uncovering nonlinear influence. To address this issue, a number of nonlinear tests have been suggested, e.g., with generalized, autoregressive, conditional heteroskedasticity (GARCH) models (Asimakopoulos et al., 2000), using wavelet transforms (Papadimitriou et al., 2003), or heuristic, additive relationships (Chu et al., 2005). However, these approaches often assume a class of allowable non-linearities; thus not quite alleviating the problem of presupposed structure. This is not just an academic issue; Granger causality has been shown to be significantly sensitive to non-linear transformations (Roberts and Nord, 1985).

Non-parametric approaches, e.g. the Hiemstra-Jones (HJ) test (Hiemstra and Jones, 1994) on the other hand, attempt to completely dispense with presuppositions regarding causality structure. Given two series, Xt and Yt, the HJ test (which is a modification of the Baek-Brock test [Baek and Brock, 1992]) uses correlation integrals to test if the probability of similar futures for Xt given similar pasts, change significantly if we condition instead on similar pasts for both Xt and Yt simultaneously.

Nevertheless, in order to achieve consistent estimation of the correlation integrals, the data series are required to be ergodic, stationary, and absolutely regular i.e.β-mixing, with an upper bound on the rate at which the β-coefficients approach zero (Denker and Keller, 1983). The additional assumptions beyond ergodicity and stationarity serve to guarantee that sufficiently separated fragments of the data series are nearly independent. The HJ test and its variants (Diks and Panchenko, 2006; Seth and Principe, 2010) have been quite successful in econometrics; uncovering nonlinear causal relations between money and income (Baek and Brock, 1992), aggregate stock returns and macroeconomic factors (Hiemstra and Kramer, 1995), currency future returns (Asimakopoulos et al., 2000) and stock price and trading volume (Hiemstra and Jones, 1994). Surprisingly, despite clear evidence that linear tests typically have low power in uncovering nonlinear causation (Hiemstra and Jones, 1994; Asimakopoulos et al., 2000), the application of non-parametric tests has been limited in areas beyond financial and macroeconomic interests.

1.1 Our Approach: Non-parametric Approach Based on Probabilistic Automata

We use a new, non-parametric Granger causality-based test for quantized processes (Chattopadhyay, 2014).

Going beyond binary hypothesis testing, we quantify the notion of the degree of causal influence between observed data streams, without presupposing any particular model structure. Generative models of causal influence are inferred with no a priori, imposed dynamical structure beyond ergodicity, stationarity, and a form of weak dependence. The explicit generative models may be used for prediction. The proposed inference algorithms are PAC-efficient (Chattopadhyay, 2014), i.e, we are guaranteed to find good models with high probability, and with small sample complexity.

Modeling Framework: Probabilistic Automata and Crossed Probabilistic Automata

In this section, we briefly describe the notion of probabilistic automata, the inference problem, the notion of causal states, and the decision fusion strategy for multiple models.

Self Models

The event catalogue’s quantized data streams may be viewed as sample paths from hidden, quantized stochastic processes that drive the observed dynamics. A self-model for such a stream is a generative model that captures statistically significant symbol patterns that causally determine (in a probabilistic sense) future symbols. A good modeling framework for such self-models are probabilistic finite state automata (PFSA) (Chattopadhyay and Lipson, 2013; Chattopadhyay, 2014).

One such self model, shown in Figure 1 (Plate A(i)), is an example of a simple PFSA with two states generated from a binary symbol sequence. Note that, for this particular example, a sequence 11 localizes or synchronizes the machine to state q2, which then implies that the next symbol is a 0 with a probability of 0.1 and a 1 with a probability of 0.9. Thus, given this stream’s self-model, along with the short history 11, we can predict the symbol distribution in the next time step. Notably, the states in a PFSA are not the alphabet symbols themselves, but are equivalence classes of symbol sequences (histories) that lead to statistically equivalent futures. This particular example has two states–not because of the fact that the alphabet is binary, but because there are only two distinct distributions that dictate the next symbol from any point in the dynamical evolution. The symbol value is distributed as either [0.1 0.9] or [0.3 0.7] over the binary event alphabet, and the current context or ‘state’ dictates which distribution is in effect chosen. These contexts are simply historical sequence equivalence classes that could have potentially lead to the present state; the specific history belonging to whichever current equivalence class actually transpired makes no difference in the future. Hence, these classes are dynamical states for the discrete stochastic evolution of the system at hand. The transition structure of the PFSA (represented in the graph as the labeled edges with probabilities) specifies how we move between these classes as new symbols are generated, or as new data is acquired.

The inference problem here is to determine both the number of states and the transition structure given a sufficiently long, quantized data stream. In our unsupervised approach, there are no restrictions on admissible graphs for the inferred PFSA, and no a priori constraints on the number of states that may appear (Chattopadhyay and Lipson, 2013).

Thus, the number of states inferred by our algorithm is a direct measure of the statistical complexity of the underlying process (Crutchfield, 1994). An inferred single-state, self model implies that the stream is a sequence of independently generated symbols (white noise or a Bernoulli process), and is therefore uninformative.

The inference algorithm identifies the number of causal states by searching for distinct ‘contexts’–sequences which, once transpired, lead to a distinct probability distribution of the next symbol. The computation proceeds as follows (Chattopadhyay and Lipson, 2013):

  1. Let the set of all possible sequences up to a length L be denoted as SL.

  2. Compute the probability of a future symbol (at a specified time shift) being σ0 or σ1 after a specific string ω from the set SL is encountered; call this distribution ϕω for the string ω.

  3. Call the set of probability distributions obtained by this method as Φ.

  4. Find clusters in the set Φ, such that individual clusters are separated by some pre-specified distance ϵ>0. These clusters represent the causal states, as they are classes of histories (sequences) that lead to identical, immediate future.

  5. Suppose string ω is in cluster qi, and sequence ωσ0 is in cluster qj; it then follows that, in the inferred PFSA, there is a transition labeled σ0 from the corresponding state qi to state qj. Carrying out this procedure for each symbol of the alphabet for each inferred cluster or state identifies the complete transition structure of the model.

  6. Once the transition structure is identified, we choose an arbitrary initial state and step through the model as dictated by the input data stream. We count the number of times each edge or transition is traversed, and, by normalizing the count, we arrive at an estimate of the edge probabilities.

This completes the PFSA inference.

Formal Definition of PFSAs and Future Prediction

Mathematically, a PFSA is a 4-tuple G=(Q,Σ,δ,π~), where Q is the set of states, Σ is the symbol alphabet, δ:Q×ΣQ is the transition function specifying the graph structure such that for any state qQ, δ(q,σ)Q is the end state of the transition from state q via symbol σ, and π~:Σ×Q[0,1] is the symbol probability function such that for any state qQ, π~(q,σ)[0,1] is the probability of generating the symbol σΣ from the state q, with the constraint:

(16) qQ,σΣπ~(q,σ)=1

We specify a time shift Δ when we infer a PFSA G from a given input stream; this time shift is the delay with which the inferred model makes predictions. Specifically, each transition step in G is translated to Δ steps in the quantized stream. Thus, if we know that the current state in the model is q0, then the predicted symbol Δ steps for the future are σ with probability π~(q0,σ).

Appendix 1—figure 1
Intuitive Description of Self and Cross Models.

Plate A illustrates the notion of self-models. Historical data is first represented as a symbol sequence (denoted as ‘Data stream’ in Plate A) using space-time discretization and magnitude quantization. For example, we may use a spatial discretization of ±3 in both latitudes and longitudes, a temporal discretization of 1 week, and a binary magnitude quantization that maps all magnitudes below 4.0 to symbol 0, and all higher magnitudes to symbol 1. This symbol stream then represents a sample path from a hidden, quantized stochastic process. A self-model is a generative model of this data stream, which captures symbol patterns that causally determine (in a probabilistic sense) future symbols. Specifically, our inferred self-model (see Plate A(i) for an example) is a probabilistic, finite state automata (PFSA). Plate B illustrates the notion of cross-models. Instead of inferring a model from a given stream to predict future symbols in the same stream, we now have two symbol streams (Data Stream I and Data Stream II), and the cross-model is essentially a generative model that attempts to predict symbols in one stream by reading historical data in another. Notably, as shown in Plate B(i), the cross-model is syntactically not exactly a PFSA (arcs have no probabilities in the cross-model, but each state has an output distribution). We call such models ”crossed probabilistic finite state automata,’ or XPFSA. Once these models are inferred, they may be used to predict the future evolution of the data streams. Thus, the self-model in Plate A may be initialized with its unique stationary distribution, after which a relatively short observed history would dictate the current distribution on the model states. This, in turn, would yield a distribution over the symbol alphabet in the next time step. For a cross-model, we would be able to obtain future symbol distribution in the second stream, given a short history in the first stream. Note that the cross-model from I II is not necessarily the same as the cross-model in the other direction.

https://doi.org/10.7554/eLife.30756.033

Note that, if we have two models G1=(Q1,Σ,δ1,π~1) and G2=(Q2,Σ,δ2,π~2) with time shifts Δ1 and Δ2 respectively, it follows that with each step the models make predictions at different points in time in the future.

Note that to make a prediction using a PFSA, we must know its current state. This is non-trivial in general due to the synchronization problem (Chattopadhyay and Lipson, 2013; Chattopadhyay, 2014). We find an approximation of the current state as follows:

  1. Compute the he PFSA states’ stationary distribution using standard Markov chain tools.

  2. Treat the stationary distribution row vector as the initial state distribution. The rationale here is that if we assume that the model has been operating for a sufficient length of time, the current distribution must be very close to the stationary distribution in the expected sense.

  3. Compute the square matrices, Γσ|Q|×|Q|, for each symbol σΣ, such that the ijth element of Γσ, denoted as Γσ|ij, which signifies the probability of going from state qi to state qj via the symbol σ.

  4. Use a relatively short history h=σ1σ of past symbols (before the current time) to update the state distribution as follows:

(17) [k]=𝗇𝗈𝗋𝗆𝖺𝗅𝗂𝗓𝖾([k-1]Γhk),k=1,,

where hk is the kth entry of the short history h, and [k] is the state distribution at step k. It follows from ergodicity that the updated distribution is an estimate of the current state distrbution.

The predicted distribution over the symbol alphabet Δ steps in the future (if the time shift for the model is Δ), is then simply:

Pr(σpredicted=σ)=qQqπ~(q,σ)

where q is the current probability for state q, i.e., the entry corresponding to state q in .

Cross Models

Unlike a self model, which attempts to predict future symbols in the same stream, a cross model attempts to predict symbols in a target stream after reading a short history in the source stream. Plate B(i) in Figure 1 illustrates a simple cross model with two states. Note that the model syntax is different from a PFSA; in particular, there are no probabilities on the arcs, though each state has a specified output distribution. Thus, if we see sequence 101101 in data stream sB (See Figure 1, Plate B), then we are in state q2 in the cross-model, and hence the next symbol in the data stream sA can be predicted to be 0 with a probability of 0.7 and 1 with a probability of 0.3. We call such models crossed Probabilistic Finite State Automata (XPFSA) (Chattopadhyay, 2014). A cross model state or an XPFSA has a slightly different interpretation from PFSA; while PFSA states are equivalence classes of histories that lead to identical futures in the same stream, the XPFSA are equivalence classes of histories in the source stream that lead to identical futures in the target stream (the future evolution in the target stream does not matter).

XPFSA inference ˜ (Chattopadhyay, 2014) is similar in principle to PFSA inference. Here, we have two input data streams, the source stream sA over the source alphabet ΣA, and the target stream sB over the source alphabet ΣB. The broad steps are as follows:

  1. Let the set of all possible sequences up to a length L in source stream sA be denoted as SLA.

  2. Compute the probability of a future symbol (at a specified time shift) being σΣB in the target stream sB after a specific string ω from the set SLA is encountered in the source stream; call this distribution ϕωsA,sB for the string ω.

  3. Call the set of probability distributions obtained in this manner as ΦsA,sB.

  4. Find clusters in the set ΦsA,sB, such that the clusters are separated by some pre-specified distance ϵ>0. These clusters now represent causal states of the cross-dependence between the stream (from sA to sB), because they are classes of histories (sequences) in the source stream that leads to identical, immediate futures in the target stream.

  5. Suppose string ω is in a cluster corresponding to state qi, and sequence ωσ0 is in a cluster corresponding to qj; it then follows that, in the inferred XPFSA, there is a transition labeled σ0 from the corresponding state qi to state qj. Carrying out this procedure for each symbol of the source alphabet for each inferred cluster or state identifies the complete transition structure of the cross model. Note that the cross-model transitions are only labelled with symbols from the source alphabet.

  6. Once we have identified the transition structure, we choose an arbitrary initial state in the XPFSA and step through the model as dictated by the source data stream. Each time we reach a particular state in the cross-model, we note which symbol from target alphabet ΣB transpires in the target stream (at the specified time shift). Once we reach the end of the input source stream, we normalize the symbol count vectors corresponding to the cross-model states, and this determines the output distribution at each state. Note that each output distribution is a probability distribution over the target alphabet. This completes the XPFSA inference.

An XPFSA may be thought of as a direct generalization of a PFSA, and any PFSA can be represented as an XPFSA. To illustrate this, note that predicting future symbols in the same stream can be thought of as predicting symbols is a second, but identical, copy of the first stream.

If all histories in a source stream are equivalent in this sense, then we have a single state XPFSA, which implies that the source cannot provide any new information on what is going to happen in the target stream based on its own history. Hence, that source lacks any causal influence on the target. Thus, as before, the complexity of the inferred models is directly related to the statistical complexity (Crutchfield, 1994) of the learnable, dynamical relationships underlying data streams themselves.

XPFSAs are asymmetric in general; the model for how a source influences a target does not need to be identical when the roles played by the streams are reversed. However, two streams are statistically independent if and only if the XPFSAs in both directions are single state machines (Chattopadhyay, 2014).

This leads us to the notion of the causality coefficient, γBA, from data stream sA to data stream sB. True to Granger’s notion of statistical causality, γBA rigorously quantifies the amount of additional information we can obtain about the immediate future in stream sB through observing stream sA.

Explicitly, the coefficient is defined as the ratio of the expected change in the entropy of the next-symbol distribution in stream sB conditioned over observations in the stream sA to the entropy of the next-symbol distribution in stream sB, conditioned on the fact that no observations are made on stream sA. We show that causality coefficient γBA takes values on the closed unit interval and that higher values indicate a stronger predictability of sB from sA, i.e., therefore, a higher degree of causal influence. We have γ=0 if and only if the inferred machine has a single state, and streams sA,sB are statistically independent if and only if γBA=γAB=0.

Thus, the interpretation of the causality coefficient which is central to our development is as follows:

γBA=0.3 (say) from sA to sB means that we can acquire, on average, 0.3bits of additional information about sB from each bit read from sA, over what is already available from the past history of sB.

This is very different from computing correlations, and it assumes no model structure a priori for the hidden dynamics driving the data streams.

Importantly, it turns out that inferring self models is simply a special case of cross model inference, in which the target stream is simply a time-shifted version of the first stream.

Formal Definition of XPFSAs and Future Prediction

Formally, an XPFSA is also a 4-tuple GAB=(Q,ΣA,Δ,π~), where Q is a set of states, ΣA is the input alphabet. The transition function δ:Q×ΣAQ is defined as before (in the case of the PFSA), but the symbol probability function is defined as π~:Q×ΣB[0,1], with

(18) qQ,σΣBπ~(q,σ)=1

where ΣB is the target alphabet, which is distinct from ΣA in general. Note that the symbol distribution specified by π~ is over the target alphabet ΣB.

As before, there is a time shift Δ associated with inferred XPFSA, such that each symbol transition in the model maps to Δ steps in the target stream. If we know the current state of the XPFSA to be q0, and we observe symbol σΣA in the source stream, then the predicted symbol in the target stream (Δ steps in the future) is σΣB, with a probability of π~(δ(q0,σ),σ).

As before, determining the current state in the XPFSA is non-trivial. We estimate the current state distribution in a manner similar to that used for self models:

  1. Note that the XPFSA graph does not have probabilities on it edges. However, the corresponding source stream’s self model does have symbol probabilities on its own edges, and it models the source stream. The problem is that the graph for the self model, and that of the cross model, might not be identical. We solve this problem using projective composition of probabilistic automata (Chattopadhyay and Ray, 2008). This operation takes the self model for the source stream GA, and projects it on the inferred XPFSA graph GAB, and the result is a PFSA GAGAB with the same structure as that of GAB.

  2. Once we obtain GAGAB, we can estimate its current state as described in the case for self models (using observed symbols in the source stream).

  3. Let the current state distribution be ¯. Then, the predicted future symbol (at time shift Δ) in the target stream is σΣB with a probability of:

Pr(σ𝐩𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝=σ)=qQ¯qπ~(q,σ)

where GAGAB (note we are using the π~ from the XPFSA, and not from the projected PFSA).

Computation of the Coefficient of Causality

Let A,B be stationary ergodic processes over finite alphabets ΣA,ΣB respectively. Then, the causal dependence coefficient of B on A, denoted as γBA, is formally defined as the ratio of the expected change in the entropy of the next symbol distribution in B. This is due to observations in A, which show the entropy of the next symbol distribution in B. In the absence of data in A, i.e., we have (Chattopadhyay, 2014):

(19) γBA=1-𝐄(𝒉(ϕxA,B))𝒉(ϕλA,B)

where the entropy 𝒉(u) of discrete probability distribution u is given by iuilog2ui, λ is the empty string, and 𝐄() is the expectation operator over all possible sequences in the source stream. Once we infer the cross model, computing γBA is straightforward.

Choice of Quantization Schemes

The weekly time series of county-specific records for the number of reported influenza cases is quantized to a binary stream. If yk is the number of reported cases for a given fixed county at the week index k, then the quantized binary stream y¯k is obtained as follows:

(20) y¯k={1if yk>yk-10otherwise

References

  1. 1
  2. 2
    Predictive bayesian model selection
    1. T Ando
    (2011)
    American Journal of Mathematical and Management Sciences 31:13–38.
    https://doi.org/10.1080/01966324.2011.10737798
  3. 3
  4. 4
  5. 5
    GADM database
    1. GA Areas
    (2014)
    GADM database, http://gadm.org/.
  6. 6
  7. 7
    A General Test for Nonlinear Granger Causality: Bivariate Model
    1. EG Baek
    2. WA Brock
    (1992)
    A General Test for Nonlinear Granger Causality: Bivariate Model.
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
    Beyond Belief: Randomness, Prediction, and Explanation in Science
    1. W Brock
    (1991)
    230–279, Causality, chaos, explanation and prediction in economics and finance, Beyond Belief: Randomness, Prediction, and Explanation in Science.
  16. 16
  17. 17
  18. 18
    Prevention and control of seasonal influenza with vaccines: recommendations of the Advisory Committee on Immunization Practices (ACIP), 2009
    1. Centers for Disease Control and Prevention
    2. AE Fiore
    3. DK Shay
    4. K Broder
    5. JK Iskander
    6. TM Uyeki
    7. G Mootrey
    8. JS Bresee
    9. NJ Cox
    (2009)
    MMWR. Recommendations and Reports : Morbidity and Mortality Weekly Report. Recommendations and Reports 58:1–52.
  19. 19
  20. 20
    Abductive learning of quantized stochastic processes with probabilistic finite automata
    1. I Chattopadhyay
    2. H Lipson
    (2013)
    Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 371:20110543.
    https://doi.org/10.1098/rsta.2011.0543
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
    Data Driven Methods for Nonlinear Granger Causality: Climate Teleconnection Mechanisms
    1. T Chu
    2. D Danks
    3. C Glymour
    (2005)
    Pittsburgh, Pennsylvania: Carnegie Mellon University.
  26. 26
  27. 27
  28. 28
    Real-time and retrospective forcing in the North American land data assimilation system (NLDAS) project
    1. BA Cosgrove
    2. D Lohmann
    3. KE Mitchell
    4. PR Houser
    5. EF Wood
    6. JC Schaake
    7. A Robock
    8. C Marshall
    9. J Sheffield
    10. Q Duan
    11. L Luo
    12. WR Higgins
    13. RT Pinker
    14. DJ Tarpley
    15. J Meng
    (2003)
    Journal of Geophysical Research, 108, 10.1029/2002JD003118.
  29. 29
  30. 30
    The Limits of Econometrics
    1. AC Darnell
    2. LJ Evans
    (1990)
    Aldershot, Hampshire: Gower.
  31. 31
  32. 32
  33. 33
  34. 34
    On U-statistics and v. mise? statistics for weakly dependent processes
    1. M Denker
    2. G Keller
    (1983)
    Zeitschrift fur Wahrscheinlichkeitstheorie Und Verwandte Gebiete 64:505–522.
    https://doi.org/10.1007/BF00534953
  35. 35
  36. 36
  37. 37
  38. 38
    A History of Econometrics
    1. R Epstein
    (1987)
    Amsterdam: North-Holland.
  39. 39
  40. 40
    The Design of Experiments
    1. RA Fisher
    (1935)
    Edinburgh, London: Oliver and Boyd.
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
    Testing for causality
    1. CWJ Granger
    (1980)
    Journal of Economic Dynamics and Control 2:329–352.
    https://doi.org/10.1016/0165-1889(80)90069-X
  47. 47
  48. 48
    The truven health marketscan databases for life sciences researchers
    1. L Hansen
    (2017)
    Truven Health Ananlytics IBM Watson Health.
  49. 49
    Wiley Series in Probability and Statistics
    1. DR Hedeker
    2. RD Gibbons
    (2006)
    Longitudinal data analysis, Wiley Series in Probability and Statistics, Hoboken, N.J, Wiley-Interscience.
  50. 50
    Testing for linear and nonlinear granger causality in the stock price-volume relation
    1. C Hiemstra
    2. JD Jones
    (1994)
    The Journal of Finance 49:1639–1664.
  51. 51
  52. 52
  53. 53
    An Enquiry Concerning Human Understanding
    1. D Hume
    (1993)
    United States: Hackett Publishing Country.
  54. 54
    The Cambridge Edition of the Works of Immanuel Kant
    1. I Kant
    (1998)
    Critique of pure reason, The Cambridge Edition of the Works of Immanuel Kant, New York, NY, Cambridge University Press, 10.1017/CBO9780511804649.
  55. 55
  56. 56
  57. 57
    A contribution to the mathematical theory of epidemics
    1. WO Kermack
    2. AG McKendrick
    (1927)
    Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 115:700–721.
    https://doi.org/10.1098/rspa.1927.0118
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
    The multi-institution North American land data assimilation system (NLDAS): utilizing multiple gcip products and partners in a continental distributed hydrological modeling system
    1. KE Mitchell
    (2004)
    Journal of Geophysical Research, 109, 10.1029/2003JD003823.
  63. 63
  64. 64
    Analytical Methods for Social Research
    1. SL Morgan
    2. C Winship
    (2015)
    Counterfactuals and causal inference : methods and principles for social research, Analytical Methods for Social Research, Second edition, New York, Cambridge University Press.
  65. 65
  66. 66
  67. 67
    PRISM climate group
    1. Oregon State University
    (2014)
    Accessed August 2016.
  68. 68
    Proceedings 2003 VLDB Conference
    1. S Papadimitriou
    2. A Brockwell
    3. C Faloutsos
    (2003)
    560–571, Adaptive, hands-off stream mining, Proceedings 2003 VLDB Conference, 10.1016/B978-012722442-8/50056-2.
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
    A test of granger non-causality based on nonparametric conditional independence
    1. S Seth
    2. JC Principe
    (2010)
    2010 20th International Conference on Pattern Recognition. pp. 2620–2623.
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
    US county adjacency
    1. The United States Census
    (2016)
    Accessed August 2016.
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
    Emergence of influenza A viruses
    1. RJ Webby
    2. RG Webster
    (2001)
    Philosophical transactions of the Royal Society of London. Series B, Biological sciences 356:1817–1828.
    https://doi.org/10.1098/rstb.2001.0997
  89. 89
  90. 90
  91. 91
    Modeling on social spread from immunity
    1. H Yasuda
    2. N Yoshizawa
    3. K Suzuki
    (2005)
    Japanese Journal of Infectious Diseases 58:S14.
  92. 92
  93. 93

Decision letter

  1. Mark Jit
    Reviewing Editor; London School of Hygiene & Tropical Medicine, and Public Health England, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]

Thank you for submitting your work entitled "Conjunctions of factors triggering waves of seasonal influenza" for consideration by eLife. Your article has been evaluated by Prabhat Jha (Senior Editor) and four reviewers, one of whom, Mark Jit, is a member of our Board of Reviewing Editors.

All the reviewers and I agreed that this is an interesting, innovative and sophisticated attempt to understand the drivers of seasonal influenza spread throughout continental United States. However, we had wide-ranging concerns about multiple areas of the paper.

Based on these discussions and the individual reviews below, we have decided to reject the paper as we think that substantial time and work will be needed to address all the concerns. However, if you would like to substantially revise the paper in a way that takes into account the concerns, we would be interested to consider a revised version of the manuscript as a new submission. Please note that we cannot guarantee review of the revised version until we have seen it.

The full reviews are included, but the most important concerns that would need to be addressed in a new submission are the following:

1) Presentation: While eLife does not have strict format requirements, this paper in particular would benefit from a more classic presentation (e.g. Introduction, Materials and methods, Results, Discussion) that provides a fuller description of the background, purpose and how the paper interacts with and adds to the existing (vast) literature on this topic.

2) The data sources and the way they are used need a more detailed description and justification. The definition of ILI was challenged by one of the reviewers and needs to be justified.

3) While the environmental and demographic factors driving seasonality are extensively examined, there is a need also to take into account viral and host immunity factors such as antigenic drift, seasonal vaccine uptake and antigenic match, host susceptibility and between-host transmissibility (with the latter heavily influenced by the school calendar).

4) The conclusions (e.g. the "Southern hypothesis" and bodies of water being sources) are interesting but need stronger explanation and justification. The different approaches taken did not all give exactly the same results. Claims of causality in epidemiology require an extremely high level of evidence so in this case more moderate language may be appropriate (e.g. "the association suggests.[…]").

5) It would have been useful to make some attempt to validate the methods used, such as by holding back a validation dataset.

Reviewer #1:

This is an innovative and well-illustrated attempt to understand the drivers of seasonal influenza spread each season throughout continental United States. The techniques used are interesting and the array of datasets that the investigators have compiled are impressive. Nevertheless, I do think that their conclusions place far too much certainty than can be justified from statistical analysis of observational data.

1) The claims about causality in the paper should be tempered. While Approach 1 may historically have been called Granger causal inference, there is a limit to the extent to which causality can be claimed in the absence of (and sometimes even with) experimental data. At best, the manuscript should claim that the associations in the data are consistent with their hypotheses.

2) The concept of streamlines is an interesting one but it's not clear what the underlying mechanisms behind them are. They seem to imply that there is a continuous line linking infection in a particular location with a source in another, usually distant, location. Is it not more consistent with our understanding of infection spread to assume that infection spread is a diffusion process, even though environmental and host factors may enhance diffusion in particular directions?

3) All three approaches used are essentially data mining exercises, followed by post-hoc generation of hypotheses to explain the associations found. This does not mean they are invalid – plenty of science works in this way, and some care has been taken to avoid type I errors. Nevertheless, the possibility of bias, confounding, alternative explanations and simply random error should not be discounted. This is particularly the case because there is strong geographic clustering of many of the variables identified as key drivers (e.g. densely populated areas, coastal areas, places with airports, places at the geographical and climactic extremes of the USA tend to be the same places). Hence the language used should be substantially tempered to be more tentative. While the authors have used E. coli and HIV spread as controls to ensure that these display different patterns of spatial spread, the problem is that the seasonality of these pathogens is completely different or (in the case of HIV) largely absent. Looking at a non-respiratory but still highly seasonal pathogen e.g. norovirus might be interesting.

4) One shortcoming of the data analysis is that although a large number of variables related to environmental and demographic factors are considered, there is little consideration of viral and host characteristics which are key to driving influenza spread. For instance, over the nine years the model is run, the United States has experienced waves of different influenza types, a pandemic (which led to a shift in the influenza A/H1 subtype), different levels of vaccine coverage with varying levels of match to the prevailing strains, as well as changes in immune profiles of individuals over the population. There is no attempt to understand these drivers and capture their spatiotemporal heterogeneity even though there are datasets (e.g. vaccine coverage, serology, virological testing) that could be used to inform this.

Note about presentation:

The presentation of the paper is rather confusing. There is hardly any introductory material; the first section is essential a summary of methods and results. This is followed by a more detailed description of the methods and finally a discussion. While eLife does not have strict rules about manuscript formatting, the paper should at least follow the general conventions of medical/life sciences journals i.e. the Introduction, Materials and methods, Results, Discussion division which will make finding information easier for most readers. A fuller Introduction section setting the background is also needed e.g. why this is an important problem from a clinical/public health point of view, what other attempts have been made to address it and what their findings and shortcomings have been?

Reviewer #2:

This is an interesting Big Data study that aims to uncover the causal factors driving the onset of epidemic influenza at the county level in the United States. Authors rely on multiple datasets including medical claim records, demographic characteristics, human movement inferred from Twitter data, and climate data. Authors employed three different methodological approaches and found a complex relationship with influenza epidemic onset. Although the study is of significant interest, I do have several concerns indicated below.

1) The definition of epidemic onset is loosely defined and is unclear whether the same definition was employed across approaches. They discuss a "trigger period" of several weeks to exponential expansion. Is exponential expansion a necessary condition? This is the response variable in their study and little is discussed about its variation across counties and studied seasons.

2) Patterns from different approaches were not necessarily in agreement as authors discussed. In particular, authors did not attempt to reconstruct season-to-season variation in influenza susceptibility. A mechanistic approach that accounts for temporal variation in influenza susceptibility is missing (perhaps departing from previous work by Shaman and his colleagues). Authors did not attempt to capture variation in reproduction numbers, vaccination rates, vaccine efficacy, and likely others across seasons.

3) Authors found strong support for a southern origin of influenza in the US. It would have been interesting to predict flu onset across the vast range of southern counties.

4) A validation phase was not attempted. For instance, authors could use the first few years of data for model calibration and the remainder of the data for forecasting purposes. It would have been interesting to quantify the predictive power of their approaches.

Reviewer #3:

This is a methodologically sophisticated paper which cleverly uses medical claims and other large datasets, and three different computational approaches, to explore whether three factors (weather, X and Y) might explain the timing of seasonal influenza in the US. They find that influenza epidemics tend to start in Southern states.

My impression is that the paper is highly sophisticated in terms of analytics, but that the data sources that are used to describe influenza and the possible drivers (such as weather, mobility) are not always well developed and ill described (Truwen) or even possibly inadequate (Twitter). The findings of short-distance traveling being important driver corroborates findings by Gog et al., another pioneer study using medical claims data to study influenza dissemination, which is nice.

Trusting other reviewers will take on an evaluation of the analytic approaches, I will focus instead on the validity of the data (outcomes, explanatory) side of the matter.

It is my impression that the paper needs substantial revision in order to be ready for publication. For example, data sources and approaches must be far better described and more sensitivity analyses done. The lack of consideration of genetic drift variants and school schedule patterns, along with the likely consequences of regional differences in Truwen data coverage and ICD coding, if any, should be argued.

On the topic of drivers not considered:…

1) One obvious driver of seasonal influenza timing is not mentioned in this paper: the degree of genetic drift of the influenza virus relative to those that circulated in earlier years. That influenza occurs early in big "drift" years is evident from the unusually early onset of the 2003 Fujian H3N2 epidemic, and also of most pandemics in history including the 2009 pandemic. This is not discussed and that is a caveat that should be discussed. If such data are too much trouble, then at the very least consideration of which influenza subtype is dominant in each season would be useful; for example influenza B tends to be epidemic later than influenza A.

a) Could authors do a sensitivity sub-analysis of seasons of major drift variants (2003 Fujian and 2009 pandemic flu) to see if their identified drivers also are found for those?

2) Another driver that is not discussed is the school session patterns across the US during the autumn and winter. As it was picked up as a major driver of pandemic spread by Gog et al., 2014, why would it not also be important for seasonal influenza? This should be listed as a shortcoming of the analysis, or authors could explain in the text why they did not think this was important.

On the topic of validity and validation of the observational data underlying the analyses:

3) I am concerned about the robustness of the analysis due to possible low counts of ILI as the authors defined it narrowly as ICD codes 487-488 – which could mean much noise in ILI time series at the local level. The authors may benefit from a very similar study of influenza dissemination in the 2009 pandemic in the US, which was based on IMS CMS1500 claims data (Gog et al., 2014); the underlying paper preparing the medical claims data and validating against CDC influenza data (Viboud et al., Plos One 2013). Viboud et al., established that adding 079.9 (ICD9) to the case definition dramatically (by 90%) increased the sensitivity of the case definition, without much cost on the specificity.

a) Yet in the "Flammable media" (Discussion section) the authors discuss how 20% of Southeners have ILI each season. This seems way too high, as most people do not seek care for ILI and 20% (and what would the rate have been if the ILI definition was not so narrow?) is already higher than population-based sero-epidemiological assessments of attack rates and it is well known that most of people with ILI do not seek medical care and that at least half of infections are asymptomatic. So 20% ILI population prevalence in Truwen data is unbelievable.

i) Furthermore, it is difficult to believe that Southern US is better connected than, for example, dense US Northern cities like New York with large scale public transportation where influenza may be easily spread. See my later point about ICD coding preferences –… another worry that seasonal initial levels in southern US is a data artefact somehow, due to for example higher Truwen coverage and different ICD coding practices.

ii) Moreover, the authors comment at some point that the ILI prevalence is higher in the mid-west. However, local differences in ICD coding preferences by doctors are likely a better explanation, and so these differences could just be artefacts. For that reason it would be better to not include prevalence in the modeling, but rather base the analysis solely on the initial epidemic phase and/or the peak timing.

4) The Truwen data source and extraction of the ILI and other time series should be described in much more detail. Is it CMS-1500 data? Were influenza codes sought among all codes listed or only the first-listed? Etc. Also, it would be helpful to show graphically the time series data, to allow the reader to evaluate data stability and volume – over time and between locales. Or else, add some summary table that summarizes such observational findings. The Truwen claims database is likely a convenience sample, so it is likely that the coverage increases dramatically over time since 2003, and also that there are substantial differences in coverage by US geographical areas depending on Truwen levels of foot holds. Did the authors see such intrinsic variability in the data (for example, consider another code in summer months? How would such variability in coverage influence the analyses presented here – or is it somehow taken into account?

a) The authors offer a comparison with HIV patterns – but is that appropriate as HIV is a chronic disease – how exactly was this done? Did authors extract only ICD codes for new onset HIV or using longitudinal patient-level data (nothing in the Materials and methods section about this)? A better comparison would perhaps be something else epidemic but with another mode of transmission such as rotavirus or norovirus diarrhea epidemics?

5) Validity of Twitter data to describe mobility? The authors conclude that short-distance land travel is an important determinant of the spread of influenza, using Twitter data to determine mobility. But is this defensible, as only older kids and adults have Twitter accounts – yet we know that children are important drivers of influenza (see fx Gog et al., two major factors determining spread of influenza was school schedules and short distance traveling). Please comment on this possibility that the Twitter data are not determinants of children's movements.

6) On the "Initiation spark" (DDiscussion section). The authors conclude that cold weather and low humidity are important drivers. Put together with their finding that most epidemics originate in the Southern United states, one must ask: How could cold weather/low humidity be a predictor of influenza and influenza starts in the South – when Southern states have warmer climate and higher humidity in early winter than Northern states? Would be great if the authors would discuss this possible paradox.

Reviewer #4:

This manuscript presents an analysis of a large dataset of ILI from insurance records to identify spatial patterns in influenza outbreak initiation and spread. The authors present 3 classes of analysis to identify variables correlated with outbreak initiation. Overall, I found this work very interesting but in dire need of restructuring to clarify the main questions and methods. At present, this reads as an explanation of several very innovative approaches to mining these large datasets for patterns, but doesn't lay out a clear goal or hypothesis. Similarly, the Discussion lists an array of patterns that were discovered, but fails to identify a single (or even just a few) key insight that was gained through this approach.

I'm usually not a stickler for presentation of papers in the classic Introduction/Materials and methods/Results/Discussion format – but given that the authors are presenting several novel methods here, I think the manuscript would really benefit from being more classic. At present, some aspects of the methods are described in the "Approach #" sections, but others are presented in the "Materials and methods". I think it would be much clearer to group all the methods together so that the results could themselves be presented in a more concise format that would highlight the key insights.

The figures are excellent, but they are referenced out of order in the text; e.g. Figure 2I gives results from Approach 2 which is presented after the results from Approach 1, which are in Figure 4 and 5.

Overall, I would like to see a more concise statement of the major innovation from this approach and this analysis. At present, this is a list of variables that were identified as significant (albeit using some really interesting methods and excellent data), many of which have already been identified as predictive of flu in the past. What would be really interesting, if possible, would be to illustrate whether this method was in some way more predictive of flu dynamics than conventional methods. For example, could these analyses be carried out on the first 8 years of ILI data and the resulting model used to predict the 9th; if so, does the addition of Approach 1/3 do better than the standard Poisson regression?

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for submitting your article "Conjunction of Factors Triggering Waves of Seasonal Influenza" for consideration by eLife. Your resubmission has been evaluated by Prabhat Jha (Senior Editor) and three reviewers, one of whom, Mark Jit (Reviewer #1), is a member of our Board of Reviewing Editors The following individual involved in review of your submission has agreed to reveal their identity: Elizabeth C Lee (Reviewer #3).

The reviewers have discussed the reviews with one another. We consider that you have undertaken a major effort in putting their paper in an appropriate format and addressing the reviewers' comments and are, in general, satisfied with the revisions made. However, we still have a few remaining concerns that should be addressed by the authors to make their paper suitable for potential publication in eLife.

The Reviewing Editor has drafted this decision to help you prepare a revised submission. If you are able to address at least all the essential points to our satisfaction, we should be able to make a quick decision about the manuscript.

Essential revisions:

1) The literature review is still not adequate. Reviewing the literature is not simply a matter of having a string of references. Ideally a couple of paragraphs would be added to the Introduction or Discussion discussing the history of related statistical models for understanding the timing, spread and burden of seasonal influenza, and how this paper builds on the existing literature.

Also several of the previous reviewers found it difficult to determine the actual aim of the data-driven exercise. While we understand and are sympathetic to the point that this is not a traditional hypothesis-driven study, it still needs an aim if not a hypothesis. Could the last line of the first paragraph be reworked into what the aim of this was? (e.g. is it to determine the most important factors that are associated with initiating influenza transmission in the continental USA at the beginning of each season?)

2) You state that "our more precisely defined ILI data defines better against the CDC dataset" – could details (including quantitative measures of better definition) be given? Both narrower and broader definitions of ILI are imperfect proxies of influenza activity, although in different ways, and it is not a given that a more specific measure is necessarily less biased.

3) The incorporation of antigenic drift, vaccine coverage and school data are important and innovative additions to the paper. However, one missing factor is host immunity not due to vaccination i.e. if there is a strong influenza season in one year with no substantial antigenic drift after that then the population immune to influenza is likely to be greater the following year. Ideally, this could be captured mechanistically by incorporating SIR-type epidemiological models, or by more statistical approaches, ideally informed by measures of prior immunity such as serology. At the very least, the implications of this factor need to be adequately discussed.

4) We still think (as reviewer 3 of the previous round of reviews pointed out) that a 20% influenza infection rate is far too high, even if it incorporates asymptomatic/subclinical infection. This is much greater than the change in seropositivity over most influenza seasons other than pandemic years.

5) Given the spatiotemporal nature of the findings, the authors should include more information about the Truven database. For instance, are the authors able to provide additional details about the spatial distribution of reporting rates (e.g., number of physicians contributing claims in a given county) within the Truven database? While insurance claims do provide high resolution spatial information, the use of raw counts for influenza-related diagnosis codes could be problematic if there is great spatial or temporal variation in reporting rates.

6) While the methods and results are clearly described, the organization of the paper can be improved to facilitate easier reading. We found ourselves flipping back and forth a lot between the text because details that appeared to be missing earlier on would then appear later in different parts of the text. Figures were also cited out of order across the methods descriptions, and in some cases we don't believe all of the figures were discussed. As a few examples, the methods used to generate Figures 3I-3K were not discussed until the Results (subsection “Approach 1: Non-parametric Granger analysis”, fourth and fifth paragraphs) but Figure 3 results are first discussed in the Materials and methods (subsection “Analysis of quantized clinical data (Approach 1)”, first paragraph).

For more concrete advice, we would remove references to results-related figures from the Materials and methods section so that readers examine the figures only after all of the terms and analyses have been explained, and, having a cleaner separation between Materials and methods and Results will improve readability. There were many times where the authors alluded to some small part of the results in the Materials and methods before the description of the analysis had been completed. Conversely, the Results section had relatively detailed methods descriptions, which seemed redundant since the Results followed the Materials and methods. The authors may also consider a greater overhaul of the manuscript structure that groups the primary results in the text and figures (as opposed to organizing the manuscript by methodological approach) and moves many of the supporting or diagnostic figures to the supplement.

7) Beyond improving the organization of the paper, we think the reporting of results for Approach 1 could be better explained and supported in the text. For instance, it was not clear how the claims in the sixth and eighth paragraphs of the subsection “Approach 1: Non-parametric Granger analysis” were demonstrated with the cited figures.

8) Finally, given the emphasis on the novelty of comparing results across three methodological approaches, the authors should discuss the generalizability of their findings to influenza epidemiology beyond the United States and the generalizability of their framework to other contexts with high spatiotemporal resolution data.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for submitting your article "Conjunction of Factors Triggering Waves of Seasonal Influenza" for consideration by eLife. Your article has been reviewed by one of the original peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Prabhat Jha as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. We appreciate your revisions to the organization of the text and interpretation of the results, and think that many of the changes improve the readability and flow of the paper and clarify the goals of the analysis. Nevertheless, we still have remaining concerns about some parts of the analyses and about the sparsity of information for others. Hence we think that the article is still not ready to be published in its current state.

We would like to avoid multiple further rounds of revision, so we can give you one more opportunity to address the remaining issues before a final decision about the manuscript is made.

The remaining issues with the manuscript can be divided into two categories: (i) substantial scientific issues that need to be addressed before the manuscript is published, (ii) recommendations about the description of the work that we strongly encourage you to take but which are optional.

Essential revisions about scientific content:i) (Discussion, second paragraph) It is still difficult to interpret the results related to short-range and long-range travel. We were unable to find the regression results for the full set of data weeks in the supplement. Also, there is not enough detail in the analysis description to see how Figure 5 Plates C-F demonstrate that local connections are more explanatory than airport connections.

ii) The highlighted areas (greater incidence) in the first flu season snapshot for Figures 3A-H each should match relatively well to the maps in Figure 7B (or at least some unspecified subset of "high incidence treatment counties"). While the county-matching results to point to initiation at coasts, 2006-2007 in Figure 3 seems to initiate from the West Coast and 2010 in Figure 7 has a scattered pattern across the middle of the US. Please check this or justify the differences.

iii) Also related to Figure 7, was there also a geographic constraint on the choice of the matched control counties? It is surprising that everything is so geographically clustered.

Recommendations about presentation of content:

i) It would be useful to provide a quantitative example of how to interpret one of the results in Figure 2A.

ii) (Subsection “The weather/humidity paradox”) This humidity paradox has come up quite a bit recently and it may be cleaner to have a fuller treatment of this in the Discussion.

iii) The subsection on Variable and model selection at the end of the Discussion seems out of place. Was this meant to be moved to the Materials and methods?

iv) Please point out the Discussion section describing the generalizability of the findings (as mentioned in the response).

v) The Conclusion section does much more of the storytelling that would be expected from a Discussion. It may be more compelling to kick off the Discussion with the analogy describing the epidemiological findings, and then elaborate on the strengths of using multiple methods to corroborate statistical findings.

vi) You may consider condensing the figures in the main text to those that present primary findings (removing some plates and potentially reducing the total number of figures). The figures are all visually appealing, but every single one is densely packed with primary findings, descriptive information, data processing details, and diagnostics/sensitivity. Some of the main takeaways may get lost.

vii) Figure 2 might be more aptly named "Putative determinants of seasonal influenza onset[…]"

viii) Figure 4, Plate I: Please add an explanation of the red circles to the legend title or caption.

ix) Please provide some explanation of how you developed the 126 models examined in Approach 2. Based on the models listed in Table 5 and in the supplement, it seems that there is substantial variation in the number of predictors included in the model and all of the models appear to include multiple autoregressive terms for the same environmental and travel predictors. What is the motivation for this approach? In addition, it would be good to better understand the random effects in the model (subsection “Approach 2: Poisson Regression on Putative Factors”, first paragraph) and the motivation for this model structure. Does this term include a spatial smoothing component?

https://doi.org/10.7554/eLife.30756.035

Author response

[Editors’ note: the author responses to the first round of peer review follow.]

The full reviews are included, but the most important concerns that would need to be addressed in a new submission are the following:

1) Presentation: While eLife does not have strict format requirements, this paper in particular would benefit from a more classic presentation (e.g. Introduction, Materials and methods, Results, Discussion) that provides a fuller description of the background, purpose and how the paper interacts with and adds to the existing (vast) literature on this topic.

As requested, we introduced classic paper structure.

We also increased coverage of literature, for example see new references Alvarado-Facundo et al., 2009; Andreasen, 2003; Araz et al., 2012; Bedford et al., 2015; Boni, 2008; Boni et al., 2004; Chlanda et al., 2015; Rizzo et al., 2008; Shaman and Kohn, 2009; Sham et al., 2010; Tamerius et al., 2013; Treanor, 2004; Vynnycky and Edmunds, 2008; Yasuda et al., 2008; Yasuda, Yoshizawa and Suzuki, 2005.

2) The data sources and the way they are used need a more detailed description and justification. The definition of ILI was challenged by one of the reviewers and needs to be justified.

We expanded descriptions and added a special comment on ILI definition, see subsection “Clinical data source”.

3) While the environmental and demographic factors driving seasonality are extensively examined, there is a need also to take into account viral and host immunity factors such as antigenic drift, seasonal vaccine uptake and antigenic match, host susceptibility and between-host transmissibility (with the latter heavily influenced by the school calendar).

As requested, we included into analysis, as requested, viral antigen diversity(subsection “Vaccination Coverage”), human immunization rates(subsections “Data on vaccination coverage” and “Estimating vaccination coverage”), and schools schedule(subsection “Estimating the effect of return-to-school days”).

4) The conclusions (e.g. the "Southern hypothesis" and bodies of water being sources) are interesting but need stronger explanation and justification. The different approaches taken did not all give exactly the same results. Claims of causality in epidemiology require an extremely high level of evidence so in this case more moderate language may be appropriate (e.g. "the association suggests.…").

We reworded our clams to sound more moderately (no causality statement or “putative causality” at most) throughout the text.

5) It would have been useful to make some attempt to validate the methods used, such as by holding back a validation dataset.

As suggested, we added to our analysis experiments with estimating parameters with test data and then testing the model with a held-out dataset. The predictive power of the best models appears reasonably high (subsection “Validation of Predictive Capability”).

Reviewer #1:

[…] 1) The claims about causality in the paper should be tempered. While Approach 1 may historically have been called Granger causal inference, there is a limit to the extent to which causality can be claimed in the absence of (and sometimes even with) experimental data. At best, the manuscript should claim that the associations in the data are consistent with their hypotheses.

Agreed, carefully changed throughout.

2) The concept of streamlines is an interesting one but it's not clear what the underlying mechanisms behind them are. They seem to imply that there is a continuous line linking infection in a particular location with a source in another, usually distant, location. Is it not more consistent with our understanding of infection spread to assume that infection spread is a diffusion process, even though environmental and host factors may enhance diffusion in particular directions?

We absolutely agree, the streamlines show the patterns of dominant flow that are realized via diffusion-like process.

3) All three approaches used are essentially data mining exercises, followed by post-hoc generation of hypotheses to explain the associations found. This does not mean they are invalid – plenty of science works in this way, and some care has been taken to avoid type I errors. Nevertheless, the possibility of bias, confounding, alternative explanations and simply random error should not be discounted. This is particularly the case because there is strong geographic clustering of many of the variables identified as key drivers (e.g. densely populated areas, coastal areas, places with airports, places at the geographical and climactic extremes of the USA tend to be the same places). Hence the language used should be substantially tempered to be more tentative.

Agreed, changed.

While the authors have used E. coli and HIV spread as controls to ensure that these display different patterns of spatial spread, the problem is that the seasonality of these pathogens is completely different or (in the case of HIV) largely absent. Looking at a non-respiratory but still highly seasonal pathogen e.g. norovirus might be interesting.

4) One shortcoming of the data analysis is that although a large number of variables related to environmental and demographic factors are considered, there is little consideration of viral and host characteristics which are key to driving influenza spread. For instance, over the nine years the model is run, the United States has experienced waves of different influenza types, a pandemic (which led to a shift in the influenza A/H1 subtype), different levels of vaccine coverage with varying levels of match to the prevailing strains, as well as changes in immune profiles of individuals over the population. There is no attempt to understand these drivers and capture their spatiotemporal heterogeneity even though there are datasets (e.g. vaccine coverage, serology, virological testing) that could be used to inform this.

We now repeated our analysis, incorporating additional data types, as suggested. (Please see our reply to point #3 raised by the Editor.)

Note about presentation:

The presentation of the paper is rather confusing. There is hardly any introductory material; the first section is essential a summary of methods and results. This is followed by a more detailed description of the methods and finally a discussion. While eLife does not have strict rules about manuscript formatting, the paper should at least follow the general conventions of medical/life sciences journals i.e. the Introduction, Materials and methods, Results, Discussion division which will make finding information easier for most readers. A fuller Introduction section setting the background is also needed e.g. why this is an important problem from a clinical/public health point of view, what other attempts have been made to address it and what their findings and shortcomings have been?

Agreed, the required changes were made, now we have the standard section titles (please see our reply to point #1 raised by the Editor).

Reviewer #2:

This is an interesting Big Data study that aims to uncover the causal factors driving the onset of epidemic influenza at the county level in the United States. Authors rely on multiple datasets including medical claim records, demographic characteristics, human movement inferred from Twitter data, and climate data. Authors employed three different methodological approaches and found a complex relationship with influenza epidemic onset. Although the study is of significant interest, I do have several concerns indicated below.

1) The definition of epidemic onset is loosely defined and is unclear whether the same definition was employed across approaches. They discuss a "trigger period" of several weeks to exponential expansion. Is exponential expansion a necessary condition? This is the response variable in their study and little is discussed about its variation across counties and studied seasons.

We made it clear that the same definition was used across all approaches and clarified the definition of onset, see subsection “Analysis of quantized clinical data (Approach 1)”, first paragraph.

2) Patterns from different approaches were not necessarily in agreement as authors discussed. In particular, authors did not attempt to reconstruct season-to-season variation in influenza susceptibility. A mechanistic approach that accounts for temporal variation in influenza susceptibility is missing (perhaps departing from previous work by Shaman and his colleagues). Authors did not attempt to capture variation in reproduction numbers, vaccination rates, vaccine efficacy, and likely others across seasons.

In the revised version, we have additional predictors (such as immunization and viral antigenic drift) that bring us closer to the mechanistic interpretation, please see our response to the Editor’s point #3.

3) Authors found strong support for a southern origin of influenza in the US. It would have been interesting to predict flu onset across the vast range of southern counties.

Following the reviewer’s suggestion, we did run a cross-validation analysis, estimating and testing models on disjoint subsets of data (see subsection “Validation of Predictive Capability”).

4) A validation phase was not attempted. For instance, authors could use the first few years of data for model calibration and the remainder of the data for forecasting purposes. It would have been interesting to quantify the predictive power of their approaches.

Agreed. In the revised version, we attempted splitting data into training and testing (see subsection “Validation of Predictive Capability”).

Reviewer #3:

This is a methodologically sophisticated paper which cleverly uses medical claims and other large datasets, and three different computational approaches, to explore whether three factors (weather, X and Y) might explain the timing of seasonal influenza in the US. They find that influenza epidemics tend to start in Southern states.

My impression is that the paper is highly sophisticated in terms of analytics, but that the data sources that are used to describe influenza and the possible drivers (such as weather, mobility) are not always well developed and ill described (Truwen) or even possibly inadequate (Twitter). The findings of short-distance traveling being important driver corroborates findings by Gog et al., another pioneer study using medical claims data to study influenza dissemination, which is nice.

Trusting other reviewers will take on an evaluation of the analytic approaches, I will focus instead on the validity of the data (outcomes, explanatory) side of the matter.

It is my impression that the paper needs substantial revision in order to be ready for publication. For example, data sources and approaches must be far better described and more sensitivity analyses done. The lack of consideration of genetic drift variants and school schedule patterns, along with the likely consequences of regional differences in Truwen data coverage and ICD coding, if any, should be argued.

On the topic of drivers not considered:

1) One obvious driver of seasonal influenza timing is not mentioned in this paper: the degree of genetic drift of the influenza virus relative to those that circulated in earlier years. That influenza occurs early in big "drift" years is evident from the unusually early onset of the 2003 Fujian H3N2 epidemic, and also of most pandemics in history including the 2009 pandemic. This is not discussed and that is a caveat that should be discussed. If such data are too much trouble, then at the very least consideration of which influenza subtype is dominant in each season would be useful; for example influenza B tends to be epidemic later than influenza A.

a) Could authors do a sensitivity sub-analysis of seasons of major drift variants (2003 Fujian and 2009 pandemic flu) to see if their identified drivers also are found for those?

We now included genetic drift into our modeling as suggested by the reviewer, please see our response to the Editor’s point #3.

2) Another driver that is not discussed is the school session patterns across the US during the autumn and winter. As it was picked up as a major driver of pandemic spread by Gog et al., 2014, why would it not also be important for seasonal influenza? This should be listed as a shortcoming of the analysis, or authors could explain in the text why they did not think this was important.

We included the school seasons in our analysis, please see our response to the Editor’s point #3.

On the topic of validity and validation of the observational data underlying the analyses:

3) I am concerned about the robustness of the analysis due to possible low counts of ILI as the authors defined it narrowly as ICD codes 487-488 – which could mean much noise in ILI time series at the local level. The authors may benefit from a very similar study of influenza dissemination in the 2009 pandemic in the US, which was based on IMS CMS1500 claims data (Gog et al., 2014); the underlying paper preparing the medical claims data and validating against CDC influenza data (Viboud et al., Plos One 2013). Viboud et al., established that adding 079.9 (ICD9) to the case definition dramatically (by 90%) increased the sensitivity of the case definition, without much cost on the specificity.

The addition of codes 079.98 and 079.99 does substantially increase the number of unique people with viral infections cases (over all years) from 4,009,982 to 10,682,458. However, after carefully considering description of codes 079.99 and 079.98, we were uncomfortable using them. For example, code 079.99 is associated with the following list of conditions.

- Acute retinal necrosis

- Acute viral bronchiolitis

- Acute viral disease

- Acute viral laryngotracheitis

- Acute viral otitis externa

- Acute viral thyroiditis

- Amantadine resistant virus present

- Arthritis due to viral infection

- Arthritis of hand due to viral infection

- Arthritis of knee due to viral infection

- Arthropathy associated with viral disease

- Boid inclusion body disease

- Cardiomyopathy due to viral infection

- Cardiomyopathy, due to viral infection

- Congenital viral disease

- Congenital viral infection

- Disease of possible viral origin

- Encephalitis due to influenza-specific virus not identified

- Maternal viral disease complicating pregnancy

- Maternal viral disease in pregnancy

- Neonatal viral infection of skin

- Nonspecific syndrome suggestive of viral illness (finding)

- Oral mucosal viral disease

- Postpartum (after childbirth) viral disease

- Postpartum viral disease

- Postviral depression

- Postviral excessive daytime sleepiness

- Postviral infection debility

- Viral acute pancreatitis

- Viral bronchitis

- Viral carditis

- Viral dermatitis of eyelid

- Viral disease

- Viral disease in childbirth

- Viral disease in pregnancy

- Viral ear infection

- Viral esophagitis

- Viral eye infection

- Viral infection

- Viral infection by site

- Viral lower respiratory infection

- Viral musculoskeletal infection

- Viral myositis

- Viral pleurisy

- Viral respiratory infection

- Viral retinitis

- Viral syndrome

- Viral ulcer of esophagus

- Virus present

- Zanamivir resistant virus present

While we are not disputing Viboud et al.'s conclusions in the paper cited by the reviewer, given the large volume of specific data in our sample (4,009,982 unique people with ILIs), we do not see a compelling reason to add cases with non-specific codes. We directly evaluated “our” stricter-defined ILIs against the CDC data (temporal prevalence of influenza) and observed a very good correspondence (that deteriorates with addition of non-specific viral infection codes).

a) Yet in the "Flammable media" (Discussion section) the authors discuss how 20% of Southeners have ILI each season. This seems way too high, as most people do not seek care for ILI and 20% (and what would the rate have been if the ILI definition was not so narrow?) is already higher than population-based sero-epidemiological assessments of attack rates and it is well known that most of people with ILI do not seek medical care and that at least half of infections are asymptomatic. So 20% ILI population prevalence in Truwen data is unbelievable.

We introduced more careful wording, mentioning unreported cases, see page 22, lines 625—626.

i) Furthermore, it is difficult to believe that Southern US is better connected than, for example, dense US Northern cities like New York with large scale public transportation where influenza may be easily spread. See my later point about ICD coding preferences – another worry that seasonal initial levels in southern US is a data artefact somehow, due to for example higher Truwen coverage and different ICD coding practices.

We are talking about empirically ascertained social networks, measuring actual number of contacts among people, see Table 1.

ii) Moreover, the authors comment at some point that the ILI prevalence is higher in the mid-west. However, local differences in ICD coding preferences by doctors are likely a better explanation, and so these differences could just be artefacts. For that reason it would be better to not include prevalence in the modeling, but rather base the analysis solely on the initial epidemic phase and/or the peak timing.

In predictors we use a derivative of prevalence, which should be invariant to coding biases. Our cross-validation results suggest that, even if biases are present in coding, the modeling does capture predictive information.

4) The Truwen data source and extraction of the ILI and other time series should be described in much more detail. Is it CMS-1500 data? Were influenza codes sought among all codes listed or only the first-listed? Etc. Also, it would be helpful to show graphically the time series data, to allow the reader to evaluate data stability and volume – over time and between locales. Or else, add some summary table that summarizes such observational findings. The Truwen claims database is likely a convenience sample, so it is likely that the coverage increases dramatically over time since 2003, and also that there are substantial differences in coverage by US geographical areas depending on Truwen levels of foot holds. Did the authors see such intrinsic variability in the data (for example, consider another code in summer months? How would such variability in coverage influence the analyses presented here – or is it somehow taken into account?

Agreed: we used percent of insured per county to adjust for insurance coverage variation. Also, we used the total number of people described each week for given county as an offset in the Poisson regression. To answer the reviewer’s question about geographic distribution of cases, we have a movie showing ILI spread over nearly a decade-period.

a) The authors offer a comparison with HIV patterns – but is that appropriate as HIV is a chronic disease – how exactly was this done? Did authors extract only ICD codes for new onset HIV or using longitudinal patient-level data (nothing in the Materials and methods section about this)? A better comparison would perhaps be something else epidemic but with another mode of transmission such as rotavirus or norovirus diarrhea epidemics?

The HIV figure is shown only to illustrate that different diseases have distinct patterns of flow. As norovirus and rotavirus infections are much rarer, we have concerns about consistency of coding and testing (i.e., specificity) for these infections.

5) Validity of Twitter data to describe mobility? The authors conclude that short-distance land travel is an important determinant of the spread of influenza, using Twitter data to determine mobility. But is this defensible, as only older kids and adults have Twitter accounts – yet we know that children are important drivers of influenza (see fx Gog et al., two major factors determining spread of influenza was school schedules and short distance traveling). Please comment on this possibility that the Twitter data are not determinants of children's movements.

We also used a measure of direct proximity of counties to account for land movements, and also explicitly tested the school schedule hypothesis.

6) On the "Initiation spark" (Discussion section). The authors conclude that cold weather and low humidity are important drivers. Put together with their finding that most epidemics originate in the Southern United states, one must ask: How could cold weather/low humidity be a predictor of influenza and influenza starts in the South – when Southern states have warmer climate and higher humidity in early winter than Northern states? Would be great if the authors would discuss this possible paradox.

We added a new paragraph explicitly discussing this “paradox,” see subsection “Is there a paradox here?”.

Reviewer #4:

This manuscript presents an analysis of a large dataset of ILI from insurance records to identify spatial patterns in influenza outbreak initiation and spread. The authors present 3 classes of analysis to identify variables correlated with outbreak initiation. Overall, I found this work very interesting but in dire need of restructuring to clarify the main questions and methods. At present, this reads as an explanation of several very innovative approaches to mining these large datasets for patterns, but doesn't lay out a clear goal or hypothesis. Similarly, the Discussion lists an array of patterns that were discovered, but fails to identify a single (or even just a few) key insight that was gained through this approach.

Agreed, restructured. The first paragraph in Discussion now explains the innovative claims of the manuscript.

I'm usually not a stickler for presentation of papers in the classic Introduction/Materials and methods/Results/Discussion format – but given that the authors are presenting several novel methods here, I think the manuscript would really benefit from being more classic. At present, some aspects of the methods are described in the "Approach #" sections, but others are presented in the "Materials and methods". I think it would be much clearer to group all the methods together so that the results could themselves be presented in a more concise format that would highlight the key insights.

Agreed, done as requested: descriptions of all methods are combined in “Materials and methods,” while “Results” summarize conclusions. Please see reply to the Editor’s comment #1.

The figures are excellent, but they are referenced out of order in the text; e.g. Figure 2I gives results from Approach 2 which is presented after the results from Approach 1, which are in Figure 4 and 5.

Corrected, figures are reordered.

Overall, I would like to see a more concise statement of the major innovation from this approach and this analysis. At present, this is a list of variables that were identified as significant (albeit using some really interesting methods and excellent data), many of which have already been identified as predictive of flu in the past. What would be really interesting, if possible, would be to illustrate whether this method was in some way more predictive of flu dynamics than conventional methods. For example, could these analyses be carried out on the first 8 years of ILI data and the resulting model used to predict the 9th; if so, does the addition of Approach 1/3 do better than the standard Poisson regression?

We added explicit highlights of innovation in our work: see the first paragraph in Discussion, which now explains the innovative claims of the manuscript.

[Editors' note: the author responses to the re-review follow.]

Essential revisions:

1) The literature review is still not adequate. Reviewing the literature is not simply a matter of having a string of references. Ideally a couple of paragraphs would be added to the Introduction or Discussion discussing the history of related statistical models for understanding the timing, spread and burden of seasonal influenza, and how this paper builds on the existing literature.

As requested, we have augmented the manuscript with a short review of prior mathematical modeling approaches to analyzing influenza epidemics (see subsection “Recent computational studies of influenza”, fourth paragraph).

Also several of the previous reviewers found it difficult to determine the actual aim of the data-driven exercise. While we understand and are sympathetic to the point that this is not a traditional hypothesis-driven study, it still needs an aim if not a hypothesis. Could the last line of the first paragraph be reworked into what the aim of this was? (e.g. is it to determine the most important factors that are associated with initiating influenza transmission in the continental USA at the beginning of each season?)

Done as requested, see Introduction, first paragraph.

2) You state that "our more precisely defined ILI data defines better against the CDC dataset" – could details (including quantitative measures of better definition) be given? Both narrower and broader definitions of ILI are imperfect proxies of influenza activity, although in different ways, and it is not a given that a more specific measure is necessarily less biased.

In the revised manuscript, we provide a whole new section dedicated to this issue. We provide definitions of ICD-9-CM codes, and argue that the narrow definition we chose is most appropriate for our purposes. We argue this while directly accepting that, similar to what is outlined in Viboud et al.’s PLOS ONE article, we have no perfect gold standard to evaluate the quality of this choice; precision appears to be more important for our analysis than sensitivity.

3) The incorporation of antigenic drift, vaccine coverage and school data are important and innovative additions to the paper. However, one missing factor is host immunity not due to vaccination i.e. if there is a strong influenza season in one year with no substantial antigenic drift after that then the population immune to influenza is likely to be greater the following year. Ideally, this could be captured mechanistically by incorporating SIR-type epidemiological models, or by more statistical approaches, ideally informed by measures of prior immunity such as serology. At the very least, the implications of this factor need to be adequately discussed.

We added the following statement to the Materials and methods to address this issue:

“Accounting for non-vaccination host immunity: Resistance to influenza infection in human hosts arises via two related mechanisms: immunological memory from a previous infection by an identical or sufficiently similar strain, or vaccination against the current strain (or a sufficiently similar strain). […] Likewise, if the absolute deviation from years previous starts decreasing, we would also register a decrease in susceptibility. Incorporating these factors in the diverse set of models that we investigate guarantees that we are indeed considering host susceptibility contributions from a wide range of possible mechanisms.

4) We still think (as reviewer 3 of the previous round of reviews pointed out) that a 20% influenza infection rate is far too high, even if it incorporates asymptomatic/subclinical infection. This is much greater than the change in seropositivity over most influenza seasons other than pandemic years.

Agreed.

Our estimates of influenza infection rate are unavoidably inflated. This is because what we are measuring is the week- and county-specific prevalence of influenza-like illnesses among insured patients who contacted their physician for any reason during the week in question. In other words, the denominator that we use for computing prevalence is almost certain to be much smaller than the overall population of the corresponding county.

We introduced the corresponding disclaimer into the manuscript; see the “Flammable Media” paragraph in the Discussion.

5) Given the spatiotemporal nature of the findings, the authors should include more information about the Truven database. For instance, are the authors able to provide additional details about the spatial distribution of reporting rates (e.g., number of physicians contributing claims in a given county) within the Truven database? While insurance claims do provide high resolution spatial information, the use of raw counts for influenza-related diagnosis codes could be problematic if there is great spatial or temporal variation in reporting rates.

Note that we are using proportions (of people affected with ILIs out of total number of visible people in the MarketScan) rather than raw counts of affected individuals.

We provide a movie that represents influenza prevalence across space and time (each frame is one week), which is the best summary of data we can think of. We supplemented the description of the database with reference to 2017 White Paper, and additional statistics on ILI spatial and temporal distribution in the Supplement (see Figure 1—figure supplement 4C).

6) While the methods and results are clearly described, the organization of the paper can be improved to facilitate easier reading. We found ourselves flipping back and forth a lot between the text because details that appeared to be missing earlier on would then appear later in different parts of the text. Figures were also cited out of order across the methods descriptions, and in some cases we don't believe all of the figures were discussed.

We made a significant effort to streamline and harmonize our exposition of methods and results.

As a few examples, the methods used to generate Figures 3I-3K were not discussed until the Results (subsection “Approach 1: Non-parametric Granger analysis”, fourth and fifth paragraphs) but Figure 3 results are first discussed in the Materials and methods (subsection “Analysis of quantized clinical data (Approach 1)”, first paragraph).

Changed as suggested.

For more concrete advice, we would remove references to results-related figures from the Materials and methods section so that readers examine the figures only after all of the terms and analyses have been explained, and, having a cleaner separation between Materials and methods and Results will improve readability. There were many times where the authors alluded to some small part of the results in the Materials and methods before the description of the analysis had been completed.

Done as requested.

Conversely, the Results section had relatively detailed methods descriptions, which seemed redundant since the Results followed the Materials and methods. The authors may also consider a greater overhaul of the manuscript structure that groups the primary results in the text and figures (as opposed to organizing the manuscript by methodological approach) and moves many of the supporting or diagnostic figures to the supplement.

We removed the redundancy.

7) Beyond improving the organization of the paper, we think the reporting of results for Approach 1 could be better explained and supported in the text. For instance, it was not clear how the claims in the sixth and eighth paragraphs of the subsection “Approach 1: Non-parametric Granger analysis” were demonstrated with the cited figures.

We improved the description of Approach 1, providing more extended description.

8) Finally, given the emphasis on the novelty of comparing results across three methodological approaches, the authors should discuss the generalizability of their findings to influenza epidemiology beyond the United States and the generalizability of their framework to other contexts with high spatiotemporal resolution data.

We added a discussion of this generalizability (see subsection Generalizability in Discussion).

[Editors' note: further revisions were requested prior to acceptance, as described below.]

The remaining issues with the manuscript can be divided into two categories: (i) substantial scientific issues that need to be addressed before the manuscript is published, (ii) recommendations about the description of the work that we strongly encourage you to take but which are optional.

Essential revisions about scientific content:i) (Discussion, second paragraph) It is still difficult to interpret the results related to short-range and long-range travel. We were unable to find the regression results for the full set of data weeks in the supplement. Also, there is not enough detail in the analysis description to see how Figure 5 Plates C-F demonstrate that local connections are more explanatory than airport connections.

Our conclusion that local travel is predominantly responsible for disease wave propagation is supported by several lines of analysis.

First, continuous land-movement infection waves are visible in the weekly influenza rate movie; we computed this movie from insurance claim data and made it available with results of this study.

Second, because our all-weeks-included dataset was too large for the R MCMCglmm package to handle, we performed mixed-effect Poisson regression calculations using a 50 percent random sample of all the weeks for which data were available. In this computation, the airport proximity fixed-effect coefficient turned out to be statistically significantly negative (see Source data 2, as well as the editable output file “flu-50-percent-weeks.txt”).

Third, the results from our Granger-causality inference showed that:

1) Local county-to-county movements were much more predictive of influenza wave change than airport movements. In comparing Plates E and F in Figure 5F’s local movement causality coefficient (γ) is, on average, twice as large as that for long-range movement (Figure 5E). Figure 5E shows that the mean long-range causality coefficient is approximately 0.05, whereas it is just over 0.1 in the local propagation. As the causality coefficient quantifies the amount of predictability (measured as information in bits) communicated about the target data stream per observed bit in the source data stream, it follows that, on average, every ten bits of sequential incidence data from an influencing location tells us one bit about the unfolding incidence dynamics in the target location. Therefore, in the long-range movement case, informativeness is twice as low, so we need on average 20 bits to infer one bit about the state of infection. These calculations strongly suggest that local movement is predictively stronger with regards to influenza infection propagation.

2) While the most frequent value of the computed time delay in influence propagation between counties with large airports is zero weeks, this distribution is significantly flatter compared to that for local, county-to-county influence propagation.

We updated the main text of the manuscript with the above prose.

ii) The highlighted areas (greater incidence) in the first flu season snapshot for Figures 3A-H each should match relatively well to the maps in Figure 7B (or at least some unspecified subset of "high incidence treatment counties"). While the county-matching results to point to initiation at coasts, 2006-2007 in Figure 3 seems to initiate from the West Coast and 2010 in Figure 7 has a scattered pattern across the middle of the US. Please check this or justify the differences.

There is no mistake; the resulting patterns are what they turn out to be. The intuitive explanation of perceived discrepancy is that the matching method agrees with other analysis types predominantly, but not in all cases. Each analysis has limitations. In the case of the matching analysis, we have less statistical power than in, for example, Poisson regression; matching by numerous parameters reduces the initial set of thousands of counties to a handful of matching “treated” counties (which meet a particular combination of weather and sociodemographic conditions) and “untreated” counties (very similar to “treated” ones in all respects but treatment). The difficult-to-match, “weeded out” counties may happen to be in the coastal areas indicated as the most likely places of influenza wave origin by other analyses.

In the case of the 2007 and 2010 results, the matching analyses pick patterns that are different from those produced by the causality streamline analysis and mixed-effect Poisson regression models.

Figure 6’s Plate B shows the distribution of the treatment counties and matched-non-treatment counties. Note that here, we are not directly predicting initiation, so while the patterns in Figure 3 and Figure 6 should indeed have some similarity, they are not required to match up perfectly. The most similar treated counties do indeed show up in the southern shores.

iii) Also related to Figure 7, was there also a geographic constraint on the choice of the matched control counties? It is surprising that everything is so geographically clustered.

The geographic clustering arose from imposing similar weather patterns from multiple climate variables; we added explanation of this observation to the main text. (See subsection “Approach 3: Matching counties and factor combinations” in Results.)

Recommendations about presentation of content:

i) It would be useful to provide a quantitative example of how to interpret one of the results in Figure 2A.

The fixed-effect regression coefficients plotted in Plate A are shown on logarithmic scale, meaning that the absolute magnitude of predictor-specific effect is obtained by exponentiating the parameter value. A negative coefficient for a predictor variable suggests that influenza rate falls as this factor increases, while a positive coefficient predicts growing rate of infection as the parameter value grows. The integrated influence of individual predictors, under this model, is additive with respect to the county-specific rate of infection.

For example, a coefficient of -0.6for parameter AVG_PRESS_mean tells us that the average atmospheric pressure has a negative association with the influenza rate. As the mean atmospheric pressure for the county grows, the probability that the county would participate in an infection initiation wave falls. As exp(-0.6)=0.54, the rate of infection drops by 46 percent when atmospheric pressure increases by one unit of zero-centered and standard-deviation-normalized atmospheric pressure. Similarly, an increase in the share of a white Hispanic population predicts an increase in influenza rate: A coefficient of 1.3 translates into a exp(1.3) ⋅ 100% – 100% = 267% rate increase, possibly, because of the higher social network connectivity associated with this segment of population.

We put this information to legend of Figure 1.

ii) (Subsection “The weather/humidity paradox”) This humidity paradox has come up quite a bit recently and it may be cleaner to have a fuller treatment of this in the Discussion.

We considered this suggestion; we do not have much to add on this topic, beyond what is already discussed. Apologies.

iii) The subsection on Variable and model selection at the end of the Discussion seems out of place. Was this meant to be moved to the Materials and methods?

Yes, it has been moved.

iv) Please point out the Discussion section describing the generalizability of the findings (as mentioned in the response).

We have added the section: our apologies, it was added, then deleted by mistake.

v) The Conclusion section does much more of the storytelling that would be expected from a Discussion. It may be more compelling to kick off the Discussion with the analogy describing the epidemiological findings, and then elaborate on the strengths of using multiple methods to corroborate statistical findings.

Agreed; it has been changed as suggested.

vi) You may consider condensing the figures in the main text to those that present primary findings (removing some plates and potentially reducing the total number of figures). The figures are all visually appealing, but every single one is densely packed with primary findings, descriptive information, data processing details, and diagnostics/sensitivity. Some of the main takeaways may get lost.

We made an effort to streamline figures, moving whole Figure 1 to the supplement.

vii) Figure 2 might be more aptly named "Putative determinants of seasonal influenza onset[…]"

Agreed. Corrected.

viii) Figure 4, Plate I: Please add an explanation of the red circles to the legend title or caption.

Done.

ix) Please provide some explanation of how you developed the 126 models examined in Approach 2. Based on the models listed in Table 5 and in the supplement, it seems that there is substantial variation in the number of predictors included in the model and all of the models appear to include multiple autoregressive terms for the same environmental and travel predictors. What is the motivation for this approach?

We added and removed random effects to the model structure, taking out and adding fixed effects, running the regression, and plotting the DIC. Model complexity was simply proxied by the descriptional complexity in terms of how large the model was (the number of factors in the regression equation). We stopped when the drop in DIC stabilized.

The driving idea here is the identification of the Pareto front that trades off accuracy (in terms of DIC) with model complexity. This is, of course, a “greedy approach,” in the sense that we do not guarantee that we have indeed found the “best” possible model. However, since our cross-validation (Figure 5 in main text) yielded good results, we deemed the stopping rule to be satisfactory.

We put this information into the main text (see subsection “Variable and Model Selection” in Materials and methods).

In addition, it would be good to better understand the random effects in the model (subsection “Approach 2: Poisson Regression on Putative Factors”, first paragraph) and the motivation for this model structure. Does this term include a spatial smoothing component?

No, we did not use an explicit spatial smoothing. The county-level random effects were used to account for possible heterogeneity in disease reporting across counties. However, we used as predictors average values of infection rates of immediate neighbors for the county (1, 2, 3 and 4 weeks in the past), this prediction structure afforded an implicit spatial smoothing, dampening spurious fluctuations in infection rates.

We clarified this point in the main text, see Approach 2 in Materials and methods.

https://doi.org/10.7554/eLife.30756.036

Article and author information

Author details

  1. Ishanu Chattopadhyay

    1. Institute of Genomics and Systems Biology, University of Chicago, Chicago, United States
    2. Department of Medicine, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon 0000-0001-8339-8162
  2. Emre Kiciman

    Information and Data Science Group, Microsoft Research, Redmond, United States
    Contribution
    Formal analysis, Methodology, Writing—original draft, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon 0000-0001-5429-468X
  3. Joshua W Elliott

    Computation Institute, University of Chicago, Chicago, United States
    Contribution
    Resources, Data curation, Formal analysis, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  4. Jeffrey L Shaman

    Department of Environmental Health Sciences, Mailman School of Public Health, Columbia University, New York, United States
    Contribution
    Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  5. Andrey Rzhetsky

    1. Institute of Genomics and Systems Biology, University of Chicago, Chicago, United States
    2. Department of Medicine, University of Chicago, Chicago, United States
    3. Computation Institute, University of Chicago, Chicago, United States
    4. Departments of Human Genetics, University of Chicago, Chicago, United States
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    arzhetsky@uchicago.edu
    Competing interests
    No competing interests declared
    ORCID icon 0000-0001-6959-7405

Funding

National Institutes of Health (U01HL108634-01)

  • Jeffrey L Shaman

National Institutes of Health (R01GM100467)

  • Jeffrey L Shaman

National Institutes of Health (U01GM110748)

  • Jeffrey L Shaman

National Institutes of Health (1P50MH094267)

  • Andrey Rzhetsky

Defense Sciences Office, DARPA (W911NF1410333)

  • Andrey Rzhetsky

National Institutes of Health (R01HL122712)

  • Andrey Rzhetsky

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Erin Gannon, Mark Jit, Prabhat Jha, Elizabeth C Lee, and Margarita Rzhetsky for numerous comments on earlier versions of the manuscript. This work was supported by NIH grants 1P50MH094267, U01HL108634-01, R01HL122712-02, R01GM100467, and U01GM110748, in addition to DARPA contract W911NF1410333, and a gift from Liz and Kent Dauten.

Reviewing Editor

  1. Mark Jit, Reviewing Editor, London School of Hygiene & Tropical Medicine, and Public Health England, United Kingdom

Publication history

  1. Received: July 26, 2017
  2. Accepted: February 13, 2018
  3. Accepted Manuscript published: February 27, 2018 (version 1)
  4. Version of Record published: March 22, 2018 (version 2)

Copyright

© 2018, Chattopadhyay et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,527
    Page views
  • 260
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Epidemiology and Global Health
    2. Microbiology and Infectious Disease
    Ellie Sherrard-Smith et al.
    Research Article
    1. Epidemiology and Global Health
    2. Genetics and Genomics
    Atif Rahman et al.
    Tools and Resources