Evaluating distributional regression strategies for modelling self-reported sexual age-mixing

  1. Timothy M Wolock  Is a corresponding author
  2. Seth Flaxman
  3. Kathryn A Risher
  4. Tawanda Dadirai
  5. Simon Gregson
  6. Jeffrey W Eaton
  1. Department of Mathematics, Imperial College London, United Kingdom
  2. MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, United Kingdom
  3. London School of Hygiene & Tropical Medicine, United Kingdom
  4. Manicaland Centre for Public Health Research, Biomedical Research and Training Institute, Zimbabwe

Abstract

The age dynamics of sexual partnership formation determine patterns of sexually transmitted disease transmission and have long been a focus of researchers studying human immunodeficiency virus. Data on self-reported sexual partner age distributions are available from a variety of sources. We sought to explore statistical models that accurately predict the distribution of sexual partner ages over age and sex. We identified which probability distributions and outcome specifications best captured variation in partner age and quantified the benefits of modelling these data using distributional regression. We found that distributional regression with a sinh-arcsinh distribution replicated observed partner age distributions most accurately across three geographically diverse data sets. This framework can be extended with well-known hierarchical modelling tools and can help improve estimates of sexual age-mixing dynamics.

Introduction

Patterns in sexual mixing across ages determine patterns of transmission of sexually transmitted infections (STIs). Consequently, sexual age-mixing has been of great interest to researchers studying the human immunodeficiency virus (HIV) since the beginning of the global epidemic. Anderson et al., 1992 used a model of partnership formation to predict that mixing between young women and older men would amplify the already-substantial effect of HIV on population growth. Garnett and Anderson, 1994 used a mathematical model to show that patterns of age-mixing could substantially influence the magnitude and timing of hypothetical epidemic trajectories, whereas Hallett et al., 2007 demonstrated that delaying sexual debut and increasing age-similar partnerships could reduce an individual’s risk of HIV infection in a highly endemic setting.

These modelling studies have been complemented by analyses of survey and population cohort data on age-mixing patterns. Gregson et al., 2002 and Schaefer et al., 2017 observed that individuals with older partners were at greater risk of HIV infection in a general-population cohort in Zimbabwe. Ritchwood et al., 2016 and Maughan-Brown et al., 2016 found that larger age differences were associated with more risky sexual behaviour in surveys of young South African people. Similarly, Akullian et al., 2017 found that partner age was an important risk factor for HIV infection in a cohort study in rural South Africa. On the other hand, Harling et al., 2014 found that age-disparate relationships were not associated greater risk of HIV acquisition in young women in South Africa.

These results underscore the importance of considering age-mixing dynamics when designing and evaluating HIV prevention strategies, and, consequently, the importance of measuring them accurately. For example, an intervention aiming to prevent new HIV infections among young women could be attenuated by high prevalence among older men. Identifying changes in sexual partner age distributions and attributing them to interventions might even be a valuable end by itself, in which case accurate measurement must be complemented by an effective modelling strategy.

Data about sexual partner age-mixing are routinely collected by long-term cohort studies (such as those that comprise the Analysing Longitudinal Population-based HIV/AIDS data on Africa, or ALPHA, Network) and large-scale household surveys (such as the Demographic and Health Surveys) (The DHS program, 2021; Reniers et al., 2016). Typically, these data consist of the respondent’s age and sex and the ages of their sexual partners in the last 12 months. These data are highly variable, skewed, and often deviate substantially from conventional parametric distributions, such as the normal distribution or the gamma distribution (Beauclair et al., 2018).

One may consider statistical modelling approaches for the distribution of partner age as a function of respondent age and sex. Some notable previous approaches to modelling partner age distributions include (Morris, 1991), who developed a log-linear modelling framework to quantify selective mixing from contingency tables, and applied the model to ordinal categorical data on husbands’ and wives’ ages in the United States. Hallett et al. used a log-logistic distribution to continuously model partner age differences for women aged 15–45 years, assuming that the partner age difference distributions did not vary over respondent age. More recently, as an input to a model of Chlamydia trachomatis, Smid et al., 2018 fit skew normal distributions to each age-/sex-specific partner age distribution and used a secondary regression model to smooth the estimated skew normal parameters across respondent age. They observed substantial changes in the estimated skew normal parameters with respect to respondent age. Although this method allows for non-linear variation across respondent age, their two-stage estimation process makes uncertainty propagation complex. Replacing this process with a single ‘distributional’ regression model, in which all distributional parameters (e.g. the location, scale, skewness, etc.) are modelled as functions of data (Kneib and Umlauf, 2017), allows for complex variation across respondent age while still robustly incorporating uncertainty. Another elegant approach has been the development of exponential-family random graph models (ERGMs) to infer full partnership networks from individuals reports of the partnerships (’ego-centric’ observations of the network) (Krivitsky and Morris, 2017). These stochastic methods, along with the broader suite of ERGMs (Hunter et al., 2008a; Hunter et al., 2008b; Krivitsky et al., 2011; Krivitsky and Handcock, 2014), can model social network data accurately with robust incorporation of covariates, and tools exist to incorporate their estimates into epidemic models (Jenness et al., 2018; Morris, 1993).

We focused on evaluating parametric models for continuously representing the distribution of sexual partner ages conditional on the respondent’s age. We were specifically interested in distributions that introduce parameters to control tail weight, which may capture intergenerational mixing that could sustain endemic HIV and STI transmission (Akullian et al., 2017; Schaefer et al., 2017; Harling et al., 2014). This led us to test the ability of the four-parameter ‘sinh-arcsinh’ distribution originally proposed by Jones and Pewsey, 2009 to fit to these data.

We hypothesised that integrating the sinh-arcsinh distribution into a distributional modelling framework would allow us to replicate observed partner age distributions more accurately than prior modelling strategies. We tested this theory by comparing a variety of candidate strategies, which varied along three dimensions: the parametrisation of the dependent variable, the choice of distribution, and the method for incorporating variability across respondent age and sex.

Materials and methods

Data

In this work, we compared a set of probability distributions and regression specifications to identify a modelling strategy that produced stable and accurate estimates of sexual partner age distributions with well-quantified uncertainty. We conducted two model comparison experiments to identify which of a set of strategies best replicated partner age distributions. First, in our probability distribution comparison, we identified which of a set of distribution-dependent variable combinations fit best to age-/sex-specific data subsets, and then, in our distributional regression evaluation, we tested whether distributional regression methods could be used to estimate age-/sex-specific partner age distributions by sharing strength across observations. In distributional regression, all parameters of a distribution can vary with respect to data, so it allowed us to smooth and interpolate even higher order moments of observed partner ages. We divided the model comparison into two separate experiments to make the probability distribution comparison as fair as possible (accounting for the possibility that certain distributions would perform particularly well under certain regression specifications).

We analysed data on sexual partner age distributions from three sources: the Africa Centre Demographic Information System, a health and demographic surveillance site in uMkhanyakude district, South Africa collected by the African Health Research Institute (AHRI) (Gareta et al., 2021; Gareta et al., 2020aGareta et al., 2020b), the Manicaland General Population Cohort in Zimbabwe (Gregson et al., 2017), and the 2016–2017 Demographic and Health Survey (DHS) in Haiti (Institut Haïtien de l’Enfance, 2018).

The AHRI and Manicaland studies are multi-round, open, general population cohort studies designed to measure the dynamics of HIV, sexual risk behaviour, and demographic change in sub-Saharan African settings. We used rounds one through six of the Manicaland study, collected between 1998 and 2013. The AHRI data we used were collected annually between 2004 and 2018. The 2016–17 Haiti DHS was a large, nationally representative household survey conducted in 2016 and 2017. We did not incorporate the weights associated with the survey into this analysis because our primary interest was in statistical modelling of partner age distribution as a function of respondent age, not producing population representative statistics for the Haitian population.

These data sets consisted of individuals’ reports of their own age and sex and the ages of each of their sexual partners from the last year. Let i(1,,N) index reported partnerships, ai[15,64] and si{0,1} be the age and sex of the respondent in partnership i with s=1 indicating female, and pi be the age of non-respondent partner in partnership i. These questionnaires do not ask specifically about partner sex, but self-reporting of non-heterosexual partnerships in these populations is thought to be low (Arias Garcia et al., 2020; World Health Organization and UNAIDS, 2020).

Respondents in each of these data sets are disproportionately likely to report that their partners’ ages are multiples of five or multiples of five away from their own age, leading to distinct ‘heaping’ in the empirical partner age (or age difference) distributions at multiples of five. For example, if a questionnaire asks ‘how many years older or younger is your partner than you?', respondents might be disproportionately likely to report a multiple of five, leading to age differences that are heaped on multiples of five. We tested the sensitivity of our results to heaping by developing a simple ‘deheaping’ algorithm, applying it to the AHRI data, and running each analysis on the deheaped AHRI data. We present these results in Appendix section ‘Age heaping'.

Probability distribution comparison

To identify the best probability distribution for modelling sexual partner age distributions, we split all three data sets into 12 subsets by sex and five-year age bin ranging from 20 to 50, resulting in 36 subsets, and fit a number of distribution-dependent variable combinations to each subset.

Distributions

Request a detailed protocol

We tested five candidate probability distributions: normal, skew normal, beta, gamma, and sinh-arcsinh. Table 1 summarises the domains, parameters, and probability density functions (PDFs) of these distributions. Because the gamma distribution is always right-skewed and men typically partner with women who are younger than them, we transformed data among male respondents to be right-skewed when using the gamma distribution. Specifically, we multiplied the men’s partners’ ages by −1 to reflect the distribution horizontally across the y-axis, and added 150 to the reflected ages to ensure that all resulting values were positive. Similarly, the beta distribution is only defined on the interval (0, 1), so, only when using a beta distribution, we scaled all partner ages to be between zero and one using upper and lower bounds of 0 and 150.

Table 1
Details of the five distributions tested in this analysis.

We define xz=(x-μ)/σ, p(x) to be the standard normal PDF, Φ(x) to be the standard normal cumulative density function, Sϵ,δ(x)=sinh(ϵ+δasinh(x)), and Cϵ,δ(x)=cosh(ϵ+δasinh(x)).

DistributionParametersDomainPDF
Normalμ(location)σ>0(scale)1σ2πexp[-xz2]
Skew normalμ(location)σ>0(scale)ϵ(skewness)2σp(xz)Φ(ϵxz)
Gammak>0(shape)θ>0(scale)+1Γ(k)θkxk-1exp[-xθ]
Betaα>0(left)β>0(right)(0,1)xα-1(1-x)β-1B(α,β)
Sinh-arcinhμ(location)σ>0(scale)ϵ(skewness)δ>0(tail weight)1σ2πδCϵ,δ(xz)1+xz2exp[-Sϵ,δ(xz)22]

The sinh-arcsinh distribution, presented by Jones and Pewsey, 2009, is an extension of Johnson’s distribution (Johnson, 1949). It has four parameters: location, scale, skewness, and tail weight (denoted, μ, σ, ϵ, and δ, respectively), and it can deviate substantially from the normal distribution. Figure 1 plots the density of this distribution with μ=0 and σ=1 for a variety of values of skewness and tail weight.

The sinh-arcsinh density with μ=0, σ=1, and a variety of assumptions about ϵ and δ.

Dependent variable transformations

Request a detailed protocol

We considered the possibility that certain distributions could fit better to particular transformations of the dependent variable (partner age) by testing a set of four potential outcome parametrisations. For example, if X is a positive-valued, right-skewed random variable, then assuming logX is normally distributed might be more effective than assuming that X itself is normal.

Let yi be the dependent variable value for partnership i, and let ai and pi be the respondent age and partner age of partnership i, respectively. We tested the following dependent variables:

  1. Linear age: yi=pi. This is untransformed partner age, included as a baseline. It has the undesirable quality of being able to predict negative ages.

  2. Age difference: yi=pi-ai. If changes in expected partner age are consistent across respondent age then this variable would be more consistent across respondent age than the linear age. This parametrisation also allows for negative partner age predictions.

  3. Log-age: yi=logpi. We can use a log link function to ensure that our predictions will be positive-valued.

  4. Log-ratio: yi=log(pi/ai). Finally, we can combine the link function and differencing approaches by modelling the log of the ratio of partner to respondent age. This variable will only produce positive predictions and, like the age difference variable, should be relatively constant over respondent age.

Because the gamma and beta distributions are not defined on the entire real line, we only fit them with the linear age dependent variable with the previously discussed transformations.

To identify which distribution-dependent variable combination best modelled the characteristics of sexual partner age distributions, we stratified each of our three data sets by sex and 5-year age bin from 20 to 24 through 45–49. We omitted ages 15–19 from the probability distribution comparison because relatively small sample sizes in that group would make reliable comparison difficult. We fit every viable distribution-dependent variable combination to all 36 data set-/sex-/age bin-specific subsets independently. Given that we fit only the linear age dependent variable to the gamma and beta distributions, comprising a total of 504 models (14 per data subset). We fit each model using the brms R package (Bürkner, 2018), defining custom families as necessary.

Distributional regression evaluation

Request a detailed protocol

Given a probability distribution that accurately replicated the non-Gaussian characteristics of partner age distributions, we tested whether or not distributional regression would allow us to pool data across age and sex without sacrificing fit. In distributional regression, we make all of our distributional parameters, not just the mean, functions of data (Kneib and Umlauf, 2017). Taking conventional Bayesian regression as an example, we have

yiN(μi,σ)μi=β𝐗i,

where β and logσ are free parameters. There is an explicit assumption in this model that the standard deviation of the generating distribution is constant across all observations. We can use distributional regression to relax this assumption, making σ a function of data:

yiN(μi,σi)μi=βμ𝐗iμlogσi=βσ𝐗iσ,

where βμ and βσ are now our free parameters. Note that we have not assumed that 𝐗μ=𝐗σ. If 𝐗σ is a column of ones, this model is identical to the conventional case. This approach increases the complexity of the model and requires more data, but, based on previously described characteristics of how the distribution of partnership age distribution changes with age, even a simple model for our distributional parameters could yield large improvements.

In this application, we modelled the log-ratio dependent variable with the sinh-arcsinh distribution, specifying a model for all four distributional parameters:

log(pi/ai)sinh(μi,σi,ϵi,δi)μi=βμ𝐗iμlogσi=βσ𝐗iσϵi=βϵ𝐗iϵlogδi=βδ𝐗iδσi=σiδi,

where βμ, βσ, βϵ, and βδ are free parameters. We placed essentially arbitrary shrinkage priors on all coefficients:

βμ, βσ, βϵ, βδN(0,5).

By varying the specifications of the four design matrices, 𝐗μ, 𝐗σ, 𝐗ϵ, and 𝐗δ, we tested how well a series of increasingly complex distributional regression models fit to each data set. We fit the following models, which varied in the definitions of their four design matrices:

  1. Conventional: linear age-sex interaction for location and constants for all three higher-order parameters

  2. Distributional 1: linear age-sex interaction for location and independent age and sex effects for all other parameters

  3. Distributional 2: linear age-sex interactions for all four parameters

  4. Distributional 3: sex-specific spline with respect to age for location and linear age-sex interactions for all other parameters

  5. Distributional 4: sex-specific splines with respect to age for all four parameters

Table 2 describes all five models. By fitting a wide set of specifications, we hoped to assess whether the additional complexity incurred by distributional regression was valuable. We fit each of the five models to all three data sets, including all respondents aged 15–64 years. We implemented these analyses with brms (Bürkner, 2018), which has deep support for distributional regression. More detailed descriptions of each model are available in the ‘Model specification details’ section of the Appendix. We tested the effect of age heaping on this analysis by fitting to the deheaped AHRI data and report results in the ‘Age heaping’ section of the Appendix.

Table 2
Summary of five models fit in this analysis.
ModelDistributional?LocationOther parameters
ConventionalNoAge-sex interactionConstant
Distributional 1YesAge-sex interactionAge and sex effects
Distributional 2YesAge-sex interactionAge-sex interaction
Distributional 3YesSex-specific splinesAge-sex interaction
Distributional 4YesSex-specific splinesSex-specific splines

Model comparison

Request a detailed protocol

Across both analyses, we used two metrics to measure model fit. First, we calculated the expected log posterior density (ELPD), which estimates the density of the model at a new, unobserved data point (Vehtari et al., 2017). In cases where we wanted to compare across dependent variables, we multiplied the posterior densities of any variables resulting from non-linear transformations of observed partner ages by the Jacobians of the transformations. For example, if our observation model was defined on the log-age dependent variable yi=logpi, we divided the posterior density by pi. We used the loo R package (Vehtari et al., 2020) to calculate ELPD values.

To measure the ability of our models to replicate partner age distributions in an objective and interpretable way, we found the root mean squared error (RMSE) between the observed and posterior predictive quantiles. We calculated quantiles from 10 to 90 in increments of 10 by age bin and sex in the data and in the posterior predictions, and found the error in model prediction of each quantile. This measure tells how well our model predicts the entire distribution in the same units as our predictions. It is equivalent to finding the root mean squared distance from the line of equality in a quantile-quantile (QQ) plot.

Software

We conducted all of these analysis using the R programming language (R Development Core Team, 2020) and the brms library (Bürkner, 2018). We used the loo library to estimate all ELPDs (Vehtari et al., 2020), and produced all plots in this paper with the ggplot2 library (Wickham, 2016). We cannot provide the data we used for this analysis, but we do provide code and data for a simulated case on GitHub (https://github.com/twolock/distreg-illustration; Wolock, 2021, copy archived at swh:1:rev:a7f808f2cde2bb16edde8fdcbfa6e208df7952f9).

Results

The AHRI data included 77,619 partnerships, Manicaland had 58,676, and the Haiti DHS had 12,447. There were 36,033 respondents reporting at least one partnership in the AHRI data, 25,024 in the Manicaland data, and 12,143 in the Haiti DHS, resulting in averages of 2.2, 2.3, and 1.0 partners per respondent, respectively. As an illustrative example of the distribution of partner ages, Figure 2 presents histograms of reported partner ages among women aged 34 years for each of our three data sets. Figure 3 shows the sex- and age bin-specific empirical moments for the three data sets. Mean partner age increased with respondent age consistently for both sexes across all three data sets: among women, mean partner age increased by 26.0, 22.7, and 23.7 years in the AHRI data, Haiti DHS data, and Manicaland data, respectively, between age bins 20–24 and 45–49. However, higher order moments were less consistent: the standard deviation of women’s partners’ ages changed by 2.3, 0.5, and 3.5 years in the AHRI data, Haiti DHS data, and Manicaland data, respectively, between age bins 20–24 and 45–49.

Observed partner age distributions among women aged 34 years in all three data sets.
Observed means, variances, skewnesses, and kurtoses of partner age by 5-year age bin and sex in all three data sets.

Within each data set, there is systematic variation across sex. For example, the standard deviation of partner ages in the Haiti DHS increased by 2.5 years among men and only by 0.5 years among women. These summary statistics illustrate the heterogeneity of partner age distributions across age and sex.

Probability distribution comparison

To identify the probability distribution that most accurately described the variation in sexual partner age distributions, we first determined the dependent variable with the highest ELPD for each distribution-dependent variable combination. Figure 4 illustrates each probability distribution’s best fit to AHRI data among women aged 35–39 with each of the best distribution-specific dependent variables. Results for all 36 data subsets and the 12 deheaped subsets are presented in Appendix 1—table 29.

Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among women aged 35–39 in the AHRI data set.

Posterior predictive distributions come from fitting each age bin/sex combination independently.

The best dependent variable varied across data subset and probability distribution. Table 3 provides the share of data sets for which each dependent variable has the highest ELPD given each distribution. The log-ratiodependent variable was best in 50.0% of subsets with a normal distribution, but it was best in only 27.8% of subsets with a skew normal distribution. The dependent variable that was best in a plurality of subsets in each probability distribution (i.e. the variable with the highest percentage in each column of Table 3) used a log link function. We restricted all remaining comparisons to each distribution-subset combination’s best dependent variable.

Table 3
Share of subsets in which each dependent variable yields the highest ELPD given each probability distribution (excluding deheaped AHRI data).
VariableNormalSkew normalSinh-arcsinh
Age difference22.2%25.0%16.7%
Linear age8.3%5.6%16.7%
Log-age19.4%41.7%30.6%
Log-ratio50.0%27.8%36.1%

The sinh-arcsinh distribution had the highest ELPD in 35 of 36 data subsets (98%). In 29 of the 35 (83%) cases in which the sinh-arcsinh provided the highest ELPD, the absolute value of the ratio of the difference between the two best ELPDs and the estimated standard error of the difference was greater than 2, indicating that the sinh-arcsinh distribution was significantly better than the alternatives in the majority of cases. In one case, men aged 20–24 in the Haiti DHS, the skew normal distribution resulted in a slightly higher ELPD than the sinh-arcsinh distribution, but the standard error of the difference was greater than the difference. These results were not affected by deheaping the data (Appendix section ‘Age heaping’).

To summarise each distribution’s performance, we calculated the average ELPD and QQ RMSE across the three data sets (Table 4). The sinh-arcsinh distribution had the highest average ELPD and lowest average QQ RMSE in all three data sets. The sinh-arcsinh distribution was, on average, able to predict the empirical quantiles of each data set within half a year of accuracy (0.36, 0.37, and 0.44 years for the AHRI, Haiti DHS, and Manicaland data, respectively). Overlaid QQ plots are presented in Appendix 1—figure 3.

Table 4
Model comparison metrics averaged across all data subsets for all three data sets.

Higher ELPD values indicate better fit. Lower QQ RMSE values indicate more accurate prediction of empirical quantiles. Bolded rows are best across all three data sets.

DistributionAHRIHaiti 2016–17 DHSManicaland
ELPD
Gamma−14847.2−2917.9−13152.8
Beta−14748.0−2896.5−13003.5
Normal−14593.7−2868.4−12856.8
Skew normal−14505.1−2854.0−12778.5
Sinh-arcsinh−14312.5−2839.5−12625.8
QQ RMSE
Gamma0.830.820.95
Beta0.990.821.11
Normal0.820.680.97
Skew normal0.770.650.85
Sinh-arcsinh0.360.370.44

Distributional regression evaluation

We fit all five distributional regression specifications to all three of our data sets with sinh-arcsinh distributions and log-ratio-dependent variables and compared the ELPDs and QQ RMSEs as before (provided in Table 5). Across all three data sets, the most complex distributional model (Distributional 4) had the highest ELPD and lowest QQ RMSE. When fit to the AHRI and Manicaland data sets (but not for the Haiti DHS), the most complex distributional model was at least two standard errors better than the next best model. Notably, the largest ELPD improvements came from moving from conventional regression (Conventional) to the simplest distributional model (improvements of 1646.0 units, 361.0 units, and 2181.2 units in the AHRI, Haiti DHS, and Manicaland data, respectively). Full results are presented in Appendix 1—table 10.

Table 5
ELPD and QQ RMSE values for all five distributional regression models fit to each data set.

The models increase in complexity from Conventional Regression to Distributional Model 4. Bolded ELPD values are more than two standard errors higher than the next best value in the column. Bolded QQ RMSE values are lowest in their column.

ModelAHRIHaiti 2016–17 DHSManicaland
ELPD
Conventional52689.24777.821011.3
Distributional 154335.25140.823192.5
Distributional 254794.85138.723472.1
Distributional 355534.25196.724313.7
Distributional 455841.95207.624516.1
QQ RMSE
Conventional1.301.332.05
Distributional 11.150.981.89
Distributional 21.210.991.80
Distributional 30.930.911.34
Distributional 40.660.841.04

Figure 5 shows the posterior predictive distributions from the conventional regression model and the most complex distributional model among men aged 16 years, 24 years, and 37 years in the AHRI data to illustrate the effect of distributional regression. Not only does the distributional model capture the high peak in the youngest age more accurately, but it also allows the variance of the distributions to change appropriately (beyond the change that naturally results from the log link function).

Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for conventional regression and the most complex distributional model among men aged 16, 24, and 37 years in the AHRI data set.

Posterior predictive distributions come from regression models fit to the entire AHRI data set.

Figure 6 illustrates posterior summaries among men and women in the AHRI data for all four distributional parameters for the conventional regression model, the simplest distributional model, and the most complex distributional model. The red estimates (Conventional Regression) of the three higher order parameters were constant across age and sex, whereas the blue estimates (Distributional Model 1) included independent, linear age and sex effects. The orange estimates (Distributional Model 4) were generated sex-specific splines with respect to age, allowing for flexible variation across age and sex.

Estimated sinh-arcsinh distributional parameters from the conventional regression model, and distributional models 1 and 4 fit to the AHRI data.

‘Conventional’ assumes no variation across age and sex, ‘Distributional 1’ allows for independent age and sex effects, and ‘Distributional 4’ includes sex-specific splines with respect to age.

The third row of plots in Figure 6, which corresponds to the skewness parameter, illustrates the impact of incorporating sex and age effects into the model. The conventional regression model estimated that neither the distrbution for men nor women exhibited much skewness; the estimated parameter value was −0.05 (95% UI: −0.06 to −0.05) regardless of age, with 0.0 corresponding to perfect symmetry. However, when we allowed independent age and sex effects in Distributional Model 1, we estimated that at age 15, women’s skewness was −0.26 (95% UI: −0.27 to −0.25) and men’s was 0.11 (95% UI: 0.10 to 0.12).

The most complex model (Distributional Model 4) inferred sex-specific, non-linear variation with respect to age in all four distributional parameters. The non-linearity was particularly dramatic in the scale parameter among men. The scale value began at 0.05 (95% UI: 0.05–0.06) among 15-year-olds, peaked among 37-year-olds at 0.11 (95% UI: 0.10–0.11), and decreased back down to 0.05 (95% UI: 0.04–0.06) at age 64.

Finally, Figure 7 presents inferred distributional parameters from Distributional Model 4 for both men and women for all three data sets. Based on those plots, the flexible model was justified for most distributional parameters in all three data sets. Were we to continue developing these models, this plot suggests that skewness might only need linear, sex-specific effects with respect to age. Interestingly, the 2016–2017 Haiti DHS and Manicaland estimates exhibit similar patterns across all four parameters, despite the different socio-cultural contexts surrounding partnerships in the two populations. We also note that the DHS does not collect data on adults aged 50 years and older, so our estimates in Haiti from age 50 to age 64 are purely extrapolated.

Estimated sinh-arcsinh distributional parameters for Distributional Model 4 fit to the three main data sets.

Discussion

We found that the sinh-arcsinh distribution reproduced observed sexual partner age distributions better than a number of other possible distributional assumptions across age and sex in three distinct data sets. We integrated this finding into a distributional regression framework using existing statistical modelling software. Even the simplest distributional regression in our set of candidate models far outperformed conventional regression, in which all moments except the first are estimated as constants. Our most complex distributional model fit better than all other models in all three data sets, suggesting that modelling these data benefits from the additional complexity.

These results indicate that distributional regression models with sinh-arcsinh distributions can accurately replicate age-/sex-specific sexual partner age distributions. This approach presents a number of advantages over previous methods. First, like Smid et al., it allows a unique distribution for every age-sex combination. As Figure 3 illustrates, partner age distributions can exhibit substantial, systematic variation across age and sex in any of the first four moments, so we must consider modelling strategies that allow for such variation. Second, distributional regression offers a principled method to propagate uncertainty through this estimation process.

Finally, distributional regression implemented through brms provides access to a deep set of hierarchical modelling tools that could enable estimation in a variety of low-data settings. We evaluated a small set of relatively simple distributional models in this work, but, theoretically, each distributional parameter could have its own, arbitrarily complex hierarchical regression model. Using these tools, one could estimate unique partner age distributions across levels of stratification that are substantively interesting but do not provide sufficient sample size for independent estimation (e.g. study sites or geographic areas).

We have identified several limitations in this approach. First, the amount of data required to produce usefully precise estimates is not tested. Each additional distributional parameter introduces model parameters, so this method is more complex than conventional regression. The sinh-arcsinh distribution did fail to produce the highest ELPD in our smallest data subset (N = 170), but it was not significantly worse than the best distribution. More importantly, by integrating these data into a distributional modelling framework, we gain the ability to impose structure on these parameters, which could easily offset the cost of any additional model parameters.

Interpreting the inferred model parameters in sinh-arcsinh regression can also be difficult. Although conventional regression estimates the effects of covariates on expected values, the sinh-arcsinh distribution is parametrised in terms of a location parameter. This parameter correlates closely with the central tendency of the distribution, but it is not strictly equal to the mean. We can reparametrise the distribution so that we estimate a mean (and therefore effects of covariates on the expected value), but it is not currently possible in the probabilistic programming software that underlies brms.

Third, our analysis assumed that we were operating at a level of stratification at which partnerships are basically comparable, but any number of factors could lead to fundamentally different partner age distributions. For example, we did not control for whether the partnership was same-sex or the type of the partnership (married, casual, etc.). That said, our distributional framework would allow us to incorporate data on any of those factors directly into the model.

There are two sources of possible non-independence in our data sets that were not modelled. First, in the two cohort studies, participants are eligible to participate in multiple survey rounds, in which case the same partnership could be reported multiple times in our data set. Second, many individuals report multiple partners in the same survey, and the age of one partner may be correlated ages of other partners, for example due to individual preferences for partnerships. Because we modelled multiple observations of the same individual as conditionally independent, we anticipate that these correlations may artificially increase the precision of our estimates.

Finally, our model does not address any reporting biases in self-reported partnership data. If certain relationship types are perceived as less socially acceptable, respondents might be less likely to report them, resulting in systematic missingness. Our method could still be appropriate to model the age distribution from the data reported about an under-reported partnership type, but it cannot predict whether or not a given partnership exists. However, if under-reporting correlates with partner age (or age difference), then the empirical distributions will be biased and our method will only smooth and interpolate the biased data.

Despite these limitations, we believe that the strategy we present will work well in future projects that require estimates of partner age distributions. We plan to use these methods to produce age-mixing matrices to inform epidemic models of HIV, but there are a number of additional directions that could be explored. We are specifically interested in leveraging the spatio-temporal structure of the survey data used here. Hierarchical mapping exercises with household survey data are increasingly common in epidemiology, but estimating spatially varying partner age distributions would require an evaluation of how best to model higher order moments over space. We would, for example, need to consider how the variance of partner age distributions varies by urbanicity.

Similarly, population-based studies typically collect far more detailed information on partnerships than we took advantage of here. Relationship type is a key confounder of the association between respondent age and partner age (that we ignored for the purposes of our experiments). We might expect the age distributions of casual partners to vary substantially from those of long-term cohabiting partners. Because we have built our model in an existing regression framework, incorporating new covariates into any of the distributional regression specifications is straightforward. The distribution regression framework with the sinh-arcsinh may also be a useful parametric model for continuous representation of marginal distributions within sexual mixing models or network models, such as the ERGM framework.

We believe that our framework offers a flexible, accurate, and robust method for smoothing and interpolating sexual partner age distributions, but these methods are not specific to partner age distributions. The sinh-arcsinh distribution is relatively easy to implement without incurring high computational cost, so it could be applied in many settings. Even without the distributional regression framework, we have used here, allowing the third and fourth moments of the distribution to vary from the ‘default’ normal values could be valuable across a variety of applications.

Distributional regression is also underutilised in social science applications. We often work with large surveys that would comfortably support models for higher order parameters. Data requirements will vary by application and model, but, as we have shown here, even a simple distributional model can improve fit and avoid biasing estimates.

Appendix 1

Age heaping

Respondents in each of these data sets are disproportionately likely to report that their partners’ ages are multiples of five or multiples of five away from their own age, leading to distinct ‘spikes’ in the empirical partner age (or age difference) distributions at multiples of five. The left panel of Appendix 1—figure 1 illustrates this phenomenon among women aged 24 years in the AHRI data. These spikes, widely referred to as ‘heaping’, could bias our results towards certain probability distributions, so we developed a simple deheaping algorithm, applied it to the AHRI data.

To account for the possibility that heaping affected the results, we developed a simple deheaping algorithm and treated the deheaped AHRI data as a fourth data set. Due to the structure of the questionnaire (‘how many years older or younger is your partner than you?’), the AHRI partner age data exhibit strong heaping on partner ages that are multiples of five years from the respondent’s age. For example, among women aged 24 years, we observe far more partners aged exactly 29 years than expected.

Let ns,a,p be the number of observed partnerships with si=s, ai=a, and pi=p. Fixing age to be a and sex to be s, we can find the expected count at partner age p, n^s,a,p by fitting a Nadaraya-Watson estimator to all ordered pairs (p,ns,a,p) such that p-a is not a multiple of five. We can then find the positive-valued excess counts at all p such that p-ais a multiple of five: es,a,p=max(ns,a,p-n^s,a,p,0).

This quantity, es,a,p, is what the Nadaraya-Watson estimator has identified as number of heaped observations. Fixing p to be a partner age such that (p-a)mod50, we assume that all of the excess mass at p will be allocated to the four partner ages on either side of p. We find the share of es,a,p to be allocated to each of (p-2,,p+2), denoted bs,a,p, as

bs,a,p=ns,a,pi=-22ns,a,p+i,

substituting in n^s,a,p for ns,a,p wherever applicable. Finally, we find the number of individuals to be reassigned from p to each p within two years of y as ds,a,p=bs,a,pes,a,p. Note that each partner age can only ‘receive’ partnerships from its nearest multiple of five and that each multiple of five can only ‘send’ partnerships to itself and the four partner ages on either side of it. For each y within two years of y, we randomly select ds,a,p individuals to move from p to p. We apply this method for both sexes and all respondent ages with at least two observations separately.

Appendix 1—figure 1 illustrates the effect of this process on data among women aged 24 in the AHRI data. This method is quite simple, but it seems to work reasonably well on the AHRI data. Regardless, we do not need a perfect deheaping algorithm for this application; we just need one that will give us a plausibly deheaped version of the AHRI data. If the results differ drastically between the heaped and deheaped data sets (i.e. if one probability distribution works perfectly only on the deheaped data), then we will know that our results are sensitive to irregularities in the data.

Appendix 1—figure 1
Illustration of the effect of the deheaping algorithm on women aged exactly 24 years in the AHRI data.

Dark grey bars correspond to ages identified as potentially heaped (multiples of five away from 24). The red line is the expected count of observations estimated by excluding any potentially heaped ages.

Results

Appendix 1—figure 2 shows the presence of age heaping among women in the AHRI data, as well as the effects of our deheaping algorithm. Visible diagonal lines indicate that women were disproportionately likely to report that the difference between their partner’s age and their own age was a multiple of five. Heaping to partner ages (not partner age differences) would manifest as horizontal lines. As we can see in the right panel, the deheaping procedure resolves the majority of the heaping. We cannot validate the algorithm, but for the purposes of this experiment, simply producing plausibly deheaped age distributions should be sufficient.

Appendix 1—figure 2
Observed sexual partner age distributions among women in the AHRI data.

The left panel is original data, and the right panel is the same data set after deheaping age differences from multiples of five.

Appendix 1—table 1 provides ELPD and QQ RMSE values for all five regression models fit to the deheaped AHRI data. As with the heaped AHRI data, the most complex distributional model had the highest ELPD (58504.0). From these results, we conclude that the presence of heaping in the three main data sets is unlikely to have substantially altered the results of this analysis.

Appendix 1—table 1
ELPD and QQ RMSE values for all five models fit to deheaped AHRI data The models increase in complexity from Conventional Regression to Distributional Model 4.

Bolded ELPD values are more than two standard errors higher than the next best value in the column. Bolded QQ RMSE values are lowest in their column.

ModelAHRI deheaped
ELPD
Conventional55296.2
Distributional 157097.4
Distributional 257503.7
Distributional 358219.2
Distributional 458504.0
QQ RMSE
Conventional1.26
Distributional 11.06
Distributional 21.14
Distributional 30.92
Distributional 40.62

Model specification details

We modelled the log-ratio dependent variable using the four-parameter sinh-arcsinh distribution:

yisinh(μi,σi,ϵi,δi)μi=βμ𝐗iμlogσi=βσ𝐗iσϵi=βϵ𝐗iϵlogδi=βδ𝐗iδσi=σiδi,

where βμ, βσ, βϵ, and βδ are free parameters. We placed essentially arbitrary shrinkage priors on all coefficients:

βμ, βσ, βϵ, βδN(0,5).

First, we fit a conventional regression, in which only the location parameter, μ, is a function of data. Specifically, we allowed for linear sex and age effects and a linear interaction between respondent sex and age (si and ai, respectively) in the model of μ:

𝐗iμ=(1,si,ai,siai)𝐗iσ,𝐗iϵ,𝐗iδ=(1).

In the second model, we allowed the three higher order distributional parameters to vary by age and sex:

𝐗iμ=(1,si,ai,siai)𝐗iσ,𝐗iϵ,𝐗iδ=(1,si,ai).

In the third model, all four distributional parameters had age, sex, and age-sex interaction effects:

𝐗iμ,𝐗iσ,𝐗iϵ,𝐗iδ=(1,si,ai,siai)

To allow for the possibility of non-linear variation with respect to age in the fourth model, we modelled the location parameter using sex-specific natural splines on age:

𝐗iμ=(1,si,ϕ1(ai),,ϕK(ai),siϕ1(ai),,siϕK(ai))𝐗iσ,𝐗iϵ,𝐗iδ=(1,si,ai,siai),

where K is the number of columns in the spline design matrix. By including a second set of basis function values that are multiplied by si, we are estimating an additional, female-specific trend with respect to age.

Finally, we fit a fifth model, in which all four distributional parameters were modelled as sex-specific splines with respect to age:

𝐗iμ,𝐗iσ,𝐗iϵ,𝐗iδ=(1,si,ϕ1(ai),,ϕK(ai),siϕ1(ai),,siϕK(ai)).

Full Results

Appendix 1—figure 3
Overlaid quantile-quantile (QQ) plots for each probability distribution’s best fit to data in all three main data sets.

Presented quantiles range from 10th to 90th in increments of 10. Lines closer to the line of equality indicate better fit to empirical quantiles.

Appendix 1—figure 4
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among women in the AHRI data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—figure 5
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among men in the AHRI data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—figure 6
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among women in the AHRI Deheaped data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—figure 7
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among men in the AHRI Deheaped data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—figure 8
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among women in the Haiti 2016–17 DHS data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—figure 9
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among men in the Haiti 2016–17 DHS data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—figure 10
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among women in the Manicaland data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—figure 11
Observed partner age distributions (grey bars) and posterior predictive partner age distributions (lines) for each probability distribution among men in the Manicaland data set.

Here, we plot the posterior predicitve distribution associated with each distribution’s highest-ELPD dependent variable.

Appendix 1—table 2
Full ELPD and QQ RMSE table for women in the AHRI data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
AHRI Female 20-24
1Sinh-arcsinh−31750.940.000.000.32
2Skew normal−32056.39−305.4648.630.47
3Normal−32414.54−663.6160.540.62
4Beta−32953.92−1202.98112.080.77
5Gamma−33461.85−1710.92148.150.80
AHRI Female 25-29
1Sinh-arcsinh−24647.650.000.000.28
2Skew normal−24906.22−258.5743.270.52
3Normal−25238.71−591.0654.820.68
4Beta−25701.13−1053.48114.840.89
5Gamma−25995.81−1348.16132.150.90
AHRI Female 30-34
1Sinh-arcsinh−19831.530.000.000.44
2Skew normal−20200.44−368.9169.400.51
3Normal−20314.79−483.2652.240.80
4Beta−20575.61−744.0867.460.93
5Gamma−20708.35−876.8273.890.91
AHRI Female 35-39
1Sinh-arcsinh−15469.180.000.000.31
2Skew normal−15749.79−280.6153.040.77
3Normal−15834.32−365.1441.230.80
4Beta−16026.51−557.3353.991.18
5Gamma−16087.40−618.2257.061.04
AHRI Female 40-44
1Sinh-arcsinh−12556.610.000.000.45
2Skew normal−12876.71−320.1045.851.27
3Normal−12935.34−378.7352.380.92
4Beta−13137.69−581.0869.181.38
5Gamma−13150.66−594.0562.731.19
AHRI Female 45-49
1Sinh-arcsinh−10059.210.000.000.59
2Skew normal−10391.95−332.7442.751.36
3Normal−10433.64−374.4348.911.53
4Gamma−10527.00−467.7950.721.35
5Beta−10545.33−486.1256.021.58
Appendix 1—table 3
Full ELPD and QQ RMSE table for men in the AHRI data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
AHRI Male 20-24
1Sinh-arcsinh−20428.110.000.000.23
2Skew normal−20499.86−71.7517.120.25
3Normal−20503.89−75.7916.850.22
4Beta−20545.59−117.4923.210.22
5Gamma−20700.24−272.1343.530.29
AHRI Male 25-29
1Sinh-arcsinh−12664.210.000.000.26
2Skew normal−12727.03−62.8217.860.28
3Beta−12739.03−74.8218.650.31
4Normal−12753.25−89.0419.350.29
5Gamma−12788.26−124.0535.070.38
AHRI Male 30-34
1Sinh-arcsinh−9301.030.000.000.29
2Skew normal−9357.18−56.1514.080.43
3Beta−9371.86−70.8316.480.37
4Normal−9385.63−84.6014.670.46
5Gamma−9419.34−118.3135.110.27
AHRI Male 35-39
1Sinh-arcsinh−6746.890.000.000.30
2Skew normal−6812.77−65.8817.730.64
3Normal−6817.86−70.9723.240.70
4Beta−6830.95−84.0617.950.71
5Gamma−6832.47−85.5832.030.44
AHRI Male 40-44
1Sinh-arcsinh−4610.950.000.000.35
2Skew normal−4711.78−100.8418.660.92
3Normal−4713.78−102.8318.820.78
4Gamma−4718.28−107.3324.830.63
5Beta−4742.70−131.7517.281.07
AHRI Male 45-49
1Sinh-arcsinh−3683.470.000.000.34
2Skew normal−3770.59−87.1216.560.81
3Gamma−3776.33−92.8615.560.87
4Normal−3778.84−95.3714.501.17
5Beta−3805.78−122.3117.401.36
Appendix 1—table 4
Full ELPD and QQ RMSE table for women in the AHRI Deheaped data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
AHRI Deheaped Female 20-24
1Sinh-arcsinh−31411.240.000.000.26
2Skew normal−31797.37−386.1353.150.59
3Normal−32179.29−768.0565.500.56
4Beta−32737.57−1326.32118.470.76
5Gamma−33254.17−1842.92155.140.78
AHRI Deheaped Female 25-29
1Sinh-arcsinh−24439.470.000.000.27
2Skew normal−24768.06−328.5946.710.65
3Normal−25104.46−664.9958.320.82
4Beta−25574.33−1134.86119.651.03
5Gamma−25870.30−1430.83137.511.05
AHRI Deheaped Female 30-34
1Sinh-arcsinh−19680.770.000.000.41
2Skew normal−20112.70−431.9472.950.55
3Normal−20228.52−547.7656.190.81
4Beta−20492.23−811.4670.530.92
5Gamma−20624.82−944.0676.980.80
AHRI Deheaped Female 35-39
1Sinh-arcsinh−15381.680.000.000.26
2Skew normal−15703.77−322.0955.300.68
3Normal−15788.73−407.0543.670.82
4Beta−15983.57−601.9056.311.13
5Gamma−16044.22−662.5459.171.04
AHRI Deheaped Female 40-44
1Sinh-arcsinh−12491.910.000.000.25
2Skew normal−12846.63−354.7247.380.99
3Normal−12905.04−413.1254.140.89
4Beta−13109.82−617.9170.961.31
5Gamma−13121.45−629.5364.121.13
AHRI Deheaped Female 45-49
1Sinh-arcsinh−9981.830.000.000.53
2Skew normal−10357.85−376.0145.081.43
3Normal−10401.64−419.8051.571.46
4Gamma−10493.73−511.9052.901.37
5Beta−10513.46−531.6358.211.61
Appendix 1—table 5
Full ELPD and QQ RMSE table for men in the AHRI Deheaped data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
AHRI Deheaped Male 20-24
1Sinh-arcsinh−20310.350.000.000.27
2Skew normal−20429.90−119.5527.090.22
3Normal−20459.73−149.3835.780.29
4Beta−20574.15−263.8075.130.22
5Gamma−20899.52−589.17175.990.27
AHRI Deheaped Male 25-29
1Sinh-arcsinh−12585.540.000.000.28
2Skew normal−12680.59−95.0521.530.44
3Beta−12697.00−111.4623.310.37
4Normal−12701.76−116.2322.960.41
5Gamma−12763.81−178.2741.240.39
AHRI Deheaped Male 30-34
1Sinh-arcsinh−9227.260.000.000.37
2Skew normal−9302.42−75.1616.150.41
3Beta−9318.24−90.9719.070.39
4Normal−9327.58−100.3116.180.41
5Gamma−9372.32−145.0638.270.27
AHRI Deheaped Male 35-39
1Sinh-arcsinh−6694.860.000.000.30
2Skew normal−6774.11−79.2619.320.61
3Normal−6780.69−85.8425.420.44
4Beta−6791.95−97.1019.810.69
5Gamma−6796.41−101.5534.450.40
AHRI Deheaped Male 40-44
1Sinh-arcsinh−4591.040.000.000.49
2Skew normal−4700.54−109.5119.381.16
3Normal−4703.52−112.4919.931.00
4Gamma−4708.43−117.4025.940.89
5Beta−4731.41−140.3717.841.30
AHRI Deheaped Male 45-49
1Sinh-arcsinh−3680.180.000.000.30
2Normal−3796.06−115.8819.241.15
3Skew normal−3797.14−116.9523.481.02
4Gamma−3801.02−120.8324.510.98
5Beta−3817.97−137.7919.371.39
Appendix 1—table 6
Full ELPD and QQ RMSE table for women in the Haiti 2016–17 DHS data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
Haiti 2016-17 DHS Female 20-24
1Sinh-arcsinh−3259.310.000.000.49
2Skew normal−3263.46−4.154.950.53
3Normal−3338.23−78.9219.540.91
4Beta−3441.91−182.6045.771.24
5Gamma−3504.85−245.5453.901.29
Haiti 2016-17 DHS Female 25-29
1Sinh-arcsinh−4447.430.000.000.26
2Skew normal−4471.22−23.788.410.57
3Normal−4527.25−79.8218.720.86
4Beta−4625.97−178.5440.881.23
5Gamma−4678.20−230.7745.811.22
Haiti 2016-17 DHS Female 30-34
1Sinh-arcsinh−4720.120.000.000.44
2Skew normal−4749.57−29.459.060.68
3Normal−4763.78−43.6610.510.62
4Beta−4809.11−88.9917.190.85
5Gamma−4836.82−116.7020.320.83
Haiti 2016-17 DHS Female 35-39
1Sinh-arcsinh−4490.820.000.000.33
2Skew normal−4518.58−27.758.140.57
3Normal−4526.55−35.738.590.73
4Beta−4561.27−70.4513.600.94
5Gamma−4577.84−87.0115.270.86
Haiti 2016-17 DHS Female 40-44
1Sinh-arcsinh−3601.020.000.000.35
2Skew normal−3629.45−28.437.510.83
3Normal−3633.14−32.117.960.71
4Beta−3641.61−40.599.760.73
5Gamma−3644.89−43.8610.470.64
Haiti 2016-17 DHS Female 45-49
1Sinh-arcsinh−3106.270.000.000.39
2Skew normal−3133.10−26.827.680.88
3Gamma−3133.61−27.337.500.68
4Normal−3134.62−28.357.460.81
5Beta−3136.89−30.628.610.88
Appendix 1—table 7
Full ELPD and QQ RMSE table for men in the Haiti 2016–17 DHS data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
Haiti 2016-17 DHS Male 20-24
1Skew normal−468.980.000.000.43
2Sinh-arcsinh−469.60−0.621.120.41
3Normal−475.31−6.334.280.67
4Beta−483.53−14.557.330.65
5Gamma−500.53−31.5513.350.94
Haiti 2016-17 DHS Male 25-29
1Sinh-arcsinh−1386.130.000.000.38
2Skew normal−1390.54−4.413.190.49
3Normal−1395.47−9.344.790.60
4Beta−1407.46−21.327.310.62
5Gamma−1434.18−48.0411.750.79
Haiti 2016-17 DHS Male 30-34
1Sinh-arcsinh−2217.200.000.000.44
2Skew normal−2222.10−4.893.420.69
3Normal−2223.97−6.764.480.45
4Beta−2240.58−23.379.320.52
5Gamma−2281.18−63.9817.910.73
Haiti 2016-17 DHS Male 35-39
1Sinh-arcsinh−2185.960.000.000.28
2Skew normal−2189.87−3.912.670.69
3Beta−2191.05−5.103.680.48
4Normal−2191.11−5.163.550.49
5Gamma−2205.69−19.739.570.52
Haiti 2016-17 DHS Male 40-44
1Sinh-arcsinh−2051.620.000.000.39
2Skew normal−2060.16−8.544.210.72
3Normal−2060.38−8.754.570.69
4Beta−2062.00−10.374.870.70
5Gamma−2063.79−12.175.730.47
Haiti 2016-17 DHS Male 45-49
1Sinh-arcsinh−2138.340.000.000.23
2Normal−2150.53−12.196.380.35
3Skew normal−2151.51−13.175.970.56
4Gamma−2152.88−14.548.930.25
5Beta−2156.13−17.796.140.48
Appendix 1—table 8
Full ELPD and QQ RMSE table for women in the Manicaland data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
Manicaland Female 20-24
1Sinh-arcsinh−16390.770.000.000.31
2Skew normal−16502.01−111.2521.220.44
3Normal−16779.93−389.1637.050.67
4Beta−17111.57−720.8062.020.86
5Gamma−17387.38−996.6176.801.02
Manicaland Female 25-29
1Sinh-arcsinh−18702.500.000.000.53
2Skew normal−18923.04−220.5325.270.94
3Normal−19080.66−378.1636.050.83
4Beta−19405.80−703.3064.971.05
5Gamma−19615.53−913.0376.381.09
Manicaland Female 30-34
1Sinh-arcsinh−16523.810.000.000.48
2Skew normal−16877.96−354.1540.360.87
3Normal−16886.62−362.8036.410.99
4Beta−17021.26−497.4443.601.12
5Gamma−17094.58−570.7649.530.93
Manicaland Female 35-39
1Sinh-arcsinh−14397.760.000.000.48
2Skew normal−14736.64−338.8828.351.25
3Normal−14798.55−400.7936.871.39
4Beta−14824.80−427.0433.021.47
5Gamma−14835.11−437.3534.491.14
Manicaland Female 40-44
1Sinh-arcsinh−12293.130.000.000.68
2Skew normal−12488.28−195.1521.361.03
3Gamma−12500.93−207.8022.181.03
4Normal−12508.91−215.7823.291.28
5Beta−12537.14−244.0125.411.22
Manicaland Female 45-49
1Sinh-arcsinh−9183.030.000.000.56
2Skew normal−9455.87−272.8323.571.68
3Normal−9477.33−294.3023.551.62
4Gamma−9497.31−314.2725.081.44
5Beta−9576.44−393.4032.071.94
Appendix 1—table 9
Full ELPD and QQ RMSE table for men in the Manicaland data set.

Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
Manicaland Male 20-24
1Sinh-arcsinh−9770.000.000.000.30
2Skew normal−9895.82−125.8333.350.40
3Normal−10139.11−369.1179.130.49
4Beta−10587.64−817.64181.230.56
5Gamma−11594.58−1824.59388.261.15
Manicaland Male 25-29
1Sinh-arcsinh−13978.590.000.000.40
2Skew normal−13990.39−11.808.510.48
3Normal−14018.60−40.0017.480.45
4Beta−14152.35−173.7648.770.40
5Gamma−14500.47−521.87117.580.55
Manicaland Male 30-34
1Sinh-arcsinh−12949.240.000.000.31
2Skew normal−13016.44−67.2125.010.37
3Normal−13037.46−88.2216.570.49
4Beta−13070.31−121.0754.740.41
5Gamma−13285.47−336.23171.920.42
Manicaland Male 35-39
1Sinh-arcsinh−11496.140.000.000.27
2Skew normal−11528.36−32.229.830.39
3Normal−11530.43−34.299.720.26
4Gamma−11531.75−35.6112.470.24
5Beta−11582.63−86.4912.970.48
Manicaland Male 40-44
1Sinh-arcsinh−8714.060.000.000.35
2Skew normal−8749.78−35.7210.110.51
3Gamma−8777.08−63.0210.380.55
4Normal−8791.45−77.3812.230.76
5Beta−8860.22−146.1618.150.93
Manicaland Male 45-49
1Sinh-arcsinh−7110.270.000.000.42
2Skew normal−7177.03−66.7525.080.76
3Gamma−7213.99−103.7213.021.07
4Normal−7232.04−121.7713.091.28
5Beta−7312.35−202.0818.611.61
Appendix 1—table 10
LOO-CV estimated ELPD values, differences, and standard errors of differences, as well as QQ RMSE values, for all five regression models fit to all four data sets.

The ‘difference’ value of a row is the difference between that row’s ELPD value and data set-specific best ELPD value. Higher ELPD values and lower QQ RMSE values are better.

RankModelELPDELPD DiffSE of DiffQQ RMSE
AHRI
1Distributional 455841.910.000.000.66
2Distributional 355534.16−307.7532.360.93
3Distributional 254794.79−1047.1251.691.21
4Distributional 154335.19−1506.7272.321.15
5Conventional52689.21−3152.70100.591.30
AHRI Deaheaped
1Distributional 458503.980.000.000.62
2Distributional 358219.23−284.7528.640.92
3Distributional 257503.68−1000.3047.141.14
4Distributional 157097.39−1406.5964.481.06
5Conventional55296.25−3207.7399.421.26
Haiti 2016-17 DHS
1Distributional 45207.570.000.000.84
2Distributional 35196.69−10.896.540.91
3Distributional 15140.77−66.8012.270.98
4Distributional 25138.75−68.8312.240.99
5Conventional4777.78−429.8030.541.33
Manicaland
1Distributional 424516.150.000.001.04
2Distributional 324313.74−202.4020.521.34
3Distributional 223472.07−1044.0847.771.80
4Distributional 123192.49−1323.6654.971.89
5Conventional21011.29−3504.8689.012.05

Data availability

Data from the Demographic and Health Surveys are available from the DHS Program website (https://dhsprogram.com/data/available-datasets.cfm). Data from the Africa Centre Demographic Information System are available on request from the AHRI website (https://data.ahri.org/index.php/home). Data from the Manicaland study were used with permission from the study investigators (http://www.manicalandhivproject.org/manicaland-data.html).

The following previously published data sets were used
    1. Institut Haïtien de l'Enfance - IHE/Haiti
    2. ICF
    (2018) The DHS Program, Haiti: Standard DHS, 2016-17
    ID survey-display-503.cfm. Haiti Enquête Mortalité, Morbidité et Utilisation des Services 2016-2017 - EMMUS-VI [Dataset].

References

  1. Conference
    1. Institut Haïtien de l’Enfance
    (2018)
    Haiti enquête mortaité, morbidité et utilisation des services 2016-2017 - EMMUS-VI [dataset]
    IHE/Haiti, ICF [Producers] IHE/Haiti, & ICF.
  2. Report
    1. Kneib T
    2. Umlauf N
    (2017)
    A Primer on Bayesian Distributional Regression Working Paper Technical Report 2017-13
    Working Papers in Economics and Statistics.
    1. Krivitsky PN
    2. Handcock MS
    (2014) A separable model for dynamic networks
    Journal of the Royal Statistical Society: Series B 76:29–46.
    https://doi.org/10.1111/rssb.12014
  3. Software
    1. R Development Core Team
    (2020) R: A Language and Environment for Statistical Computing
    R Foundation for Statistical Computing, Vienna, Austria.
  4. Book
    1. The DHS program
    (2021)
    The DHS Program
    USAID.
  5. Report
    1. Vehtari A
    2. Gabry J
    3. Magnusson M
    4. Yao Y
    5. Bürkner P-C
    6. Paananen T
    7. Gelman A
    (2020)
    Loo: Efficient Leave-One-Out Cross-Validation and WAIC for Bayesian Models
    loo.
  6. Book
    1. World Health Organization and UNAIDS
    (2020)
    Recommended Population Size Estimates of Men Who Have Sex with Men
    World Health Organization.

Decision letter

  1. Eduardo Franco
    Senior Editor; McGill University, Canada
  2. Talía Malagón
    Reviewing Editor; McGill University, Canada
  3. Adam Akullian
    Reviewer; Institute for Disease Modeling, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This paper demonstrates the use of a sinh-arcsinh distribution to fit sex partner age distributions across different settings, and shows a superior fit compared to other more conventional distributions and regression methods. The methods outlined in this study will be of particular interest to social science researchers examining age mixing patterns, and for infectious disease modelers wishing to model sexual partner age mixing into their frameworks.

Decision letter after peer review:

Thank you for submitting your article "Evaluating distributional regression strategies for modelling self-reported sexual age-mixing" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by a Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Adam Akullian (Reviewer #2).

As is customary in eLife, the reviewers have discussed their critiques with one another. What follows below is the Reviewing Editor's edited compilation of the essential and ancillary points provided by reviewers in their critiques and in their interaction post-review. Please submit a revised version that addresses these concerns directly.

Essential revisions:

1) Include more discussion about biases in self-reported data, including misreporting of partnership characteristics, selection bias, and missing data; also, discuss how these biases may affect the modelling of age mixing patterns.

2) Some methods could be further clarified by including text from the appendix into the main text.

3) Discuss how your results compare with and add to previous statistical frameworks for age mixing, notably log-linear models for mixing and Exponential-family Random Graph Models

Reviewer #1:

In this study, Wolock et al. compare the ability of different distributions to fit sexual age mixing patterns in a variety of settings, and provide a modeling framework for fitting these distributions across all ages without having to partition the data into separate datasets. They convincingly demonstrate that a sinh-arcsinh distribution will better fit the data than many other commonly used distributions, and also explore various ways of parameterizing age differences between partners. In general, the sinh-arcsinh produced both better fit statistics and more visually pleasing fits to the data. Moreover, the results confirm that more complex distributional assumptions with more parameters are in many cases warranted to fit age mixing patterns.

Particular strengths of this study are the exhaustive examination of multiple distributional assumptions, the use of multiple datasets to test the generalizability of results across regions, and use of objective fit statistics to discriminate between models and distributions. I do not see any major weaknesses, apart from maybe the issue of how possible it would be to fit these more complex distributions and models to smaller datasets, where there is less data to inform parameters.

As the authors point out, the methods outlined in this paper could be extended by including further stratification factors predicting age mixing, or potentially also mixing by some other continuous characteristics that are not expected to follow a normal distribution. The authors provided some sample code for fitting these distributions to the data, which should help others implement their proposed methods.

Comments for the authors:

• I had a lot of difficulty following the methods for the second aim, the regression component. I was very unclear on what was the outcome (dependent variables) being modeled and what were the parameters of the model. The appendix on model specifications helped to explain this a bit more. I think the manuscript would benefit from putting the model specification details from the appendix in the main text, as I could not understand what was being done without this further information.

• I find the heaping of ages surprising. This might have occurred due to the way the question about partner age was framed to participants. It might be useful to comment on the wording of this question across the different studies, and if there was a particular way of wording the question that led to more heaping in one study than in others.

• P. 8 I was initially confused as to why there are 36 subsets, given that there are only 2 sexes and 6 age groups between 20 and 50. It was not clear that there was a third stratification (survey). You should mention there are also 3 surveys (2*6*3=36).

• Table 1: the definition in the legend for Ce,d should probably read Ce,d(x).

Reviewer #2:

In "Evaluating distributional regression strategies for modelling self-reported sexual age-mixing" Wolock and colleagues present a statistical modeling framework to estimate the age-distribution of sexual partners from self-reported data. The authors use data from three population-level surveys where respondents report the ages of their most recent sexual partners. The authors use a Bayesian distributional regression approach to fit probability distribution parameters to the data and assess model performance. In this way, the authors demonstrate that a unique distribution of partnership ages can be reconstructed for each age and sex combination. The best fitting distributions and parameterizations are presented in the paper and the authors found that the sinh-arcsinh distribution most accurately reconstructed the distribution of partnership ages for each of the three geographies. Results from this analysis will be highly useful as a component of HIV (or other STI) risk prediction or to reconstruct sexual networks for mathematical models of STI transmission.

The main strength of this analysis is the simplicity of their modeling approach, which uses probability distributions with few parameters to capture the age distribution of sexual partnerships. This approach will allow researchers to easily simulate age- and sex-specific partnership probabilities and inherent uncertainty in those estimates without the need for complex empirical distributions.

Weaknesses of the analysis center around the inherent limitations of self-reported data on sexual partnership age. The authors use data on self-reported age(s) of a respondent's partnership(s) in the last 12 months. The authors don't consider potential reporting biases and missing data associated with self-report that may be differential by age, gender, and relationship type. For example, some individuals may selectively report certain types of partners over others – perhaps stable partnerships versus transient ones. Missing data is quite common for sensitive questions on sexual histories and thus the results presented here should be interpreted with that in mind. It is important to consider that some relationship types may be under-represented and may lead to biased age distributions.

Nonetheless, the results support the conclusions made by the authors. This paper makes an important methodological contribution to the literature on sexual networks and has broad application to understanding the transmission of STI's, especially HIV in diverse settings.

Overall this paper has a sound methodological approach and is clearly written.

1.Page 12, line 241, the number of partnerships is indicated for each geography but not the number of respondents. Would be good to understand how many individuals reported just one partner versus more than one. Also, I assume the same partner could be reported multiple times at different survey rounds for the two longitudinal datasets, which could create pseudo replication. Could the authors discuss how this type of correlated data might influence the uncertainty estimates?

Reviewer #3:

The primary weakness with this manuscript is that the authors appear to be unaware of two robust statistical frameworks that have been used to estimate age mixing (and other forms of heterogeneous mixing) for many years.

The first is the use of log-linear models for mixing matrices (Morris, 1991, 1993). These models can be used to fit a wide array of mixing patterns, both discrete and continuous, and are also quite general. They offer everything from parsimonious summary parameterization (with assessments of goodness of fit) to dummy parameterization for each cell in the matrix to get exact fits. They benefit from the well developed statistical theory and methods that support generalized linear models. While they can be generalized to include random effects for persons, the limitations of these models is that they can not be used to represent "edge dependence" (think monogamy bias – the presence of one spouse ("edge") has a profound effect on the probabilty of another spouse for the same individual). The papers above integrate the resulting estimates into a standard deterministic compartmental model for epidemic simulation.

The second is the more recent Exponential-family Random Graph Models (ERGMs, and their temporal cousins TERGMs – references provided below). These models are also a form of GLM, but extend the log-linear framework with the capability to represent dependent edges. They are designed primarily for use with stochastic network epidemic models, and are sufficiently well established that they serve as the foundation for the EpiModel (Jenness et al. 2018) software package.

In both cases, the benefit of these models is that they are a general, principled statistical framework, with well understood properties and established methods for model testing and evaluation. This enables the analyst to model not only age mixing, but other forms of heterogeneous mixing (e.g., by sex, or geographic location, or race/ethnicity), and, in the case of TERGMs, degree distributions and higher order network configurations, along with heterogeneous edge duration. All of these terms can be estimated jointly, from sampled network data, and simulations from the estimated model will reproduce the sufficient statistics in expectation. This latter property makes these models a natural foundation for epidemic simulation.

As a result, the added value of this distributional regression strategy is unclear. And this statement: "More broadly, no previous work has systematically evaluated the wide variety of distributions potentially available to model partner age distributions" is simply untrue. Both frameworks above provide general methods for fitting and assessing a wide variety of distributions.

References:

Morris, M. (1991). "A log-linear modeling framework for selective mixing." Math Biosc 107: 349-377.

Morris, M. (1993). "Epidemiology and Social Networks: Modeling Structured Diffusion." Soc. Meth. Res. 22(1): 99-126.

Jenness, S. M., S. M. Goodreau and M. Morris (2018). "EpiModel: An R Package for Mathematical Modeling of Infectious Disease over Networks." Journal of Statistical Software 84(8).

Selected refs for ERGMs and TERGMs

Hunter, D. R., M. S. Handcock, C. T. Butts, S. M. Goodreau and M. Morris (2008). "ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks." Journal of Statistical Software 42(i03).

Hunter, D. R., S. M. Goodreau and M. S. Handcock (2008). "Goodness of fit of social network models." Journal of the American Statistical Association 103(481): 248-258.

Krivitsky, P. N. and M. S. Handcock (2014). "A separable model for dynamic networks." Journal of the Royal Statistical Society, Series B 76(1): 29-46.

Krivitsky, P. N., M. S. Handcock and M. Morris (2011). "Adjusting for Network Size and Composition Effects in Exponential-Family Random Graph Models." Statistical Methodology 8(4): 319–339.

Krivitsky, P. N. and M. Morris (2017). "Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US." Annals of Applied Statistics 11(1): 427-455.

https://doi.org/10.7554/eLife.68318.sa1

Author response

Essential revisions:

1) Include more discussion about biases in self-reported data, including misreporting of partnership characteristics, selection bias, and missing data; also, discuss how these biases may affect the modelling of age mixing patterns.

Thank you for this suggestion. The reviewers raise a good point about missingness and bias. Our method cannot resolve the limitations inherent in self-reported data, but, for certain types of missingness, it could be used to smooth and interpolate missing data. We have included a paragraph (lines 456-463) addressing this limitation:

“Finally, our model does not address any reporting biases in self-reported partnership data. If certain relationship types are perceived as less socially acceptable, respondents might be less likely to report them, resulting in systematic missingness. Our method could still be appropriate to model the age distribution from the data reported about an under-reported partnership type, but it cannot predict whether or not a given partnership exists. However, if under-reporting correlates with partner age (or age difference), then the empirical distributions will be biased and our method will only smooth and interpolate the biased data.”

2) Some methods could be further clarified by including text from the appendix into the main text.

Thank you for this feedback. We have moved some of the explanation of the distributional regression methods from the Appendix back into the main text. In lines 235-240 we have included a general description of the models we fit, and in lines 241-245, we have clarified that our five regression models vary in how their design matrices are defined:

“By varying the specifications of the four design matrices, Xµ, Xσ, Xε, and Xδ, we tested how well a series of increasingly complex distributional models fit to each data set. We fit the following models, which varied in the definitions of their four design matrices:”

3) Discuss how your results compare with and add to previous statistical frameworks for age mixing, notably log-linear models for mixing and Exponential-family Random Graph Models

Thank you for this suggestion. We focused our analysis on a relatively narrow problem of identifying continuous parametric models for sexual partner ages conditional on an individual’s age and sex. We agree with the editors and reviewer that it is valuable to connect our work to the broader context of more general and flexible statistical approaches to model sexual partner mixing matrices and sexual network, and greatly appreciate the suggestion.

We have elaborated the introductory paragraph on previous approaches to modelling sexual partner age distributions to also draw connections to this wider literature (lines 52-77; revisions emphasised below):

“One may consider statistical modelling approaches for the distribution of partner age as a function of respondent age and sex. […] These stochastic methods, along with the broader suite of ERGMs (David R. Hunter et al. 2008a; David R Hunter et al. 2008b; Krivitsky and Handcock 2014; Krivitsky, Handcock, and Martina Morris 2011), can model social network data accurately with robust incorporation of covariates, and tools exist to incorporate their estimates into epidemic models (Morris 1993; Jenness et al. 2018).”

In the penultimate paragraph of the introduction (line 78), we have clarified the focus of our analysis in the context of this wider research area:

“We focused on evaluating parametric models for continuously representing the distribution of sexual partner ages conditional on the respondent's age…”

Finally in the Discussion section (line 481), we have noted the potential opportunity to use sinh-arcsinh distribution within more comprehensive models for mixing matrices, such as those described above:

“The distribution regression framework with the sinh-arcsinh may also be a useful parametric model for continuous representation of marginal distributions within sexual mixing models or network models, such as the ERGM framework.”

Reviewer #1:

In this study, Wolock et al. compare the ability of different distributions to fit sexual age mixing patterns in a variety of settings, and provide a modeling framework for fitting these distributions across all ages without having to partition the data into separate datasets. They convincingly demonstrate that a sinh-arcsinh distribution will better fit the data than many other commonly used distributions, and also explore various ways of parameterizing age differences between partners. In general, the sinh-arcsinh produced both better fit statistics and more visually pleasing fits to the data. Moreover, the results confirm that more complex distributional assumptions with more parameters are in many cases warranted to fit age mixing patterns.

Particular strengths of this study are the exhaustive examination of multiple distributional assumptions, the use of multiple datasets to test the generalizability of results across regions, and use of objective fit statistics to discriminate between models and distributions. I do not see any major weaknesses, apart from maybe the issue of how possible it would be to fit these more complex distributions and models to smaller datasets, where there is less data to inform parameters.

As the authors point out, the methods outlined in this paper could be extended by including further stratification factors predicting age mixing, or potentially also mixing by some other continuous characteristics that are not expected to follow a normal distribution. The authors provided some sample code for fitting these distributions to the data, which should help others implement their proposed methods.

Comments for the authors:

• I had a lot of difficulty following the methods for the second aim, the regression component. I was very unclear on what was the outcome (dependent variables) being modeled and what were the parameters of the model. The appendix on model specifications helped to explain this a bit more. I think the manuscript would benefit from putting the model specification details from the appendix in the main text, as I could not understand what was being done without this further information.

Thank you for raising this point. As discussed in our second response to the Essential Revisions above, we have provided a specific definition of our sinh-arcsinh distributional model in the main text (including a definition of the dependent variable) in lines 235-240. In lines 241-245, we have clarified that our five distributional models vary in their definitions of the four design matrices. We hope that these changes clarify the methods.

• I find the heaping of ages surprising. This might have occurred due to the way the question about partner age was framed to participants. It might be useful to comment on the wording of this question across the different studies, and if there was a particular way of wording the question that led to more heaping in one study than in others.

The reviewer is correct that heaping is determined how the questions are phrased. The AHRI survey asks how much older/younger the partner is, so respondents round age differences to multiples of five (as opposed to rounding partner ages themselves). We have included a description of this phenomenon in the main text in lines 151-154: “For example, if a questionnaire asks "how many years older or younger is your partner than you?", respondents might be disproportionately likely to report a multiple of five, leading to age differences that are heaped on multiples of five.”

• P. 8 I was initially confused as to why there are 36 subsets, given that there are only 2 sexes and 6 age groups between 20 and 50. It was not clear that there was a third stratification (survey). You should mention there are also 3 surveys (2*6*3=36).

Thank you for this feedback. We have clarified that the data subsets are data set-, sex-, and age bin-specific in line 214: replaced “all 36 subsets” with “all 36 data set-/sex-/age bin-specific subsets”. In line 160 we have replaced “each data set” with “all three data sets”

• Table 1: the definition in the legend for Cε,δ should probably read Cε,δ(x).

Thank you for pointing out this mistake. We have replaced “Cε,δ” with “Cε,δ(x)” in the caption of Table 1.

Reviewer #2:

[…] Overall this paper has a sound methodological approach and is clearly written. The following are suggestions to improve clarity and understanding:

1.Ppage 12, line 241, the number of partnerships is indicated for each geography but not the number of respondents. Would be good to understand how many individuals reported just one partner versus more than one. Also, I assume the same partner could be reported multiple times at different survey rounds for the two longitudinal datasets, which could create pseudo replication. Could the authors discuss how this type of correlated data might influence the uncertainty estimates?

The reviewer raises an excellent point that we did not address in the original submission. We view both of these forms of structure in the data sets as violations of the independence we have assumed across observations. Because of this pseudo-replication, we suspect that we are attributing too much information to respondents who are repeated in the data and therefore might be deflating estimated uncertainty. We have included the number of individual respondents and the resulting average partners per respondent in lines 292-285: “There were 36,033 respondents reporting at least one partnership in the AHRI data, 25,024 in the Manicaland data, and 12,143 in the Haiti DHS, resulting in averages of 2.2, 2.3, and 1.0 partners per respondent, respectively.”

We have also included a paragraph (lines 447-455) identifying both of these details as a limitation of the method:

“There are two sources of possible non-independence in our data sets that were not modelled. […] Because we modelled multiple observations of the same individual as conditionally independent, we anticipate that these correlations may artificially increase the precision of our estimates.”

Reviewer #3:

The primary weakness with this manuscript is that the authors appear to be unaware of two robust statistical frameworks that have been used to estimate age mixing (and other forms of heterogeneous mixing) for many years.

The first is the use of log-linear models for mixing matrices (Morris, 1991, 1993). These models can be used to fit a wide array of mixing patterns, both discrete and continuous, and are also quite general. They offer everything from parsimonious summary parameterization (with assessments of goodness of fit) to dummy parameterization for each cell in the matrix to get exact fits. They benefit from the well developed statistical theory and methods that support generalized linear models. While they can be generalized to include random effects for persons, the limitations of these models is that they can not be used to represent "edge dependence" (think monogamy bias – the presence of one spouse ("edge") has a profound effect on the probabilty of another spouse for the same individual). The papers above integrate the resulting estimates into a standard deterministic compartmental model for epidemic simulation.

The second is the more recent Exponential-family Random Graph Models (ERGMs, and their temporal cousins TERGMs – references provided below). These models are also a form of GLM, but extend the log-linear framework with the capability to represent dependent edges. They are designed primarily for use with stochastic network epidemic models, and are sufficiently well established that they serve as the foundation for the EpiModel (Jenness et al. 2018) software package.

In both cases, the benefit of these models is that they are a general, principled statistical framework, with well understood properties and established methods for model testing and evaluation. This enables the analyst to model not only age mixing, but other forms of heterogeneous mixing (e.g., by sex, or geographic location, or race/ethnicity), and, in the case of TERGMs, degree distributions and higher order network configurations, along with heterogeneous edge duration. All of these terms can be estimated jointly, from sampled network data, and simulations from the estimated model will reproduce the sufficient statistics in expectation. This latter property makes these models a natural foundation for epidemic simulation.

As a result, the added value of this distributional regression strategy is unclear. And this statement: "More broadly, no previous work has systematically evaluated the wide variety of distributions potentially available to model partner age distributions" is simply untrue. Both frameworks above provide general methods for fitting and assessing a wide variety of distributions.

References:

Morris, M. (1991). "A log-linear modeling framework for selective mixing." Math Biosc 107: 349-377.

Morris, M. (1993). "Epidemiology and Social Networks: Modeling Structured Diffusion." Soc. Meth. Res. 22(1): 99-126.

Jenness, S. M., S. M. Goodreau and M. Morris (2018). "EpiModel: An R Package for Mathematical Modeling of Infectious Disease over Networks." Journal of Statistical Software 84(8).

Selected refs for ERGMs and TERGMs

Hunter, D. R., M. S. Handcock, C. T. Butts, S. M. Goodreau and M. Morris (2008). "ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks." Journal of Statistical Software 42(i03).

Hunter, D. R., S. M. Goodreau and M. S. Handcock (2008). "Goodness of fit of social network models." Journal of the American Statistical Association 103(481): 248-258.

Krivitsky, P. N. and M. S. Handcock (2014). "A separable model for dynamic networks." Journal of the Royal Statistical Society, Series B 76(1): 29-46.

Krivitsky, P. N., M. S. Handcock and M. Morris (2011). "Adjusting for Network Size and Composition Effects in Exponential-Family Random Graph Models." Statistical Methodology 8(4): 319–339.

Krivitsky, P. N. and M. Morris (2017). "Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US." Annals of Applied Statistics 11(1): 427-455.

We thank the Reviewer for these suggestions. As we have indicated in our responses to the Essential Revisions above, we conceptualised the focus of our work on the somewhat narrower problem of continuously modelling the distributional features of sexual partner age mixing. We agree with, and greatly appreciate, the suggestion to draw connection and situate our work in the context of these powerful and flexible statistical approaches to modelling mixing matrices and sexual networks. Our specific revisions, including references to the literature highlighted above, are detailed in our responses to the ‘Essential revisions’ section above.

https://doi.org/10.7554/eLife.68318.sa2

Article and author information

Author details

  1. Timothy M Wolock

    Department of Mathematics, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Software, Formal analysis, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    t.wolock18@imperial.ac.uk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5898-1014
  2. Seth Flaxman

    Department of Mathematics, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Supervision, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2477-4217
  3. Kathryn A Risher

    1. MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, United Kingdom
    2. London School of Hygiene & Tropical Medicine, London, United Kingdom
    Contribution
    Conceptualization, Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-9588-1693
  4. Tawanda Dadirai

    Manicaland Centre for Public Health Research, Biomedical Research and Training Institute, Harare, Zimbabwe
    Contribution
    Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
  5. Simon Gregson

    1. MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, United Kingdom
    2. Manicaland Centre for Public Health Research, Biomedical Research and Training Institute, Harare, Zimbabwe
    Contribution
    Data curation, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2707-0714
  6. Jeffrey W Eaton

    MRC Centre for Global Infectious Disease Analysis, School of Public Health, Imperial College London, London, United Kingdom
    Contribution
    Conceptualization, Supervision, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7728-728X

Funding

Bill and Melinda Gates Foundation (OPP1190661)

  • Jeffrey W Eaton

Bill and Melinda Gates Foundation (OPP1164897)

  • Kathryn A Risher
  • Simon Gregson
  • Jeffrey W Eaton

Medical Research Council (MR/R015600/1)

  • Simon Gregson
  • Jeffrey W Eaton

National Institute of Allergy and Infectious Diseases (R01AI136664)

  • Jeffrey W Eaton

Engineering and Physical Sciences Research Council (EP/V002910/1)

  • Seth Flaxman

Imperial College London (President's PhD Scholarship)

  • Timothy M Wolock

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

JWE, SG, and KAR acknowledge funding support from the Bill and Melinda Gates Foundation (OPP1190661, OPP1164897). JWE and SG acknowledge funding support from the MRC Centre for Global Infectious Disease Analysis (reference MR/R015600/1), jointly funded by the UK Medical Research Council (MRC) and the UK Foreign, Commonwealth and Development Office (FCDO), under the MRC/FCDO Concordat agreement and is also part of the EDCTP2 programme supported by the European Union. JWE also acknowledges funding support from National Institute of Allergy and Infectious Disease of the National Institutes of Health under award number R01AI136664. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. SRF acknowledges funding support from the EPSRC (EP/V002910/1). TMW’s work is funded by the Imperial College President’s PhD Scholarship. We thank all of the people who participated in the three studies that shared their data with us, as well as the survey and data management teams, without whom this work would not be possible.

Ethics

Human subjects: We conducted secondary analysis of previously collected anonymised data in compliance with each data producer's use requirements. Procedures and questionnaires for standard DHS surveys have been reviewed and approved by the ICF International Institutional Review Board (IRB). The Manicaland study was approved by the Medical Research Council of Zimbabwe and the Imperial College Research Ethics Committee. The Africa Centre Demographic Information System PIP surveillance study was approved by Biomedical Research Ethics Committee, University of KwaZulu-Natal, South Africa (BE290/16).

Senior Editor

  1. Eduardo Franco, McGill University, Canada

Reviewing Editor

  1. Talía Malagón, McGill University, Canada

Reviewer

  1. Adam Akullian, Institute for Disease Modeling, United States

Version history

  1. Received: March 11, 2021
  2. Accepted: June 23, 2021
  3. Accepted Manuscript published: June 24, 2021 (version 1)
  4. Version of Record published: July 7, 2021 (version 2)

Copyright

© 2021, Wolock et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 507
    Page views
  • 46
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Timothy M Wolock
  2. Seth Flaxman
  3. Kathryn A Risher
  4. Tawanda Dadirai
  5. Simon Gregson
  6. Jeffrey W Eaton
(2021)
Evaluating distributional regression strategies for modelling self-reported sexual age-mixing
eLife 10:e68318.
https://doi.org/10.7554/eLife.68318

Further reading

    1. Epidemiology and Global Health
    Charumathi Sabanayagam, Feng He ... Ching Yu Cheng
    Research Article Updated

    Background:

    Machine learning (ML) techniques improve disease prediction by identifying the most relevant features in multidimensional data. We compared the accuracy of ML algorithms for predicting incident diabetic kidney disease (DKD).

    Methods:

    We utilized longitudinal data from 1365 Chinese, Malay, and Indian participants aged 40–80 y with diabetes but free of DKD who participated in the baseline and 6-year follow-up visit of the Singapore Epidemiology of Eye Diseases Study (2004–2017). Incident DKD (11.9%) was defined as an estimated glomerular filtration rate (eGFR) <60 mL/min/1.73 m2 with at least 25% decrease in eGFR at follow-up from baseline. A total of 339 features, including participant characteristics, retinal imaging, and genetic and blood metabolites, were used as predictors. Performances of several ML models were compared to each other and to logistic regression (LR) model based on established features of DKD (age, sex, ethnicity, duration of diabetes, systolic blood pressure, HbA1c, and body mass index) using area under the receiver operating characteristic curve (AUC).

    Results:

    ML model Elastic Net (EN) had the best AUC (95% CI) of 0.851 (0.847–0.856), which was 7.0% relatively higher than by LR 0.795 (0.790–0.801). Sensitivity and specificity of EN were 88.2 and 65.9% vs. 73.0 and 72.8% by LR. The top 15 predictors included age, ethnicity, antidiabetic medication, hypertension, diabetic retinopathy, systolic blood pressure, HbA1c, eGFR, and metabolites related to lipids, lipoproteins, fatty acids, and ketone bodies.

    Conclusions:

    Our results showed that ML, together with feature selection, improves prediction accuracy of DKD risk in an asymptomatic stable population and identifies novel risk factors, including metabolites.

    Funding:

    This study was supported by the National Medical Research Council, NMRC/OFLCG/001/2017 and NMRC/HCSAINV/MOH-001019-00. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

    1. Epidemiology and Global Health
    C Kim, Benjamin Chen ... RECOVER Mechanistic Pathways Task Force
    Review Article

    The NIH-funded RECOVER study is collecting clinical data on patients who experience a SARS-CoV-2 infection. As patient representatives of the RECOVER Initiative’s Mechanistic Pathways task force, we offer our perspectives on patient motivations for partnering with researchers to obtain results from mechanistic studies. We emphasize the challenges of balancing urgency with scientific rigor. We recognize the importance of such partnerships in addressing post-acute sequelae of SARS-CoV-2 infection (PASC), which includes ‘long COVID,’ through contrasting objective and subjective narratives. Long COVID’s prevalence served as a call to action for patients like us to become actively involved in efforts to understand our condition. Patient-centered and patient-partnered research informs the balance between urgency and robust mechanistic research. Results from collaborating on protocol design, diverse patient inclusion, and awareness of community concerns establish a new precedent in biomedical research study design. With a public health matter as pressing as the long-term complications that can emerge after SARS-CoV-2 infection, considerate and equitable stakeholder involvement is essential to guiding seminal research. Discussions in the RECOVER Mechanistic Pathways task force gave rise to this commentary as well as other review articles on the current scientific understanding of PASC mechanisms.