Introduction

The fields of quantitative genetics and genetic epidemiology have exploited the shared genetics and environment of siblings in a range of applications, notably to estimate heritability, using theory first developed a century ago (1, 2), and more recently to infer a so-called ‘household effect’ (3), which contributes to genetic risk indirectly via a correlation between genetics and the household environment. Here we leverage trait information on siblings to infer the genetic architecture of those traits, not only at the population-level, but also in relation to whether the genetic liability of each individual - specifically those with extreme trait values - is a result of a few large effect alleles or many small effect alleles.

The genetic architecture of a complex trait is typically inferred from the findings of multiple studies: with genome-wide association studies (GWAS) identifying common variants (4, 5), whole exome or whole genome sequencing studies detecting rare variants (6), and family sequencing studies designed to identify de novo and rare Mendelian mutations (7). The relative contribution of each type of variant to trait heritability is a function of historical selection pressures on the trait in the population (8, 9). If selection has recently acted to increase the average value of a trait, then the lower tail of the trait distribution will be subject to negative selection and may be enriched for large effect rare variants (10), while if the trait is subject to stabilising selection, then both tails of the trait distribution may be enriched for rare variant aetiology (11). This can result in less accurate polygenic scores in the tails of the trait distribution (12), but can also produce dissimilarity between siblings beyond what is expected under polygenecity. For example, studies on the intellectual ability of sibling pairs have demonstrated similarity for average intellectual ability (13), regression-to-the-mean for siblings at the upper tail of the distribution (13), and complete discordance when one sibling is at the lower extreme tail of the distribution (14). Findings from such studies such as Reichenberg et al. 2016 (14) are are consistent with the presence of de novo alleles of large effect in the trait tails, alongside alternative explanations such as specific environmental exposures. However, no theoretical framework has been developed to formally infer genetic architecture from sibling trait data.

We introduce a theoretical framework that allows widely available sibling trait data in population cohorts and health registries to be leveraged to perform statistical tests that estimate complex genetic architecture in the tails of the trait distribution. These tests can differentiate between polygenic, de novo alleles and rare variants of large effect (hereafter ‘Mendelian variants’ for shorthand). This framework establishes expectations about the trait distributions of siblings of index individuals with extreme trait values (e.g. the top 1% of the trait) according to polygenic and non-polygenic tail trait architecture (see Figure 1).

Sibling Similarity Under Different Tail Architectures.

From left to right: When individuals have extreme (top 1%) tail values as a result of many common alleles of small effect (polygenicity), then their siblings’ trait values are expected to show regression-to-the-mean (grey). When individuals have extreme trait values as a result of de novo mutations of large effect, then their siblings are expected to have trait values that correspond to the background distribution. When individuals have extreme trail values as a result of a rare variant of large effect inherited from one of their parents (Mendelian), then their siblings are expected to have either similarly extreme trait values or trait values that are drawn from the background distribution, depending on whether or not they inherited the same large effect allele.

Critical to our framework is our derivation of the “conditional-sibling trait distribution”, which describes the trait distribution for one individual given the quantile value of one or more “index” siblings. Our statistical framework, derivation of the conditional-sibling trait distribution, and simulation study, allow us to develop statistical tests to infer genetic architecture from sibling registries (15) without the need for genetic data (see Model & Methods). We validate the statistical power of our tests using simulated data and real data from the UK Biobank. Our novel framework can be extended to applications such as estimating heritability, inferring assortative mating, and characterising historical selection pressures.

Model & Methods

Here we outline a framework that models the conditional-sibling trait distribution which describes an individual’s trait distribution conditioned on one or more index sibling(s). In this section we describe our derivation of the distribution for a completely polygenic trait, outline development of statistical tests that detect complex genetic architecture via departure polygenicity, and describe the simulation scheme used to validate and benchmark our results. Full derivations that complement this high-level summary of our methodology are included in the Appendices.

Conditional-Sibling Inference Framework

Assuming completely polygenic architecture, siblings of individuals with extreme trait values are expected to be much less extreme. This regression-to-the-mean can be understood through the two factors (16, 17) that determine inherited genetic liability: (i) average genetic liability of the parents, known as “midparent” liability, and, (ii) random genetic reassortment occurring during meiosis. Assuming that an individual presenting an upper-tail trait value does so due to both high midparent liability and genetic reassortment favouring a higher value, then a sibling sharing common midparent liability - but subject to independent reassortment - is likely less extreme. How much less extreme can be derived by first considering the conditional distribution that relates midparents and their offspring. In the simplest case, for a completely heritable continuous polygenic trait in a large randomly mating population (18), the offspring trait value, s, is normally distributed around the midparent trait value, m, as follows:

where σ2 is the population trait variance. Note that the trait variance within families is half that of the population trait variance under neutrality and random mating, which is a key property in quantitative genetics (1921). For a trait with heritability h2, trait variance can be partitioned into genetic, , and environmental, , contributions, such that and . Assuming meancentred genetic and environmental trait contributions, and a trait variance of 1 (a standard normalised trait), the distribution for offspring conditioned on the midparent genetic liability, mg, is:

Since p(mg) ∼ 𝒩(0, h2/2), from Bayes’ Theorem it can be shown that the midparent liability conditional on offspring trait value, s, is distributed as follows:

From Equations 2 & 3, the sibling trait distribution conditional on an index sibling can be calculated:

Here we note the relative simplicity of this distribution, which means, for example, that under complete heritability, the siblings of an individual with a standard normal trait value of z, will have a mean trait value of , with variance equal to three quarters of the population variance. In Results (Figure 4) we illustrate how trait heritability and index quantile determine the conditional-sibling distribution in standard trait space and in population percentile space. In Appendix 1 we provide a full derivation of this distribution (Equation 4) and generalize our analytical results across a range of scenarios, including binary phenotypes and multiple index siblings, increasing the utility of this framework for further applications and theoretical development.

Statistical Tests for Complex Tail Architecture

In Figure 2, the strategy employed to develop statistical tests for complex tail architecture is depicted. Our approach corresponds to testing for deviations from the expected conditional-sibling trait distribution under the null hypothesis of polygenicity in the trait tails: excess discordance is indicative of an enrichment of de novo mutations, while excess concordance indicates an enrichment of Mendelian variants, i.e. large effect variants segregating in the population. The heritability, h2, required to define the null distribution, is estimated by maximising the log-likelihood of the conditional-sibling trait distribution (Equation 4) with respect to h2:

where s1 and s2 represent index and conditional-sibling trait values, respectively, and n is the number of sibling pairs. This estimation method allows h2 to be estimated for given quantiles of the trait distribution by restricting sibling pair observations to those index siblings in the quantile of interest. To maximise power to detect non-polygenic architecture in the tails of the trait distribution, we estimate “polygenic heritability” from sibling pairs for which the index sibling trait value is between the 5th and 95th percentile (labeled “Distribution Body” in Figure 2). Tests for complex architecture are then performed in relation to index siblings whose trait values are in the tails of the distribution (e.g. the lower and upper 1%). Below, Aq denotes the set of sibling pairs for which the index sibling is in quantile q such that s1 > Φ(q) and s1 < Φ(q) for the upper and lower tails, respectively, where Φ−1 is the inverse normal cumulative distribution function.

Identifying Complex Tail Genetic Architecture.

Conditional-sibling z-values plotted against index sibling quantiles. Grey depicts purely polygenic architecture across all index sibling values. In the Lower Tail, an extreme scenario of de novo architecture is shown (in green), resulting in sibling discordance. In the Upper Tail, an extreme scenario of Mendelian architecture is shown (in red), whereby half the siblings are concordant and half discordant, resulting in a bimodal conditional sibling distribution. Statistical tests to identify the presence of each type of complex tail architecture are designed to exploit these characteristics.

Statistical Test for De Novo Architecture

To identify de novo architecture in the tails of the trait distribution, we introduce a parameter, α, to the log-likelihood defined by the conditional-sibling trait distribution Equation 5:

Values of α > 0 in the lower tail and α < 0 in the upper tail indicate excess regression-to-the-mean and, thus, high sibling discordance, consistent with an enrichment of de novo mutations among the index siblings. The z-statistic of the one-sided score test for α > 0 in the lower quantile, q, relative to the null of α = 0 is (see Appendix 2 for derivation):

For the upper tail test of α < 0, the above is multiplied by -1.

Statistical Test for Mendelian Architecture

To identify Mendelian architecture in the tails of the trait distribution, we compare the observed and expected tail sibling concordance, defined by the number of sibling pairs for which both siblings have trait values in the tail. For each index sibling in Aq, we calculate the probability that the conditional sibling is also in Aq, which, for the upper tail, is given by:

where Φ represents the normal cumulative distribution function. Denoting the mean of across all index siblings in Aq by πo, the expected sibling concordance is o where n is the number of index siblings in Aq. Given an observed number of concordant siblings r, the z-statistic for a one-sided score test for excess concordance is (see Appendix 2 for derivation) given by:

Simulation of Conditional Sibling Data

We perform simulations using publicly available GWAS data on multiple traits to validate our analytical model and to benchmark our tests for complex architecture. Figure 3 depicts the different stages of our simulation procedure. We start by simulating a “parent population” (step A), assigning genotypes based on the allele frequencies of the first 100k SNPs from a trait GWAS. Additive parent liability can then be calculated based on the genotype effect size distribution of the GWAS. Next, parents are randomly paired, their liabilities averaged to produce midparent trait values (step B), and genotypes of two offspring (Equation 1) and corresponding genetic liabilities, G, are calculated (step C) assuming independent reassortment of parental alleles and unlinked SNPs.

Simulation Schematic.

Publicly available GWAS allele frequency and effect size data is used to simulate parent liability (A). Midparent liability (B) is simulated assuming random mating. Offspring genotype and liability (C) is simulated assuming complete recombination. Environmental variation (D) is added to compare with theoretical polygenic conditional sibling distribution. De novo and Mendelian rare-variant effects are simulated (E) to benchmark tests for complex architecture (F).

In step D, we generate offspring trait values for different degrees of heritability by adding an environmental random effect. For heritability, h2, offspring trait values are given by T = hG + E, where the environmental effect E is drawn from a normal distribution with mean 0 and variance (1− h2). The simulated trait has a 𝒩(0, 1) distribution and the correlation between the genetic liability and trait is equal to the heritability.

In step E, we simulate the effect of complex tail architecture on the conditional-sibling trait distribution. We assume that rare variants are sufficiently penetrant to move individuals into the tails of the distribution independent of polygenic liability. We modify sibling trait values for individuals already in the tails (from Step D) to minimise perturbation of the trait distribution. We simulate de novo tail architecture by resampling the less extreme sibling from the background distribution, and simulate Mendelian tail architecture by resampling the less extreme sibling from the background distribution with probability 0.5 and from the same tail as the extreme sibling with probability 0.5.

Application to UK Biobank Data

For the UK Biobank analyses, we used six continuous traits (Body Fat, Mean Corpuscular Haemoglobin, Neuroticism, Hell Bone Mineral Density, Monocyte Count, and Sitting Height) with (22) and data on > 4,500 sibling pairs (sibling-pairs defined as having kinship coefficient 0.18 - 0.35 and > 0.1% SNPs with 0 IBD to distinguish from parent-offspring (23, 24). Outliers with absolute trait value > 6 standard deviations from the mean were removed and then cohort-wide trait values were standardised using a rank-based inverse normal transformation and adjusted for age, sex, recruitment centre, batch covariates and the first 40 principle components. The sub-sample corresponding to the sibling pairs was then re-normalised, and for each sibling pair one was randomly assigned as the index sibling and the other the conditional-sibling. Sibling pairs were then sorted by their index trait value and each sibling binned according to trait percentile rank among all siblings.

Results

Here we illustrate the conditional-sibling trait distribution, validate the accuracy our analytical model using simulation, perform power analyses for our statistical tests for complex genetic tail architecture (see Model & Methods), and apply our tests to trait data on thousands of siblings from the UK Biobank.

Conditional-Sibling Trait Distribution

In Figure 4:A the conditional-sibling trait distribution (Equation 4) is illustrated at different index sibling trait values (ranked percentiles). For an almost entirely heritable polygenic trait (orange), siblings of individuals at the 99th percentile (z = 2.32) have mean z-scores approximately halfway between the population mean and index mean (i.e. z = 1.1). This regression-to-the-mean is greater when trait heritability is lower (blue), assuming (as here) independent environmental risk among siblings.

Conditional-Sibling Trait Distribution under Polygenic Architecture.

A: The conditional-sibling trait distribution according to Equation 4 for index siblings at the 1st, 25th, 50th, 75th, and 99th percentile of the standardised trait distribution, when heritability is high (h2 = 0.95, in orange) and moderate (h2 = 0.5, in blue). When heritability is 0.95 conditional-sibling expectation is almost half of the index sibling z-score, when heritability is 0.5 the conditional-sibling expectation is equal to 1/4 of the index sibling z-score. B: The conditional distribution transformed into rank space. An individual whose sibling is at the 99% percentile is expected to have a trait value in the 80% percentile when heritability is high and in the 67% percentile when heritability is moderate.

In Figure 4:B the conditional-sibling z-distribution is transformed into percentiles for interpretation in rank space. This distribution is skewed, especially at the tails, due to truncation at extreme quantiles (i.e. siblings cannot be more extreme than the top 1%). For a trait with h2 = 0.95, siblings of individuals at the 99th percentile (z=2.32) have a mean trait value at the 80th percentile. Note that this is less extreme than the result of transforming their expected z-value into percentile space (Φ−1(1.1) = 86%), which is a consequence of Jensen’s inequality (25) given that the inverse cumulative distribution functional of the normal distribution is convex above zero.

We compared the theoretical conditional-sibling trait distributions to those generated from simulated data (see Appendix 3) and found that irrespective of trait used to simulate data (e.g. fluid intelligence, height) the two distributions did not differ significantly, suggesting that our analytically derived distributions are a valid model for the conditional-sibling trait distribution (Equation 4).

Power of Statistical Tests to Identify Complex Tail Architecture

Building on the theoretical framework introduced in the Model & Methods and illustrated in the previous section, we develop statistical tests to identify complex architecture in the tails of the trait distribution. These tests leverage the fact that the similarity (or dissimilarity) in trait values among siblings provides information about the genetic architecture underlying the trait (see Figure 1). For example, high-impact de novo mutations generate large dissimilarity between siblings when only one carries the unique mutant allele, while Mendelian variants can create excess similarity in the tails of the distribution when siblings share both inherit the same mutant allele.

In Appendix 2, we provide detailed derivations for the statistical tests described at a high level in Model & Methods and explain how they identify tail signatures in contrast to a polygenic background where conditional siblings regress-to-the-mean at a rate proportional to h2/2 (Equation 4). The performance of the two tests evaluated using simulated sibling data is shown in Figure 5. These tests demonstrate that power to identify de novo architecture is greatest when heritability is high, while power to identify Mendelian architecture is greatest when heritability is low. These patterns can be explained by the fact that high heritability should lead to relatively high similarity among siblings, and low heritability to low similarity, under polygenicity. When heritability is estimated near 50% and at least 0.1% of the population has high-impact rare aetiology, both tests are well-powered to identify each class of complex tail architecture.

Power to to detect complex tail architecture for different heritability levels, de novo and Mendelian frequencies and sample sizes.

Simulation assumes highly penetrant de novo and Mendelian frequencies of 0.05%,0.1%,0.2%,0.3%, and 0.5%. The false-positive rate was set at 0.05. Null simulations (red dashed line) demonstrate tests are well calibrated.

Analysis of Six UK Biobank Traits.

Application of statistical tests for Mendelian and de novo tail architecture to sibling trait data of six UK Biobank traits. For each trait the conditional sibling mean is plotted under polygenicity in which heritability is estimated by index siblings lying between the 5th and 95th percentiles. Statistical tests were applied to conditional siblings with index sibling in the upper and lower 1% of the distribution. Significant associations for the Mendelian and de novo tests are shown in red and green respectively. Tail architecture that is not distinct from polygenic expectation is denoted in grey.

Identifying Complex Tail Architecture in UK Biobank Data

We applied our two statistical tests of complex tail architecture to sibling-pair data on six traits from the UK Biobank (26) to illustrate the performance of our tests on real data. For each of the six traits, we tested the trait distribution for normality, estimated the (polygenic) heritability in the body of the trait distribution via Equation 5 (computed between the 5th and 95th percentiles), and performed tests to identify de novo and Mendelian architecture in the lower and upper tails of the distribution of each trait. We identified lower tail de novo architecture in Heel Bone Mineral Density and Monocyte Count, upper tail de novo architecture in Mean Corpuscular Haemoglobin and de novo architecture at both ends of the distribution in Sitting Height. Upper tail Mendelian architecture was identified in Body Fat and no complex tail architecture was detected for Neuroticism. These results support evidence from deep sequencing studies that indicate that rare variants play a substantial role in the genetic aetiology for Sitting Height (27, 28) and Heel Bone Mineral Density (29).

Discussion

In this paper, we present a novel approach to infer the genetic architecture of continuous traits, specifically in the tails of their distributions, from sibling trait data alone. Our approach is based on a theoretical framework that we develop, which derives the expected trait distributions of siblings conditional on the trait value of an index-sibling and the trait heritability, assuming polygenicity. The key intuition underlying the approach is that departures from the expected conditional-sibling trait distribution in relation to index-siblings selected from the trait tails may be due to non-polygenic architecture in the tails.

We demonstrate the validity of our conditional-sibling analytical derivations through simulations and show that our tests for identifying de novo and Mendelian architecture in the tails of trait distributions are well-powered when high-impact alleles are present in the population on the order of 1 out of 1000 individuals. We apply our tests to six traits using UK Biobank data and find evidence for de novo architecture in the distribution tails of heel bone mineral density, monocyte count, mean corpuscular hemoglobin and sitting height, as well as Mendelian architecture in the upper tail of body fat.

There are several areas in which our work could have short-term utility. Firstly, those individuals inferred as having rare variants of high-impact could be followed up in multiple ways to gain individual-level insights. For example, they could undergo clinical genetic testing to identify potential pathogenic variants with effects beyond the examined trait, either in the form of diseases or disorders that the individual has already been diagnosed with or else that they have yet to present with but may be at high future risk for. Furthermore, investigation of their environmental risk profile may indicate an alternative - environmental - explanation for their extreme trait value (see below), rather than the rare genetic architecture inferred by our tests.

Our framework could also help to refine the design of sequencing studies for identifying rare variants of large effect. Such studies either sequence entire cohorts at relatively high cost (30) or else perform more targeted sequencing of individuals in the trait tails with the goal of optimising power per cost (31). However, even the latter approach is usually performed blind to evidence of enrichment of rare variant aetiology in the tails. Since our approach enables identification of individuals most likely to harbour rare variants, then these individuals could be prioritised for (deep) sequencing. Moreover, our ability to distinguish between de novo and Mendelian architecture could influence the broad study design, with the former suggesting that a family trio design may be more effective than population sequencing, which may be favoured if Mendelian architecture is inferred. Furthermore, our approach could be applied as a screening step to prioritise those traits, and corresponding tails, most likely to harbour rare variant architecture. Finally, if sequence data have already been collected, either cohort-wide or using a more targeted design, then our approach could be utilised to increase the power of statistical methods for detecting rare variants by upweighting individuals most likely to harbour rare variants.

This study has several limitations. First and foremost, departures from the expected conditional-sibling trait distributions could be due to environmental risk factors, such as medication-use or work related exposures, rather than rare genetic architecture. For this reason, rejections of the null hypothesis from our tests should be considered only as indicating effects consistent with non-polygenic genetic architecture, alongside alternative explanations such as tail-specific environmental risks. We suggest that further investigation of individuals’ clinical, environmental and genetic profiles are required to achieve greater certainty about the causes of their extreme trait values. Nevertheless, given knowledge that rare variants of high-impact contribute to complex trait architecture, then we expect that traits for which we infer non-polygenic architecture, will, on average, be more enriched for rare architecture in the tail(s) than other traits. Secondly, our modelling assumes that environmental risk factors of siblings are independent of each other. If in fact shared environmental risk factors contribute significantly to trait similarity among siblings, then our heritability estimates will be upwardly biased. However, this would only impact our tests if the degree of shared environmental risk differed in the trail tails relative to the rest of the trait distribution. Moreover, a large meta-analysis of heritability estimates from twin studies (32) concluded that the contribution of shared environment among siblings (even twins) is insubstantial, and so we might expect this to have limited impact on results from our tests. Thirdly, our modelling assumes random mating and so the results from our tests in relation to traits that may be the subject of assortative mating should be considered with caution. Likewise, our modelling assumes additivity of genetic effects and so, while additivity is well-supported by much statistical genetics research (33), results from our tests should be reconsidered for any traits with evidence for significant non-additive genetic effects.

Our approach not only provides a novel way of inferring genetic architecture (without genetic data) but can do so specifically in the tails of trait distributions, which are most likely to harbour complex genetic architecture, due to selection, and are a key focus in biomedical research given their enrichment for disease. This work could also have broader implications in quantitative genetics since we derive fundamental results about the relationship among family members’ complex trait values. The conditional-sibling trait distribution provides a simple way of understanding the expected trait values of individuals according to their sibling’s trait value, which could be used to answer questions of societal importance and inform future research. For example, it can be used to answer questions such as: as a consequence of genetics alone, how much overlap should there be in the traits of offspring of midparents at the 5th and 95th percentile and how does that contrast with what we observe in highly structured societies? Moreover, further development of the theory described here could lead to a range of other applications, for example, estimating levels of assortative mating, inferring historical selection pressures, and quantifying heritability in specific strata of the population.

Acknowledgements

We thank the participants in the UK Biobank and the scientists involved in the construction of this resource for making the sibling data used in this manuscript available. The work in this manuscript has been conducted using the UK Biobank Resource under application 18177 (Dr O’Reilly). We would also like to thank Dr. Peter Visscher for highlighting some important references.

Appendix 1 Derivation of Conditional Sibling Distributions

Here we derive the distribution that describes the probability density of the “conditional” sibling (S2) given the genetic liability, z-value, case-status, or rank of one or more index siblings relative to the population. In each case we assume a population of unrelated parents and rely on the results from the infinitesimal polygenic model that show that within family variance is normally distributed around midparent genetic liability (average of parents) with half the ancestral trait variance even when selection, drift, population structure or dominance effects alter the between family trait distribution (18, 34, 35).

Case 1) Index Liability Known, Continuous Trait (h2 = 1): P(S2 | S1 = s1)

We begin with the simplest case, a polygenic normally distributed trait which is fully heritable (h2 = 1) where the genetic liability of an index sibling in a population is known. Throughout we denote the midparent, index sibling and conditional sibling by M, S1 and S2. We begin by calculating the midparent distribution conditional on index liability using Bayes theorem:

Then using the following identity (36):

We calculate the conditional sibling distribution similarly:

Thus, as predicted by the infinitesimal polygenic model the conditional sibling liability is normally distributed around the midparent liability distribution with additional variance equal to half the population liability variance.

Case 2) Index Trait Value, Continuous Trait (h2 ≠ 1): P(S2 | S1 = s1)

In this case, the primary result considered in this manuscript, a trait z-value, or equivalently, the percentile rank of an index sibling in genome wide association where the rank-based inverse transformation has been applied (37) is known. Transformation to a Z distribution (Σ2 = 1) means that for heritability h2 the genetic liability and environmental contributions to trait variance are are and , respectively. Similar to the previous case we begin by calculating the conditional midparent liability from Bayes theorem:

Then, we again use Equation 11 to derive the distribution conditional sibling distribution:

Case 3) Multiple Index Trait Values, Continuous Trait: P(S3 | S1 = s1, S2 = s2)

The previous case can also be derived using the joint trait distribution for related individuals:

where the covariance G is the genetic relationship matrix. Thus for a sibling pair

As shown by Bernardo and Smith (38), if X is multivariate normal 𝒩(μ, λ−1), where λ = Σ−1 is the precision matrix, and X is partitioned into x1 and x2, with corresponding partitions of μ and λ of:

then the conditional distribution of x1 given x2 is also normal with mean and precision matrix:

Thus, given the joint distribution for three siblings:

The precision matrix λ = Σ−1 is

And we can calculate the conditional distribution for for two sibling using Equation 15

Case 4) Binary Trait: P(S2 | S1 = Affected)

Here we again assume an underlying distribution that is 𝒩(0, 1) and made up of genetic and environmental components. However, we only know the index sibling’s status, which, as described under the liability threshold model (39), is equivalent to conditioning on the event where than index sibling’s trait value is above or below a z-value threshold T :

where T = Φ−1(1 − K), Φ−1 is the inverse normal cumulative distribution function, and K is the incidence of the binary trait in the population. Thus, the conditional distribution one sibling given an Affected index sibling can be can be calculated integrating over the normal distribution truncated at T :

The first two moments of the this truncated normal (40) are:

Approximating the index sibling distribution using a normal whose moments are taken from this truncated distribution, Equation 17 becomes:

which can be solved using the identity given in Equation 11:

Thus, conditional on an Affected sibling, the probability of concordance is:

which is equivalent to Reich’s (41) correction to Falconer’s (42) approximation where the relationship between the relatives is 0.5 for siblings.

The probability that of discordance given an Unaffected sibling can be calculated from Bayes Theorem:

which allow the conditional probability of case status to be determined given a index sibling’s status.

Appendix 2 Statistical Tests for Complex Architecture

Here we describe our statistical tests for complex tail architecture. Our tests identify changes in the conditional sibling distribution when ascertaining on an index sibling in the tail relative to polygenic expectation. To carry out these tests we establish a null distribution built on the assumption that indexing on siblings not in the tails reduces that likelihood that either sibling phenotype in the pair is driven by rare variants of large effect. We use the region from the 5th to the 95th percentile to estimate heritability. From the n sibling pairs where the index sibling is in the 5th to the 95th, we calculate the conditional likelihood Equation 14:

and maximize the log-likelihood with respect to h2:

to obtain an maximum likelihood estimate for h2 that is used to define the null distribution for our statistical tests.

Statistical Test for De Novo Architecture

We identify de novo mutations of large effect by testing for discordance between sibs relative to the polygenic null using the conditional distribution of a sib given index sib from (Equation 14). Since de novo mutations typically result in trait values in the tail of the distribution the test conditions on those index sibs in a specified upper quantile q of the distribution, i.e. those sib pairs such that , defined as the set Aq. We introduce an additional parameter α where values of α < 0 in the right tail and α > 0 in the left tail are indicative of discordant sibs with trait values closer to the mean, giving a log-likelihood:

The null hypothesis H0 : α = 0 is tested via a score test:

And the score test for H0 : α = 0:

Statistical Test for Mendelian Architecture

Here we test for excess concordance between sibs in the tails of the distribution by testing for an excess number of observed siblings in the tail S2 > Φ−1(q) given the index sib is in the tail S1 > Φ−1(q), where q is the quantile of interest. Denoting the set of index sibs in the tail by Aq and the size of the set by n, under the null of pologenicity we calculate the probability that the conditional sibling exceeds Φ−1(q) from the normal cdf and compute the mean to define mean concordance under polygenicity π0:

Denoting the observed concordance (number of sibling pairs both > Φ−1(q)) by r, the binomial log-likelihood (ignoring the constant) is:

Assuming r = such that I is not a function of any particular observation:

And the score test for H0 : π = π0:

Appendix 3 Model Evaluation

Here we compare our theoretical derivations (Equation 4) that rely on the infinitesimal polygenic model (18) with simulated offspring data (Figure 7). We also compare our model to our empirical simulation (see Model & Methods) that draws allele frequencies and effect sizes from publicly available GWAS data (43) for two traits to produce parent and offspring genotype and genetic liability (equivalent to trait value when h2 = 1) (Figure 8).

Theoretical and simulated conditional expectation and variance in liability (z-score) and rank across index sibling percentiles for conditional sibling, midparents and index siblings. Simulation drew one-thousand parent liability values from 𝒩(0, 1), these were randomly paired to produce to midparents with liability mi, two offspring were subsequently drawn from and randomly assigned as index and conditional siblings.

For both Fluid Intelligence and Standing Height GWAS variants (on chromosome one) were used to simulate parent and offspring genotypes and liability values. Plots show that for both traits the offspring distribution is normal and that the sibling distribution is multivariate normal, in lin.e with our theoretical prediction

These tests demonstrate that our theoretical framework accurately reflects an additive polygenic trait in an outcrossing population. Additionally these results demonstrate that deviations in the conditional sibling distribution can be interpreted as non-polygenic architecture or quantile specific environmental effects.