Introduction

The fields of quantitative genetics and genetic epidemiology have exploited the shared genetics and environment of siblings in many applications, notably to estimate heritability, using theory developed a century ago (1, 2), and more recently to infer a so-called ‘household effect’ (3), which contributes to genetic risk indirectly via correlation between genetics and household environment. Here we leverage sibling trait data to infer genetic architecture departing from polygenicity, specifically affecting the tails of the trait distribution and consistent with enrichment in rare variants of large effect.

The genetic architecture of a complex trait is typically inferred from findings of multiple independent studies: genome-wide association studies (GWAS) identifying common variants (4, 5), whole exome or whole genome sequencing studies detecting rare variants (6), and family sequencing studies designed to identify de novo and rare Mendelian mutations (7). The relative contribution of each type of variant to trait heritability is a function of historical selection pressures on the trait in the population (8, 9). If selection has recently acted to increase the average value of a trait, then the lower tail of the trait distribution may become enriched for large effect rare variants over time because trait-reducing alleles will be subject to negative selection, and so those with large effects - most likely to ‘push’ individuals to the lower tail - will be reduced to low frequencies (10). Similarly, if the trait is subject to stabilising selection, then both tails of the trait distribution may be enriched for rare variant aetiology (11). This can result in less accurate polygenic scores in the tails of the trait distribution (12), and can also produce dissimilarity between siblings beyond what is expected under polygenicity. For example, studies on the intellectual ability of sibling pairs have demonstrated similarity for average intellectual ability (13), regression-to-the-mean for siblings at the upper tail of the distribution (13), and complete discordance when one sibling is at the lower extreme tail of the distribution (14). However, no theoretical framework has been developed to formally infer genetic architecture from sibling trait data.

We introduce a novel theoretical framework that allows widely available sibling trait data in population cohorts and health registries to be leveraged to perform statistical tests that can infer complex genetic architecture in the tails of the trait distribution. These tests are powered to differentiate between polygenic, de novo and Mendelian (i.e. rare variants of large effect) architectures (see Figure 1); while these are simplifications of true complex architectures, our tests allow enrichments of the different architectures to be inferred and compared across traits. This framework establishes expectations about the trait distributions of siblings of index individuals with extreme trait values (e.g. the top 1% of the trait) according to null assumptions of polygenicity, random mating and environmental factors that are either controlled for or have combined Gaussian effects. Critical to our framework is our derivation of the “conditional-sibling trait distribution”, which describes the trait distribution for one individual given the quantile value of one or more “index” siblings. Our statistical framework, derivation of the conditional-sibling trait distribution, and simulation study, allow us to develop statistical tests to infer genetic architecture from data on siblings (15) without the need for genetic data (see Model & Methods). We validate the statistical power of our tests using simulated data and apply our tests across a range of traits from the UK Biobank. We also release sibArc, an open-source software tool that can be used to apply our tests to sibling trait data (16). Our novel framework can be extended to applications such as estimating heritability, characterising mating patterns, and inferring historical selection pressures.

Sibling Similarity Under Different Tail Architectures.

Left to right: When an individual’s extreme trait value (top 1%) is due to many alleles of small effect (“Polygenic”), then their siblings’ trait values are expected to show regression-to-the-mean (grey). When an individual’s extreme trait value is due to a de novo mutation of large effect, then their siblings are expected to have trait values that correspond to the background distribution (green). When an individual’s extreme trait value is the result of an inherited rare allele of large effect (“Mendelian”), then their siblings are expected to have either similarly extreme trait values or trait values that are drawn from the background distribution (red), depending on whether or not they inherited the same large effect allele.

Model & Methods

Here we outline a framework that models the conditional-sibling trait distribution, which describes an individual’s trait distribution conditioned on an index sibling’s trait value. In this section, we describe our derivation of the distribution for a completely polygenic trait, outline development of statistical tests that infer enrichments of rare genetic architectures via departures from polygenicity, and describe the simulation study used to validate and benchmark our results. Full derivations that complement this high-level summary of our methodology, as well as extended results from application of our tests to UK Biobank data, are included in the Appendices.

Conditional-Sibling Inference Framework

Assuming completely polygenic (i.e. infinitesimal) architecture, siblings of individuals with extreme trait values are expected to be less extreme. This regression-to-the-mean can be understood through the two factors (17, 18) that determine inherited genetic variation: (i) average genetic contribution of the parents (mid-parent) to the trait value, and, (ii) random genetic reassortment occurring during meiosis. Assuming that an individual presenting an upper-tail trait value does so due to both high mid-parent trait value and genetic reassortment favouring a higher value, then a sibling sharing common midparent trait value - but subject to independent reassortment - is likely to be less extreme. How much less extreme can be derived by first considering the conditional distribution that relates mid-parents and their offspring. In the simplest case, for a completely heritable continuous polygenic trait in a large population, the offspring trait values, s, are normally distributed around their mid-parent trait value, m, irrespective of selection and population structure (19, 20)

where represents within family variance, which is constant across the population. This result does not require the trait distribution across the population to be Gaussian. If we assume that the population distribution is Gaussian and there is random mating and no selection, then half the population variance (19, 21, 22). For a trait with heritability h2, trait variance can be partitioned into genetic, ,and environmental, , contributions, such that and . Assuming meancentred genetic and environmental trait contributions, and a trait variance of 1 (a standard normalised trait), the distribution for offspring conditioned on the mid-parent genetic trait value, mg, is:

Since p(mg) = 𝒩 (0, h2/2), from Bayes’ Theorem it can be shown that the mid-parent genetic trait value conditional on offspring trait value, s, is distributed as follows:

From Equations 2 & 3, the sibling trait distribution conditional on an index sibling can be calculated as:

A full derivation is provided on Appendix 1. This derivation via mid-parents can be generalised to cases where assumptions of random mating, Gaussian population trait distribution and no selection do not hold. When these assumptions do hold, then the conditional sibling distribution can also be derived from the joint sibling distribution defined by the relationship matrix (23). In the Appendix, we use the joint sibling distribution to derive the distribution conditional on two sibling values. From (4), we can infer that under complete heritability, the siblings of an individual with a standard normal trait value of z, will have a mean trait value of , with variance equal to three quarters of the population variance. In Appendix 1, we further generalise the result to binary phenotypes, increasing the utility of this framework for further applications and theoretical development.

Statistical Tests for Complex Tail Architecture

In Figure 2, the strategy employed to develop statistical tests for complex tail architecture is depicted. Our approach corresponds to testing for deviations from the expected conditional-sibling trait distribution under the null hypothesis of complete polygenicity in trait tails (see above for other null assumptions): excess discordance is indicative of an enrichment of de novo mutations, while excess concordance indicates an enrichment of Mendelian variants, i.e. large effect variants segregating in the population. The heritability, h2, required to define the null distribution, is estimated by maximising the log-likelihood of the conditional-sibling trait distribution (Equation 4) with respect to h2:

Identifying Complex Tail Genetic Architecture.

Conditional-sibling z-values plotted against index-sibling quantiles. Grey depicts complete polygenic architecture across index-sibling values. In the lower tail, an extreme scenario of de novo architecture is shown in green, resulting in sibling discordance. In the Upper Tail, extreme Mendelian architecture is shown in red, whereby siblings are half concordant and half discordant, resulting in a bimodal conditional sibling trait distribution. Statistical tests to infer each type of complex tail architecture are designed to exploit these expected trait distributions.

where s1 and s2 represent index and conditional-sibling trait values, respectively, and n is number of sibling pairs. This allows h2 to be estimated for given quantiles of the trait distribution by restricting sibling pair observations to those index siblings in the quantile of interest. To maximise power to detect non-polygenic architecture in the tails of the trait distribution, we estimate “polygenic heritability”, , from sibling pairs for which the index sibling trait value is between the 5th and 95th percentile (labelled “Distribution Body” in Figure 2).

Tests for complex architecture are then performed in relation to index siblings whose trait values are in the tails of the distribution (e.g. the lower and upper 1%). Below, Aq denotes the set of sibling pairs for which the index sibling is in quantile q, such that s1 > Φ1(1 q) and s1 < Φ1(q) for the upper and lower tails, respectively, where Φ1 is the inverse normal cumulative distribution function.

It should be noted that our estimate of h2 in Equation 5 assumes no effects of shared environment. Polderman and colleagues (24) found limited contribution of shared environment for most complex traits and, critically, our statistical tests are robust to shared environmental effects with consistent effects throughout the trait distribution (see Discussion).

Statistical Test for De Novo Architecture

For inference of de novo architecture in the tails of the trait distribution, we introduce a parameter, α, to the log-likelihood defined by the conditional-sibling trait distribution Equation 5:

Values of α > 0 in the lower tail and α < 0 in the upper tail indicate excess regression-to-the-mean and, thus, high sibling discordance, consistent with an enrichment of de novo mutations among the index siblings. The z-statistic of the one-sided score test for α > 0 in the lower quantile, q, relative to the null of α = 0 is (see Appendix 2 for derivation):

For the upper tail test of α < 0, the above is multiplied by -1.

Statistical Test for Mendelian Architecture

For inference of Mendelian architecture in the tails of the trait distribution, we compare the observed and expected tail sibling concordance, defined by the number of sibling pairs for which both siblings have trait values in the tail. For each index sibling in Aq, we calculate the probability that the conditional sibling is also in Aq, which, for the upper tail, is given by:

where Φ represents the normal cumulative distribution function. Denoting the mean of across all index siblings in Aq by πo, the expected sibling concordance is o, where n is the number of index siblings in Aq. Given an observed number of concordant siblings, r, the z-statistic for a one-sided score test for excess concordance is (see Appendix 2 for derivation) given by:

Simulation of Conditional Sibling Data

We perform simulations using publicly available GWAS data on multiple traits to validate our analytical model and to benchmark our tests for complex architecture. Figure 3 depicts the different stages of our simulation procedure. We start by simulating a “parent population” (step A), utilsing the allele frequencies and effect sizes of the first 100k SNPs from a trait GWAS to sample genotypes and subsequently trait values assuming an additive model. Next, parents are randomly paired and their genetic trait values averaged to produce mid-parent trait values (step B), and genotypes of two offspring (Equation 1) and corresponding genetic trait values, G, are calculated (step C) assuming independent reassortment of parental alleles and unlinked SNPs.

Simulation Schematic.

Publicly available GWAS allele frequency and effect size data is used to simulate parent genetic trait value (A). Midparent genetic trait value (B) is simulated assuming random mating. Offspring genotype and genetic trait value (C) is simulated assuming complete recombination. Environmental variation (D) is added to compare with theoretical polygenic conditional sibling distribution. De novo and Mendelian rare-variant effects are simulated (E) to benchmark tests for complex architecture (F).

In step D, we generate offspring trait values for different degrees of heritability by adding a Gaussian environmental effect. For heritability, h2, offspring trait values are given by T = hG + E, where G is the genetic effect, standardised to have mean 0 and variance 1, and the environmental effect E is drawn from a normal distribution with mean 0 and variance (1− h2). The simulated trait has a 𝒩 (0, 1) distribution.

In step E, we simulate the effect of complex tail architecture on the conditional-sibling trait distribution. We assume that rare variants are sufficiently penetrant to move individuals into the tails of the distribution, independent of their polygenic contribution. We, thus, modify sibling trait values for individuals already in the tails (from Step D) to minimise perturbation of the trait distribution. We simulate de novo tail architecture by resampling the less extreme sibling from the background distribution, and simulate Mendelian tail architecture by resampling the less extreme sibling from the background distribution with probability 0.5, and from the same tail as the extreme sibling with probability 0.5.

Application to UK Biobank Data

The UK Biobank includes data from over 21,000 siblings (25). To apply our tests to this dataset we began by identifing continuous traits (at least 50 unique values) with at least 5,000 sibling pairs, as defined by kinship coefficient 0.18 - 0.35 and > 0.1% SNPs with IBS0 to distinguish from parent-offspring (26, 27). After removing outliers with absolute trait value > 8 standard deviations from the mean, we removed all traits with absolute skew or excess kurtosis greater than 0.5 to reduce the likelihood that skewed or heavy-tailed trait distributions impact our statistical tests. The remaining traits were standardised using rank-based inverse normal transformation and adjusted for age, sex, recruitment centre, batch covariates and the first 40 principle components. After this, to ensure a primarily additive polygenic aetiology, we required that traits have heritability (28) and that no single SNP contribute more than 0.01 to h2.

We applied our method to estimate heritability from sibling pairs (Equation 5) across the distribution and within the trait body (5th to 95th percentiles) on the remaining traits, and selected the eighteen traits for which both measures exceeded 30% for further analysis. For each of these eighteen traits, siblings were randomly assigned index and conditional status and both trait tails were tested for departures from polygenicity using our De Novo and Mendelian tests, as well as a general Kolmogorov-Smirnov test (29) to identify departures from the conditional sibling distribution assuming polygenicity (Appendix 2).

Results

Here we illustrate the conditional-sibling trait distribution, validate the accuracy of our analytical model using simulation, perform power analyses for our statistical tests for complex genetic tail architecture (see Model & Methods), and apply our tests to trait data on thousands of siblings from the UK Biobank.

Conditional-Sibling Trait Distribution

In Figure 4:A the conditional-sibling trait distribution (Equation 4) is illustrated at different index sibling trait values (ranked percentiles). For an almost entirely heritable polygenic trait (orange), siblings of individuals at the 99th percentile (z = 2.32) have mean z-scores approximately halfway between the population mean and index mean (i.e. z = 1.1). This regression-to-the-mean is greater when trait heritability is lower (blue), assuming (as here) independent environmental risk among siblings.

Conditional-Sibling Trait Distribution under Polygenic Architecture.

A: The conditional-sibling trait distribution according to Equation 4 for index siblings at the 1st, 25th, 50th, 75th, and 99th percentile of the standardised trait distribution, when heritability is high (h2 = 0.95, in orange) and moderate (h2 = 0.5, in blue). When heritability is 0.95 conditional-sibling expectation is almost half of the index sibling z-score, when heritability is 0.5 the conditionalsibling expectation is equal to 1/4 of the index sibling z-score. B: The conditional distribution transformed into rank space. An individual whose sibling is at the 99% percentile is expected to have a trait value in the 80% percentile when heritability is high and in the 67% percentile when heritability is moderate.

In Figure 4:B the conditional-sibling z-distribution is transformed into percentiles for interpretation in rank space. This distribution is skewed, especially at the tails, due to truncation at extreme quantiles (i.e. siblings cannot be more extreme than the top 1%). For a trait with h2 = 0.95, siblings of individuals at the 99th percentile (z=2.32) have a mean trait value at the 80th percentile. Note that this is less extreme than the result of transforming their expected z-value into percentile space (Φ1(1.1) = 86%), which is a consequence of Jensen’s inequality (30) given that the inverse cumulative distribution function of the normal is convex above zero.

We compared the theoretical conditional-sibling trait distributions to those generated from simulated data (see Appendix 3) and found that irrespective of trait used to simulate data (e.g. fluid intelligence, height) the two distributions did not differ significantly, suggesting that our analytically derived distributions are a valid model for the conditional-sibling trait distribution (Equation 4).

Power of Statistical Tests to Identify Complex Tail Architecture

Building on the theoretical framework introduced in the Model & Methods and illustrated in the previous section, we develop statistical tests to identify complex architecture in the tails of the trait distribution. These tests leverage the fact that the similarity (or dissimilarity) in trait values among siblings provides information about the underlying genetic architecture (see Figure 1). For example, high-impact de novo mutations generate large dissimilarity between siblings when only one carries the unique mutant allele, while Mendelian variants can create excess similarity in the tails of the distribution when siblings both inherit the same mutant allele.

In Appendix 2, we provide detailed derivations for the statistical tests described at a high level in Model & Methods and explain how they identify tail signatures in contrast to a polygenic background where conditional siblings regress-to-the-mean at a rate proportional to h2/2 (Equation 4). The performance of the two tests evaluated using simulated sibling data is shown in Figure 5. These tests demonstrate that power to identify de novo architecture is greatest when heritability is high, while power to identify Mendelian architecture is greatest when heritability is low. These patterns can be explained by the fact that high heritability should lead to relatively high similarity among siblings, and low heritability to low similarity, under polygenicity. When heritability is estimated near 50% and at least 0.1% of the population has high-impact rare aetiology, both tests are well-powered to identify each class of complex tail architecture.

Power to to detect complex tail architecture for different heritability levels, de novo and Mendelian frequencies and sample sizes.

Simulation assumes highly penetrant de novo and Mendelian frequencies of 0.05%,0.1%,0.2%,0.3%, and 0.5%. The false-positive rate was set at 0.05. Null simulations (red dashed line) demonstrate tests are well calibrated.

Identifying Complex Tail Architecture in UK Biobank Data

We applied our statistical tests for complex tail architecture to sibling-pair data on eighteen traits from the UK Biobank (31). Here were present results from a set of six traits with varied tail architecture: Sitting Height, Forced Expiratory Volume, Urate, Ankle Spacing, Left Hand Grip Strength, and Waist Circumference. For each trait we estimated conditional sibling heritability via Equation 5, and performed tests to identify de novo and Mendelian architecture in the lower and upper tails of the distribution of each trait. Additionally, we also performed a Kolmogorov-Smirnov test (29) to provide a general test for any departures from our null model. We observed expected polygenic architecture in both tails for Ankle Spacing. We inferred Mendelian architecture in the lower tail for Urate, and the upper tail for Wait Circumference and Left Hand Grip Strength. De novo architecture was inferred in the lower tail for Forced Expiratory Volume and strongly inferred in both tails for Sitting Height, which is supported by evidence from deep sequencing studies indicating that rare variants play a substantial role in the genetic aetiology for this trait (32, 33). In the lower tail for Left Hand Grip Strength we infer the presence of both De Novo (greater than expected mean) and Mendelian (more concordant siblings than expected) architecture. We note that this could occur as a result of highly penetrant variants that are only shared among some siblings or perhaps because at the extremes, siblings with different handedness are un-expectedly divergent and siblings with matching handedness are unexpectedly concordant. Extended results for all eighteen traits analysed can be found in Appendix 4.

Discussion

In this paper, we present a novel approach to infer the genetic architecture of continuous traits, specifically in the tails of the distributions, from sibling trait data alone. Our approach is based on a theoretical framework that we develop, which derives the expected trait distributions of siblings conditional on the trait value of an index-sibling and the trait heritability, assuming polygenicity, random mating and environmental factors that are either controlled for or have combined Gaussian effects. The key intuition underlying the approach is that departures from the expected conditional-sibling trait distribution in relation to index-siblings selected from the trait tails may be due to non-polygenic architecture in the tails.

We demonstrate the validity of our conditional-sibling analytical derivations through simulations and show that our tests for identifying de novo and Mendelian architecture in the tails of trait distributions are well-powered when large effect alleles are present in the population on the order of 1 out of 1000 individuals. Applying our test to a subset of well-powered traits in the UK Biobank, we find evidence of complex genetic architecture in at least one tail (α < 0.05) in sixteen of eighteen traits and find de novo architecture occurring more frequently than Mendelian architecture (19 vs 6 of 36 total tails).

There are several areas in which our work could have short-term utility. Firstly, those individuals inferred as having rare variants of large effect could be followed up in multiple ways to gain individuallevel insights. For example, they could undergo clinical genetic testing to identify potential pathogenic variants with effects beyond the examined trait, either in the form of diseases or disorders that the individual has already been diagnosed with or else that they have yet to present with but may be at high future risk for. Furthermore, investigation of their environmental risk profile may indicate an alternative - environmental - explanation for their extreme trait value (see below), rather than the rare genetic architecture inferred by our tests.

Our framework could also help to refine the design of sequencing studies for identifying rare variants of large effect. Such studies either sequence entire cohorts at relatively high cost (34) or else perform more targeted sequencing of individuals in the trait tails with the goal of optimising power per cost (35). However, even the latter approach is usually performed blind to evidence of enrichment of rare variant aetiology in the tails. Since our approach enables identification of individuals that may be most likely to harbour rare variants, then these individuals could be prioritised for (deep) sequencing. Moreover, our ability to distinguish between de novo and Mendelian architecture could influence the broad study design, with the former suggesting that a family trio design may be more effective than population sequencing, which may be favoured if Mendelian architecture is inferred. Furthermore, our approach could be applied as a screening step to prioritise those traits, and corresponding tails, most likely to harbour rare variant architecture. Finally, if sequence data have already been collected, either cohort-wide or using a more targeted design, then our approach could be utilised to increase the power of statistical methods for detecting rare variants by upweighting individuals most likely to harbour rare variants.

This study has several limitations. First and foremost, departures from the expected conditionalsibling trait distributions could be due to environmental risk factors, such as medication-use or work related exposures, rather than rare genetic architecture. Thus, while we believe that tail-specific deviations from polygenic expectation are interesting whether they arise primarily from genetic or environmental factors, we caution against over-interpretation of our results. Rejection of the null hypothesis from our tests should be considered only as indicating effects consistent with non-polygenic genetic architecture, alongside alternative explanations such as tail-specific unshared (de novo) or shared (“Mendelian”) environmental risks. We suggest that further investigation of individuals’ clinical, environmental and genetic profiles are required to achieve greater certainty about the causes of their extreme trait values. Nevertheless, given knowledge that rare variants of large effect contribute to complex trait architecture, we expect that traits for which we infer non-polygenic architecture, will, on average, be more enriched for rare architecture in the tail(s) than other traits. Secondly, our modelling assumes that environmental risk factors of siblings are independent of each other. If in fact shared environmental risk factors contribute significantly to trait similarity among siblings, then our heritability estimates will be upwardly biased. However, this would only impact our tests if the degree of shared environmental risk differed in the tails relative to the rest of the trait distribution. Moreover, a large meta-analysis of heritability estimates from twin studies (24) concluded that the contribution of shared environment among siblings (even twins) is insubstantial, and so we might expect this to have limited impact on results from our tests. Thirdly, our modelling assumes random mating and so the results from our tests in relation to traits that may be the subject of assortative mating should be considered with caution. Likewise, our modelling assumes additivity of genetic effects and so, while additivity is well-supported by much statistical genetics research (36, 37), results from our tests should be reconsidered for any traits with evidence for significant non-additive genetic effects.

Our approach not only provides a novel way of inferring genetic architecture (without genetic data) but can do so specifically in the tails of trait distributions, which are most likely to harbour complex genetic architecture, due to selection, and are a key focus in biomedical research given their enrichment for disease. This work could also have broader implications in quantitative genetics since we derive fundamental results about the relationship among family members’ complex trait values. The conditional-sibling trait distribution provides a simple way of understanding the expected trait values of individuals according to their sibling’s trait value, which could be used to answer questions of societal importance and inform future research. For example, it can be used to answer questions such as: as a consequence of genetics alone, how much overlap should there be in the traits of offspring of mid-parents at the 5th and 95th percentile and how does that contrast with what we observe in highly structured societies? Moreover, further development of the theory described here could lead to a range of other applications, for example, estimating levels of assortative mating, inferring historical selection pressures, and quantifying heritability in specific strata of the population.

Acknowledgements

We thank the participants in the UK Biobank and the scientists involved in the construction of this resource for making the sibling data used in this manuscript available. The work in this manuscript has been conducted using the UK Biobank Resource under application 18177 (Dr O’Reilly). We would also like to thank Dr. Avi Reichenberg for early discussions and Dr. Peter Visscher for highlighting key references related to the topic, and Dr. Shai Carmi for providing feedback on a draft version of the article.

Appendix 1

Derivation of Conditional Sibling Distributions

Here we derive the distribution that describes the probability density of the “conditional” sibling (S2) given the genetic liability, trait value and case-status of one or more index siblings. In each case we assume a population of unrelated parents and rely on the results from the infinitesimal polygenic model that show that within family variance is normally distributed around midparent genetic liability (average of parents) with half the ancestral trait variance even when selection, drift, population structure or dominance effects alter the between family trait distribution (20, 38, 39).

Case 1) Index liability known, continuous trait (h2 = 1): P (S2|S1 = s1)

We begin with the simplest case, a polygenic normally distributed trait which is fully heritable (h2 = 1) where the genetic liability of an index sibling in a population is known. Throughout we denote the midparent, index sibling and conditional sibling by M, S1 and S2. We begin by calculating the midparent distribution conditional on index liability using Bayes theorem:

Then using the following identity (40):

we calculate the conditional sibling distribution similarly:

Thus, as predicted by the infinitesimal polygenic model the conditional sibling liability is normally distributed around the midparent liability distribution with additional variance equal to half the population liability variance.

Case 2) Index trait value known, continuous trait (h2 ≠ 1): P (S2 | S1 = s1)

In this case, the primary result considered in this manuscript, a trait z-value, or equivalently, the percentile rank of an index sibling in genome wide association where the rank-based inverse transformation has been applied (41) is known. Transformation to a Z distribution (σ2 = 1) means that for heritability h2 the genetic liability and environmental contributions to trait variance are are and ,respectively. Similar to the previous case we begin by calculating the conditional mid-parent liability from Bayes’ theorem:

Then, we again use Equation 11 to derive the distribution conditional sibling distribution:

Case 3) Multiple index trait values known, continuous trait: P (S3|S1 = s1, S2 = s2)

The conditional sibling distribution can also be derived using the joint trait distribution for related individuals:

where the covariance G is the genetic relationship matrix. Thus for a sibling pair

As shown by Bernardo and Smith (42), if X is multivariate normal 𝒩(µ, λ1), where λ = Σ1 is the precision matrix, and X is partitioned into x1 and x2, with corresponding partitions of µ and λ of:

then the conditional distribution of x1 given x2 is also normal with mean and precision matrix:

Thus, given the joint distribution for three siblings:

The precision matrix λ = Σ1 is

And we can calculate the conditional distribution for for two sibling using Equation 15

Case 4) Binary trait: P (S2|S1 = Affected)

Here we again assume an underlying distribution that is 𝒩(0, 1) and made up of genetic and environmental components. However, we only know the index sibling’s status, which, as described under the liability threshold model (43), is equivalent to conditioning on the event where than index sibling’s trait value is above or below a z-value threshold T :

where T = Φ−1(1−K), Φ1 is the inverse normal cumulative distribution function, and K is the incidence of the binary trait in the population. Thus, the conditional distribution one sibling given an Affected index sibling can be can be calculated integrating over the normal distribution truncated at T :

The first two moments of the this truncated normal (44) are:

Approximating the index sibling distribution using a normal whose moments are taken from this truncated distribution, Equation 17 becomes:

which can be solved using the identity given in Equation 11:

Thus, conditional on an Affected sibling, the probability of concordance is:

which is equivalent to Reich’s (45) correction to Falconer’s (46) approximation where the relationship between the relatives is 0.5 for siblings. The probability of discordance given an Unaffected sibling can be calculated from Bayes’ Theorem:

which allow the conditional probability of case status to be determined given a index sibling’s status.

Appendix 2

Statistical Tests for Complex Architecture

Here we describe our statistical tests for complex tail architecture. Our tests identify changes in the conditional sibling distribution when ascertaining on an index sibling in the tail relative to polygenic expectation. To carry out these tests we establish a null distribution built on the assumption that indexing on siblings not in the tails reduces that likelihood that either sibling phenotype in the pair is driven by rare variants of large effect. We use the region from the 5th to the 95th percentile to estimate heritability. From the n sibling pairs where the index sibling is in the 5th to the 95th, we calculate the conditional likelihood Equation 14:

and maximize the log-likelihood with respect to h2:

to obtain an maximum likelihood estimate for h2 that is used to define the null distribution for our statistical tests.

Statistical Test for De Novo Architecture

We identify de novo mutations of large effect by testing for discordance between siblings relative to the polygenic null using the conditional distribution of a sibling given index sibling from (Equation 14). Since de novo mutations typically result in trait values in the tail of the distribution the test conditions on those index siblings in a specified upper quantile q of the distribution, i.e. those sibling pairs such that , defined as the set Aq. We introduce an additional parameter α where values of α < 0 in the right tail and α > 0 in the left tail are indicative of discordant siblings with trait values closer to the mean, giving a log-likelihood:

The null hypothesis H0 : α = 0 is tested via a score test:

And the score test for H0 : α = 0:

Statistical Test for Mendelian Architecture

Here we test for excess concordance between siblings in the tails of the distribution by testing for an excess number of observed siblings in the tail S2 > Φ1(q) given the index sibling is in the tail S1 > Φ1(q), where q is the quantile of interest. Denoting the set of index siblings in the tail by Aq and the size of the set by n, under the null of pologenicity we calculate the probability that the conditional sibling exceeds Φ1(q) from the normal cdf and compute the mean to define mean concordance under polygenicity π0:

Denoting the observed concordance (number of sibling pairs both > Φ1(q)) by r, the binomial loglikelihood (ignoring the constant) is:

Assuming r = such that I is not a function of any particular observation:

And the score test for H0 : π = π0:

Statistical Test for Non-Polygenic Architecture

General departures from polygenicity can be identified based on the degree of discordance in conditional sibling distribution from the expectation. Assuming polygenicity, given index and conditional sibling pairs , from Equation 4 we can write

Therefore, departures in polygenicity can be tested in trait quantiles by testing the observed distribution of , where Aq is the set of index siblings in the quantile of interest, relative to a standard normal distribution via the Kolmogorov-Smirnov test.

Appendix 3

Model Evaluation

Here we compare our theoretical derivations (Equation 4) that rely on the infinitesimal polygenic model (20) with simulated offspring data (Figure 7). We also compare our model to our empirical simulation (see Model & Methods) that draws allele frequencies and effect sizes from publicly available GWAS data (47) for two traits to produce parent and offspring genotype and genetic liability (equivalent to trait value when h2 = 1) (Figure 8).

Analysis of Six UK Biobank Traits.

Application of statistical tests for Mendelian and de novo tail architecture to sibling trait data of six UK Biobank traits. For each trait the conditional sibling mean is plotted under polygenicity (black line) for the heritability estimated estimated from the data. The red (high) and blue (low) bands represent the expected conditional sibling mean under polygenicity at different heritability values. Statistical tests for de novo architecture, Mendelian architecture, and general departure from polygenicity (Kolmogorov-Smirnov Test, Dist P-val), were applied to conditional siblings with index siblings in the upper and lower 1% of the distribution. Significant associations for the Mendelian and de novo tests are shown in red and green respectively. Tail architecture that is not distinct from polygenic expectation is denoted in grey.

Theoretical and simulated conditional expectation and variance in liability (z-score) and rank across index sibling percentiles for conditional sibling, midparents and index siblings. Simulation drew one-thousand parent liability values from 𝒩 (0, 1), these were randomly paired to produce to midparents with liability mi, two offspring were subsequently drawn from and randomly assigned as index and conditional siblings.

For both Fluid Intelligence and Standing Height GWAS variants (on chromosome one) were used to simulate parent and offspring genotypes and liability values. Plots show that for both traits the offspring distribution is normal and that the sibling distribution is multivariate normal, in line with our theoretical prediction

Application To UKB (Extended).

These tests demonstrate that our theoretical framework accurately reflects an additive polygenic trait in an outcrossing population. Additionally these results demonstrate that deviations in the conditional sibling distribution can be interpreted as non-polygenic architecture or quantile specific environmental effects.

Appendix 4

Software Availability and Extended Results

We have made our code, sample data, and a brief tutorial available online at www.sibArc.net (16). Below we display the summary statistics and the tail results for all eighteen UK Biobank traits analyzed and referenced in the manuscript.