Figures and data

Study design and visual abstract of purpose and benefits of long-read sequencing and a multi-ancestry imputation panel.
a) Long-reads that span large SVs enable identification of novel SVs and improve calling of known SVs, which better inform locus-to-gene (L2G) approaches. b) Study design including the sample selection, SV calling, SV imputation in UK Biobank, GWAS, and post-GWAS stages (not exhaustive). For the GWASs in UK Biobank, separate analyses using the same phenotype definitions were carried out using (i) the imputed SNVs, and (ii) imputed SVs (i.e., SV-WAS). The two GWAS summary statistics were then combined for the relevant post-GWAS analyses. L2G/V2G: locus-to-gene/variant-to-gene (carried out on top SNVs identified by Shrine et al., which have an SV deletion within 500 kb); SV2G: SV-to-gene (carried out only on SV deletions that were the most significant variants in the associated loci in the GWASs we carried out in UK Biobank, which also overlap with protein-coding genes); 1KGP: 1000 Genomes Project (Phase 1); UKB: UK Biobank.

Characterization of the quality-controlled imputation panel
(N=888 samples and n=107,445 SVs). a) A multi-ancestry, long-read sequencing-based imputation panel enables robust SV imputation in all biobanks, including UK Biobank – which we utilised for proof-of-concept. AMR: Admixed American ancestry; AFR: African ancestry; EUR: White European ancestry; EAS: East Asian ancestry; SAS: South Asian ancestry. b) Sample counts by superpopulation and population codes (population code description in Supplementary Table 1), using abbreviations from 1000 Genomes Project. c) Number of SVs by variant type (DEL: deletion, INS: insertion, DUP: duplication, INV: inversion, BND: break-end) and frequency class (common: 0.05<MAF≤0.5, low frequency: 0.005<MAF≤0.05, rare: 0<MAF≤0.005), with total SV counts by SV type. d) SV size distributions by SV type and frequency class, excluding break-ends, which do not have a size. e) Number of minor SV alleles per individual by superpopulation.

Evaluation of imputation quality:
a) Leave-one-out imputation: sample-wise non-reference concordance stratified by superpopulation and GIAB region type, based on imputed common SVs (top) and imputed common SNVs (bottom). b) Leave-one-out imputation: variant-wise non-reference concordance (top) and r2imp score (bottom) stratified by variant type, MAF class, and GIAB region type. c) Imputation into UK Biobank: r2imp score by variant type and MAF class, for confident regions, all regions, and difficult regions.

a) Manhattan plot of 9 selected traits, with only GW-significant SVs overlapping with protein-coding genes shown. Phenotype definitions and full list of GW-significant SVs are available in Supplementary Tables 12-13 and 14, respectively. b) Region plot showing the FEV1/FVC association of Sniffles2.DEL.3639MF (841-base SV deletion in an intron of CFDP1) and nearby SNVs in a GWAS of UK Biobank participants. LD information between the variants was calculated using the same population utilised in the UKB GWASs. Additional details on the SV can be found in Supplementary Table 5.