Admixture into and within sub-Saharan Africa

Wellcome Trust Centre for Human Genetics, United Kingdom; Wellcome Trust Sanger Institute, United Kingdom; Medical Research Council Unit, The Gambia; Royal Victoria Teaching Hospital, The Gambia; Centre National de Recherche et de Formation sur le Paludisme, Burkina Faso; University of Rome La Sapienza, Italy; Navrongo Health Research Centre, Ghana; Komfo Anokye Teaching Hospital, Ghana; University of Buea, Cameroon; KEMRI-Wellcome Trust Research Programme, Kenya; Kilimanjaro Christian Medical College, Tanzania; London School of Hygiene and Tropical Medicine, United Kingdom; College of Medicine, University of Malawi, Malawi; University of Bamako, Mali
Joseph K Pickrell, New York Genome Center and Columbia University, United States

Busby et al. present new genomic data from several African populations that greatly expand our understanding of African genomic diversity.

Essential revisions:

The reviewers raised a number of important points, specifically regarding the interpretation of the analysis. There should be some additional attention paid to these points at the relevant spots in the manuscript. I (the editor) have lightly edited some of the major points raised by the reviewers below.

1) Reviewer #2 argued that the results of the analysis are highly dependent upon the available samples. While the sampling here is better than previously available, relative to the actual scale of Africa it remains under sampled in several key regions. This fact is acknowledged at several points in the manuscript, however, much of the interpretation of the results are done without considering how sampling may influence their findings. For example, a central focus of the manuscript is on the Bantu expansion, however there are no samples included from the purported area of the Western branch of their spread south. This is a major limitation that needs to be considered at all points.

2) Reviewer #2 argued that the authors' reading of the available literature on African prehistory is a bit thin at times. These analyses will be much more valuable and have a much wider impact if they are better integrated into the reasonably well developed archaeological and linguistic understanding of African population history

The primary disconnect from the interpretations presented here and the existing literature is with respect to the Bantu expansion. The authors consider their Cameroon samples to best represent the western branch of the Bantu migration. However, the archaeological and linguistic consensus seems to be that the area of Cameroon is actually the origin of the Bantu expansion. In other words, the Cameroon samples may best represent something like the ancestral population of both the Eastern and Western branches of the expansion. Also, since the Cameroonian samples are the most southern and western samples included from the West African set, it means that, contrary to the author's assertions, no western branch Bantu populations were included in this study. Considering the Cameroonian samples as representative of the origin, rather than the western part of the expansion will change the interpretation considerably.

3) MALDER seems to give consistently older dates than GLOBETROTTER? Why? Which should we prefer? The only way to properly address this question is to attempt to relate the inferred dates to the known historical or archaeological record (e.g. dates that are independent of population genetic inference). Some contextualisation here would be extremely helpful. Otherwise there is no clear way for the reader to evaluate these findings.

4) In the second paragraph of the subsection “Population movements within Africa and the Bantu expansion”. Tishkoff et al. 2009 showed pretty strongly convincingly using genomic STR markers that the Bantu expansion involved a spread of West African people, rather than simple cultural diffusion. Thus, the characterisation of this as 1) an ongoing controversy, and 2) one only addressed using uniparental markers is not correct. In fact, Tishkoff et al. did so with even greater population sampling than found in the present study.

5) In the second paragraph of the subsection “Population movements within Africa and the Bantu expansion”. The descriptions of the two linguistic hypotheses are particularly lacking in detail. It is important to put dates on the possible early and late splits, since the magnitude of the difference between these will determine the power to detect a difference genetically. Also, it is stated that recent genetic and linguistic analyses support the late split, but no mention of how. This is important for contextualising how this analysis adds to these other findings.

6) In the third paragraph of the subsection “Population movements within Africa and the Bantu expansion”. I do not understand the logic behind the claim that the early split hypothesis predicts all of the ancestry of Eastern NC speakers comes from Central West Africa. The general consensus is that the Bantu expansion originated in the area of Cameroon or Nigeria ~5kya, then split into two major forks (Greenberg 1972, Phillipson 1975). The first spread around the Congo forest to the North, and entered East Africa in the Great Lakes region and continued down the eastern margin of Africa to Southern Africa. The second went around the Congo Forest down the Western Coast and entered Southern Africa. These two branches are thought to have ultimately reunited in Southern Africa. The analysis presented here finding Southern African NC ancestry in East African NC speakers might best be explained by the former descending from the latter. The ability to detect ancestry from western branch of the Bantu expansion is not possible here because there are no populations sampled from the area of Gabon, Angola, etc. I am worried these findings are a sampling artefact.

7) In the subsection “(3) Medieval contact between Asia and the East African Swahili Coast”. Again it is not clear that this represents the Western Bantu expansion since the samples come from Cameroon which is hypothesised to be near the homeland of both the Eastern and Western branches of the expansion.

8) In the subsection “(7) Pre-Bantu pastoralist movements from East to South Africa”. The proposed model here is incongruent with the known archaeology, which shows the Bantu expansion reaching the eastern side of the Congo Forest in the north near Lake Victoria, rather than the south. The archaeology may not be correct, however given the lack of western Bantu samples included in this study, the conservative interpretation is that these findings are a sampling artefact.

9) In the first paragraph of the Discussion. The adoption of pastoralism began well before 2,000 years ago. The earliest archaeological evidence of cattle keeping may date to as early as 9kya in the Nile River Valley, and is well established by 6kya (see Boivin et al. 2010). Pastoralists reach the Koobi Fora region of Kenya by 4.5 kya, and are east of Lake Victoria by 3.8 kyat (see Pendergast 2011).

10) In the first paragraph of the Discussion. Why is the GLOBETROTTER analysis more precise? Is there any evidence to support the assertion? Also, even if more precise, is it more accurate?

11) Methods section: The Methods section is often difficult to follow because it frequently mixes results with the methods themselves. If results of one analysis are necessary to inform how the following analysis is to be performed, it would be preferable to lay out the strict methods and the logic of how methodological options were chosen in the Methods section, and then keep the actual progression of results in the Results section.

12) More generally, the manuscript is a bit too long; in particular, the Discussion and Abstract contain a substantial amount of content on genetic epidemiology and infectious disease that is only tangentially related to the analyses performed in the manuscript.

13) The description of the TVD analyses (subsection “Haplotypes reveal subtle population structure”) could be clarified. The comparison of TVD vs. Fst suggests that TVD is more informative ("expected to capture more variability") because recombination rates are faster than mutation rates. However, aren't recombination rates and mutation rates quite similar (both roughly 10^-8 per base per generation)? Also, the difference between TVD and Fst seems to be more than just amount of information captured; the two metrics must be capturing different types of information given the empirical differences between the values of the metrics. More intuition would help. Overall, I found the discussion of TVD rather confusing, especially given that paragraph ends with a caveat about interpretation – what then should we make of the TVD results in Figure 2A?

14) In the main GLOBETROTTER analyses, the authors note that the GLOBETROTTER approach allows them to infer whether Eurasian haplotypes came directly into sub-Saharan Africa or were brought indirectly together with sub-Saharan groups (subsection “Direct and indirect gene flow from Eurasia back into Africa”, second paragraph). Is this statement technically correct? It seems to me that inference of this sort would require looking for colocation of Eurasian + other sub-Saharan haplotypes (passed along together in a single tract that itself was the product of a previous admixture + recombination). Such a signal would indeed be strong evidence of Eurasian haplotypes entering via a previous admixture, but as far as I can tell GLOBETROTTER only looks at copying vectors (vs. relative locations of copied segments).