Cutting Edge: Changing the rules of the game
Genomics researchers have built a Facebook game that allows members of the public to join the effort to understand a disease that has killed millions of ash trees across Europe.
Crowdsourcing encompasses a wide range of activities, all of which involve getting something done using labour contributed by people who enjoy doing the thing you want to be done. It draws strength from community spirit and a sense of involvement in the project, and is made possible by the fact that social networks and mobile computing devices mean that many people are always online and able to contribute very easily.
Scientists have been quick to make use of the power of crowds in both the physical and the life sciences (see Box 1). Possibly the most audacious and visually exciting application of crowdsourcing is the Eyewire project. For Eyewire, researchers have used a laser-guided microscope to capture images of the retina, in particular the neurons within the retina, at different depths. People playing Eyewire have to colour in the neurons within these images so that, eventually, it becomes possible to combine all these two-dimensional images to make a three-dimensional model of the complex network of neurons in the retina.
Other examples of crowdsourcing in science
Examples of crowdsourcing in science range from foldit, an online protein-folding game, to Galaxy zoo, which involves members of the public identifying different types of galaxies in images captured by the Hubble Space Telescope.
A novel use of Facebook for scientific crowdsourcing happened in 2011 when Devin Bloom, a PhD student at the University of Toronto Scarborough, and his colleagues needed to identify the species of 5000 fish that had been caught in the Cuyuni River in Guyana as part of a research project. Pressed for time, they uploaded photographs of the fish onto Facebook and asked other scientists and fish experts to help. Within 24 hr, all 5000 fish had been identified.
Also, the rapid release of genetic data during an outbreak of E. coli in Germany in 2011 was instrumental in the rapid identification of the strain responsible for the outbreak because it allowed bacterial genomicists from around the world to contribute analyses of the data.
More recently, scientists from the Wellcome Trust Sanger Institute mentored budding game developers in a short coding competition to produce a game based on genomics and genetics. And Cancer Research UK is collaborating with Amazon, Facebook and Google to run a ‘GameJam’ to design and develop a mobile phone game that would analyse genetic data as part of the effort to speed up the development of improved treatments for cancer.
The usefulness of using crowdsourcing to do genetics research has been demonstrated by the results of the Phylo project. Phylo is a game in which the four bases in DNA are represented by blocks of different colour: the aim of the game is to move groups of blocks to the left and right in order to find the best possible alignment of groups with each other. Or, in the language of bioinformatics, to work on a computationally intractable problem called ‘multiple sequence alignment’. The game has provided the Phylo team with many thousands of improved solutions to this problem for genes related to disease in humans and 43 other vertebrate species (see Kawrykow et al., 2012; note that ‘Phylo players’ are listed as an author on this paper).
More than a game
Both of these games require a low amount of investment: from the players’ point of view, they can be dipped into in a few spare minutes and played for their intrinsic reward as a game and still return useful results. However, this minor effort per player does not mean that the results are trivial: rather, there are certain types of problems that the human brain and eye can solve much faster than a computer can solve them.
For problems that require a high level of investment in time and application, and perhaps specialist knowledge from a different field, incentives beyond those provided by a game have been provided. For example, Karim Lakhani of Harvard Business School and co-workers ran a prize-based contest in which they invited algorithmic scientists from outside of biomedicine to develop algorithms to annotate (i.e. correctly label) genetic recombination in human T cells (Lakhani et al., 2013). The advances sought here were substantially technical, and the contest brought to the fore a variety of techniques that had not previously been tried by the computational biology community: moreover, some of the entries achieved algorithmic speeds that were faster than existing techniques by a factor of over 1000, and some entries achieved accuracies that were close to the theoretical maximum for the test data set.
A crucial component of this approach, apart from the promise of a prize, was the appeal to a pre-existing community. The contest was launched through the TopCoder.com website, a community of more than 500,000 coders who regularly compete to solve coding problems. To reach this community effectively, the genetics problem the researchers were interested in was recast as an algorithmic problem that was better suited to the target audience.
When we want to make use of crowds to do specialist analyses, one of the challenges is to accommodate their non-expert status (or, in the previous example, their other-expert status) by translating the problem into a form that is readily understandable by the crowd. Also, in many cases non-experts will not appreciate the value of the thing they are working towards, so other goals and rewards must be wrapped around the activity. Game interfaces are able to mix these three factors—a problem that can be understood, a goal and a reward—together. This is the approach that the OpenAshDieBack project—run by researchers at the Sainsbury Laboratory, the John Innes Centre and the Genome Analysis Centre, all in Norwich in the UK—is now taking to gain a better understanding of ash dieback, a fungal disease that has killed millions of ash trees (Fraxinus excelsior) across Europe.
An open approach to genomics
Ash dieback is caused by a fungus, Chalara fraxinea, that emerged in Poland in the early 1990s and spread west across Europe. There is no known treatment for the disease, so when it reached the UK in 2012, the author and colleagues set up the OpenAshDieBack project to rapidly generate genomic sequence data for C. fraxinea and release these data directly to the scientific community (even before we had a chance to analyse them ourselves). This crowdsourcing approach enabled us to gather analyses from scientific experts around the globe, and we now have a variety of new genomic resources, including draft genome assemblies of both F. excelsior and C. fraxinea.
To encourage input from beyond the scientific community we have now developed a game interface that allows non-geneticists and non-scientists to contribute to our genomic analyses by analysing genetic variants within Facebook. The game, which is called Fraxinus, is not merely an exercise in outreach, it is designed to return data that are useful to scientists.
The process of preparing a genome sequence is complex because genomes are extremely long and because existing sequencing techniques have various shortcomings. The genome of C. fraxinea is 63 million bases long and the genome of the ash tree is even longer (approximately 954 million bases). The biochemical process we use to sequence these genomes can give us fragments of less than 200 bases: this means that we have millions of fragments (or ‘reads’) for each genome, and we must assemble these reads together into longer contiguous sections in order to understand the genetic code of that organism.
Individuals from the same species will have DNA sequences that are almost identical. However, the differences between them are important because they contribute to differences in, for example, the ability to cause disease (for pathogens) or the susceptibility to disease (for trees). Therefore, by comparing the DNA sequences from different samples with a reference genome for that species, we should be able to identify the genetic variations that make some strains of C. fraxinea so deadly and/or the variations that make some trees resistant to ash dieback.
Common variants include a simple change in the identity of a base—such as an adenine (A) being replaced by a thymine (T)—or the insertion/deletion of a base compared with the reference (a variant called an indel). A lot of research has gone into developing computer programs that can identify these variants automatically, but computational approaches have some significant shortcomings, especially when identifying indels in certain types of genomes, such as genomes that are highly repetitive or contain a lot of cytosine (C) and guanine (G) bases. In these situations, the visual acuity and excellent pattern recognition skills of humans can be invaluable and can help us to find these elusive and potentially crucial variants.
Game on for genomics
The basic idea of Fraxinus is to compare reads from different samples (which could be pathogens or trees) with the relevant reference genome to identify important variants. To make this challenge manageable, we have selected locations within the reference genome that we know (based on a preliminary analysis of the data) are likely to contain genetic variants and included the bases on either side of these locations to make target patterns that contain a total of 21 bases. We have then extracted all the reads that overlap to some extent with each target pattern. To begin with, the game will only contain data on the pathogens, but the game could be played with data from any species, and we plan to include data on the trees within the next few weeks.
In the game the A, C, G and T bases within both the target and read patterns are represented by coloured leaf shapes (see Figure 1). The aim of the game, for the player, is to align the read patterns with the target pattern: the better the alignment between these two patterns, the higher the player’s score. The player can improve the alignment—and, therefore, his or her score—by shifting the whole read pattern, and then hiding individual leaves (to represent a deletion) or introducing a space between two leaves (an insertion). And if there already is a space between two leaves, it can be increased in size or removed.
The player’s score is recalculated after every action, and play ends when the player decides that he or she cannot improve the score any further. If the player’s score is the highest for that target pattern, the player ‘owns’ the pattern. The player can then either tackle a target pattern that is not owned by another player, or attempt to ‘steal’ a target pattern owned by someone else by getting a higher score for it. Players know that the alignments they create will be stored by the scientists running the game and that their names will also be stored (if they consent) so that they can be credited for their contribution to science.
Parts of the genome of C. fraxinea are rich in G and C bases, which makes it difficult to identify variants with computational techniques, so there is a good chance that the game will supply us with a rich supply of new variants to analyse. With more accurate lists of genetic variants we will be much better placed to assess genetic variability between strains and to carry out experiments to associate particular organismal traits (such as the ability to cause or withstand infection) with their genetic root. Ultimately this will help us to breed trees that are resistant to ash dieback (or other diseases), to identify treatments, and to plan for better management of our woodlands and tree stocks.
By embedding Fraxinus in the Facebook platform, we hope to encourage competition between players and other members of their social network, as this will lead to multiple independent analyses of the same data and to more accurate results. However, anyone attempting to do something similar should be aware that game design is difficult, and it requires input from experienced software engineers, gameplay experts and graphic designers. (We worked with a company called Team Cooper to help us develop Fraxinus.) A key challenge is to bridge the gap between the players and the scientists—if the game is not exciting to play, the project will fail no matter how exciting the science is to the scientists.
We think that with Fraxinus we have created a game that exists in a happy middle ground as a game platform that effectively and simply encapsulates the problem we need to solve, that is enjoyable and competitive for the player and, we hope, can provide us with the results we need.
Article and author information
- Version of Record published: August 13, 2013 (version 1)
© 2013, MacLean
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
- Page views
Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
- Epidemiology and Global Health
- Genetics and Genomics
Background: To evaluate the utility of polygenic risk scores (PRS) in identifying high-risk individuals, different publicly available PRS for breast (n=85), prostate (n=37), colorectal (n=22) and lung cancers (n=11) were examined in a prospective study of 21,694 Chinese adults.
Methods: We constructed PRS using weights curated in the online PGS Catalog. PRS performance was evaluated by distribution, discrimination, predictive ability, and calibration. Hazard ratios (HR) and corresponding confidence intervals [CI] of the common cancers after 20 years of follow-up were estimated using Cox proportional hazard models for different levels of PRS.
Results: A total of 495 breast, 308 prostate, 332 female-colorectal, 409 male-colorectal, 181 female-lung and 381 male-lung incident cancers were identified. The area under receiver operating characteristic curve for the best performing site-specific PRS were 0.61 (PGS000873, breast), 0.70 (PGS00662, prostate), 0.65 (PGS000055, female-colorectal), 0.60 (PGS000734, male-colorectal) and 0.56 (PGS000721, female-lung), and 0.58 (PGS000070, male-lung), respectively. Compared to the middle quintile, individuals in the highest cancer-specific PRS quintile were 64% more likely to develop cancers of the breast, prostate, and colorectal. For lung cancer, the lowest cancer-specific PRS quintile was associated with 28-34% decreased risk compared to the middle quintile. In contrast, the hazard ratios observed for quintiles 4 (female-lung: 0.95 [0.61-1.47]; male-lung: 1.14 [0.82-1.57]) and 5 (female-lung: 0.95 [0.61-1.47]) were not significantly different from that for the middle quintile.
Conclusions: Site-specific PRSs can stratify the risk of developing breast, prostate, and colorectal cancers in this East Asian population. Appropriate correction factors may be required to improve calibration.
Funding This work is supported by the National Research Foundation Singapore (NRF-NRFF2017-02), PRECISION Health Research, Singapore (PRECISE) and the Agency for Science, Technology and Research (A*STAR). WP Koh was supported by National Medical Research Council, Singapore (NMRC/CSA/0055/2013). CC Khor was supported by National Research Foundation Singapore (NRF-NRFI2018-01). Rajkumar Dorajoo received a grant from the Agency for Science, Technology and Research Career Development Award (A*STAR CDA - 202D8090), and from Ministry of Health Healthy Longevity Catalyst Award (HLCA20Jan-0022). The Singapore Chinese Health Study was supported by grants from the National Medical Research Council, Singapore (NMRC/CIRG/1456/2016) and the U.S. National Institutes of Health [NIH] (R01 CA144034 and UM1 CA182876).
- Developmental Biology
- Genetics and Genomics
The development of tools to manipulate the mouse genome, including knockout and transgenic technology, has revolutionized our ability to explore gene function in mammals. Moreover, for genes that are expressed in multiple tissues or at multiple stages of development, the use of tissue-specific expression of the Cre recombinase allows gene function to be perturbed in specific cell types and/or at specific times. However, it is well known that putative tissue-specific promoters often drive unanticipated 'off target' expression. In our efforts to explore the biology of the male reproductive tract, we unexpectedly found that expression of Cre in the central nervous system resulted in recombination in the epididymis, a tissue where sperm mature for ~1-2 weeks following the completion of testicular development. Remarkably, we not only observed reporter expression in the epididymis when Cre expression was driven from neuron-specific transgenes, but also when Cre expression in the brain was induced from an AAV vector carrying a Cre expression construct. A surprisingly wide range of Cre drivers - including six different neuronal promoters as well as the adipose-specific Adipoq Cre promoter - exhibited off target recombination in the epididymis, with a subset of drivers also exhibiting unexpected activity in other tissues such as the reproductive accessory glands. Using a combination of parabiosis and serum transfer experiments, we find evidence supporting the hypothesis that Cre may be trafficked from its cell of origin to the epididymis through the circulatory system. Together, our findings should motivate extreme caution when interpreting conditional alleles, and suggest the exciting possibility of inter-tissue RNA or protein trafficking in modulation of reproductive biology.