A High Performance Computing (HPC) infrastructure can be used for fast, large-scale and parallel execution of processes and tasks (e.g. genome assembly, sequence alignment, database search and so on). Substantially more robust than traditional desktop computers, HPC is frequently employed for big data analysis in fields such as genomics, bioinformatics, computational sciences, geology, engineering and astronomy. Some of the limitations of HPC are cost, maintenance and performance capacity.
In Africa, bioinformatics, computational biology and biomedical research are rapidly growing: more and bigger sets of data are being generated. Many African institutions are actively producing data, but few have access to HPC to analyse this data, due to the limited number of instances of such infrastructure being available in the continent, or accessible by African institutions. This hinders the development and progress of research related to analysis of these large sets of data. The few institutions in Africa with an HPC facility operate with regional and federated access, which limits access from other regions of the continent.
I am passionate about open science, open tools and open data. My aim for the HPC4Africa project is to set up an open-access HPC or compute resource for big data analysis in African institutions (see https://github.com/trustodia). The availability of on-demand cloud computing platforms, such as Amazon Web Services (AWS), Azure and Google Cloud, means that I no longer have to set up this compute resource from scratch, but can rather take advantage of and set up the HPC on one of these existing platforms. This dramatically reduces the technical difficulty in setting up and maintaining this compute resource.
I first took this idea to Mozilla Global Sprint 2018, where I was able to share, discuss and collaborate with some participants. The major achievement at this event was the setting up of the existing Git repositories that house the project on local hardware infrastructure for development.
eLife’s Innovation Sprint 2019 provided a venue for me to showcase and work on this project (see the Sprint’s project roundup). Working on this project at the Sprint was not all easy: many participants were busy on multiple other projects and hence could only dedicate limited time, and the internet connection was sometimes unstable due to high user traffic. Nevertheless, I met people, namely Ivo Jimenez, Aziz Khan, Giorgio Sironi and Nick Duffield, who were experienced and skilled in DevOps, database management, HPC administration and user interface design, respectively. We were able to set up a virtual HPC on AWS, which is currently running, and installed some tools and scripts for 16S rRNA data analysis. Also, through conversations at the event, I sourced a number of other brilliant alternative solutions for the project, including potentially implementing Popper, an open-source tool for conducting scientific explorations and writing academic articles following a DevOps approach.
I hope that the virtual HPC that I have built can serve as a blueprint for future set-ups. Moving forward, one of the first next steps for this project is to apply for grant/funding to purchase compute resources on an existing cloud computing platform. Ultimately, I hope that with the new knowledge that I have acquired at the Sprint, I will be able to help other bioinformaticians in Africa access and set up similar virtual HPCs and analyse their data. I also wish to see if I can scale up the current prototype to allow multiple users to access my set-up and use the open bioinformatics tools that I have created, as well as to share their own. This HPC instance can potentially serve as a sandbox for computational biologists and research software developers to build their own open tools, conduct small-scale user testing and gather feedback. In the long run, by encouraging a culture of sharing and openness, I hope that this will allow the African bioinformatics community to see the benefits of open collaborations, incentivise researchers to work together more often, and raise the visibility of African research.
I am using this post as an opportunity to make a call for contributions and collaboration on this project, especially with other researchers and software engineers who would like to explore setting up similar cloud computing resources for the community. I have also recently come across Galaxy, an open-source, web-based platform for data-intensive biomedical research, and am keen to understand if that could be implemented. I would also love to work with web and software developers on a front-end design and back-end set-up for user account creation/registration for the current instance. A bottleneck towards the scaling up of this project is the lack of funding – any thoughts on ways to secure financial support for this resource would be very welcome. Finally, I would like to work with relevant communities to provide training on implementing cloud infrastructures for bioinformatics analysis, for example, the Data Carpentries. If you are interested in any of these aspects, please let me know via email – I am open to ideas, options and alternatives.
We welcome comments, questions and feedback. Please annotate publicly on the article or contact us at innovation [at] elifesciences [dot] org.
Do you have an idea or innovation to share? Send a short outline for a Labs blogpost to innovation [at] elifesciences [dot] org.