Data Engineer (GCP, AWS, Open-source)

We are now seeking a Data Engineer to work with our Data Scientist in automating and expanding data pipelines, and preparing, transforming and generating insight from the data we have access to.
eLife jobs

Closing date for applications is December 28, 2018.

You’ll join a small and enthusiastic team who are passionate about data science, continuous improvement and software quality.

Things like TDD/BDD, lean, continuous delivery, strong collaboration and DevOps are part of our culture. The size of our team means you’ll be involved in all aspects of development, not just data engineering. We are committed to openness both in science and in the products that we release so that we can encourage broad change across the research communication landscape. This means everything we do is open-source and we specifically seek out innovative projects to collaborate with, while also constantly looking at groundbreaking ways to innovate ourselves.

Experience and attributes

You care about data and building data systems, and you believe engineering involves a lot more than just computers, configs and code. You know getting the best out of a system comes from strong collaboration with the people using it; both on the user side and the other people in your team. You have developed cloud-based ETL and data warehousing systems using batch or stream processing for ingesting data. You are familiar with Python (or similar) and keep abreast of the latest updates in the data analytics and data science communities. You care about open-source software and see the value in contributing to the community so that you can help make a difference.

Specific responsibilities

We build and operate an open platform for publishing that is designed for re-use by other publishers, scientists and institutions. The applications in this platform are augmented by our data science efforts including:

  • Semantic extraction from PDF and Word documents using ScienceBeam
  • Recommendations of editors and reviewers using PeerScout
  • Data-driven user experiences for automated form completion for xPub
  • Integration with third-party services for improved metrics
  • New, experimental machine learning and artificial intelligence projects

To expand our existing platforms we’re also in the early stages of building a data pipeline to consolidate our different data sources, along with external sources, in a modern, state-of-the-art data analytics platform. We’re using open tools on Google Cloud platform to achieve that, along with Jupyter notebooks and Google Data Studio to expose the data and insights to our teams. We’re quite new to this though so there’s plenty of scope for trying out new and emerging technologies - the role is a good mix of modern, big data analytics, preparing data for machine learning and optimising data flows between our existing systems.

Our production systems currently use Python, PHP and Node.js, with our data science apps all in Python using TensorFlow and Apache Beam, and we are evaluating Keras and PyTorch. We use Google Cloud Platform for data engineering and data science, hosting Apache NiFi with Docker on Kubernetes and using Google Cloud Storage, DataPrep, Computing Engine and Cloud Functions where needed. Our publishing platform runs on AWS but we use containerisation with Docker allowing us to choose the relevant technology for the task and the people working on it. We’d be interested in you if you’re from any background as long as you have a keen understanding of good development practices and the latest in data engineering techniques.

Here’s a summary of some of the technology we’re currently using but we’re open to new ideas:

  • Open languages and frameworks (Python, Django, PHP, Drupal, Symfony, Ruby, Rails)
  • Infrastructure is defined in code and automated using Terraform, SaltStack, Docker, Helm and Kubernetes
  • Data repositories (BigQuery, Postgres, Redis, Lucene, Elasticsearch)
  • Data processing tools (Apache NiFi, Apache Beam, Apache Airflow)
  • Visualisation and analytics tools (Google Data Studio, Google Colaboratory, Jupyter)
  • Machine Learning tools (TensorFlow, Keras, PyTorch)
  • Open-source continuous integration and testing (TravisCI, Jenkins, Selenium, PHPUnit, Behat, Pytest)
  • Monitoring, logging and metrics (New Relic and Loggly)

We know not to spend time doing something that has been done well before, so use hosted web services and existing tools when we can, unless it makes our software less open and not as easy to reuse.

Terms and conditions

We are a well supported, not for profit, mission-driven organisation. We have a deeply open culture, and ideas are welcome from across the company, so you’ll get the chance to really make a difference. eLife is a great place to work if you care about science, and our modern office environment means you can vary how you like to work. Our approach to flexible working and ability to work well remotely suits people with commitments outside work or returners. The smaller size of the team means you’ll be able to get involved in many aspects of technology, innovation and scholarly publishing. We also offer:

  • A competitive salary and benefits.
  • 25 days holiday, plus bank holidays.
  • A communicative and inspiring working environment.
  • A newly refurbished and employee designed modern office near the centre of Cambridge, with cycle parking, shower facilities, standing desks, informal working or relaxation areas, free fresh fruit, a fancy coffee machine and snacks in the kitchen.
  • A sociable and friendly team with interesting social events.
  • The latest in computer equipment, Herman Miller chairs and regular conference visits.
  • A flexible approach to working hours and remote working.
  • Company pension scheme.

Please send your CV and a covering letter explaining your enthusiasm for this position and why you are a great person for this role to hr@elifesciences.org. If you get stuck on what to write then here are some suggestions:

  • Why you would like to join eLife.
  • Where you found us (a job ad, on Twitter, a conference, meeting one of us).
  • Information about any your interesting GitHub, Bitbucket etc. accounts.
  • Any blogs and email groups you read or would recommend.
  • Any communities and events you attend or are involved with.
  • Any books you’ve read that stood out.
  • Links to Twitter/LinkedIn/personal websites/blogs that you’d like us to read.

We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, colour, national origin, gender, sexual orientation, age, marital status, or disability status.