Driving innovation in single-cell analysis on AWS

Computational biology is undergoing a revolution. Recent advances in microfluidic technology have enabled high-throughput, high-dimensional analysis of individual cells at an unprecedented scale. However, the analysis of single cells is a challenging problem to solve. Standard statistical techniques used in genomic analysis do not capture the complexity present in single-cell datasets. Unlocking the potential of single-cell biology requires development of innovative methods employing complimentary approaches by researchers with diverse background and expertise.

Open Problems in Single-Cell Analysis is a community-driven effort supported by Chan Zuckerberg Initiative, to drive the development of novel methods that leverage the power of single-cell data. Open Problems provides a framework for researchers to define formalized tasks in single-cell genomics and benchmark model performance on those tasks using AWS.

Single-cell genomics benchmarking platform

Figure 1. Architecture of single-cell genomics benchmarking platform

A core part of Open Problems is a benchmarking platform that allows anyone to contribute methods to single-cell analysis GitHub repository where that method can be evaluated against other methods using standard datasets. The end-to-end process comprises of two stages: 1, the development and debugging of containerized methods, and 2, benchmarking.

Stage one: Development of containerized methods
This is largely a preparatory phase that is done on local computers, where researchers develop a new method and containerize it for evaluation and benchmarking. The Docker files are committed to GitHub which triggers a container build workflow that publishes the image to Amazon Elastic Container Registry (Amazon ECR). Amazon ECR serves as a central repository for all containers used by Open Problems in Single-Cell Analysis.

Stage two: Benchmarking of computational methods
Benchmarking is done using AWS Batch and Nextflow. Containerized methods in Amazon ECR are deployed to AWS Batch and run in parallel. The Nextflow pipeline handles data download, preprocessing, evaluation of methods, and calculation of metrics. Data is downloaded from remote repositories such as the National Center for Biotechnology Information’s Gene Expression Omnibus (GEO). Data is stored in Amazon Simple Storage Service (Amazon S3) to provide low-latency access to jobs running in parallel. Jobs are deployed to Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances using AWS Batch to scale to thousands of benchmarks in a cost-effective manner.

The combination of Docker, Nextflow, AWS Batch, and Amazon ECR provides a reliable, scalable, and automated benchmarking platform for Open Problems. The platform levels the playing field for diverse research teams around the world to rapidly prototype new computational methods, using diverse programming tools and massive data sets.

Benchmarking computational methods for single-cell analysis

Nine major are already defined in the Open Problems framework. These tasks range from projecting labels from single-cell reference atlases onto new datasets, to integrating scRNA-seq datasets across batches, to deconvolving spatial transcriptomic spots into their constituent cell types. Computational biologists choose a task of interest and benchmark the method using the AWS Batch and Nextflow pipeline.

Looking at batch-integration, there are currently over 50 methods that integrate single-cell RNA-sequencing data across batches. The Open Problems batch integration benchmark compares 16 of these methods which range from the earliest proposed approaches for this problem (MNN, Seurat, SAUCIE), to more recently proposed popular methods (scANVI, FastMNN). Each of the 16 methods are run on 19 datasets and evaluated on 11 metrics to produce an output. A sample output of the batch integration benchmark is shown in Figure 2 below.

Figure 2. Sample output of batch integration benchmark

Researchers can extend these methods and metrics using the API for batch integration methods. Adding a method to the Open Problems repository triggers a GitHub Actions workflow that automatically benchmarks it against the existing 16 methods using the AWS Batch and Nextflow pipeline.

An international competition using the Open Problems framework

The most recent development from the Open Problems community is a competition for NeurIPS 2021. In this competition, Open Problems presents three critical tasks on multimodal single-cell data using public datasets and a first-of-its-kind multi-omics benchmarking dataset. Teams will predict one modality from another and learn representations of multiple modalities measured in the same cells. Progress will reveal how a common genetic blueprint gives rise to distinct cell types and processes, as a foundation for improving human health. Results from the competition will be hosted on the Open Problems repository, and final model evaluation will be run on AWS.

Conclusion

Open Problems in Single-cell Analysis is a community-driven project that has taken on the daunting challenge of benchmarking all computational methods in single-cell genomics. This is made possible by the interplay of GitHub Actions workflows, Nextflow pipelines, Docker containerization, and AWS. Every contribution made to Open Problems is reviewed and vetted by a community of over 30 computational biologists around the world.

We invite you to contribute, to compete in the Multimodal Single-Cell Data Integration NeurIPS Competition, and think big with the Open Problems team in setting a standard for benchmarking in single-cell genomics.

to get the latest in AWS tools, solutions, and innovations from the public sector delivered to your inbox, or contact us.