The Data Toolkit That Can Analyze More Than 1M Cells

Brian P. Dunleavy
FEBRUARY 20, 2018
scanpy,single cell analysis,gene expression python,hca news

F. Alexander Wolf, PhD, and his team at the Institute of Computational Biology (ICB) at Helmholtz Zentrum München, the German Research Center for Environmental Health, have already been to the “Data Science Bowl”—a sort of “championship game” for machine learning.

Now, they may be at the forefront of a monumental breakthrough in the analysis of single-cell gene expression data with the advent of SCANPY, a “scalable toolkit” offering “preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks.” In fact, SCANPY, which stands for Single-Cell Analysis in Python (the programming language), may be the only currently available software package that can analyze data sets containing more than 1 million cells, including the Human Cell Atlas, a reference database of maps designed to describe and define the cellular basis of health and disease, which was developed by an international team of researchers.

A summary of Wolf’s work with SCANPY to date was published on February 6 in the journal Genome Biology.

“The Human Cell Atlas could profit from SCANPY,” Wolf, team leader in machine learning at ICB, told Healthcare Analytics News™. “Generating a cell atlas of the whole human body poses unseen computational challenges; we're talking about analyzing millions and millions of cells here. SCANPY makes a very good effort of resolving this.”

Wolf—who developed the software with his colleague Philipp Angerer in the Machine Learning Group of Fabian Theis, PhD, professor of mathematical modelling of biological systems at the Technical University of Munich—said the team has been asked to present SCANPY to the computational analysis committee of the Human Cell Atlas later this year. The Human Cell Atlas is only 1 of many “exploding” data sets (to use Wolf’s description) in healthcare research that, to date, have confounded investigators. Currently available software systems for gene-expression analysis simply haven’t been able to process data sets of this magnitude.

A key to SCANPY’s capabilities lies in the programming language upon which it is based. Python, which is more commonly used in the machine learning field, enables software to be more intuitive than conventional biostatics packages, which are typically written using the R programming language. With Python as its base, SCANPY is able to combine the preprocessing, cell visualization, and “pseudotemporal ordering” of separate systems in a single platform. Unlike conventional systems, which assess cells as points in a coordinate system, SCANPY uses algorithms (modelled on those used by social media platforms) that assess cells on a graph-like coordinate system that maps cells by identifying their closest neighbors, rather than characterizing a single cell by the expression value for thousands of genes.

In assessing its capabilities for the Genome Biology paper, Wolf and his colleagues found that SCANPY could perform specific cell analysis steps several times faster than existing platforms. They believe the platform is capable of analyzing 1.3 million cells in just a few hours, without subsampling.

“Quite generally, as soon as large data sets with many observations arise, [or] when you want to integrate data sets from many studies, SCANPY will either enable this or, if it’s possible already, make it much faster,” Wolf said. “Another goal is to use SCANPY as a back-end for data portals that are now created to simplify analyzing data for non-computational-expert users: visualizing cells, clustering them to find new cell types, finding trajectories and branchings, be it in the context of development, disease progression, or dose response, and finding the genes that mark all these effects in an interactive data exploration.”

Although SCANPY is still very much in the developmental stage, experts within the field believe it could have a significant impact on research in the short term. Martin Hemberg, PhD, of Wellcome Trust Sanger Institute, Cambridge, in the United Kingdom, who has expertise in bioinformatics, systems biology, and applied mathematics, told HCA that he can see the software playing a role in “every area” of basic research because “it provides broad support for processing scRNA-seq data.

“Processing scRNA-seq data remains challenging today for 2 main reasons: 1) The field has not reached a consensus for what is the best practice; and, 2), large volumes of data are computationally challenging to analyze,” continued Hemberg, who was not involved with the SCANPY project. “SCANPY provides a massive step forward for and it makes it much more feasible for researchers to analyze data sets that previously were intractable.”

Become a contributor