Zarr: taking the headache out of massive datasets

My primary focus in the last few years has been trying to make gene-level metagenomic analysis practical for microbiome research. Without going into all of the details, one of the biggest challenges for this type of analysis is that there are a lot of genes in the microbiome, and so the data volume becomes massive.

I had generally thought that I had all my tools working as well as I could (in part by optimizing how to create massive DataFrames in memory), but I was still finding that some steps (such as grouping genes by co-abundance) were always going to require a huge amount of memory and take a really long time to read and write to disk. This didn’t really bother me — my feeling was that a matrix with millions of rows (genes) and thousands of columns (samples) is a ton of data and should be hard to work with.

Then I talked to an earth scientist.

One of the great things about being a scientist is when you get to peek over the wall with another discipline and see that they’ve already solved problems which you didn’t realize could be solved. It turns out that there are scientists who use satellites to take pictures of the earth to learn things about volcanos, glaciers, global warming, and other incredibly important topics. Instead of taking a single picture, they take many pictures over time and build a three dimensional volume of signal intensity data which makes my metagenomic datasets seem … modest.

The earth scientist, Scott Henderson, recommended that I check out an emerging software project which is being developed to deal with these large, high-dimensional, numeric matrices — Zarr.

The basic pitch for Zarr is that it is incredibly efficient at reading slices out of N-dimensional cubes. It also has some exciting integration with object storage systems like AWS S3 which I haven’t tried out yet, but I mostly like it because it immediately solved two big problems which were holding me back (and which I won’t describe in depth here). For both of these problems I ended up going down the path of trying out some alternate approaches:

Feather: Really fast to read and write, but you have to load the entire table at once
HDF5: Wonderful Python integration and support for complex data formats, but not efficient to read slices from arbitrary axes
Redis: Great support for caching on keys, but extremely slow to load all the data from each slice
Zarr: Fast to write, fast to read, and supports indexing across any axis with no impact on performance

With any young software project you can read the docs and say to yourself, “well that sounds nice, but does it work?” Let me be the voice of experience and tell that yes, it works. So if my problems sound like your problems, think about giving it a try.