Code Discussion

Microbial Pan-Genome Cartography

Many of the projects that I’ve been working on recently have led me to ask questions like, ”what bacteria encode this group of genes?” and “what genes are shared across this group of bacteria?” Even for a single species, the patterns of genes across genomes can be complex and beautiful!

Update: All of the maps shown below can now be viewed interactively here.

To make it easier to generate interactive displays to explore these patterns (which I call “genes-in-genomes maps”), I ended up building out a collection of tools which I would love for any researcher to use.

To explain a bit more about this topic, I recorded a short talk based on a presentation I gave at a local microbiome interest group.

In the presentation I walk through a handful of these pan-genome maps, which you can also download and explore for yourself using the links below.

I’m excited to think that this way of displaying microbial pan-genomes might be useful to researchers who are interested in exploring the diversity of the microbial world. If you’d like to talk more about using this approach in your research, please don’t hesitate to be in touch.

Zarr: taking the headache out of massive datasets

My primary focus in the last few years has been trying to make gene-level metagenomic analysis practical for microbiome research. Without going into all of the details, one of the biggest challenges for this type of analysis is that there are a lot of genes in the microbiome, and so the data volume becomes massive.

I had generally thought that I had all my tools working as well as I could (in part by optimizing how to create massive DataFrames in memory), but I was still finding that some steps (such as grouping genes by co-abundance) were always going to require a huge amount of memory and take a really long time to read and write to disk. This didn’t really bother me — my feeling was that a matrix with millions of rows (genes) and thousands of columns (samples) is a ton of data and should be hard to work with.

Then I talked to an earth scientist.

One of the great things about being a scientist is when you get to peek over the wall with another discipline and see that they’ve already solved problems which you didn’t realize could be solved. It turns out that there are scientists who use satellites to take pictures of the earth to learn things about volcanos, glaciers, global warming, and other incredibly important topics. Instead of taking a single picture, they take many pictures over time and build a three dimensional volume of signal intensity data which makes my metagenomic datasets seem … modest.

The earth scientist, Scott Henderson, recommended that I check out an emerging software project which is being developed to deal with these large, high-dimensional, numeric matrices — Zarr.

The basic pitch for Zarr is that it is incredibly efficient at reading slices out of N-dimensional cubes. It also has some exciting integration with object storage systems like AWS S3 which I haven’t tried out yet, but I mostly like it because it immediately solved two big problems which were holding me back (and which I won’t describe in depth here). For both of these problems I ended up going down the path of trying out some alternate approaches:

  • Feather: Really fast to read and write, but you have to load the entire table at once

  • HDF5: Wonderful Python integration and support for complex data formats, but not efficient to read slices from arbitrary axes

  • Redis: Great support for caching on keys, but extremely slow to load all the data from each slice

  • Zarr: Fast to write, fast to read, and supports indexing across any axis with no impact on performance

With any young software project you can read the docs and say to yourself, “well that sounds nice, but does it work?” Let me be the voice of experience and tell that yes, it works. So if my problems sound like your problems, think about giving it a try.

Automating Your Code

It’s been far too long since I’ve posted, but I wanted to share a small piece of how my work has changed over the last year in case my experience ends up being helpful for anyone.

Automated Actions with Code

For a few years now I’ve gotten used to setting up projects as code repositories (by this I really mean GitHub repositories, but I assume there are other providers to host versioned code). The point of these repositories is to make it easy to track changes to a collection of code, even when a group of people is collaborating and updating different parts of the code. For a while I thought that my sophistication with this system would develop mostly in the areas of managing and tracking these code changes (pull requests, working on branches, etc.), but in the last few months my eyes have been opened to a whole new world of automated tests or “actions.”

Of course, providers like GitLab, CircleCI, TravisCI, etc. have been providing tools for automated code execution for some time, but I never ended up setting those systems up myself and so they seemed a bit too intimidating to start out with. Then, sometime last year GitHub introduced a new part of their website called “Actions” and I started to dive in.

The idea with these actions is that you can automatically execute some compute using the code in your repository. This compute has to be pretty limited, can’t use that much resources and can’t run for that long, but it’s more than enough capacity to do some really useful things.

Pipeline Validation

One task I spend time on is building small workflows to help people run bioinformatics. Something that I’ve found very useful with these workflows is that you can set up an action which will automatically run the entire pipeline and report if any errors are encountered. The prerequisite here is that you have to be able to generate testing data which will run in ~5 minutes, but the benefit is that you can test a range of conditions with your code, and not have to worry that your local environment is misleading you. This is also really nice in case you are making a minor change and don’t want to run any tests locally — all the tests run automatically and you just get a nice email if there are any problems. Peace of mind! Here is an example of the configuration that I used for running this type of testing with a recent repository.

Packaging for Distribution

I recently worked on a project for which I needed to wrap up a small script which would run on multiple platforms as a standalone executable. As common as this task is, it’s not something I’ve done frequently and I wasn’t particularly confident in my ability to cross-compile from my laptop for Windows, Ubuntu, and MacOS. Luckily, I was able to figure out how to configure actions which would package up my code for three different operating systems, and automatically attach those executables as assets to tagged releases. This means that all I have to do to release a new set of binaries is to push a tagged commit, and everything else is just taken care of for me.

In the end, I think that spending more time in bioinformatics means figuring out all of the things which you don’t actually have to do and automating them. If you are on the fence, I would highly recommend getting acquainted with some automated testing system like GitHub Actions to see what work you can take off your plate entirely to focus on more interesting things.

The Rise of the Machines: Workflow Managers for Bioinformatics

As with many things these days, it started with Twitter and it went further than I expected.

The other day I wrote a slightly snarky tweet

There were a handful of responses to this, almost all of them gently pointing out to me that there are a ton of workflow managers out there, some of which are quite good. So, rather than trying to dive further on Twitter (a fool’s errand), I thought I would explain myself in more detail here.

What is “being a bioinformatician”?

“Bioinformatics” is a term that is broadly used by people like me, and really quite poorly defined. In the most literal sense, it is the analysis of data relating to biology or biomedical science. However, the shade of meaning which has emerged emphasizes the scope and scale of the data being analyzed. In 2018, being a bioinformatician means dealing with large datasets (genome sequencing, flow cytometry, RNAseq, etc.) which is made up of a pretty large number of pretty large files. Not only is the data larger than you can fit into Excel (by many orders of magnitude), but it often cannot fit onto a single computer, and it almost always takes a lot of time and energy to analyze.

The aspect of this definition useful here is that bioinformaticians tend to

  1. keep track of and move around a large number of extremely large files (>1Gb individually, 100’s of Gbs in aggregate)

  2. analyze those files using a “pipeline” of analytical tools — input A is processed by algorithm 1 to produce file B, which is processed by algorithm 2 to produce file C, etc. etc.

Here’s a good counterpoint that was raised to the above:

Good point, now what is a “Workflow Manager”?

A workflow manager is a very specific thing that takes many different forms. At its core, a workflow manager will run a set of individual programs or “tasks” as part of a larger pipeline or “workflow,” automating a process that would typically be executed by (a) a human entering commands manually into the command line, or (b) a “script” containing a list of commands to be executed. There can be a number of differences between a “script” and a “workflow,” but generally speaking the workflow should be more sophisticated, more transportable between computers, and better able to handle the complexities of execution that would simply result in an error for a script.

This is a very unsatisfying definition, because there isn’t a hard and fast delineation between scripts and workflow, and scripts are practically the universal starting place for bioinformaticians as they learn how to get things done with the command line.

Examples of workflow managers (partially culled from the set of responses I got on Twitter):

My Ideal Workflow Manager

I was asked this question, and so I feel slightly justified in laying out my wishlist for a workflow manager:

  • Tasks consist of BASH snippets run inside Docker containers

  • Supports execution on a variety of computational resources: local computer, local clusters (SLURM, PBS), commercial clusters (AWS, Google Cloud, Azure)

  • The dependencies and outputs of a task can be defined by the output files created by the task (so a task isn’t re-run if the output already exists)

  • Support for file storage locally as well as object stores like AWS S3

  • Easy to read, write, and publish to a general computing audience (highly subjective)

  • Easy to set up and get running (highly subjective)

The goal here is to support reproducibility and portability, both to other researchers in the field, but also to your future self who wants to rerun the same analysis with different samples in a year’s time and doesn’t want to be held hostage to software dependency hell, not to mention the crushing insecurity of not knowing whether new results can be compared to previous ones.

Where are we now?

The state of the field at the moment is that we have about a dozen actively maintained projects that are working in this general direction. Ultimately I think the hardest thing to achieve is the last two bullets on my list. Adding support for services which are highly specialized (such as AWS) necessarily adds a ton of configuration and execution complexity that makes it even harder to a new user to pick up and use a workflow that someone hands to them.

Case in point — I like to run things inside Docker containers using AWS Batch, but this requires that all of the steps of a task (coping the files down from S3, running a long set of commands, checking the outputs, and uploading the results back to S3) be encapsulated in a single command. To that end, I have had to write wrapper scripts for each of my tools and bake them into the Docker image so that they can be invoked in a single command. As a result, I’m suck using the Docker containers that I maintain, instead of an awesome resource like BioContainers. This is highly suboptimal, and would be difficult for someone else to elaborate and develop further without completely forking the repo for every single command you want to tweak. Instead, I would much rather if we could all just contribute to and use BioContainers and use a workflow system that took care of all of the complex set of commands executed inside each container.

In the end, I have a lot of confidence that the developers of workflow managers are working towards exactly the end goals that I’ve outlined. This isn’t a highly controversial area, it just requires an investment in computational infrastructure that our R&D ecosystem has always underinvested in. If the NIH decided today that they were going to fund the development and ongoing maintenance of three workflow managers by three independent groups (and their associated OSS communities), we’d have a much higher degree of reproducibility in science, but that hasn’t happened (as far as I know — I am probably making a vast oversimplification here for dramatic effect).

Give workflow managers a try, give back to the community where you can, and let’s all work towards a world where no bioinformatician ever has to run BWA by hand and look up which flag sets the number of threads.

Update on Making CAGs, or The Importance of Good Software

I wrote a little bit recently on why I am so interested in making CAGs from metagenomic data, and I wanted to provide a little update on that topic.

CAGs are "Co-Abundant Genes," which is to say the groups of genes which are found at similar abundances in metagenomic data across many different samples. The inference from the "co-abundance" observation is that those genes are likely present on the same bits of DNA (i.e. chromosomes) across those samples. I am particularly interested in these groups of genes because of the highly mosaic nature of microbial genomic evolution.

Having spent a bit of time on maximizing the computational efficiency of finding CAGs, I have been struck by just how hard it is. Naively, the core computational task requires calculating a similarity (or co-abundance) metric for every pair of genes, which scales exponentially with the number of genes. For 100 genes there are ~5,000 comparisons, for 1,000 genes there are ~500,000, for 10,000 genes there are ~50,000,000 comparisons, etc. etc.

Speaking practically, this means that it takes a long time on a very large computer to get this work done. There are all sorts of tricks to make it faster, including parallelization, code optimization, etc., but at the core it is just a hard task.

However, I happened to ask the Twitter hive mind for help, and Andrew Carroll happened to give me some great advice:

This lead me down the road of exploring the Approximate Nearest Neighbor algorithms. It turns out that there a number of different algorithms out there which approximate this task of finding the closest set of points in highly dimensional space, which is exactly what I needed to make CAGs. There's even a website showing you which of these pieces of code work the best.

Interestingly, one of these algorithms (called "annoy") was produced by Spotify, and can be used to rapidly find the nearest neighbor from a large index that can be mmapped for rapid access across multiple threads.

Annoy (Approximate Nearest Neighbors Oh Yeah) 

In Closing

The point I want to reinforce is that Good Software matters. Implementing an exact solution was taking me 10 days with an expensive 72 core machine, which gave me little opportunity to iterate over different variables or analyze new datasets quickly. The advantage of using a better piece of software is not just that I can make pretty figures, it's that I can actually do science more quickly, spending less money, and getting more accurate answers. I didn't know that this software was out there, and so it wasn't until I asked some people on Twitter for me to find out about it. But now that I know I'm happy to tell the world to give it a try, and hopefully that will save you all some time, some money, and help you find out the answer to whatever questions are important.

Detecting viruses from short-read metagenomic data

My very first foray into the microbiome was in graduate school when my advisor (Dr. Rick Bushman) suggested that I try to use a high-throughput genome sequencing instrument (a 454 pyrosequencer) to characterize the viruses present in the healthy human gut. That project ended up delving deeply into the molecular evolution of bacteriophage populations, and gave me a deep appreciation for the unpredictable complexity of viral genomes. 

While I wasn't the first to publish in this area, the general approach has become more popular over the last decade and a number of different groups have put their own spin on it. 

Because of the diversity of viruses, you can always customize and enhance different aspects of the process, from sample collection, purification, and DNA isolation to metagenomic sequencing, bioinformatic analysis, and visualization. It's been very interesting to see how sophisiticated some of these methods have become.

On the bioinformatic analysis side of things, I ended up having the most success in those days by assembling each viral genome from scratch and measuring the evolution of that genome over time. More of the bespoke approach to bioinformatics, rather than the assembly line. 

In contrast, these days I am much more interested in computational approaches that allow me to analyze very large numbers of samples with a minimum of human input required. For that reason I don't find de novo assembly to be all that appealing. It's not that the computation can't be done, it's more than I have a hard time imagining how to wrap my brain around the results in a productive way. 

In contrast, one approach that I have been quite happy with is also much more simple minded. Instead of trying to assemble all of the new genomes in a sample, it's much easier to simply align your short reads against a reference database of viral genomes. One of the drawbacks is that you can only detect a virus that has been detected before. On the other hand, one of the advantages is also that you can only detect a virus that has been detected before, meaning that all samples can be rapidly and intuitively compared against each other. 

To account for the rapid evolution of viral genomes, I think it's best to do this alignment in protein space against the set of proteins contained in every viral genome. This makes the alignments a bit more robust. 

If you would like to perform this simple read alignment method for viral detection, you can use the code that I've put together at this GitHub repo. There is also a Quay repository hosting a Docker image that you can use to run this analysis without having to install any dependencies. 

This codebase is under active development (version 0.1 at time of writing) so please be sure to test carefully against your controls to make sure everything is behaving well. At some point I may end up publishing more about this method, but it may be just too simple to entice any journal. 

Lastly, I want to point out that straightforward alignment of reads does not account for any number of confounding factors, including but not limited to:

  1. presence of human or host DNA
  2. shared genome segments between extant viruses
  3. novel viral genomes
  4. complex viral quasi-species
  5. integration of prophages into host genomes

There are a handful of tools out there that do try to deal with some of those problems in different ways, quite likely to good effect. However, it's good to remember that with every additional optimization you add a potential confounding factor. For example, it sounds like a good idea to remove human sequences from your sample, but that runs the risk of also eliminating viral sequences that happen to be found with the human genome, such as lab contaminants or integrated viral genome fragments. There are even a few human genes with deep homology to existing viral genes, thought to be due to an ancient integration and subsequent repurposing. All I mean to say here is that sometimes it's better to remove the assumptions from your code, and instead include a good set of controls to your experiment that can be used to robustly eliminate signal coming from, for example, the human genome. 

Making sparse DataFrames efficiently

Here's a little story about a computational challenge that I came across recently, and how a little bit of performance profiling went a long way to improve my analysis.

Background

I've been working with a large amount of data recently, and my principal bottleneck hasn't been the raw sequence analysis, but rather analyzing the results across large numbers of samples. For context, I'm working with thousands of samples, each of which contains 10,000 - 10,000,000 observations. The raw data was terabytes of raw genomic sequence data, and so I'm definitely working with a more compact data structure, but it's still quite a bit of data to work with in memory. 

There have been a few different iterations for this workflow, starting by working in memory (started swapping), then moving to a PostgreSQL database (which was too slow with read/write), and now working in memory on a larger machine using sparse matrices (following a helpful suggestion from Jonathan Golob).

What I've really liked about these sparse matrices is that it's incredibly fast to slice the data in different ways, which is what I need to be able to figure out both (a) how similar samples are to each other and (b) how different observations are distributed across samples. In other words, I need to be able to retrieve the data quickly in a column-wise or row-wise manner and sparse matrices really help with this.  Also, my underlying data is very sparse, which is to say that a vast majority of the cells in my table are 0 or N/A. Lastly, they're supported in Pandas, which is really great. 

The Challenge

While it's really quick to query these sparse matrices, I've been finding it difficult to make them. This morning I did a bit of performance profiling to compare a few different ways of constructing these tables and I found that not all construction methods are created equal. You can find my profiling code here, if you're interested (this was run in Pandas 0.20.3). 

The data that I'm working with is a set of observations that are originally constructed in longitudinal format, with every observation as an item in a list that includes the row name, column name, and cell value. What I wanted to see was what the most efficient method would be to make this into a sparse table in wide format, which could be indexed and sliced easily by row or column. 

Options

  1. Sparse DataFrame from nested dictionaries: Make a nested dictionary with the first level of keys as column labels and the second level of keys as row labels, and then convert directly to the SparseDataFrame.

  2. Column-wise addition to sparse DataFrame: Make an empty SparseDataFrame and then sequentially add in the data from each column. 

  3. Dense longitudinal, pivot, then make sparse: Make a dense DataFrame from the longitudinal data, pivot it in dense format, and then make it into a SparseDataFrame.

  4. Sparse longitudinal DataFrame, then pivot: Make a SparseDataFrame in longitudinal format and then pivot it in to wide format. 

Results

I tested each of the above options using four different matrices of varying sizes:

Matrix: 60 rows, 67 cols, 2.463% full

1. Testing sparse DataFrame from nested dictionaries
29.6 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Testing columnwise addition to sparse DataFrame
41.5 ms ± 813 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
33.1 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
19 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Matrix: 100 rows, 100 cols, 9.54% full

1. Testing sparse DataFrame from nested dictionaries
61.2 ms ± 8.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Testing columnwise addition to sparse DataFrame
97.7 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
57.3 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
101 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 632 cols, 1.574% full

1. Testing sparse DataFrame from nested dictionaries
1.29 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Testing columnwise addition to sparse DataFrame
1.41 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
1.19 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
94 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 1000 cols, 9.526% full

1. Testing sparse DataFrame from nested dictionaries
2.82 s ± 29.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Testing columnwise addition to sparse DataFrame
3.15 s ± 81.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
2.8 s ± 72.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
826 ms ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary

In the results above you can see that in general the fastest method was #4 – making the DataFrame sparse in longitudinal format, and then pivoting it into wide matrix format. However, I need to stress that the relative performance of each of these methods varied widely depending on how sparse the matrix was. In general, option #4 worked better with sparser datasets, but I haven't fully tested the range of parameter space varying the number of rows, columns, and values. 

A larger point that I could also make here is that picking the right data structure for your project is actually a pretty tricky problem. I found here that the advantages of sparse matrices was immense relative to relational databases, but that they come with an up-front cost of matrix construction. I'm a biologist, not a computational scientist, and so the only way that I find my way through is by talking to smart people and incorporating their suggestions wherever appropriate. If you come up against a similar problem dealing with largish data, just take a deep breath, ask for help, and then write some tests.