software

Automating Your Code

It’s been far too long since I’ve posted, but I wanted to share a small piece of how my work has changed over the last year in case my experience ends up being helpful for anyone.

Automated Actions with Code

For a few years now I’ve gotten used to setting up projects as code repositories (by this I really mean GitHub repositories, but I assume there are other providers to host versioned code). The point of these repositories is to make it easy to track changes to a collection of code, even when a group of people is collaborating and updating different parts of the code. For a while I thought that my sophistication with this system would develop mostly in the areas of managing and tracking these code changes (pull requests, working on branches, etc.), but in the last few months my eyes have been opened to a whole new world of automated tests or “actions.”

Of course, providers like GitLab, CircleCI, TravisCI, etc. have been providing tools for automated code execution for some time, but I never ended up setting those systems up myself and so they seemed a bit too intimidating to start out with. Then, sometime last year GitHub introduced a new part of their website called “Actions” and I started to dive in.

The idea with these actions is that you can automatically execute some compute using the code in your repository. This compute has to be pretty limited, can’t use that much resources and can’t run for that long, but it’s more than enough capacity to do some really useful things.

Pipeline Validation

One task I spend time on is building small workflows to help people run bioinformatics. Something that I’ve found very useful with these workflows is that you can set up an action which will automatically run the entire pipeline and report if any errors are encountered. The prerequisite here is that you have to be able to generate testing data which will run in ~5 minutes, but the benefit is that you can test a range of conditions with your code, and not have to worry that your local environment is misleading you. This is also really nice in case you are making a minor change and don’t want to run any tests locally — all the tests run automatically and you just get a nice email if there are any problems. Peace of mind! Here is an example of the configuration that I used for running this type of testing with a recent repository.

Packaging for Distribution

I recently worked on a project for which I needed to wrap up a small script which would run on multiple platforms as a standalone executable. As common as this task is, it’s not something I’ve done frequently and I wasn’t particularly confident in my ability to cross-compile from my laptop for Windows, Ubuntu, and MacOS. Luckily, I was able to figure out how to configure actions which would package up my code for three different operating systems, and automatically attach those executables as assets to tagged releases. This means that all I have to do to release a new set of binaries is to push a tagged commit, and everything else is just taken care of for me.

In the end, I think that spending more time in bioinformatics means figuring out all of the things which you don’t actually have to do and automating them. If you are on the fence, I would highly recommend getting acquainted with some automated testing system like GitHub Actions to see what work you can take off your plate entirely to focus on more interesting things.

Quick note on workflow managers

After having written a pretty negative assessment of the state of the field for workflow managers (those pieces of software which make it easier to run multiple other pieces of software in a controlled, coordinated manner), I’ve been feeling like I needed to put out an update. The field has changed a lot in the last few months, and I’d like to be less out of date.

A Few Good Options

It turns out that there are a few good options out there: workflow managers that don’t take too long to figure out how to use, which have some cloud computing support, and which have communities of users developing. The two best options I’ve seen so far are Cromwell and Nextflow. Nextflow is pretty popular in Europe and Cromwell is being adopted by the Broad, so they both are reasonable options to try out. I’ve been able to get them both up and running without too much work, but there are some inherent challenges with any workflow manager that I think will always present some stumbling blocks.

Issue 1 — Where do you execute your command?

Fundamentally, a workflow manager executes a set of commands, each of which consumes and produces files. However, the operation of executing a command is completely different whether you’re trying to run it on your laptop, your local HPC, the Google Cloud, AWS, or Azure. Each of those execution options comes with their own idiosyncratic settings for permissions, authentication, formatting, etc. A big part of getting up and running with any workflow manager is getting all of those settings configured in just the right way. It’s not glamorous, but it’s important and it takes time.

Issue 2 — When do you execute your command?

A good workflow manager only executes commands when it’s appropriate — when the inputs are available and the outputs haven’t been produced yet. Doing this properly means that you can restart and rerun workflows without duplicating effort, but that also requires that you can keep track of what commands have been run before. This can also require a bit of effort to configure. As an aside, the traditional training path for bioinformatics folks is to start with BASH scripting, where you run a command when the output files don’t already exist. This is not the method that provides the most reproducible results, and it is also not the method used by Nextflow or Cromwell. I believe that this is the Snakemake model, but I have less experience there. Lots of complexity is hidden inside this issue.

Issue 3 — Where does your data live?

One of the big attractions of a good workflow manager is being able to run the exact same analysis on my laptop, an HPC, or the cloud. However, you really need to have the data live next to the execution environment — it would be insane to download and upload files from my laptop for every single task that’s executed in the cloud. This means that getting up and running with a cloud based workflow manager is getting all of your data organized and accessible in the same system that you want to run the tasks in. This takes time and means that you really have to commit to a model for execution.

Wrapping Up

While this post is pretty meandering and vague, all I mean to add here is that the area of workflow managers is expanding rapidly and lots of good people are doing great development. That said, the endeavor is fundamentally challenging and it will require a good amount of time to configure everything and get up and running. I encourage you to try out the options that exist and share your experiences with the world. This is the way of the future, and it would be great if we built a better future together.

The Rise of the Machines: Workflow Managers for Bioinformatics

As with many things these days, it started with Twitter and it went further than I expected.

The other day I wrote a slightly snarky tweet

There were a handful of responses to this, almost all of them gently pointing out to me that there are a ton of workflow managers out there, some of which are quite good. So, rather than trying to dive further on Twitter (a fool’s errand), I thought I would explain myself in more detail here.

What is “being a bioinformatician”?

“Bioinformatics” is a term that is broadly used by people like me, and really quite poorly defined. In the most literal sense, it is the analysis of data relating to biology or biomedical science. However, the shade of meaning which has emerged emphasizes the scope and scale of the data being analyzed. In 2018, being a bioinformatician means dealing with large datasets (genome sequencing, flow cytometry, RNAseq, etc.) which is made up of a pretty large number of pretty large files. Not only is the data larger than you can fit into Excel (by many orders of magnitude), but it often cannot fit onto a single computer, and it almost always takes a lot of time and energy to analyze.

The aspect of this definition useful here is that bioinformaticians tend to

  1. keep track of and move around a large number of extremely large files (>1Gb individually, 100’s of Gbs in aggregate)

  2. analyze those files using a “pipeline” of analytical tools — input A is processed by algorithm 1 to produce file B, which is processed by algorithm 2 to produce file C, etc. etc.

Here’s a good counterpoint that was raised to the above:

Good point, now what is a “Workflow Manager”?

A workflow manager is a very specific thing that takes many different forms. At its core, a workflow manager will run a set of individual programs or “tasks” as part of a larger pipeline or “workflow,” automating a process that would typically be executed by (a) a human entering commands manually into the command line, or (b) a “script” containing a list of commands to be executed. There can be a number of differences between a “script” and a “workflow,” but generally speaking the workflow should be more sophisticated, more transportable between computers, and better able to handle the complexities of execution that would simply result in an error for a script.

This is a very unsatisfying definition, because there isn’t a hard and fast delineation between scripts and workflow, and scripts are practically the universal starting place for bioinformaticians as they learn how to get things done with the command line.

Examples of workflow managers (partially culled from the set of responses I got on Twitter):

My Ideal Workflow Manager

I was asked this question, and so I feel slightly justified in laying out my wishlist for a workflow manager:

  • Tasks consist of BASH snippets run inside Docker containers

  • Supports execution on a variety of computational resources: local computer, local clusters (SLURM, PBS), commercial clusters (AWS, Google Cloud, Azure)

  • The dependencies and outputs of a task can be defined by the output files created by the task (so a task isn’t re-run if the output already exists)

  • Support for file storage locally as well as object stores like AWS S3

  • Easy to read, write, and publish to a general computing audience (highly subjective)

  • Easy to set up and get running (highly subjective)

The goal here is to support reproducibility and portability, both to other researchers in the field, but also to your future self who wants to rerun the same analysis with different samples in a year’s time and doesn’t want to be held hostage to software dependency hell, not to mention the crushing insecurity of not knowing whether new results can be compared to previous ones.

Where are we now?

The state of the field at the moment is that we have about a dozen actively maintained projects that are working in this general direction. Ultimately I think the hardest thing to achieve is the last two bullets on my list. Adding support for services which are highly specialized (such as AWS) necessarily adds a ton of configuration and execution complexity that makes it even harder to a new user to pick up and use a workflow that someone hands to them.

Case in point — I like to run things inside Docker containers using AWS Batch, but this requires that all of the steps of a task (coping the files down from S3, running a long set of commands, checking the outputs, and uploading the results back to S3) be encapsulated in a single command. To that end, I have had to write wrapper scripts for each of my tools and bake them into the Docker image so that they can be invoked in a single command. As a result, I’m suck using the Docker containers that I maintain, instead of an awesome resource like BioContainers. This is highly suboptimal, and would be difficult for someone else to elaborate and develop further without completely forking the repo for every single command you want to tweak. Instead, I would much rather if we could all just contribute to and use BioContainers and use a workflow system that took care of all of the complex set of commands executed inside each container.

In the end, I have a lot of confidence that the developers of workflow managers are working towards exactly the end goals that I’ve outlined. This isn’t a highly controversial area, it just requires an investment in computational infrastructure that our R&D ecosystem has always underinvested in. If the NIH decided today that they were going to fund the development and ongoing maintenance of three workflow managers by three independent groups (and their associated OSS communities), we’d have a much higher degree of reproducibility in science, but that hasn’t happened (as far as I know — I am probably making a vast oversimplification here for dramatic effect).

Give workflow managers a try, give back to the community where you can, and let’s all work towards a world where no bioinformatician ever has to run BWA by hand and look up which flag sets the number of threads.

Update on Making CAGs, or The Importance of Good Software

I wrote a little bit recently on why I am so interested in making CAGs from metagenomic data, and I wanted to provide a little update on that topic.

CAGs are "Co-Abundant Genes," which is to say the groups of genes which are found at similar abundances in metagenomic data across many different samples. The inference from the "co-abundance" observation is that those genes are likely present on the same bits of DNA (i.e. chromosomes) across those samples. I am particularly interested in these groups of genes because of the highly mosaic nature of microbial genomic evolution.

Having spent a bit of time on maximizing the computational efficiency of finding CAGs, I have been struck by just how hard it is. Naively, the core computational task requires calculating a similarity (or co-abundance) metric for every pair of genes, which scales exponentially with the number of genes. For 100 genes there are ~5,000 comparisons, for 1,000 genes there are ~500,000, for 10,000 genes there are ~50,000,000 comparisons, etc. etc.

Speaking practically, this means that it takes a long time on a very large computer to get this work done. There are all sorts of tricks to make it faster, including parallelization, code optimization, etc., but at the core it is just a hard task.

However, I happened to ask the Twitter hive mind for help, and Andrew Carroll happened to give me some great advice:

This lead me down the road of exploring the Approximate Nearest Neighbor algorithms. It turns out that there a number of different algorithms out there which approximate this task of finding the closest set of points in highly dimensional space, which is exactly what I needed to make CAGs. There's even a website showing you which of these pieces of code work the best.

Interestingly, one of these algorithms (called "annoy") was produced by Spotify, and can be used to rapidly find the nearest neighbor from a large index that can be mmapped for rapid access across multiple threads.

Annoy (Approximate Nearest Neighbors Oh Yeah) 

In Closing

The point I want to reinforce is that Good Software matters. Implementing an exact solution was taking me 10 days with an expensive 72 core machine, which gave me little opportunity to iterate over different variables or analyze new datasets quickly. The advantage of using a better piece of software is not just that I can make pretty figures, it's that I can actually do science more quickly, spending less money, and getting more accurate answers. I didn't know that this software was out there, and so it wasn't until I asked some people on Twitter for me to find out about it. But now that I know I'm happy to tell the world to give it a try, and hopefully that will save you all some time, some money, and help you find out the answer to whatever questions are important.