Automating Your Code

It’s been far too long since I’ve posted, but I wanted to share a small piece of how my work has changed over the last year in case my experience ends up being helpful for anyone.

Automated Actions with Code

For a few years now I’ve gotten used to setting up projects as code repositories (by this I really mean GitHub repositories, but I assume there are other providers to host versioned code). The point of these repositories is to make it easy to track changes to a collection of code, even when a group of people is collaborating and updating different parts of the code. For a while I thought that my sophistication with this system would develop mostly in the areas of managing and tracking these code changes (pull requests, working on branches, etc.), but in the last few months my eyes have been opened to a whole new world of automated tests or “actions.”

Of course, providers like GitLab, CircleCI, TravisCI, etc. have been providing tools for automated code execution for some time, but I never ended up setting those systems up myself and so they seemed a bit too intimidating to start out with. Then, sometime last year GitHub introduced a new part of their website called “Actions” and I started to dive in.

The idea with these actions is that you can automatically execute some compute using the code in your repository. This compute has to be pretty limited, can’t use that much resources and can’t run for that long, but it’s more than enough capacity to do some really useful things.

Pipeline Validation

One task I spend time on is building small workflows to help people run bioinformatics. Something that I’ve found very useful with these workflows is that you can set up an action which will automatically run the entire pipeline and report if any errors are encountered. The prerequisite here is that you have to be able to generate testing data which will run in ~5 minutes, but the benefit is that you can test a range of conditions with your code, and not have to worry that your local environment is misleading you. This is also really nice in case you are making a minor change and don’t want to run any tests locally — all the tests run automatically and you just get a nice email if there are any problems. Peace of mind! Here is an example of the configuration that I used for running this type of testing with a recent repository.

Packaging for Distribution

I recently worked on a project for which I needed to wrap up a small script which would run on multiple platforms as a standalone executable. As common as this task is, it’s not something I’ve done frequently and I wasn’t particularly confident in my ability to cross-compile from my laptop for Windows, Ubuntu, and MacOS. Luckily, I was able to figure out how to configure actions which would package up my code for three different operating systems, and automatically attach those executables as assets to tagged releases. This means that all I have to do to release a new set of binaries is to push a tagged commit, and everything else is just taken care of for me.

In the end, I think that spending more time in bioinformatics means figuring out all of the things which you don’t actually have to do and automating them. If you are on the fence, I would highly recommend getting acquainted with some automated testing system like GitHub Actions to see what work you can take off your plate entirely to focus on more interesting things.

Quality and Insights – The Human Gut Virome in 2019

There were a couple of good virome papers I read this week, and I thought it was worth commenting on the juxtaposition.

Virome — The collection of viruses which are found in a complex microbial community, such as the human microbiome. NB: Most viruses found in any environment are bacteriophages — the viruses which infect bacteria, and do not infect humans.

Measuring Quality, Measuring Viruses

https://www.nature.com/articles/s41587-019-0334-5

I was excited to see two of my favorite labs collaborating on a virome data quality project: the Bushman lab at the University of Pennsylvania (where I trained) and the Segata lab at the University of Trento (who originally made MetPhlAn, the first breakthrough software tool for microbial metagenomics). The goal of this work was to measure the quality of virome research projects.

Virome research projects over the last decade have relied on a technological approach in which viral particles are physically isolated from a complex sample and then loaded onto a genome sequencer. There are a variety of experimental approaches which you can use to isolate viruses, including size filtration and density gradient centrifugation, which rely on the fact that viruses are physically quite different from bacterial cells.

The question asked by the researchers in this study was, “How well are these physical isolation methods actually working?” It’s such a good question that I’m surprised (in retrospect) that nobody had asked it before. As someone who has worked a bit in this area, I’m also surprised that I never thought to ask this question before.

Their approach was nice and straightforward — they looked in these datasets for sequences that should not be found very often, those belonging to the bacterial ribosome, whose absence is almost required in order to be considered a virus.

They found that the quality of these virome datasets varied extremely widely. You can read the paper for more details, and I am hesitant to post the figures on a public blog, but I really did not expect to see that there were published virome datasets with proportions of ribosomal sequences ranging from 0.001% all the way up to 1%.

Take Home: When you use a laboratory method to study your organism of interest, you need to use some straightforward approach for proving to yourself and others that it is actually working as you expect. For something as challenging and complex as the human virome, this new QC tool might help the field maintain a high standard of quality and avoid misleading or erroneous results.

Viral Dark Matter in IBD

https://www.sciencedirect.com/science/article/pii/S1931312819305335

One of the best talks I saw at ASM Microbe 2019 was from Colin Hill (APC Microbiome Ireland) and so I was happy to see a new paper from that group analyzing the gut virome in the context of Inflammatory Bowel Disease. I was even more gratified to read the abstract and see some really plausible and defensible claims being made in an area which is particularly vulnerable to over-hype.

1-s2.0-S1931312819305335-fx1_lrg.jpg

No Change in Richness in IBD: Microbiome researchers talk a lot about “richness,” which refers to the total number of distinct organisms in a community. This metric can be particularly hard to nail down with viruses because they are incredibly diverse and we have a hard time counting how many “types” there are (or even what being a “type” means). In this paper they used the very careful approach of re-assembling all viral genomes from scratch, rather than comparing against an existing database, and found that there was no difference in the richness of the virome in IBD vs. non-IBD samples. When others have analyzed the same data with methods that relied on reference databases, they found a significant difference in richness, which suggests that the database was confounding the results for those prior studies.

Changes in Viruses Reflect Bacteria: The authors state that “the changes in virome composition reflected alterations in bacterial composition,” which resonated with me so strongly that I think it merits mentioning again here. Viruses tend to be extremely specific and only infect a subset of strains within a species of bacteria. They are also so diverse that it is hard to even figure out which virus is infecting which bacteria. Therefore, with our current level of understanding and technology, viruses in the human gut are really best approached as a marker of what bacterial strains are present. It’s hard to get anything more concrete than that from sequencing-based approaches, except with some specific examples of well-understood viruses. With that limitation of our knowledge in mind, it is entirely expected that changes in bacteria would be reflected in changes in their viruses. Moreover, in this type of observational study we don’t have any way to figure out which direction the arrow of directionality is pointing. I think the authors did a great job of keeping their claims to the limits of our knowledge without over-hyping the results.

There is a lot more to this paper, so I encourage you to read it in more depth and I won’t claim to make a full summary here.

In Summary: This is a fascinating field, with some really great groups doing careful and important work. We know a lot about how little we know, which means that there are even more exciting discoveries on the horizon.

Shiny Microbiome Analysis

Are you a microbiome researcher? Did you do a microbiome experiment? Do you want a quick, first-pass analysis to find out what happened? This post is for you.

When I talk to my collaborators, I find that there is a point people encounter where they have a 16S dataset and they want a quick look at what happened. This is not to generate publication-quality figures and not to take the place of a real-life statistician, but just a rough pass over the data. To help make this possible, I worked with a talented student named Will Frohlich and made a small shinyApp for this first-pass microbiome analysis.

Where To Find It

You can access the app at https://shinymicrobiome.fredhutch.org. The code for the app, example data, and documentation can be found at https://github.com/FredHutch/shinyMicrobiomeAnalysis.

What It Does

The shinyMicrobiome app only does a few basic things:

  • Plot stacked bar graphs showing relative abundance of taxa over samples

  • Plot the number of reads per sample, or group of samples

  • Plot the estimated total number of taxa (using breakaway)

  • Calculate and plot differential abundance by sample group (using corncob)

Disclaimers: Note that the differential abundance calculation is a very naive implementation of a sophisticated tool (corncob), and you will almost certainly get a more accurate answer by running corncob yourself and selecting parameters as appropriate for your study design. Also note that there is no False Discovery Rate correction in the app.

What You Need

The full description of input data for the app can be found in the GitHub repo, but the short description is that you will need:

  • A metadata sheet (in CSV format) describing what groups (treatment/control, etc) each sample is in. You must have the first column labeled “name” with the sample name, and then you can have as many additional columns as you like.

  • A taxon table (in CSV format) with the number of 16S reads assigned to each taxon for each sample. Each taxon is a row and the first column must be named “tax_name” with the name of the taxon. Each sample is a column, and the name of those columns must match the sample names in the metadata sheet.

Example data with this format can be found in the GitHub repo.

Need Help With Raw 16S Data?

Do you have raw 16S data which you would like to transform into these read-count taxon tables? The MaLiAmPi pipeline created by Jonathan Golob is a great way to process 16S data and make these sorts of tables. If you do use MaLiAmPi, the wide-form tables it produces (found in classify/tables/tallies_wide.genus.csv) are properly formatted for use as taxon tables by this app.

Having Problems With The App?

This app is very much a work-in-progress. Please don’t hesitate to reach out if you have any problems, or just file an issue if you think there are some bugs which might be impacting other people. However, there are many things that are really never going to work well for a simple app like this – axis labels will be misplaced, legends may overlap the plot area, statistical tests won’t be exactly suited for your experiment, etc. As long as you approach this app with those expectations, you may end up finding it to be useful.

3D structures of gut bacteria and the human immune system

When I talk to people about my work I sometimes get the question, “Do you really think that the microbiome has a direct effect on human health?” It’s a completely understandable question – the study-of-the-week which makes it into the news cycle tends to just confirm what we already know about the importance of diet and exercise. Then I come across these beautiful papers that show just how intimately connected we are with our gut bacteria. Here’s a good example, and it even comes with a video.

Ladinsky, M.S., et al. Endocytosis of commensal antigens by intestinal epithelial cells regulates mucosal T cell homeostasis. Science. 363(6431). DOI: 10.1126/science.aat4042.

There are some beautiful illustrations and graphics in this paper which I won’t reproduce here, but which I hope you can access from whichever side of the paywall you are on.

Background: Researchers are continuing to find evidence that the type of bacteria in your gut (if you are a mouse or a human) influences the type of your immune response. If you don’t study the immune system, just remember that the immune system responds in different ways to different kinds of pathogens – viruses are different from bacteria, which are different from parasites, etc. Mounting the correct type of response is essential, and it seems that which bacteria you have in your gut has some influence over the nature of those responses.

The Gist: This study focused on the how of the question, the specific molecular mechanism which would explain this observed relationship between bacteria and the immune system. They used one particular type of bacteria (segmented filamentous bacteria, or “SFB”) and showed that this bacteria gets so close to human cells that bacterial proteins are actually taken up and can be found inside the human cells. In addition, this movement of bacterial proteins inside human cells causes a shift in the type of response mounted by the immune system.

What Caught My Eye: This paper has a video showing a protrusion of a bacterial cell pushing deep into a human cell, complete with a 3D reconstruction of the physical structure using electron tomography. If you can follow the link above and make it to the video, I highly recommend taking a look.

The biggest story for me in the microbiome these days is that there are a number of great researchers who are starting to figure out some of the specific molecular mechanisms by which the microbiome may influence human health. This makes me more and more optimistic and excited that we will see a day where microbiome-based therapeutics make it into the clinic, which could have a profound impact on a broad range of diseases, from inflammatory bowel disease to colorectal cancer and auto-inflammatory disease. It is exciting to be a part of this effort and try to help as we bring that day closer.

Molecules Mediating Microbial Manipulation of Mouse (and Human) Maladies

Sometime in the last ten years I gave up on the idea of truly keeping up with the microbiome field. In graduate school it was more reasonable because I had the luxury of focusing on viruses in the microbiome, but since then my interests have broadened and the size of the field has continued to expand. These days I try to focus on the subset of papers which are telling the story of either gene-level metagenomics, or the specific metabolites which mediate the biological effect of the microbiome on human health. The other day I happened across a paper which did both, and so I thought it might be worth describing it quickly here.

Brown, EM, et al. Bacteroides-Derived Sphingolipids Are Critical for Maintaining Intestinal Homeostasis and Symbiosis. Cell Host & Microbe 2019 25(5) link

As a human, my interest is drawn by stories that confirm my general beliefs about the world, and do so with new specific evidence. Of course this is the fallacy of ascertainment bias, but it’s also an accurate description of why this paper caught my eye.

The larger narrative that I see this paper falling into is the one which says that microbes influence human health largely because they produce a set of specific molecules which interact with human cells. By extension, if you happen to have a set of microbes which cannot produce a certain molecule, then your health will be changed in some way. This narrative is attractive because it implies that if we understand which microbes are making which metabolites (molecules), and how those metabolites act on us, then we can design a therapeutic to improve human health.

Motivating This Study

Jumping into this paper, the authors describe a recently emerging literature (which I was unaware of) on how bacterially-produced sphingolipids have been predicted to influence intestinal inflammation like IBD. Very generally, sphingolipids are a diverse class of molecules that can be found in bacterial cell membranes, but which also can be produced by other organisms, and which also can have a signaling effect on human cells. The gist of the prior evidence going into this paper is that

  • people with IBD have lower levels of different sphingolipids in their stool, and

  • genomic analysis of the microbiome of people with IBD predicts that their bacteria are making less sphingolipids

Of course, those observations don’t go very far on their own, mostly because there are a ton of things that are different in the microbiome of people with IBD, and so it’s hard to point to any one bacteria or molecule from the bunch and say that it is having a causal role, and isn’t just a knock-on effect from some other cause.

The Big Deal Here

The hypothesis in this study is that one particular type of bacteria, Bacteroides are producing sphingolipids which reduce inflammation in the host. The experimental system they used were mice that were born completely germ-free, and which were subsequently colonized with strains of Bacteroides that either did or did not have the genes required to make some particular types of sphingolipids. The really cool thing here was that they were able to knock out the gene for sphingolipid production in one specific species of Bacteroides, and so they could see what the effect was of that particular set of genes, while keeping everything else constant. They found a pretty striking result, which is that inflammation was much lower in the mice which were colonized with the strain which was able to make the sphingolipid.

1-s2.0-S1931312819302057-fx1_lrg.jpg


To me, narrowing down the biological effect in an experiment to the difference of a single gene is hugely motivating, and really makes me think that this could plausibly have a role in the overall phenomenon of microbiome-associated inflammation.

The authors rightly point out that sphingolipids might not actually be the molecular messenger having an impact on host physiology — there are a lot of other things different in the sphingolipid-deficient bacteria used here, including carbohydrate metabolism and membrane composition, but it’s certainly a good place to keep looking.

Of course the authors did a bunch of other work in this paper to demonstrate that the experimental system was doing what they said, and they also went on to re-analyze the metabolites from human stool and identify specific sphingolipids that may be produced by these Bacteroides species, but I hope that my short summary gives you an idea of what they are getting at.

All About Those Genes

I think it can be difficult for non-microbiologists to appreciate just how much genetic diversity there is among bacteria. Strains which seem quite similar can have vastly different sets of genes (encoding, for example, a giant harpoon used to kill neighboring cells), and strains which seem quite different may in fact be sharing genes through exotic forms of horizontal gene transfer. With all of this complexity, I find it very comforting when scientists are able to conduct experiments which identify specific molecules and specific genes within the microbiome which have an impact on human health. I think we are moving closer to a world where we are able to use our knowledge of the microbiome to improve human health, and I think studies like this are bringing us closer.

Working with Nextflow at Fred Hutch

I’ve been putting in a bit of work recently trying to make it easier for other researchers at Fred Hutch to use Nextflow as a way to run their bioinformatics workflows, while also getting the benefits of cloud computing and Docker-based computational reproducibility.

You can see some slides describing some of that content here, including a description of the motivation for using workflow managers, as well as a more detailed walk-through of using Nextflow right here at Fred Hutch.

Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

  1. Start with a survey of people with and without a disease,

  2. Collect WGS data from microbiome samples,

  3. Find microbial CAGs that are associated with disease, and then

  4. Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Bioinformatics: Reproducibility, Portability, Transparency, and Technical Debt

I’ve been thinking a lot about what people are talking about when they talk about reproducibility. It has been helpful to start to break apart the terminology in order to distinguish between some conceptually distinct, albeit highly intertwined, concepts.

Bioinformatics: Strictly speaking, analysis of data for the purpose of biological research. In practice, the analysis of large files (GBs) with a series of compiled programs, each of which may have a different set of environmental dependencies and computational resource requirements.

Reproducibility: An overarching concept describing how easily a bioinformatic analysis performed at one time may be able to be executed a second time, potentially by a different person, at a different institution, or on a different set of input data. There is also a strict usage of the term which describes the computational property of an analysis in which the analysis of an identical set of inputs will always produce an identical set of outputs. These two meanings are related, but not identical. Bioinformaticians tend to accept a lack of strict reproducibility (e.g., the order of alignment results may not be consistent when multithreading), but very clearly want to have general reproducibility in which the biological conclusions drawn from an analysis will always be the same from identical inputs.

Portability: The ability of researchers at different institutions (or in different labs) to execute the same analysis. This aspect of reproducibility is useful to consider because it highlights the difficulties that are encountered when you move between computational environments. Each set of dependencies, environmental variables, file systems, permissions, hardware, etc., is typically quite different and can cause endless headaches. Some people point to Docker as a primary solution to this problem, but it is typical for Docker to be prohibited on HPCs because it requires root access. Operationally, the problem of portability is a huge one for bioinformaticians who are asked by their collaborators to execute analyses developed by other groups, and the reason why we sometimes start to feel like UNIX gurus more than anything else.

Transparency: The ability of researchers to inspect and understand what analyses are being performed. This is more of a global problem in concept than in practice — people like to talk about how they mistrust black box analyses, but I don’t know anybody who has read through the code for BWA searching for potential bugs. At the local level, I think that the level of transparency that people actually need is at the level of the pipeline or workflow. We want to know what each of the individual tools are that are being invoked, and with what parameters, even if we aren’t qualified (speaking for myself) to debug any Java or C code.

Technical Debt: The amount of work required to mitigate any of the challenges mentioned above. This is the world that we live in which nobody talks about. With infinite time and effort it is possible to implement almost any workflow on almost any infrastructure, but the real question is how much effort it will take. It is important to recognize when you are incurring technical debt that will have to be paid back by yourself or others in the field. My rule of thumb is to think about, for any analysis, how easily I will be able to re-run all of the analyses from scratch when reviewers ask what would be different if we changed a single parameter. If it’s difficult in the slightest for me to do this, it will be almost impossible for others to reproduce my analysis.

Final Thoughts

I’ve been spending a lot of time recently on workflow managers, and I have found that there are quite a number of systems which provide strict computational reproducibility with a high degree of transparency. The point where they fall down, at no fault of their own, is the ease with which they can be implemented on different computational infrastructures. It is just a complete mess to be able to run an analysis in the exact same way in a diverse set of environments, and it requires that the development teams for those tools devote time and energy to account for all of those eventualities. In a world where very little funding goes to bioinformatics infrastructure reproducibility will always be a challenge, but I am hopeful that things are getting better every day.