Musings

Looking Under the Lamppost - Reference Genomes in Metagenomics

In the world of research, a phrase that often comes up is “looking under the lamppost.” This refers to the fundamental bias in which we are more likely to pay attention to whatever is illuminated with our analytical methods, even though it might not be the most important. This phrase always comes into my mind when people are discussing how to best use reference genomes in microbial metagenomics, and I thought that a short explanation might be worthwhile.

The paper which prompted this thought was a recent preprint "OGUs enable effective, phylogeny-aware analysis of even shallow metagenome community structures” (https://www.biorxiv.org/content/10.1101/2021.04.04.438427v1). For those who aren’t familiar with the last author on this paper, Dr. Rob Knight is one of the most published and influential researchers in the microbiome field and is particularly known for being at the forefront of the field of 16S analysis with the widely-used QIIME software suite. In this paper, they describe a method for analyzing another type of data, whole-genome shotgun (WGS) metagenomics, which is based on the alignment of short genome fragments to a collection of reference genomes.

Rather than focus on this particular paper, I would rather spend my short time talking about the use of reference genomes in metagenomic analysis in general. There are many different methods which use reference genomes in different ways. This includes alignment-based approaches like the OGU method as well as k-mer based taxonomic classification, which is one of the most widely used approaches to WGS analysis. There are many bioinformatic advantages of using reference genomes, including the greatly increased speed of being able to analyze new data against a fixed collection of known organisms. The question really comes down to whether or not we are being misled by looking under the lamppost.

To circle back around and totally unpack the metaphor, the idea is that when we analyze the microbiome on the basis of a fixed set of reference genomes we are able to do a very good job of measuring the organisms that we have reference genomes for, but we have very little information about the organisms which are not present in that collection. Particular bioinformatics methods are influenced by this bias to different degrees depending on how they process the raw data, but the underlying principle is inescapable.

The question you might ask is, “is it a good idea for me to use a reference-based approach in my WGS analysis?” and like most good questions the answer is complex. In the end, it comes down to how well the organisms that matter for your biological system have been characterized by reference genome sequencing. Some organisms have been studied quite extensively, and the revolution in genome sequencing has resulted in massive numbers of microbial reference genomes in the public domain. In those cases, using reference genomes for those organisms will almost certainly give you higher quality results than performing your analysis de novo.

The harder question to answer is what organisms matter to your biological system. In many cases, for many diseases, we might think we have a very good idea of what organisms matter. However, one of the amazing benefits of WGS data is that it provides insight into all organisms in a specimen which contain genetic material. This is an incredibly broad pool of organisms to draw from, and we are constantly being surprised by how many organisms are present in the microbiome which we have not yet characterized or generated a reference genome for.

In the end, if you are a researcher who uses WGS data to understand complex microbial communities, my only recommendation is that you approach reference-based metagenomic analysis with an appreciation of the biases that they bring along with their efficiency and speed. As a biologist who studies the microbiome I think that there is a lot more diversity out there than we have characterized in the lab, and I am always excited to find out what lies at the furthest reaches from the lamppost.

What's the Matter with Recombination?

Why would I be so bold as to assign required reading on a Saturday morning via Twitter? Because the ideas laid out in this paper have practical implications for many of the computational tools that we use to understand the microbiome. Taxonomy and phylogeny lies at the heart of k-mer based methods for metagenomics, as well as the core justification for measuring any marker gene (e.g. 16S) with amplicon sequencing.

Don’t get me wrong, I am a huge fan of both taxonomy and phylogeny as one of the best ways for humans to understand the microbial world, and I’m going to keep using both for the rest of my career. For that reason, I think it’s very important to understand the ways in which these methods can be confounded (i.e. the ways in which these methods can mislead us) by mechanisms like genomic recombination.

What Is Recombination?

Bacteria are amazingly complex creatures, and they do a lot more than just grow and divide. During the course of a bacterial cell’s life, it may end up doing something exciting like:

  • Importing a piece of naked DNA from its environment and just pasting it into its genome;

  • Using a small needle to inject a plasmid (small chromosome) into an adjacent cell; or

  • Package up a small piece of its genome into a phage (protein capsule) which contains all of the machinery needed to travel a distance and then inject that DNA into another bacterium far away.

Not all bacteria do all of these things all of the time, but we know that they do happen. One common feature of these activities is that genetic material is exchanged between cells in a manner other than clonal reproduction (when a single cell splits into two). In other words, these are all forms of ‘recombination.’

How Do We Study the Microbiome?

Speaking for myself, I study the microbiome by analyzing data generated by genome sequencing instruments. We use those instruments to identify a small fraction of the genetic sequences contained in a microbiome sample, and then we draw inferences from those sequences. This may entail amplicon sequencing of the 16S gene, bulk WGS sequencing of a mixed population, or even single-cell sequencing of individual bacterial cells. Across all of these different technological approaches, we are collecting a small sample of the genomic sequences present in a much larger microbiome sample, and we are using that data to gain some understanding of the microbiome as a whole. In order to extrapolate from data to models, we rely on some key assumptions of how the world works, one of which being that bacteria do not frequently recombine.

What Does Recombination Mean to Me?

If you’ve read this far you are either a microbiome researcher or someone very interested in the topic, and so you should care about recombination. As an example, let’s walk through the logical progression of microbiome data analysis:

  • I have observed microbial genomic sequence S in specimen X. This may be an OTU, ASV, or WGS k-mer.

  • The same sequence S can also be observed in specimen Y, but not specimen Z. There may be some nuances of sequencing depth and the limit-of-detection, but I have satisfied myself that for this experiment marker S can be found in X and Y, but not Z.

  • Because bacteria infrequently recombine, I can infer that marker S represents a larger genomic region G which is similarly present in X and Y, but not Z. For 16S that genomic region would be the species- or genus-level core genome, and for WGS it could also be some accessory genetic elements like plasmids, etc. In the simplest rendering, we may give a name to genomic region G which corresponds to the taxonomic label for those organisms which share marker S (e.g. Escherichia coli).

  • When I compare a larger set of samples, I find that the marker S can be consistently found in samples obtained from individuals with disease D (like X and Y) but not in samples from healthy controls (like Z). Therefore I would propose the biological model that organisms containing the larger genomic region G are present at significantly higher relative abundance in the microbiome of individuals with disease D.

In this simplistic rendering I’ve tried to make it clear that the degree to which bacteria recombine will have a practical impact on how much confidence we can have in inferences which rely on the concepts of taxonomy or phylogeny.

The key idea here is that if you observe any marker S, we tend to assume that there is a monophyletic group of organisms which share that marker sequence. Monophyly is one of the most important concepts in microbiome analysis which also happens to be a lot of fun to say — it’s worth reading up on it.

How Much Recombination Is There?

Getting back to the paper that started it all, the authors did a nice job of carefully estimating the frequency of recombination across a handful of bacterial species for which a reasonable amount of data is available. The answer they found is that recombination rates vary, and this answer matches our mechanistic understanding of recombination. The documented mechanisms of recombination vary widely across different organisms, and there is undoubtedly a lot more out there we haven’t characterized yet.

At the end of the day, we have only studied a small fraction of the organisms which are found in the microbiome. As such, we should approach them with a healthy dose of skepticism for any key assumption, like a lack of recombination, which we know is not universal.

In conclusion, I am going to continue to use taxonomy and phylogeny every single day that I study the microbiome, but I’m also going to stay alert for how recombination may be misleading me. On a practical note, I am also going to try to use methods like gene-level analysis which keep a tight constraint on the size of regions G which are inferred from any marker S.

Bioinformatics: Reproducibility, Portability, Transparency, and Technical Debt

I’ve been thinking a lot about what people are talking about when they talk about reproducibility. It has been helpful to start to break apart the terminology in order to distinguish between some conceptually distinct, albeit highly intertwined, concepts.

Bioinformatics: Strictly speaking, analysis of data for the purpose of biological research. In practice, the analysis of large files (GBs) with a series of compiled programs, each of which may have a different set of environmental dependencies and computational resource requirements.

Reproducibility: An overarching concept describing how easily a bioinformatic analysis performed at one time may be able to be executed a second time, potentially by a different person, at a different institution, or on a different set of input data. There is also a strict usage of the term which describes the computational property of an analysis in which the analysis of an identical set of inputs will always produce an identical set of outputs. These two meanings are related, but not identical. Bioinformaticians tend to accept a lack of strict reproducibility (e.g., the order of alignment results may not be consistent when multithreading), but very clearly want to have general reproducibility in which the biological conclusions drawn from an analysis will always be the same from identical inputs.

Portability: The ability of researchers at different institutions (or in different labs) to execute the same analysis. This aspect of reproducibility is useful to consider because it highlights the difficulties that are encountered when you move between computational environments. Each set of dependencies, environmental variables, file systems, permissions, hardware, etc., is typically quite different and can cause endless headaches. Some people point to Docker as a primary solution to this problem, but it is typical for Docker to be prohibited on HPCs because it requires root access. Operationally, the problem of portability is a huge one for bioinformaticians who are asked by their collaborators to execute analyses developed by other groups, and the reason why we sometimes start to feel like UNIX gurus more than anything else.

Transparency: The ability of researchers to inspect and understand what analyses are being performed. This is more of a global problem in concept than in practice — people like to talk about how they mistrust black box analyses, but I don’t know anybody who has read through the code for BWA searching for potential bugs. At the local level, I think that the level of transparency that people actually need is at the level of the pipeline or workflow. We want to know what each of the individual tools are that are being invoked, and with what parameters, even if we aren’t qualified (speaking for myself) to debug any Java or C code.

Technical Debt: The amount of work required to mitigate any of the challenges mentioned above. This is the world that we live in which nobody talks about. With infinite time and effort it is possible to implement almost any workflow on almost any infrastructure, but the real question is how much effort it will take. It is important to recognize when you are incurring technical debt that will have to be paid back by yourself or others in the field. My rule of thumb is to think about, for any analysis, how easily I will be able to re-run all of the analyses from scratch when reviewers ask what would be different if we changed a single parameter. If it’s difficult in the slightest for me to do this, it will be almost impossible for others to reproduce my analysis.

Final Thoughts

I’ve been spending a lot of time recently on workflow managers, and I have found that there are quite a number of systems which provide strict computational reproducibility with a high degree of transparency. The point where they fall down, at no fault of their own, is the ease with which they can be implemented on different computational infrastructures. It is just a complete mess to be able to run an analysis in the exact same way in a diverse set of environments, and it requires that the development teams for those tools devote time and energy to account for all of those eventualities. In a world where very little funding goes to bioinformatics infrastructure reproducibility will always be a challenge, but I am hopeful that things are getting better every day.

The Blessing and the Curse of Dimensionality

A paper recently caught my eye, and I think it is a great excuse to talk about data scale and dimensionality.

Vatanen, T. et al. The human gut microbiome in early-onset type 1 diabetes from the TEDDY study. Nature 562, 589–594 (2018). (link)

In addition to having a great acronym for study of child development, they also sequenced 10,913 metagenomes from 783 children.

This is a ton of data.

If you haven’t worked with a “metagenome,” it’s usually about 10-20 million short words, each corresponding to 100-300 bases of a microbial genome. It’s a text file with some combination of ATCG written out over tens of millions of lines, with each line being a few hundred letters long. A single metagenome is big. It won’t open in Word. Now imagine you have 10,000 of them. Now imagine you have to make sense out of 10,000 of them.

Now, I’m being a bit extreme – there are some ways to deal with the data. However, I would argue that it’s this problem, how to deal with the data, that we could use some help with.

Taxonomic classification

The most effective way to deal with the data is to take each metagenome and figure out which organisms are present. This process is called “taxonomic classification” and it’s something that people have gotten pretty good at recently. You take all of those short ATCG words, you match them against all of the genomes you know about, and you use that information to make some educated about what collection of organisms are present. This is a biologically meaningful reduction in the data that results in hundreds or thousands of observations per sample. You can also validate these methods by processing “mock communities” and seeing if you get the right answer. I’m a fan.

With taxonomic classification you end up with thousands of observations (in this case organisms) across X samples. In the TEDDY study they had >10,000 samples, and so this dataset has a lot of statistical power (where you generally want more samples than observations).

Metabolic reconstruction

The other main way that people analyze metagenomes these days is by quantifying the abundance of each biochemical pathway present in the sample. I won’t talk about this here because my opinions are controversial and it’s best left for another post.

Gene-level analysis

I spend most of my time these days on “gene-level analysis.” This type of analysis tries to quantify every gene present in every genome in every sample. The motivation here is that sometimes genes move horizontally between species, and sometimes different strains within the same species will have different collections of genes. So, if you want to find something that you can’t find with taxonomic analysis, maybe gene-level analysis will pick it up. However, that’s an entirely different can of worms. Let’s open it up.

Every microbial genome contains roughly 1,000 genes. Every metagenome contains a few hundred genomes. So every metagenome contains hundreds of thousands of genes. When you look across a few hundred samples you might find a few million unique genes. When you look across 10,000 samples I can only guess that you’d find tens of millions of unique genes.

Now the dimensionality of the data is all lopsided. We have tens of millions of genes, which are observed across tens of thousands of samples. A biostatistician would tell us that this is seriously underpowered for making sense of the biology. Basically, this is an approach that just doesn’t work for studies with 10,000 samples, which I find to be pretty daunting.

Dealing with scale

The way that we find success in science is that we take information that a human cannot comprehend, and we transform it into something that a human can comprehend. We cannot look at a text file with ten million lines and understand anything about that sample, but we can transform it into a list of organisms with names that we can Google. I’m spending a lot of my time trying to do the same thing with gene-level metagenomic analysis, trying to transform it into something that a human can comprehend. This all falls into the category of “dimensionality reduction,” trying to reduce the number of observations per sample, while still retaining the biological information we care about. I’ll tell you that this problem is really hard and I’m not sure I have the single best true angle on it. I would absolutely love to have more eyes on the problem.

It increasingly seems like the world is driven by people who try to make sense of large amounts of data, and I would humbly ask for anyone who cares about this to try to think about metagenomic analysis. The data is massive, and we have a hard time figuring out how to make sense of it. We have a lot of good starts to this, and there are a lot of good people working in this area (too many to list), but I think we could always use more help.

The authors of the paper who analyzed 10,000 metagenomes learned a lot about how the microbiome develops during early childhood, but I’m sure that there is even more we can learn from this data. And I am also sure that we are getting close to world where we have 10X the data per sample, and experiments with 10X the samples. That is a world that I think we are ill-prepared for, and I’m excited to try to build the tools that we will need for it.