metagenomics

Looking Under the Lamppost - Reference Genomes in Metagenomics

In the world of research, a phrase that often comes up is “looking under the lamppost.” This refers to the fundamental bias in which we are more likely to pay attention to whatever is illuminated with our analytical methods, even though it might not be the most important. This phrase always comes into my mind when people are discussing how to best use reference genomes in microbial metagenomics, and I thought that a short explanation might be worthwhile.

The paper which prompted this thought was a recent preprint "OGUs enable effective, phylogeny-aware analysis of even shallow metagenome community structures” (https://www.biorxiv.org/content/10.1101/2021.04.04.438427v1). For those who aren’t familiar with the last author on this paper, Dr. Rob Knight is one of the most published and influential researchers in the microbiome field and is particularly known for being at the forefront of the field of 16S analysis with the widely-used QIIME software suite. In this paper, they describe a method for analyzing another type of data, whole-genome shotgun (WGS) metagenomics, which is based on the alignment of short genome fragments to a collection of reference genomes.

Rather than focus on this particular paper, I would rather spend my short time talking about the use of reference genomes in metagenomic analysis in general. There are many different methods which use reference genomes in different ways. This includes alignment-based approaches like the OGU method as well as k-mer based taxonomic classification, which is one of the most widely used approaches to WGS analysis. There are many bioinformatic advantages of using reference genomes, including the greatly increased speed of being able to analyze new data against a fixed collection of known organisms. The question really comes down to whether or not we are being misled by looking under the lamppost.

To circle back around and totally unpack the metaphor, the idea is that when we analyze the microbiome on the basis of a fixed set of reference genomes we are able to do a very good job of measuring the organisms that we have reference genomes for, but we have very little information about the organisms which are not present in that collection. Particular bioinformatics methods are influenced by this bias to different degrees depending on how they process the raw data, but the underlying principle is inescapable.

The question you might ask is, “is it a good idea for me to use a reference-based approach in my WGS analysis?” and like most good questions the answer is complex. In the end, it comes down to how well the organisms that matter for your biological system have been characterized by reference genome sequencing. Some organisms have been studied quite extensively, and the revolution in genome sequencing has resulted in massive numbers of microbial reference genomes in the public domain. In those cases, using reference genomes for those organisms will almost certainly give you higher quality results than performing your analysis de novo.

The harder question to answer is what organisms matter to your biological system. In many cases, for many diseases, we might think we have a very good idea of what organisms matter. However, one of the amazing benefits of WGS data is that it provides insight into all organisms in a specimen which contain genetic material. This is an incredibly broad pool of organisms to draw from, and we are constantly being surprised by how many organisms are present in the microbiome which we have not yet characterized or generated a reference genome for.

In the end, if you are a researcher who uses WGS data to understand complex microbial communities, my only recommendation is that you approach reference-based metagenomic analysis with an appreciation of the biases that they bring along with their efficiency and speed. As a biologist who studies the microbiome I think that there is a lot more diversity out there than we have characterized in the lab, and I am always excited to find out what lies at the furthest reaches from the lamppost.

What's Big about Small Proteins?

Something interesting has been happening in the world of microbiome research, and it’s all about small proteins.

What’s New?

There was a paper in my weekly roundup of microbiome publications which caught my eye:

Petruschke, H., Schori, C., Canzler, S. et al. Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome. Microbiome 9, 55 (2021).

Reading through the abstract, the authors have “a particular focus on the discovery of novel small proteins with less than 100 amino acids.” While this may seem to be a relatively innocuous statement, I was very interested to see what they found because of some recent innovations in the computational approaches used to study the microbiome.

What’s the Context?

When people study the microbiome, they often only have access to the genome sequences of the bacteria which are present. This is very much the case for the type of metagenomic analysis which I focus on, as with any approach which takes advantage of the massive amounts of data which can be generated with genome sequencing instruments.

When analyzing bacterial genomes, we are able to predict what genes are contained in each genome using annotation tools designed for this purpose. The most commonly used tool for this task is Prokka, made by Torsten Seemann. Recently, researchers have started to realize that there are some bacterial proteins which were being missed by these types of approaches, since the experimental data used to build the predictive models did not include a whole collection of small proteins.

Then, in 2019 Dr. Ami Bhatt’s group at Stanford published a high-profile paper making the case that microbiome analyses were systematically omitting small bacterial proteins:

Sberro H, Fremin BJ, Zlitni S, Edfors F, Greenfield N, Snyder MP, Pavlopoulos GA, Kyrpides NC, Bhatt AS. Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes. Cell. 2019 Aug 22;178(5):1245-1259.e14. doi: 10.1016/j.cell.2019.07.016. Epub 2019 Aug 8. PMID: 31402174; PMCID: PMC6764417.

Around the same time, other groups were publishing studies which used other experimental approaches which supported the idea that bacteria encoded these small genes, which were also being transcribed and translated as bona fide proteins (a few quick examples).

What’s the Point?

The reason I think this story is worth mentioning is because it shines light on part the foundation of microbiome research. When we conduct a microbiome experiment, we can only make a limited number of measurements. We then do the best job we can to infer the biological features which are relevant to our experimental question. Part of the revolution of microbiome research from the last ten years has been the explosion of metagenomic data which is now available. This research is particularly interesting because it shows us how our analysis of that data may have been missing an entire class of genetic elements — genes which encode proteins less than 100 amino acids in length.

At the end of the day, the message is a positive one: with improved experimental techniques we can now generate more useful and accurate data from existing datasets. I am looking forward to seeing what we are able to find as the field continues to explore this new area of the microbiome!

What's the Matter with Recombination?

Why would I be so bold as to assign required reading on a Saturday morning via Twitter? Because the ideas laid out in this paper have practical implications for many of the computational tools that we use to understand the microbiome. Taxonomy and phylogeny lies at the heart of k-mer based methods for metagenomics, as well as the core justification for measuring any marker gene (e.g. 16S) with amplicon sequencing.

Don’t get me wrong, I am a huge fan of both taxonomy and phylogeny as one of the best ways for humans to understand the microbial world, and I’m going to keep using both for the rest of my career. For that reason, I think it’s very important to understand the ways in which these methods can be confounded (i.e. the ways in which these methods can mislead us) by mechanisms like genomic recombination.

What Is Recombination?

Bacteria are amazingly complex creatures, and they do a lot more than just grow and divide. During the course of a bacterial cell’s life, it may end up doing something exciting like:

  • Importing a piece of naked DNA from its environment and just pasting it into its genome;

  • Using a small needle to inject a plasmid (small chromosome) into an adjacent cell; or

  • Package up a small piece of its genome into a phage (protein capsule) which contains all of the machinery needed to travel a distance and then inject that DNA into another bacterium far away.

Not all bacteria do all of these things all of the time, but we know that they do happen. One common feature of these activities is that genetic material is exchanged between cells in a manner other than clonal reproduction (when a single cell splits into two). In other words, these are all forms of ‘recombination.’

How Do We Study the Microbiome?

Speaking for myself, I study the microbiome by analyzing data generated by genome sequencing instruments. We use those instruments to identify a small fraction of the genetic sequences contained in a microbiome sample, and then we draw inferences from those sequences. This may entail amplicon sequencing of the 16S gene, bulk WGS sequencing of a mixed population, or even single-cell sequencing of individual bacterial cells. Across all of these different technological approaches, we are collecting a small sample of the genomic sequences present in a much larger microbiome sample, and we are using that data to gain some understanding of the microbiome as a whole. In order to extrapolate from data to models, we rely on some key assumptions of how the world works, one of which being that bacteria do not frequently recombine.

What Does Recombination Mean to Me?

If you’ve read this far you are either a microbiome researcher or someone very interested in the topic, and so you should care about recombination. As an example, let’s walk through the logical progression of microbiome data analysis:

  • I have observed microbial genomic sequence S in specimen X. This may be an OTU, ASV, or WGS k-mer.

  • The same sequence S can also be observed in specimen Y, but not specimen Z. There may be some nuances of sequencing depth and the limit-of-detection, but I have satisfied myself that for this experiment marker S can be found in X and Y, but not Z.

  • Because bacteria infrequently recombine, I can infer that marker S represents a larger genomic region G which is similarly present in X and Y, but not Z. For 16S that genomic region would be the species- or genus-level core genome, and for WGS it could also be some accessory genetic elements like plasmids, etc. In the simplest rendering, we may give a name to genomic region G which corresponds to the taxonomic label for those organisms which share marker S (e.g. Escherichia coli).

  • When I compare a larger set of samples, I find that the marker S can be consistently found in samples obtained from individuals with disease D (like X and Y) but not in samples from healthy controls (like Z). Therefore I would propose the biological model that organisms containing the larger genomic region G are present at significantly higher relative abundance in the microbiome of individuals with disease D.

In this simplistic rendering I’ve tried to make it clear that the degree to which bacteria recombine will have a practical impact on how much confidence we can have in inferences which rely on the concepts of taxonomy or phylogeny.

The key idea here is that if you observe any marker S, we tend to assume that there is a monophyletic group of organisms which share that marker sequence. Monophyly is one of the most important concepts in microbiome analysis which also happens to be a lot of fun to say — it’s worth reading up on it.

How Much Recombination Is There?

Getting back to the paper that started it all, the authors did a nice job of carefully estimating the frequency of recombination across a handful of bacterial species for which a reasonable amount of data is available. The answer they found is that recombination rates vary, and this answer matches our mechanistic understanding of recombination. The documented mechanisms of recombination vary widely across different organisms, and there is undoubtedly a lot more out there we haven’t characterized yet.

At the end of the day, we have only studied a small fraction of the organisms which are found in the microbiome. As such, we should approach them with a healthy dose of skepticism for any key assumption, like a lack of recombination, which we know is not universal.

In conclusion, I am going to continue to use taxonomy and phylogeny every single day that I study the microbiome, but I’m also going to stay alert for how recombination may be misleading me. On a practical note, I am also going to try to use methods like gene-level analysis which keep a tight constraint on the size of regions G which are inferred from any marker S.

Quality and Insights – The Human Gut Virome in 2019

There were a couple of good virome papers I read this week, and I thought it was worth commenting on the juxtaposition.

Virome — The collection of viruses which are found in a complex microbial community, such as the human microbiome. NB: Most viruses found in any environment are bacteriophages — the viruses which infect bacteria, and do not infect humans.

Measuring Quality, Measuring Viruses

https://www.nature.com/articles/s41587-019-0334-5

I was excited to see two of my favorite labs collaborating on a virome data quality project: the Bushman lab at the University of Pennsylvania (where I trained) and the Segata lab at the University of Trento (who originally made MetPhlAn, the first breakthrough software tool for microbial metagenomics). The goal of this work was to measure the quality of virome research projects.

Virome research projects over the last decade have relied on a technological approach in which viral particles are physically isolated from a complex sample and then loaded onto a genome sequencer. There are a variety of experimental approaches which you can use to isolate viruses, including size filtration and density gradient centrifugation, which rely on the fact that viruses are physically quite different from bacterial cells.

The question asked by the researchers in this study was, “How well are these physical isolation methods actually working?” It’s such a good question that I’m surprised (in retrospect) that nobody had asked it before. As someone who has worked a bit in this area, I’m also surprised that I never thought to ask this question before.

Their approach was nice and straightforward — they looked in these datasets for sequences that should not be found very often, those belonging to the bacterial ribosome, whose absence is almost required in order to be considered a virus.

They found that the quality of these virome datasets varied extremely widely. You can read the paper for more details, and I am hesitant to post the figures on a public blog, but I really did not expect to see that there were published virome datasets with proportions of ribosomal sequences ranging from 0.001% all the way up to 1%.

Take Home: When you use a laboratory method to study your organism of interest, you need to use some straightforward approach for proving to yourself and others that it is actually working as you expect. For something as challenging and complex as the human virome, this new QC tool might help the field maintain a high standard of quality and avoid misleading or erroneous results.

Viral Dark Matter in IBD

https://www.sciencedirect.com/science/article/pii/S1931312819305335

One of the best talks I saw at ASM Microbe 2019 was from Colin Hill (APC Microbiome Ireland) and so I was happy to see a new paper from that group analyzing the gut virome in the context of Inflammatory Bowel Disease. I was even more gratified to read the abstract and see some really plausible and defensible claims being made in an area which is particularly vulnerable to over-hype.

1-s2.0-S1931312819305335-fx1_lrg.jpg

No Change in Richness in IBD: Microbiome researchers talk a lot about “richness,” which refers to the total number of distinct organisms in a community. This metric can be particularly hard to nail down with viruses because they are incredibly diverse and we have a hard time counting how many “types” there are (or even what being a “type” means). In this paper they used the very careful approach of re-assembling all viral genomes from scratch, rather than comparing against an existing database, and found that there was no difference in the richness of the virome in IBD vs. non-IBD samples. When others have analyzed the same data with methods that relied on reference databases, they found a significant difference in richness, which suggests that the database was confounding the results for those prior studies.

Changes in Viruses Reflect Bacteria: The authors state that “the changes in virome composition reflected alterations in bacterial composition,” which resonated with me so strongly that I think it merits mentioning again here. Viruses tend to be extremely specific and only infect a subset of strains within a species of bacteria. They are also so diverse that it is hard to even figure out which virus is infecting which bacteria. Therefore, with our current level of understanding and technology, viruses in the human gut are really best approached as a marker of what bacterial strains are present. It’s hard to get anything more concrete than that from sequencing-based approaches, except with some specific examples of well-understood viruses. With that limitation of our knowledge in mind, it is entirely expected that changes in bacteria would be reflected in changes in their viruses. Moreover, in this type of observational study we don’t have any way to figure out which direction the arrow of directionality is pointing. I think the authors did a great job of keeping their claims to the limits of our knowledge without over-hyping the results.

There is a lot more to this paper, so I encourage you to read it in more depth and I won’t claim to make a full summary here.

In Summary: This is a fascinating field, with some really great groups doing careful and important work. We know a lot about how little we know, which means that there are even more exciting discoveries on the horizon.

Preprint: Identifying genes in the human microbiome that are reproducibly associated with human disease

I’m very excited about a project that I’ve been working on for a while with Prof. Amy Willis (UW - Biostatistics), and now that a preprint is available I wanted to share some of that excitement with you. Some of the figures are below, and you can look at the preprint for the rest.

Caveat: There are a ton of explanations and qualifications that I have overlooked for the statements below — I apologize in advance if I have lost some nuance and accuracy in the interest of broader understanding.

Big Idea

When researchers look for associations of the microbiome with human disease, they tend to focus on the taxonomic or metabolic summaries of those communities. The detailed analysis of all of the genes encoded by the microbes in each community hasn’t really been possible before, purely because there are far too many genes (millions) to meaningfully analyze on an individual basis. After a good amount of work I think that I have found a good way to efficiently cluster millions of microbial genes based on their co-abundance, and I believe that this computational innovation will enable a whole new approach for developing microbiome-based therapeutics.

Core Innovation

I was very impressed with the basic idea of clustering co-abundant genes (to form CAGs) when I saw it proposed initially by one of the premier microbiome research groups. However, the computational impossibility of performing all-by-all comparisons for millions of microbial genes (with trillions of potential comparisons) ultimately led to an alternate approach which uses co-abundance to identify “metagenomic species” (MSPs), a larger unit that uses an approximate distance metric to identify groups of CAGs that are likely from the same species.

That said, I was very interested in finding CAGs based on strict co-abundance clustering. After trying lots of different approaches, I eventually figured out that I could apply the Approximate Nearest Neighbor family of heuristics to effectively partition the clustering space and generate highly accurate CAGs from datasets with millions of genes across thousands of biological samples. So many details to skip here, but the take-home is that we used a new computational approach to perform dimensionality reduction (building CAGs), which made it reasonable to even attempt gene-level metagenomics to find associations of the microbiome with human disease.

Just to make sure that I’m not underselling anything here, being able to use this new software to perform exhaustive average linkage clustering based on the cosine distance between millions of microbial genes from hundreds of metagenomes is a really big deal, in my opinion. I mostly say this because I spent a long time failing at this, and so the eventual success is extremely gratifying.

Associating the Microbiome with Disease

We applied this new computational approach to existing, published microbiome datasets in order to find gene-level associations of the microbiome with disease. The general approach was to look for individual CAGs (groups of co-abundant microbial genes) that were significantly associated with disease (higher or lower in abundance in the stool of people with a disease, compared to those people without the disease). We did this for both colorectal cancer (CRC) and inflammatory bowel disease (IBD), mostly because those are the two diseases for which multiple independent cohorts existed with WGS microbiome data.

Discovery / Validation Approach

The core of our statistical analysis of this approach was to look for associations with disease independently across both a discovery and a validation cohort. In other words, we used the microbiome data from one group of 100-200 people to see if any CAGs were associated with disease, and then we used a completely different group of 100-200 people in order to validate that association.

Surprising Result

Quite notably, those CAGs which were associated with disease in the discovery cohort were also similarly associated with disease in the the validation cohort. These were different groups of people, different laboratories, different sample processing protocols, and different sequencing facilities. With all of those differences, I am very hopeful that the consistencies represent an underlying biological reality that is true across most people with these diseases.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Figure 2A: Association of microbial CAGs with host CRC across two independent cohorts.

Developing Microbiome Therapeutics: Linking Genes to Isolates

While it is important to ensure that results are reproducible across cohorts, it is much more important that the results are meaningful and provide testable hypotheses about treating human disease. The aspect of these results I am most excited about is that each of the individual genes that were associated above with CRC or IBD can be directly aligned against the genomes of individual microbial isolates. This allows us to identify those strains which contain the highest number of genes which are associated positively or negatively with disease. It should be noted at this point that observational data does not provide any information on causality — the fact that a strain is more abundant in people with CRC could be because it has a growth advantage in CRC, it could be that it causes CRC, or it could be something else entirely. However, this gives us some testable hypotheses and a place to start for future research and development.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Figure 3C: Presence of CRC-associated genes across a subset of microbial isolates in RefSeq. Color bar shows coefficient of correlation with CRC.

Put simply, I am hopeful that others in the microbiome field will find this to be a useful approach to developing future microbiome therapeutics. Namely,

  1. Start with a survey of people with and without a disease,

  2. Collect WGS data from microbiome samples,

  3. Find microbial CAGs that are associated with disease, and then

  4. Identify isolates in the freezer containing those genes.

That process provides a prioritized list of isolates for preclinical testing, which will hopefully make it a lot more efficient to develop an effective microbiome therapeutic.

Thank You

Your time and attention are appreciated, as always, dear reader. Please do not hesitate to be in touch if you have any questions or would like to discuss anything in more depth.

Massive unexplored genetic diversity of the human microbiome

When you analyze extremely large datasets, you tend to be guided by your intuition or predictions on how those datasets are composed, or how they will behave. Having studied the microbiome for a while, I would say that my primary rule of thumb for what to expect from any new sample is tons of novel diversity. This week saw the publication of another great paper showing just how true this is.

Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle Resource

graphical_abstract.jpg

The Approach

If you are new to the microbiome, you may be interested to know that there are basically two approaches to figuring out what microbes (bacteria, viruses, etc.) are in a given sample (e.g. stool). You can either (1) compare all of the DNA in that sample to a reference database of microbial genomes, or (2) try to reassemble the genomes in each sample directly from the DNA.

The thesis of this paper is one that I strongly support: reference databases contain very little of the total genomic content of microbes out there in the world. By extension, they predict that (1) would perform poorly, while (2) will generate a much better representation of what microbes are present.

Testing this idea, the authors analyzed an immense amount of microbiome data (almost 10,000 biological samples!), performing the relatively computationally intensive task of reconstructing genomes (so-called _de novo_ assembly).

The Results

The authors found a lot of things, but the big message is that they were able to reconstruct a *ton* of new genomes from these samples — organisms that had never been sequenced before, and many that don’t really resemble any phyla that we know of. In other words, they found a lot more novel genomic content than even I expected, and I was sure that they would find a lot.

gr1.jpg

There’s a lot more content here for microbial genome afficianados, so feel free to dig in on your own (yum yum).

Take Home

When you think about what microbes are present in the microbiome, remember that there are many new microbes that we’ve never seen before. Some of those are new strains of clearly recognizable species (e.g. E. coli with a dozen new genes), but some will be novel organisms that have never been cultured or sequenced by any lab.

If you’re a scientist, keep that in mind when you are working in this area. If you’re a human, take hope and be encouraged by the fact that there is still a massive undiscovered universe within us, full of potential and amazing new things waiting to be discovered.

The Blessing and the Curse of Dimensionality

A paper recently caught my eye, and I think it is a great excuse to talk about data scale and dimensionality.

Vatanen, T. et al. The human gut microbiome in early-onset type 1 diabetes from the TEDDY study. Nature 562, 589–594 (2018). (link)

In addition to having a great acronym for study of child development, they also sequenced 10,913 metagenomes from 783 children.

This is a ton of data.

If you haven’t worked with a “metagenome,” it’s usually about 10-20 million short words, each corresponding to 100-300 bases of a microbial genome. It’s a text file with some combination of ATCG written out over tens of millions of lines, with each line being a few hundred letters long. A single metagenome is big. It won’t open in Word. Now imagine you have 10,000 of them. Now imagine you have to make sense out of 10,000 of them.

Now, I’m being a bit extreme – there are some ways to deal with the data. However, I would argue that it’s this problem, how to deal with the data, that we could use some help with.

Taxonomic classification

The most effective way to deal with the data is to take each metagenome and figure out which organisms are present. This process is called “taxonomic classification” and it’s something that people have gotten pretty good at recently. You take all of those short ATCG words, you match them against all of the genomes you know about, and you use that information to make some educated about what collection of organisms are present. This is a biologically meaningful reduction in the data that results in hundreds or thousands of observations per sample. You can also validate these methods by processing “mock communities” and seeing if you get the right answer. I’m a fan.

With taxonomic classification you end up with thousands of observations (in this case organisms) across X samples. In the TEDDY study they had >10,000 samples, and so this dataset has a lot of statistical power (where you generally want more samples than observations).

Metabolic reconstruction

The other main way that people analyze metagenomes these days is by quantifying the abundance of each biochemical pathway present in the sample. I won’t talk about this here because my opinions are controversial and it’s best left for another post.

Gene-level analysis

I spend most of my time these days on “gene-level analysis.” This type of analysis tries to quantify every gene present in every genome in every sample. The motivation here is that sometimes genes move horizontally between species, and sometimes different strains within the same species will have different collections of genes. So, if you want to find something that you can’t find with taxonomic analysis, maybe gene-level analysis will pick it up. However, that’s an entirely different can of worms. Let’s open it up.

Every microbial genome contains roughly 1,000 genes. Every metagenome contains a few hundred genomes. So every metagenome contains hundreds of thousands of genes. When you look across a few hundred samples you might find a few million unique genes. When you look across 10,000 samples I can only guess that you’d find tens of millions of unique genes.

Now the dimensionality of the data is all lopsided. We have tens of millions of genes, which are observed across tens of thousands of samples. A biostatistician would tell us that this is seriously underpowered for making sense of the biology. Basically, this is an approach that just doesn’t work for studies with 10,000 samples, which I find to be pretty daunting.

Dealing with scale

The way that we find success in science is that we take information that a human cannot comprehend, and we transform it into something that a human can comprehend. We cannot look at a text file with ten million lines and understand anything about that sample, but we can transform it into a list of organisms with names that we can Google. I’m spending a lot of my time trying to do the same thing with gene-level metagenomic analysis, trying to transform it into something that a human can comprehend. This all falls into the category of “dimensionality reduction,” trying to reduce the number of observations per sample, while still retaining the biological information we care about. I’ll tell you that this problem is really hard and I’m not sure I have the single best true angle on it. I would absolutely love to have more eyes on the problem.

It increasingly seems like the world is driven by people who try to make sense of large amounts of data, and I would humbly ask for anyone who cares about this to try to think about metagenomic analysis. The data is massive, and we have a hard time figuring out how to make sense of it. We have a lot of good starts to this, and there are a lot of good people working in this area (too many to list), but I think we could always use more help.

The authors of the paper who analyzed 10,000 metagenomes learned a lot about how the microbiome develops during early childhood, but I’m sure that there is even more we can learn from this data. And I am also sure that we are getting close to world where we have 10X the data per sample, and experiments with 10X the samples. That is a world that I think we are ill-prepared for, and I’m excited to try to build the tools that we will need for it.