In the world of research, a phrase that often comes up is “looking under the lamppost.” This refers to the fundamental bias in which we are more likely to pay attention to whatever is illuminated with our analytical methods, even though it might not be the most important. This phrase always comes into my mind when people are discussing how to best use reference genomes in microbial metagenomics, and I thought that a short explanation might be worthwhile.
The paper which prompted this thought was a recent preprint "OGUs enable effective, phylogeny-aware analysis of even shallow metagenome community structures” (https://www.biorxiv.org/content/10.1101/2021.04.04.438427v1). For those who aren’t familiar with the last author on this paper, Dr. Rob Knight is one of the most published and influential researchers in the microbiome field and is particularly known for being at the forefront of the field of 16S analysis with the widely-used QIIME software suite. In this paper, they describe a method for analyzing another type of data, whole-genome shotgun (WGS) metagenomics, which is based on the alignment of short genome fragments to a collection of reference genomes.
Rather than focus on this particular paper, I would rather spend my short time talking about the use of reference genomes in metagenomic analysis in general. There are many different methods which use reference genomes in different ways. This includes alignment-based approaches like the OGU method as well as k-mer based taxonomic classification, which is one of the most widely used approaches to WGS analysis. There are many bioinformatic advantages of using reference genomes, including the greatly increased speed of being able to analyze new data against a fixed collection of known organisms. The question really comes down to whether or not we are being misled by looking under the lamppost.
To circle back around and totally unpack the metaphor, the idea is that when we analyze the microbiome on the basis of a fixed set of reference genomes we are able to do a very good job of measuring the organisms that we have reference genomes for, but we have very little information about the organisms which are not present in that collection. Particular bioinformatics methods are influenced by this bias to different degrees depending on how they process the raw data, but the underlying principle is inescapable.
The question you might ask is, “is it a good idea for me to use a reference-based approach in my WGS analysis?” and like most good questions the answer is complex. In the end, it comes down to how well the organisms that matter for your biological system have been characterized by reference genome sequencing. Some organisms have been studied quite extensively, and the revolution in genome sequencing has resulted in massive numbers of microbial reference genomes in the public domain. In those cases, using reference genomes for those organisms will almost certainly give you higher quality results than performing your analysis de novo.
The harder question to answer is what organisms matter to your biological system. In many cases, for many diseases, we might think we have a very good idea of what organisms matter. However, one of the amazing benefits of WGS data is that it provides insight into all organisms in a specimen which contain genetic material. This is an incredibly broad pool of organisms to draw from, and we are constantly being surprised by how many organisms are present in the microbiome which we have not yet characterized or generated a reference genome for.
In the end, if you are a researcher who uses WGS data to understand complex microbial communities, my only recommendation is that you approach reference-based metagenomic analysis with an appreciation of the biases that they bring along with their efficiency and speed. As a biologist who studies the microbiome I think that there is a lot more diversity out there than we have characterized in the lab, and I am always excited to find out what lies at the furthest reaches from the lamppost.