Python-Based Data Viz (With No Installation Required)

In my work I’m constantly trying to find better ways to communicate the results of complex analyses to users, whether they are other scientists or the general public. As the tools for delivering sophisticated apps via the web have become better and better, it has even become possible for someone like me (who is not a web developer) to package up my Python code to be run directly in the web browser.

To try to help other people adopt this technique, I wrote a short tutorial which is now available on Towards Data Science:

https://towardsdatascience.com/python-based-data-viz-with-no-installation-required-aaf2358c881


Dispatches from the microbial frontier of cancer research

One of the aspects of my job that I enjoy the most is being able to support the stellar researchers working here at the Fred Hutch Cancer Center. My own personal goal is that my work analyzing data from the human gut microbiome will one day be used to improve the tools we have for preventing and treating cancer. As part of that effort, I have been collaborating with a brilliant physician-scientist, Dr. Neel Dey, who combines his clinical practice with a biomedical research program to identify the ways in with the microbiome influences colorectal cancer.

Our work together was recently featured in an article by Sabin Russell from the Fred Hutch press office, which I thought did a great job of capturing our recent advances.

Dispatches from the microbial frontier of cancer research



Microbial Pan-Genome Cartography

Many of the projects that I’ve been working on recently have led me to ask questions like, ”what bacteria encode this group of genes?” and “what genes are shared across this group of bacteria?” Even for a single species, the patterns of genes across genomes can be complex and beautiful!

Update: All of the maps shown below can now be viewed interactively here.

To make it easier to generate interactive displays to explore these patterns (which I call “genes-in-genomes maps”), I ended up building out a collection of tools which I would love for any researcher to use.

To explain a bit more about this topic, I recorded a short talk based on a presentation I gave at a local microbiome interest group.

In the presentation I walk through a handful of these pan-genome maps, which you can also download and explore for yourself using the links below.

I’m excited to think that this way of displaying microbial pan-genomes might be useful to researchers who are interested in exploring the diversity of the microbial world. If you’d like to talk more about using this approach in your research, please don’t hesitate to be in touch.

Looking Under the Lamppost - Reference Genomes in Metagenomics

In the world of research, a phrase that often comes up is “looking under the lamppost.” This refers to the fundamental bias in which we are more likely to pay attention to whatever is illuminated with our analytical methods, even though it might not be the most important. This phrase always comes into my mind when people are discussing how to best use reference genomes in microbial metagenomics, and I thought that a short explanation might be worthwhile.

The paper which prompted this thought was a recent preprint "OGUs enable effective, phylogeny-aware analysis of even shallow metagenome community structures” (https://www.biorxiv.org/content/10.1101/2021.04.04.438427v1). For those who aren’t familiar with the last author on this paper, Dr. Rob Knight is one of the most published and influential researchers in the microbiome field and is particularly known for being at the forefront of the field of 16S analysis with the widely-used QIIME software suite. In this paper, they describe a method for analyzing another type of data, whole-genome shotgun (WGS) metagenomics, which is based on the alignment of short genome fragments to a collection of reference genomes.

Rather than focus on this particular paper, I would rather spend my short time talking about the use of reference genomes in metagenomic analysis in general. There are many different methods which use reference genomes in different ways. This includes alignment-based approaches like the OGU method as well as k-mer based taxonomic classification, which is one of the most widely used approaches to WGS analysis. There are many bioinformatic advantages of using reference genomes, including the greatly increased speed of being able to analyze new data against a fixed collection of known organisms. The question really comes down to whether or not we are being misled by looking under the lamppost.

To circle back around and totally unpack the metaphor, the idea is that when we analyze the microbiome on the basis of a fixed set of reference genomes we are able to do a very good job of measuring the organisms that we have reference genomes for, but we have very little information about the organisms which are not present in that collection. Particular bioinformatics methods are influenced by this bias to different degrees depending on how they process the raw data, but the underlying principle is inescapable.

The question you might ask is, “is it a good idea for me to use a reference-based approach in my WGS analysis?” and like most good questions the answer is complex. In the end, it comes down to how well the organisms that matter for your biological system have been characterized by reference genome sequencing. Some organisms have been studied quite extensively, and the revolution in genome sequencing has resulted in massive numbers of microbial reference genomes in the public domain. In those cases, using reference genomes for those organisms will almost certainly give you higher quality results than performing your analysis de novo.

The harder question to answer is what organisms matter to your biological system. In many cases, for many diseases, we might think we have a very good idea of what organisms matter. However, one of the amazing benefits of WGS data is that it provides insight into all organisms in a specimen which contain genetic material. This is an incredibly broad pool of organisms to draw from, and we are constantly being surprised by how many organisms are present in the microbiome which we have not yet characterized or generated a reference genome for.

In the end, if you are a researcher who uses WGS data to understand complex microbial communities, my only recommendation is that you approach reference-based metagenomic analysis with an appreciation of the biases that they bring along with their efficiency and speed. As a biologist who studies the microbiome I think that there is a lot more diversity out there than we have characterized in the lab, and I am always excited to find out what lies at the furthest reaches from the lamppost.

What's Big about Small Proteins?

Something interesting has been happening in the world of microbiome research, and it’s all about small proteins.

What’s New?

There was a paper in my weekly roundup of microbiome publications which caught my eye:

Petruschke, H., Schori, C., Canzler, S. et al. Discovery of novel community-relevant small proteins in a simplified human intestinal microbiome. Microbiome 9, 55 (2021).

Reading through the abstract, the authors have “a particular focus on the discovery of novel small proteins with less than 100 amino acids.” While this may seem to be a relatively innocuous statement, I was very interested to see what they found because of some recent innovations in the computational approaches used to study the microbiome.

What’s the Context?

When people study the microbiome, they often only have access to the genome sequences of the bacteria which are present. This is very much the case for the type of metagenomic analysis which I focus on, as with any approach which takes advantage of the massive amounts of data which can be generated with genome sequencing instruments.

When analyzing bacterial genomes, we are able to predict what genes are contained in each genome using annotation tools designed for this purpose. The most commonly used tool for this task is Prokka, made by Torsten Seemann. Recently, researchers have started to realize that there are some bacterial proteins which were being missed by these types of approaches, since the experimental data used to build the predictive models did not include a whole collection of small proteins.

Then, in 2019 Dr. Ami Bhatt’s group at Stanford published a high-profile paper making the case that microbiome analyses were systematically omitting small bacterial proteins:

Sberro H, Fremin BJ, Zlitni S, Edfors F, Greenfield N, Snyder MP, Pavlopoulos GA, Kyrpides NC, Bhatt AS. Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes. Cell. 2019 Aug 22;178(5):1245-1259.e14. doi: 10.1016/j.cell.2019.07.016. Epub 2019 Aug 8. PMID: 31402174; PMCID: PMC6764417.

Around the same time, other groups were publishing studies which used other experimental approaches which supported the idea that bacteria encoded these small genes, which were also being transcribed and translated as bona fide proteins (a few quick examples).

What’s the Point?

The reason I think this story is worth mentioning is because it shines light on part the foundation of microbiome research. When we conduct a microbiome experiment, we can only make a limited number of measurements. We then do the best job we can to infer the biological features which are relevant to our experimental question. Part of the revolution of microbiome research from the last ten years has been the explosion of metagenomic data which is now available. This research is particularly interesting because it shows us how our analysis of that data may have been missing an entire class of genetic elements — genes which encode proteins less than 100 amino acids in length.

At the end of the day, the message is a positive one: with improved experimental techniques we can now generate more useful and accurate data from existing datasets. I am looking forward to seeing what we are able to find as the field continues to explore this new area of the microbiome!

What's the Matter with Recombination?

Why would I be so bold as to assign required reading on a Saturday morning via Twitter? Because the ideas laid out in this paper have practical implications for many of the computational tools that we use to understand the microbiome. Taxonomy and phylogeny lies at the heart of k-mer based methods for metagenomics, as well as the core justification for measuring any marker gene (e.g. 16S) with amplicon sequencing.

Don’t get me wrong, I am a huge fan of both taxonomy and phylogeny as one of the best ways for humans to understand the microbial world, and I’m going to keep using both for the rest of my career. For that reason, I think it’s very important to understand the ways in which these methods can be confounded (i.e. the ways in which these methods can mislead us) by mechanisms like genomic recombination.

What Is Recombination?

Bacteria are amazingly complex creatures, and they do a lot more than just grow and divide. During the course of a bacterial cell’s life, it may end up doing something exciting like:

  • Importing a piece of naked DNA from its environment and just pasting it into its genome;

  • Using a small needle to inject a plasmid (small chromosome) into an adjacent cell; or

  • Package up a small piece of its genome into a phage (protein capsule) which contains all of the machinery needed to travel a distance and then inject that DNA into another bacterium far away.

Not all bacteria do all of these things all of the time, but we know that they do happen. One common feature of these activities is that genetic material is exchanged between cells in a manner other than clonal reproduction (when a single cell splits into two). In other words, these are all forms of ‘recombination.’

How Do We Study the Microbiome?

Speaking for myself, I study the microbiome by analyzing data generated by genome sequencing instruments. We use those instruments to identify a small fraction of the genetic sequences contained in a microbiome sample, and then we draw inferences from those sequences. This may entail amplicon sequencing of the 16S gene, bulk WGS sequencing of a mixed population, or even single-cell sequencing of individual bacterial cells. Across all of these different technological approaches, we are collecting a small sample of the genomic sequences present in a much larger microbiome sample, and we are using that data to gain some understanding of the microbiome as a whole. In order to extrapolate from data to models, we rely on some key assumptions of how the world works, one of which being that bacteria do not frequently recombine.

What Does Recombination Mean to Me?

If you’ve read this far you are either a microbiome researcher or someone very interested in the topic, and so you should care about recombination. As an example, let’s walk through the logical progression of microbiome data analysis:

  • I have observed microbial genomic sequence S in specimen X. This may be an OTU, ASV, or WGS k-mer.

  • The same sequence S can also be observed in specimen Y, but not specimen Z. There may be some nuances of sequencing depth and the limit-of-detection, but I have satisfied myself that for this experiment marker S can be found in X and Y, but not Z.

  • Because bacteria infrequently recombine, I can infer that marker S represents a larger genomic region G which is similarly present in X and Y, but not Z. For 16S that genomic region would be the species- or genus-level core genome, and for WGS it could also be some accessory genetic elements like plasmids, etc. In the simplest rendering, we may give a name to genomic region G which corresponds to the taxonomic label for those organisms which share marker S (e.g. Escherichia coli).

  • When I compare a larger set of samples, I find that the marker S can be consistently found in samples obtained from individuals with disease D (like X and Y) but not in samples from healthy controls (like Z). Therefore I would propose the biological model that organisms containing the larger genomic region G are present at significantly higher relative abundance in the microbiome of individuals with disease D.

In this simplistic rendering I’ve tried to make it clear that the degree to which bacteria recombine will have a practical impact on how much confidence we can have in inferences which rely on the concepts of taxonomy or phylogeny.

The key idea here is that if you observe any marker S, we tend to assume that there is a monophyletic group of organisms which share that marker sequence. Monophyly is one of the most important concepts in microbiome analysis which also happens to be a lot of fun to say — it’s worth reading up on it.

How Much Recombination Is There?

Getting back to the paper that started it all, the authors did a nice job of carefully estimating the frequency of recombination across a handful of bacterial species for which a reasonable amount of data is available. The answer they found is that recombination rates vary, and this answer matches our mechanistic understanding of recombination. The documented mechanisms of recombination vary widely across different organisms, and there is undoubtedly a lot more out there we haven’t characterized yet.

At the end of the day, we have only studied a small fraction of the organisms which are found in the microbiome. As such, we should approach them with a healthy dose of skepticism for any key assumption, like a lack of recombination, which we know is not universal.

In conclusion, I am going to continue to use taxonomy and phylogeny every single day that I study the microbiome, but I’m also going to stay alert for how recombination may be misleading me. On a practical note, I am also going to try to use methods like gene-level analysis which keep a tight constraint on the size of regions G which are inferred from any marker S.

Geneshot: Identifying microbial genes associated with human health and disease

The focus of my independent research over the last few years has been on how we (the microbiome research community) can use whole-genome shotgun sequencing (WGS) data to efficiently identify what genetic elements within microbes are consistently enriched in the microbiome of humans with particular health of disease states.

A lot of that work is focused on the tractability of the various computational methods that we need in order to perform this process: de novo assembly, gene de-duplication, read mapping, alignment de-duplication, co-abundance clustering, etc. In a couple of cases I’ve worked with collaborators to improve those individual components, but we’ve also spent a lot of time on making all of those pieces work together as part of a cohesive whole (using Nextflow).

I’ve been working with my collaborators (Kevin Barry, Amy Willis, Jonathan Golob, and Caroline Kasman) to put together a demonstration of this approach, and I’m happy to say that this has all come together in the form of a preprint which was published this week:

Gene-level metagenomics identifies genome islands associated with immunotherapy response

I’ll use subsequent posts to talk more about the ideas behind this approach to analyzing the microbiome, but for now I’ll just say that I’m extremely excited that we are able to analyze previously-published datasets and identify new gene-level microbiome associations. In this case, we compared the stool microbiome of individuals being treated for metastatic melanoma on the basis of whether they responded to immune checkpoint inhibitor (ICI) therapy. With this approach we identified specific “genome islands” (localized regions of the genome) whose presence in gut bacteria was consistently associated with ICI response across two independent cohorts.

Needless to say, I think this is an extremely exciting finding and I’m looking forward to pushing forward with this research, both on the methods and the microbiome-ICI association. Follow this space for future developments!

Zarr: taking the headache out of massive datasets

My primary focus in the last few years has been trying to make gene-level metagenomic analysis practical for microbiome research. Without going into all of the details, one of the biggest challenges for this type of analysis is that there are a lot of genes in the microbiome, and so the data volume becomes massive.

I had generally thought that I had all my tools working as well as I could (in part by optimizing how to create massive DataFrames in memory), but I was still finding that some steps (such as grouping genes by co-abundance) were always going to require a huge amount of memory and take a really long time to read and write to disk. This didn’t really bother me — my feeling was that a matrix with millions of rows (genes) and thousands of columns (samples) is a ton of data and should be hard to work with.

Then I talked to an earth scientist.

One of the great things about being a scientist is when you get to peek over the wall with another discipline and see that they’ve already solved problems which you didn’t realize could be solved. It turns out that there are scientists who use satellites to take pictures of the earth to learn things about volcanos, glaciers, global warming, and other incredibly important topics. Instead of taking a single picture, they take many pictures over time and build a three dimensional volume of signal intensity data which makes my metagenomic datasets seem … modest.

The earth scientist, Scott Henderson, recommended that I check out an emerging software project which is being developed to deal with these large, high-dimensional, numeric matrices — Zarr.

The basic pitch for Zarr is that it is incredibly efficient at reading slices out of N-dimensional cubes. It also has some exciting integration with object storage systems like AWS S3 which I haven’t tried out yet, but I mostly like it because it immediately solved two big problems which were holding me back (and which I won’t describe in depth here). For both of these problems I ended up going down the path of trying out some alternate approaches:

  • Feather: Really fast to read and write, but you have to load the entire table at once

  • HDF5: Wonderful Python integration and support for complex data formats, but not efficient to read slices from arbitrary axes

  • Redis: Great support for caching on keys, but extremely slow to load all the data from each slice

  • Zarr: Fast to write, fast to read, and supports indexing across any axis with no impact on performance

With any young software project you can read the docs and say to yourself, “well that sounds nice, but does it work?” Let me be the voice of experience and tell that yes, it works. So if my problems sound like your problems, think about giving it a try.