Minot Lab

Here's a little story about a computational challenge that I came across recently, and how a little bit of performance profiling went a long way to improve my analysis.

Background

I've been working with a large amount of data recently, and my principal bottleneck hasn't been the raw sequence analysis, but rather analyzing the results across large numbers of samples. For context, I'm working with thousands of samples, each of which contains 10,000 - 10,000,000 observations. The raw data was terabytes of raw genomic sequence data, and so I'm definitely working with a more compact data structure, but it's still quite a bit of data to work with in memory.

There have been a few different iterations for this workflow, starting by working in memory (started swapping), then moving to a PostgreSQL database (which was too slow with read/write), and now working in memory on a larger machine using sparse matrices (following a helpful suggestion from Jonathan Golob).

What I've really liked about these sparse matrices is that it's incredibly fast to slice the data in different ways, which is what I need to be able to figure out both (a) how similar samples are to each other and (b) how different observations are distributed across samples. In other words, I need to be able to retrieve the data quickly in a column-wise or row-wise manner and sparse matrices really help with this. Also, my underlying data is very sparse, which is to say that a vast majority of the cells in my table are 0 or N/A. Lastly, they're supported in Pandas, which is really great.

The Challenge

While it's really quick to query these sparse matrices, I've been finding it difficult to make them. This morning I did a bit of performance profiling to compare a few different ways of constructing these tables and I found that not all construction methods are created equal. You can find my profiling code here, if you're interested (this was run in Pandas 0.20.3).

The data that I'm working with is a set of observations that are originally constructed in longitudinal format, with every observation as an item in a list that includes the row name, column name, and cell value. What I wanted to see was what the most efficient method would be to make this into a sparse table in wide format, which could be indexed and sliced easily by row or column.

Options

Sparse DataFrame from nested dictionaries: Make a nested dictionary with the first level of keys as column labels and the second level of keys as row labels, and then convert directly to the SparseDataFrame.
Column-wise addition to sparse DataFrame: Make an empty SparseDataFrame and then sequentially add in the data from each column.
Dense longitudinal, pivot, then make sparse: Make a dense DataFrame from the longitudinal data, pivot it in dense format, and then make it into a SparseDataFrame.
Sparse longitudinal DataFrame, then pivot: Make a SparseDataFrame in longitudinal format and then pivot it in to wide format.

Results

I tested each of the above options using four different matrices of varying sizes:

Matrix: 60 rows, 67 cols, 2.463% full

Testing sparse DataFrame from nested dictionaries
6 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing columnwise addition to sparse DataFrame
5 ms ± 813 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
1 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing make sparse DataFrame from longitudinal format, then pivot it
ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Matrix: 100 rows, 100 cols, 9.54% full

Testing sparse DataFrame from nested dictionaries
2 ms ± 8.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing columnwise addition to sparse DataFrame
7 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
3 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Testing make sparse DataFrame from longitudinal format, then pivot it
ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 632 cols, 1.574% full

Testing sparse DataFrame from nested dictionaries
29 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing columnwise addition to sparse DataFrame
41 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
19 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing make sparse DataFrame from longitudinal format, then pivot it
ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 1000 cols, 9.526% full

Testing sparse DataFrame from nested dictionaries
82 s ± 29.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing columnwise addition to sparse DataFrame
15 s ± 81.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
8 s ± 72.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Testing make sparse DataFrame from longitudinal format, then pivot it
ms ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary

In the results above you can see that in general the fastest method was #4 – making the DataFrame sparse in longitudinal format, and then pivoting it into wide matrix format. However, I need to stress that the relative performance of each of these methods varied widely depending on how sparse the matrix was. In general, option #4 worked better with sparser datasets, but I haven't fully tested the range of parameter space varying the number of rows, columns, and values.

A larger point that I could also make here is that picking the right data structure for your project is actually a pretty tricky problem. I found here that the advantages of sparse matrices was immense relative to relational databases, but that they come with an up-front cost of matrix construction. I'm a biologist, not a computational scientist, and so the only way that I find my way through is by talking to smart people and incorporating their suggestions wherever appropriate. If you come up against a similar problem dealing with largish data, just take a deep breath, ask for help, and then write some tests.