pandas

Making sparse DataFrames efficiently

Here's a little story about a computational challenge that I came across recently, and how a little bit of performance profiling went a long way to improve my analysis.

Background

I've been working with a large amount of data recently, and my principal bottleneck hasn't been the raw sequence analysis, but rather analyzing the results across large numbers of samples. For context, I'm working with thousands of samples, each of which contains 10,000 - 10,000,000 observations. The raw data was terabytes of raw genomic sequence data, and so I'm definitely working with a more compact data structure, but it's still quite a bit of data to work with in memory. 

There have been a few different iterations for this workflow, starting by working in memory (started swapping), then moving to a PostgreSQL database (which was too slow with read/write), and now working in memory on a larger machine using sparse matrices (following a helpful suggestion from Jonathan Golob).

What I've really liked about these sparse matrices is that it's incredibly fast to slice the data in different ways, which is what I need to be able to figure out both (a) how similar samples are to each other and (b) how different observations are distributed across samples. In other words, I need to be able to retrieve the data quickly in a column-wise or row-wise manner and sparse matrices really help with this.  Also, my underlying data is very sparse, which is to say that a vast majority of the cells in my table are 0 or N/A. Lastly, they're supported in Pandas, which is really great. 

The Challenge

While it's really quick to query these sparse matrices, I've been finding it difficult to make them. This morning I did a bit of performance profiling to compare a few different ways of constructing these tables and I found that not all construction methods are created equal. You can find my profiling code here, if you're interested (this was run in Pandas 0.20.3). 

The data that I'm working with is a set of observations that are originally constructed in longitudinal format, with every observation as an item in a list that includes the row name, column name, and cell value. What I wanted to see was what the most efficient method would be to make this into a sparse table in wide format, which could be indexed and sliced easily by row or column. 

Options

  1. Sparse DataFrame from nested dictionaries: Make a nested dictionary with the first level of keys as column labels and the second level of keys as row labels, and then convert directly to the SparseDataFrame.

  2. Column-wise addition to sparse DataFrame: Make an empty SparseDataFrame and then sequentially add in the data from each column. 

  3. Dense longitudinal, pivot, then make sparse: Make a dense DataFrame from the longitudinal data, pivot it in dense format, and then make it into a SparseDataFrame.

  4. Sparse longitudinal DataFrame, then pivot: Make a SparseDataFrame in longitudinal format and then pivot it in to wide format. 

Results

I tested each of the above options using four different matrices of varying sizes:

Matrix: 60 rows, 67 cols, 2.463% full

1. Testing sparse DataFrame from nested dictionaries
29.6 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Testing columnwise addition to sparse DataFrame
41.5 ms ± 813 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
33.1 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
19 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Matrix: 100 rows, 100 cols, 9.54% full

1. Testing sparse DataFrame from nested dictionaries
61.2 ms ± 8.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2. Testing columnwise addition to sparse DataFrame
97.7 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
57.3 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
101 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 632 cols, 1.574% full

1. Testing sparse DataFrame from nested dictionaries
1.29 s ± 50.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Testing columnwise addition to sparse DataFrame
1.41 s ± 40.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
1.19 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
94 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Matrix: 100 rows, 1000 cols, 9.526% full

1. Testing sparse DataFrame from nested dictionaries
2.82 s ± 29.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Testing columnwise addition to sparse DataFrame
3.15 s ± 81.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Testing make dense DataFrame from longitudinal format, pivot it, then convert to sparse
2.8 s ± 72.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

4. Testing make sparse DataFrame from longitudinal format, then pivot it
826 ms ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary

In the results above you can see that in general the fastest method was #4 – making the DataFrame sparse in longitudinal format, and then pivoting it into wide matrix format. However, I need to stress that the relative performance of each of these methods varied widely depending on how sparse the matrix was. In general, option #4 worked better with sparser datasets, but I haven't fully tested the range of parameter space varying the number of rows, columns, and values. 

A larger point that I could also make here is that picking the right data structure for your project is actually a pretty tricky problem. I found here that the advantages of sparse matrices was immense relative to relational databases, but that they come with an up-front cost of matrix construction. I'm a biologist, not a computational scientist, and so the only way that I find my way through is by talking to smart people and incorporating their suggestions wherever appropriate. If you come up against a similar problem dealing with largish data, just take a deep breath, ask for help, and then write some tests.