Genomics data – distributed and searchable by content

MinHashes

Recently there has been a lot of interesting activity around MinHash sketches on Titus Browns blog (1, 2, 3).

To quickly recap what a MinHash sketch is, it is a data structure which allows approximate set comparisons in constant memory. In essence it stores a number of output from hash functions, e.g. 1000 such hashes, and then compares two sets by comparing the overlap of the hashes rather than the overlap of the original sets.

This means that you can calculate a distance between two genomes or sets of sequencing reads by computing a MinHash sketch from the sequence k-mers. For a full description of MinHash sketches way better than mine I recommend reading the Mash paper which has a very nice and clear introduction to the concepts.

Distributed data

The idea that I found the most interesting in Browns blog post is that of creating a database of the MinHashes, referencing datasets either with links to where they can be downloaded, or for non-public data, to contact information for owner of the data. When I then came across the Interplanetary File System (IPFS) it struck me that combining these two technologies could make for an extremely interesting match.

Combining these two technologies I imagine that it would be possible to build a distributed database of sequencing data, allowing it to be searched by content. To make this process effective, some type of index would of course need to be applied and once again this idea is explored by Brown in the form of sequence bloom trees.

Preferably the data itself could also be stored in IPFS. For fully public data this seems perfectly feasible, for private data it might be somewhat trickier. For that scenario one could imagine either have pointers to “normal” static resources, to private IPFS networks (a feature which appears to be on the road map for IPFS), or store the data encrypted in the public network.

There are already some interesting developments in using IPFS with genomics data. The Cancer Gene Trust project are using IPFS to create a distributed database of cancer variants and clinical data. The project is called Cancer Gene Trust Daemon and is available on Github. Super interesting stuff, which as far as I can tell is not directly searchable by content. Right now I’m not sure how well the MinHash sketch technique would extend to something like a variant file, but I suspect that since it’s was developed for general document similarity evaluation it should not be impossible. Of the top of my head I would think that it would be possible to use the variant positions themselves to compute the hashes.

Why a distributed data store?

I think that there are a number of reasons why this could be a worthwhile effort. Firstly there are obvious dangers with keeping scientific data in central repositories – what happens when funding for a project runs out or the PhD student leaves the lab? I’m fairly sure that there are many dead or dying scientific web services out there. Having a distributed system would mean that the data will (in theory) live indefinitely in the network.

Secondly having a distributed system means bandwidth can be used way more efficiently, effectively creating a content delivery network for the data, were any actor who’d like could set up a mirror of the data in the network to make sure that their clients do not need to download the human genome from the other side of the world again and again, when it is available on your colleagues workstation.

Finally

Right now I’m mostly excited about both of these technologies and I’d thought I’d share my thoughts. I hope to be able to carve out some time to actually play around with these ideas sometime in the future.

As a final note I have downloaded the human reference genome (GRCh38/hg38) and uploaded it to IPFS, it is available under this address: https://ipfs.io/ipfs/QmQ3gCx4WqaohRYbE8NFmE2Yc8xmARDPJ3jZNpQdfoWyKQ. If you run a IPFS node you might want to pin it. Enjoy!