Category Archives: genomics

Genomics data – distributed and searchable by content


Recently there has been a lot of interesting activity around MinHash sketches on Titus Browns blog (1, 2, 3).

To quickly recap what a MinHash sketch is, it is a data structure which allows approximate set comparisons in constant memory. In essence it stores a number of output from hash functions, e.g. 1000 such hashes, and then compares two sets by comparing the overlap of the hashes rather than the overlap of the original sets.

This means that you can calculate a distance between two genomes or sets of sequencing reads by computing a MinHash sketch from the sequence k-mers. For a full description of MinHash sketches way better than mine I recommend reading the Mash paper which has a very nice and clear introduction to the concepts.

Distributed data

The idea that I found the most interesting in Browns blog post is that of creating a database of the MinHashes, referencing datasets either with links to where they can be downloaded, or for non-public data, to contact information for owner of the data. When I then came across the Interplanetary File System (IPFS) it struck me that combining these two technologies could make for an extremely interesting match.

Combining these two technologies I imagine that it would be possible to build a distributed database of sequencing data, allowing it to be searched by content. To make this process effective, some type of index would of course need to be applied and once again this idea is explored by Brown in the form of sequence bloom trees.

Preferably the data itself could also be stored in IPFS. For fully public data this seems perfectly feasible, for private data it might be somewhat trickier. For that scenario one could imagine either have pointers to “normal” static resources, to private IPFS networks (a feature which appears to be on the road map for IPFS), or store the data encrypted in the public network.

There are already some interesting developments in using IPFS with genomics data. The Cancer Gene Trust project are using IPFS to create a distributed database of cancer variants and clinical data. The project is called Cancer Gene Trust Daemon and is available on Github. Super interesting stuff, which as far as I can tell is not directly searchable by content. Right now I’m not sure how well the MinHash sketch technique would extend to something like a variant file, but I suspect that since it’s was developed for general document similarity evaluation it should not be impossible. Of the top of my head I would think that it would be possible to use the variant positions themselves to compute the hashes.

Why a distributed data store?

I think that there are a number of reasons why this could be a worthwhile effort. Firstly there are obvious dangers with keeping scientific data in central repositories – what happens when funding for a project runs out or the PhD student leaves the lab? I’m fairly sure that there are many dead or dying scientific web services out there. Having a distributed system would mean that the data will (in theory) live indefinitely in the network.

Secondly having a distributed system means bandwidth can be used way more efficiently, effectively creating a content delivery network for the data, were any actor who’d like could set up a mirror of the data in the network to make sure that their clients do not need to download the human genome from the other side of the world again and again, when it is available on your colleagues workstation.


Right now I’m mostly excited about both of these technologies and I’d thought I’d share my thoughts. I hope to be able to carve out some time to actually play around with these ideas sometime in the future.

As a final note I have downloaded the human reference genome (GRCh38/hg38) and uploaded it to IPFS, it is available under this address: If you run a IPFS node you might want to pin it. Enjoy!

Whole human genome data released under Creative Commons licence

One of the things that I’ve been working a lot on over the last year is setting up pipelines to analyze whole genome sequencing data from human samples. This work is now coming to fruition and one part of that is that we (at the SNP&SEQ Technology Platform) have now released data for our users and others to see. It’s still a work in progress, but most of the pieces are in place at this stage.

The data is being release under a Creative Commons Attribution-NonCommercial 4.0 International License, so as long as you attribute the work to the SNP&SEQ Technology Platform you can use it for non-commercial purposes. You’ll find the data here:

Being a fan of open science working for an employer that will release data for the benefit of the community makes me jump with joy!

P.S. Like to have a look at the code that makes it all happen, checkout the National Genomics Infrastructure github repo.

It’s already been done…

Not that long ago I wrote about the use of CRISPR-CAS9 systems and the ongoing ethical debate on their use in humans. As it turns out this future is even closer that I might initially have thought. This week a Chinese group published a paper where they used these systems to do genome editing in non-viable human embryos.

The original paper is available here: And nature news piece on the topic can be found her: Finally the findings were also covered in Swedish mainstream press (in Swedish) here:

The study itself draws attention to some problems related to genome editing using these techniques. The poor success rates (28 out of 86 embryos tested were successfully spliced) and problems with off-target mutations hinder the immediate clinical use of the technique. These are however problems that I’m sure will be addressed by technology development.

More importantly I think that this highlights the importance of establishing ethical frameworks for the use of gene-editing. Interestingly the paper was rejected both by Nature and Science “in part because of ethical objections” – it remains to be seen if such objections will hold in the future.

Personally I’m as of yet undecided on the morality of carrying out gene editing in embryos. While the promise of cures to heritable disease is wonderful, it’s easy to see a slippery slope from there into more dubious uses. Also it warrants the questions of what is to be considered a disease/disability.

Today at the SciLifeLab Large Scale Human Genome Sequencing Symposium Dr. Anna Lindstrand spoke about a study indicating  CTNND2 as a candidate gene for reading problems. Are such mild disabilities to be considered for gene editing? Maybe not – but where do we draw the line? Once the technology is available I have little doubts that some will want to use it produce “genetically enhanced” humans.

The Nature news article referenced above ends of with the somewhat ominous quote: “A Chinese source familiar with developments in the field said that at least four groups in China are pursuing gene editing in human embryos.” Certainly we are going to hear more on this topic in the future.

CRISPR-Cas9 and William Gibson

“The future is already here — it’s just not very evenly distributed. – William Gibson”

A quote I keep coming back to is one by William Gibson, saying that “the future is already here — it’s just not very evenly distributed.” This came to mind once again when I read the following New York Times piece about the consequences of the potential use of the CRISPR-Cas9 system for genome editing in humans. It builds on an article in Science where the authors urge caution on the use of genome editing in humans before the potential consequences have been thoroughly investigated, and legal as well as ethical frameworks are in place.

What was once considered pure science fiction is now quickly becoming reality and I think that the future portrayed in the film Gattaca is not as far of as we might once have imagined. In my opinion this is just one more example of how we as a society will have to deal with what it means to be human in an age where we will be able to take charge of the actual physical definition of humanity.

I see in this enormous potential, but also equally big risks. I welcome initiatives such as the above to work proactively on these questions and bring them up on the general agenda. Hopefully that means that that these technologies can be used responsibly, and that we might avoid the bleak geno-dystopian future of Gattaca and instead hope for a brighter more Star Trek like future.

Genomics Today and Tomorrow

Last week I went to the Genomics Today and Tomorrow at Norrlands Nation in Uppsala. It was hosted by Uppsala Connect, and aimed at bringing people in- and outside of academia together to connect on questions about the use of “the cloud” in genomics. The major attraction of the day was the talk by Jonathan Bingham from Google Genomics. He presented what they can do for the genomics field considering that they already have extensive experience in working with very large data sets (their search index alone being more than a 100 PB).

He pitched two of Googles projects. Firstly the base line project which aims at collecting longitudinal data on healthy individuals. This includes genomics data, but also proteomics, epigenitics, clinical data, etc. Secondly their genomics platform for which the long term goal seems to be integrate many types of data and make it easily quariable.

This of course also ties in in a interesting way to the work the Google has been doing on deep learning, promoting the idea that if it is possible to compile a sufficiently large dataset it would be possible to utilize those types of techniques to do unsupervised feature extraction from biological data. This was an example that was also brought up by Mikael Huss later in his talk later in the day, and is a prospect that I personally find very interesting.

At a high level they are also working on implementing the API standards of the Global Alliance for Health and Genomics and Big Query running on-top of genomics data allowing interactive style SQL-queries over very large data sets.

Of course the topic of privacy was brought up, and Jonathan stressed that this was a question that Google is taking very seriously (and he also jokingly assured us that they’d not be serving any ads based on genomics data). I however cannot help to think about what’s in this for Google. They do get to sell compute time on their cloud infrastructure, but I’m wondering what their bid in this market is in addition to that. I’m not seeing it as impossible that they are making a run into the clinical data management market in the long run. However, with that said I think there are going to be some major hurdles to that, especially in Europe where legislation surrounding this type of thing seems to be quite restrictive.

Some other highlights of the day were:

  • Mikaels Huss (SciLifeLab bioinformatics long-term support) presentation of different questions one might want to ask from APIs, and to some extent which providers could answer them at the moment. He also made what I think is a very valid point about that the true challenge in the life science field in the future does not lay in the size of the data but in it’s heterogeneity. A topic that several of the speakers touched upon.
  • Keijo Heljankos (Aalto School of Science) rundown of the tools that they are developing for handling NGS data. Including HadoopBam, but also SeqPig (NGS data in Apache Pig) and SeqSpork (roughly SeqPig in Spark, kudos by the way for that nice name).
  • Rolf Apweiler of the EMBL-EBI spoke of the “pheotyping centers called hospitals” that we seem to have all over the place and the challenges and prospects of working with data originating from that source in the future. He also mentioned the problem of transferring data between centers due to limited bandwidth and spoke briefly about the EMBL-EBI Embassy cloud. This was presented as one type of solution to this problem, where one can get access to virtual machines within their infrastructure and thus move the computations to the data rather than the other way around.
  • For the last talk of the day it was nice to hear Asta Laiho (Bioinformatics Unit at the Turku Centre for Biotechnology) speaking on the work they do in their sequencing and bioinformatics core facilities, since it resonates very well with what we have been seeing. We also had an interesting talk in the post-talk networking session comparing notes – and as always it seems that everyone is confronted with similar problems. On the technical side she mentioned the cloud infrastructure ePuouta that they were using to scale out their internal infrastructure to the (private) cloud, which I thought sounded like a really nice solution for handling a highly variable workload.

To wrap up I’d like to mention a, in my mind highly interesting topic, brought up in the panel discussion of the implications of making genomic data publicly available. It’s been argued that the individual should be allowed to make the decision of making his or her own data public, however since this will not reveal information only about that individual, but also about their relatives, this makes it a much broader question. I think that establishing ethical guidelines and legal frameworks for this type of this questions is something that we as a society are going to have to deal with in coming years.

All in all a very interesting afternoon.