Last week I went to the Genomics Today and Tomorrow at Norrlands Nation in Uppsala. It was hosted by Uppsala Connect, and aimed at bringing people in- and outside of academia together to connect on questions about the use of “the cloud” in genomics. The major attraction of the day was the talk by Jonathan Bingham from Google Genomics. He presented what they can do for the genomics field considering that they already have extensive experience in working with very large data sets (their search index alone being more than a 100 PB).
He pitched two of Googles projects. Firstly the base line project which aims at collecting longitudinal data on healthy individuals. This includes genomics data, but also proteomics, epigenitics, clinical data, etc. Secondly their genomics platform for which the long term goal seems to be integrate many types of data and make it easily quariable.
This of course also ties in in a interesting way to the work the Google has been doing on deep learning, promoting the idea that if it is possible to compile a sufficiently large dataset it would be possible to utilize those types of techniques to do unsupervised feature extraction from biological data. This was an example that was also brought up by Mikael Huss later in his talk later in the day, and is a prospect that I personally find very interesting.
At a high level they are also working on implementing the API standards of the Global Alliance for Health and Genomics and Big Query running on-top of genomics data allowing interactive style SQL-queries over very large data sets.
Of course the topic of privacy was brought up, and Jonathan stressed that this was a question that Google is taking very seriously (and he also jokingly assured us that they’d not be serving any ads based on genomics data). I however cannot help to think about what’s in this for Google. They do get to sell compute time on their cloud infrastructure, but I’m wondering what their bid in this market is in addition to that. I’m not seeing it as impossible that they are making a run into the clinical data management market in the long run. However, with that said I think there are going to be some major hurdles to that, especially in Europe where legislation surrounding this type of thing seems to be quite restrictive.
Some other highlights of the day were:
- Mikaels Huss (SciLifeLab bioinformatics long-term support) presentation of different questions one might want to ask from APIs, and to some extent which providers could answer them at the moment. He also made what I think is a very valid point about that the true challenge in the life science field in the future does not lay in the size of the data but in it’s heterogeneity. A topic that several of the speakers touched upon.
- Keijo Heljankos (Aalto School of Science) rundown of the tools that they are developing for handling NGS data. Including HadoopBam, but also SeqPig (NGS data in Apache Pig) and SeqSpork (roughly SeqPig in Spark, kudos by the way for that nice name).
- Rolf Apweiler of the EMBL-EBI spoke of the “pheotyping centers called hospitals” that we seem to have all over the place and the challenges and prospects of working with data originating from that source in the future. He also mentioned the problem of transferring data between centers due to limited bandwidth and spoke briefly about the EMBL-EBI Embassy cloud. This was presented as one type of solution to this problem, where one can get access to virtual machines within their infrastructure and thus move the computations to the data rather than the other way around.
- For the last talk of the day it was nice to hear Asta Laiho (Bioinformatics Unit at the Turku Centre for Biotechnology) speaking on the work they do in their sequencing and bioinformatics core facilities, since it resonates very well with what we have been seeing. We also had an interesting talk in the post-talk networking session comparing notes – and as always it seems that everyone is confronted with similar problems. On the technical side she mentioned the cloud infrastructure ePuouta that they were using to scale out their internal infrastructure to the (private) cloud, which I thought sounded like a really nice solution for handling a highly variable workload.
To wrap up I’d like to mention a, in my mind highly interesting topic, brought up in the panel discussion of the implications of making genomic data publicly available. It’s been argued that the individual should be allowed to make the decision of making his or her own data public, however since this will not reveal information only about that individual, but also about their relatives, this makes it a much broader question. I think that establishing ethical guidelines and legal frameworks for this type of this questions is something that we as a society are going to have to deal with in coming years.
All in all a very interesting afternoon.