Setting up Spark on OpenStack using Ansible

In 2013 I first found out about Spark and was greatly intrigued, to the point that I actually hosted a journal club** on the topic. There has been many discussions on the merits of Spark around the internet, often focusing on Sparks in-memory processing capabilities. These allow for some rather impressive speed-ups for iterative algorithms compared to the equivalent program written in Apache MapReduce, however, what I found to be the main feature was the ability to express programs in a way very similar to non-parallel programs and allow the framework to figure out the parallelisation.

While I’ve have played around some with Spark on my own machine since then I haven’t until recently had the time nor the resources to try it out in a distributed mode. However, recently I got access to the cloud test-bed Smog on setup up by our local HPC center Uppmax.

To get up and running and try to figure out the basic nuts-and-bolts of Spark I’ve setup a some Ansible playbooks to deploy a basic stand-alone Spark cluster (with hdfs support) to OpenStack and written up some instructions for how to get it going in this github repo: https://github.com/johandahlberg/ansible_spark_openstack.

I haven’t had time to play around with it that much yet, but I hope to do so and report back here later. I’m especially interested in investigating setting up and using Adam (a Spark framework for genomics) and also trying out the Spark Notebook (iPython-notebookish interface for working interactively with Spark).

Finishing of with some acknowledgements I do want to mention Mikael Huss (@MikaelHuss) who has help me test out the playbooks making sure that the instructions are up to scrap, and Sheshan Ali (@zeeshanalishah) who has help me get up and running with OpenStack.

Check out the repo – pull requests are welcome!

** Informal discussion around an research paper common in academia. In our group we have them once per week.

 

CRISPR-Cas9 and William Gibson

“The future is already here — it’s just not very evenly distributed. – William Gibson”

A quote I keep coming back to is one by William Gibson, saying that “the future is already here — it’s just not very evenly distributed.” This came to mind once again when I read the following New York Times piece about the consequences of the potential use of the CRISPR-Cas9 system for genome editing in humans. It builds on an article in Science where the authors urge caution on the use of genome editing in humans before the potential consequences have been thoroughly investigated, and legal as well as ethical frameworks are in place.

What was once considered pure science fiction is now quickly becoming reality and I think that the future portrayed in the film Gattaca is not as far of as we might once have imagined. In my opinion this is just one more example of how we as a society will have to deal with what it means to be human in an age where we will be able to take charge of the actual physical definition of humanity.

I see in this enormous potential, but also equally big risks. I welcome initiatives such as the above to work proactively on these questions and bring them up on the general agenda. Hopefully that means that that these technologies can be used responsibly, and that we might avoid the bleak geno-dystopian future of Gattaca and instead hope for a brighter more Star Trek like future.

Genomics Today and Tomorrow

Last week I went to the Genomics Today and Tomorrow at Norrlands Nation in Uppsala. It was hosted by Uppsala Connect, and aimed at bringing people in- and outside of academia together to connect on questions about the use of “the cloud” in genomics. The major attraction of the day was the talk by Jonathan Bingham from Google Genomics. He presented what they can do for the genomics field considering that they already have extensive experience in working with very large data sets (their search index alone being more than a 100 PB).

He pitched two of Googles projects. Firstly the base line project which aims at collecting longitudinal data on healthy individuals. This includes genomics data, but also proteomics, epigenitics, clinical data, etc. Secondly their genomics platform for which the long term goal seems to be integrate many types of data and make it easily quariable.

This of course also ties in in a interesting way to the work the Google has been doing on deep learning, promoting the idea that if it is possible to compile a sufficiently large dataset it would be possible to utilize those types of techniques to do unsupervised feature extraction from biological data. This was an example that was also brought up by Mikael Huss later in his talk later in the day, and is a prospect that I personally find very interesting.

At a high level they are also working on implementing the API standards of the Global Alliance for Health and Genomics and Big Query running on-top of genomics data allowing interactive style SQL-queries over very large data sets.

Of course the topic of privacy was brought up, and Jonathan stressed that this was a question that Google is taking very seriously (and he also jokingly assured us that they’d not be serving any ads based on genomics data). I however cannot help to think about what’s in this for Google. They do get to sell compute time on their cloud infrastructure, but I’m wondering what their bid in this market is in addition to that. I’m not seeing it as impossible that they are making a run into the clinical data management market in the long run. However, with that said I think there are going to be some major hurdles to that, especially in Europe where legislation surrounding this type of thing seems to be quite restrictive.

Some other highlights of the day were:

  • Mikaels Huss (SciLifeLab bioinformatics long-term support) presentation of different questions one might want to ask from APIs, and to some extent which providers could answer them at the moment. He also made what I think is a very valid point about that the true challenge in the life science field in the future does not lay in the size of the data but in it’s heterogeneity. A topic that several of the speakers touched upon.
  • Keijo Heljankos (Aalto School of Science) rundown of the tools that they are developing for handling NGS data. Including HadoopBam, but also SeqPig (NGS data in Apache Pig) and SeqSpork (roughly SeqPig in Spark, kudos by the way for that nice name).
  • Rolf Apweiler of the EMBL-EBI spoke of the “pheotyping centers called hospitals” that we seem to have all over the place and the challenges and prospects of working with data originating from that source in the future. He also mentioned the problem of transferring data between centers due to limited bandwidth and spoke briefly about the EMBL-EBI Embassy cloud. This was presented as one type of solution to this problem, where one can get access to virtual machines within their infrastructure and thus move the computations to the data rather than the other way around.
  • For the last talk of the day it was nice to hear Asta Laiho (Bioinformatics Unit at the Turku Centre for Biotechnology) speaking on the work they do in their sequencing and bioinformatics core facilities, since it resonates very well with what we have been seeing. We also had an interesting talk in the post-talk networking session comparing notes – and as always it seems that everyone is confronted with similar problems. On the technical side she mentioned the cloud infrastructure ePuouta that they were using to scale out their internal infrastructure to the (private) cloud, which I thought sounded like a really nice solution for handling a highly variable workload.

To wrap up I’d like to mention a, in my mind highly interesting topic, brought up in the panel discussion of the implications of making genomic data publicly available. It’s been argued that the individual should be allowed to make the decision of making his or her own data public, however since this will not reveal information only about that individual, but also about their relatives, this makes it a much broader question. I think that establishing ethical guidelines and legal frameworks for this type of this questions is something that we as a society are going to have to deal with in coming years.

All in all a very interesting afternoon.