In 2013 I first found out about Spark and was greatly intrigued, to the point that I actually hosted a journal club** on the topic. There has been many discussions on the merits of Spark around the internet, often focusing on Sparks in-memory processing capabilities. These allow for some rather impressive speed-ups for iterative algorithms compared to the equivalent program written in Apache MapReduce, however, what I found to be the main feature was the ability to express programs in a way very similar to non-parallel programs and allow the framework to figure out the parallelisation.
While I’ve have played around some with Spark on my own machine since then I haven’t until recently had the time nor the resources to try it out in a distributed mode. However, recently I got access to the cloud test-bed Smog on setup up by our local HPC center Uppmax.
To get up and running and try to figure out the basic nuts-and-bolts of Spark I’ve setup a some Ansible playbooks to deploy a basic stand-alone Spark cluster (with hdfs support) to OpenStack and written up some instructions for how to get it going in this github repo: https://github.com/johandahlberg/ansible_spark_openstack.
I haven’t had time to play around with it that much yet, but I hope to do so and report back here later. I’m especially interested in investigating setting up and using Adam (a Spark framework for genomics) and also trying out the Spark Notebook (iPython-notebookish interface for working interactively with Spark).
Finishing of with some acknowledgements I do want to mention Mikael Huss (@MikaelHuss) who has help me test out the playbooks making sure that the instructions are up to scrap, and Sheshan Ali (@zeeshanalishah) who has help me get up and running with OpenStack.
Check out the repo – pull requests are welcome!
** Informal discussion around an research paper common in academia. In our group we have them once per week.