What is a core hour?

Having worked with setting up so called best practice pipelines for different genomics analysis one question that I’ve heard a lot is how long does it take to run? At this point I like to give an answer that’s something along the lines of “it takes 40 hours and 640 core-hours”, and I’d like to argue that the later is the more interesting of the two. Why is that? I’d like try to explain this in layman’s terms so that I myself (and possibly others) can point here to provide an explanation for this in the future.

For some cases what you want to look at is the actual run time (or the “wall time” as it’s sometimes called). For example if you’re setting up a clinical diagnosis pipeline this number might matter a great deal as it is means that you can deliver a result to the doctor quicker. However, for many cases in research this is not the most interesting thing to look at, assuming that you don’t have infinite resources.

Taking a step back and explaining this in a bit more detail one first needs to understand that a computer today typically has more than one processor (also known as a core), and many programs will make use of multiple cores at the same time. This gives a simple relation between normal hours and core hours:

core hours = number of hours passed * number of cores used

However few programs will scale linearly with the number of cores, meaning that you will end up in a situation with diminishing returns.

A computer cluster (such as is used by many genomics researchers) consists of multiple such machines where you can run your programs. Typically you will book a number of cores on such a cluster to run your program on, and you will be charged for the number of cores you booked,  as if your were using them 100 percent throughout the entire run-time of the program.

So unless you have for infinite resources you want to use those hours as efficiently as possible, right? Looking at the core hours used rather than the actual run time is one way to make sure that you squeeze out as much as possible from the resources you have. Maybe you have to wait a little longer, but as you’re moving into for example scaling to handle the analysis of thousands whole human genomes core-hours becomes an important factor.

If you didn’t get what those computer folks meant when they were talking about core hours, now you know.

Thoughts on BOSC 2015

I won’t be making any detailed summary from Bioinformatics Open Source Conference (BOSC) 2015 in Dublin since there is already an excellent one available here: https://smallchangebio.wordpress.com/2015/07/11/bosc2015day2b/ However I thought I’d write down some thoughts on the trends that I think I saw.

Before going into the technical stuff I’d just like to add that the picture above was taken at the panel on diversity in the bioinformatics open source community. It was nice seeing the issue addressed, as it is an important a challenge as any technical once we might currently be facing as a community. This in addition to the cards used to take questions for speakers (instead of the usual stand up and talk way), show that the BOSC organizers are willing to take this on. Kodos to them for doing so!

Workflows, workflows, workflows….

It’s clear that the problem of handling workflows is still an unsolved problem. Having spent considerable time and effort setting up and managing pipelines myself, I truly applaud  the on-going efforts to make things a bit more standardized and interoperable using the Common Workflow Language (CWL). If in the future it would actually be possible to download and run somebodies pipeline on more than one platform that would be truly amazing.

There still seems to be some confusion about the exact nature of CWL and what it aims at doing. My understanding is that it will provide a specification consisting of tool definitions and workflow descriptions that platform developers can implement in order to make it possible to migrate these between platforms. As of yet it seems to be somewhat lacking on the implementation side of things (which is to be expected since it was announced to the public at this BOSC if I understand things correctly). I really hope that things will take off on the implementation front and once it does I want to try my hand at translating some of the things that we have setup into CWL.

In my wildest dreams the CWL could also serve as a starting point for build a community which could also be collaborating on and providing other things, such as:

  • tools repositories (like a Docker Hub for bioinformatics tools), providing containers and tool definitions.
  • collaborative workflow repositories (that are actually possible to deploy outside their exact environment – no more re-implementing the GATK best practice pipeline yet another time).
  • reference data repositories – something that could be to bioinformatics what Maven Central is to Java. A single place from which e.g. reference genomes could be downloaded automatically based on a configuration file. (While writing this I realized that I’d seen something similar to what I was describing: Cosmid – so folks, let’s just go and adopt this now!)

Docker everywhere

Docker was mentioned so many times that eventually it became a joke. It does seem to provide a convenient solution to the software packaging problem – however my own limited experience with Docker tells me that the process of using it would have to be simplified in order to make it adoptable outside of a limited group.

What’s needed is something which allows you to access the tool you want to use with a minimum overhead. Something like:

run-in-docker <my-favorite-tool> <args>

Until this is possible I don’t think that we are going to see wide spread adoption outside platform solutions like Arvados, Galaxy, etc. I guess there’s also there are issues of security that would need to be resolved before sys-admins at HPC centers would be willing to install it.

Wrapping up

Going to BOSC was a rewarding  experience and an excellent opportunity to get a feel for where the community is heading in the future. A warm thanks to the organizers as well as all the speakers.

 

 

Codefest 2015

I had the great pleasure of attending the Codefest 2015 in Dublin. Not only did I get to meet some really nice people but I also got some work done. Below are the projects that I worked on and some general notes.

TEQCviewer
https://github.com/marcou/TEQCviewer

In the first day I joined Mirjam Rehr and Maciej Pajak in working on a Shiny app to visualize targeted sequencing data in an interactive way. Not only was this a nice opportunity to shape up on on R and Shiny, but I was also introduced to something new and interesting.

I’d never seen the R-package Packrat before – it’s used to create a set of local dependencies for a R project. Thus making sure that updating a dependency in one project does not break other projects that you might have that uses a older version of the same library. It also bootstraps all the dependencies into the project when you load it, which is just an added benefit.

contAdamination
https://github.com/johandahlberg/contAdamination

I’ve been wanting to try out ADAM for some time now, but I haven’t found the time for it. It felt like the Codefest was a excellent opportunity to get into something new, so that’s what I did day two. After a short brainstorming session, me and Roman Valls Guimerà decided to see if we could port (and possibly improve upon) then idea used in FACS of using bloom filters to find contamination in short read data.

We got off to a good start and got a lot of fun hacking done. We even got some remote help, both with pull requests and advice from Michael L Heuer and Christian Pérez-Llamas, which was awesome!

Not only did I learn a lot in this day, but it also got me excited to keep working on this and see if we could actually make a real product out of this. We’ll see how that goes, but regardless getting it started in the Codefest was super exciting.

Other things of note

A large part of Codefest 2015 was dedicated to the Common Workflow Language – and it seems that there was some progress made in the hackaton. I think it’s going to be very interesting to see where this project is going in the future. If it keeps getting traction in the community I think that it could actually be the solution to the N + 1 pipeline system in bioinformatics. Wouldn’t what be awesome?

Finally, if you’re in any way interested in open source coding for bioinformatics I’d absolutely recommend going to the Codefest. Hope to see you there in the future!

P.S.  Robin Ander also did a write up of the Codefest here, check it out! (even later edit): Here Guillermo Carrasco also wrote about the Codefest here.