Category Archives: technology

Genomics data – distributed and searchable by content

MinHashes

Recently there has been a lot of interesting activity around MinHash sketches on Titus Browns blog (1, 2, 3).

To quickly recap what a MinHash sketch is, it is a data structure which allows approximate set comparisons in constant memory. In essence it stores a number of output from hash functions, e.g. 1000 such hashes, and then compares two sets by comparing the overlap of the hashes rather than the overlap of the original sets.

This means that you can calculate a distance between two genomes or sets of sequencing reads by computing a MinHash sketch from the sequence k-mers. For a full description of MinHash sketches way better than mine I recommend reading the Mash paper which has a very nice and clear introduction to the concepts.

Distributed data

The idea that I found the most interesting in Browns blog post is that of creating a database of the MinHashes, referencing datasets either with links to where they can be downloaded, or for non-public data, to contact information for owner of the data. When I then came across the Interplanetary File System (IPFS) it struck me that combining these two technologies could make for an extremely interesting match.

Combining these two technologies I imagine that it would be possible to build a distributed database of sequencing data, allowing it to be searched by content. To make this process effective, some type of index would of course need to be applied and once again this idea is explored by Brown in the form of sequence bloom trees.

Preferably the data itself could also be stored in IPFS. For fully public data this seems perfectly feasible, for private data it might be somewhat trickier. For that scenario one could imagine either have pointers to “normal” static resources, to private IPFS networks (a feature which appears to be on the road map for IPFS), or store the data encrypted in the public network.

There are already some interesting developments in using IPFS with genomics data. The Cancer Gene Trust project are using IPFS to create a distributed database of cancer variants and clinical data. The project is called Cancer Gene Trust Daemon and is available on Github. Super interesting stuff, which as far as I can tell is not directly searchable by content. Right now I’m not sure how well the MinHash sketch technique would extend to something like a variant file, but I suspect that since it’s was developed for general document similarity evaluation it should not be impossible. Of the top of my head I would think that it would be possible to use the variant positions themselves to compute the hashes.

Why a distributed data store?

I think that there are a number of reasons why this could be a worthwhile effort. Firstly there are obvious dangers with keeping scientific data in central repositories – what happens when funding for a project runs out or the PhD student leaves the lab? I’m fairly sure that there are many dead or dying scientific web services out there. Having a distributed system would mean that the data will (in theory) live indefinitely in the network.

Secondly having a distributed system means bandwidth can be used way more efficiently, effectively creating a content delivery network for the data, were any actor who’d like could set up a mirror of the data in the network to make sure that their clients do not need to download the human genome from the other side of the world again and again, when it is available on your colleagues workstation.

Finally

Right now I’m mostly excited about both of these technologies and I’d thought I’d share my thoughts. I hope to be able to carve out some time to actually play around with these ideas sometime in the future.

As a final note I have downloaded the human reference genome (GRCh38/hg38) and uploaded it to IPFS, it is available under this address: https://ipfs.io/ipfs/QmQ3gCx4WqaohRYbE8NFmE2Yc8xmARDPJ3jZNpQdfoWyKQ. If you run a IPFS node you might want to pin it. Enjoy!

Tutorial: Verifying messages from Facebook using Scala and the Play Framework

Background

I’ve been interested in chat bots for some time and when Facebook launched their API for building bots on their messenger platform I thought I’d play around with it. I have also wanted to look into using Scala with the Play Framework for building a web service, so I decided to combine the two.

Going about building a small messenger bot I realized that there had to be some way to verify that the messages being posted to the webhook I’d set up – and I noticed that there is a “X-Hub-Signature” in the header of the request – while this is not documented in the API docs that I could find, it is documented here.

The HTTP request will contain an X-Hub-Signature header which contains the SHA1 signature of the request payload, using the app secret as the key, and prefixed with sha1=. Your callback endpoint can verify this signature to validate the integrity and origin of the payload
Please note that the calculation is made on the escaped unicode version of the payload, with lower case hex digits. If you just calculate against the decoded bytes, you will end up with a different signature. For example, the string äöå should be escaped to \u00e4\u00f6\u00e5.

Having spent some time figuring out how to set this up, I thought I’d share my results here.

Step 0: How Facebook posts messages to your webhook

Following the tutorial here you can see how to setup a webhook that Facebook will post any messages sent to your page to. What you then need to do is implement a http(s) endpoint that can handle json on the following format being posted to it:

{
  "object":"page",
  "entry":[
    {
      "id":PAGE_ID,
      "time":1466278528,
      "messaging":[
        {
          "sender":{
            "id":"USER_ID"
          },
          "recipient":{
            "id":"PAGE_ID"
          },
          "timestamp":1466278528,
          "message":{
            "mid":"mid.1567764493618:41d102a3e1ae210a38",
            "seq":79,
            "text":"hello, world!"
          }
        }
      ]
    }
  ]
}

Note that in reality the end-point must be able to handle other types of messages as well, such as a user opting-in, delivery confirmations, etc.

Step 1: Creating the model

While Play does allow you to work directly with json, I think it’s nice to translate it to some Scala classes. Using the implicit readers/formaters provided by the Play framework this is fairly simple.

Just translate the json structure to a set of Scala case classes with field names matching the keys in the json structure.

We will later use this to convert the incoming json to a Scala class.

Step 2: Building a simple verification service

As the documentation states, to verify the request we will use a hash-based message authentication code. The quick version is that we are computing a hash of the message content together with a secret token that we share with Facebook. In this way, we can be sure that the message posted to the webhook actually came from them.

For this tutorial we will implement a very simple SHA1-based verification service. Please note that you should never keep the secret in the source-code, but load it from configuration, an environment variable or some other approach which does not leave it exposed when you check your code into version control.

Now that we have a way to check the incoming bytes it’s time to create a function to verify the request.

Step 3: Constructing a wrapper function to verify requests

While this tutorial only covers verifying incoming messages (i.e. what Facebook calls receiving messages), there are other events as well that might trigger a callback sending data to the webhook you have registered. Therefore it would be neat to implement this as a function which we can then re-use for the other types of callbacks.

The trick here is that we have to parse the message body – and all the examples of stacking actions on-top of each other in the Play documentation assumes that only the last action will actually parse the request. This leads us to adopt a slightly different approach.

We will implement by creating a function which takes a function as a argument (i.e. a higher-order function). In this case the function we accept as an argument will be an action, but this action will only be carried out, i.e. have that function be executed if we successfully validate the request.

Note how we parse the request body as bytes, convert it to a utf8 string (as per the specification), and verify that the signature that our verification service computes is the same as the one sent in the “X-Hub-Signature”.

Since we do parse the body as bytes here we will have to deal with converting the body as bytes to something more useful, such as json, in the next part.

Here is what the code looks like:

Step 4: Receiving messages

Finally we will use what we’ve created above to receive a message, parse it to a Scala class and respond appropriately giving a status 200 if we have successfully parsed the request.

Note that we convert the bytes to json here, and use the json.validate function to convert it to the models we created in part one. 

Wrapping up

Looking at this all together we get about 160 line of code including imports and blank lines. Not to bad. There are probably more elegant ways to solve this problem, but this does seem to get the job done.

As a reward for following this tutorial to the end, here is the entire source in one gist.

Using IBM Watson to build a podcast annotator

Some time ago now – back in October – I joined Alvaro Martinez Barrio and Mikael Huss in the Uppsala Hackup Hackathon. We formed the formidable team the “hungry aardvarks” (a name graciously provided by the organizers which was to cool to change). The goal of the hackathon was to play around with the IBM Watson APIs, and it was tons of fun. Since then I’ve been keeping busy with other things – but now I’ve finally gotten around to cleaning the code up a bit and placing it on github.

While early on we were trying to come up with some idea for something to do in bioinformatics using Watson – we realized that most of the APIs available lent themselves better to do something based on machine understanding of natural language. Mikael came up with the idea to do an automatic podcast annotator. This sounded like a really cool project and something that we could hack together in a day – so that’s what we decided to do.

You find the code here:

https://github.com/johandahlberg/hungry_aardvarks

Our idea and execution took us in the the final – but in the end the win was snatched by the team building a speech enabled home-automation system on a raspberry pi. Super cool project – well worth the victory!

Mikael has already written about Hackup here. Go and check it out! He writes more about the our process and what the other teams at the hackathon were up to.

Finally lot’s of kudos to the organized of the Hackaton – Jason Dainter, and to IBM for build such a cool system and making it easy to use.

What is a core hour?

Having worked with setting up so called best practice pipelines for different genomics analysis one question that I’ve heard a lot is how long does it take to run? At this point I like to give an answer that’s something along the lines of “it takes 40 hours and 640 core-hours”, and I’d like to argue that the later is the more interesting of the two. Why is that? I’d like try to explain this in layman’s terms so that I myself (and possibly others) can point here to provide an explanation for this in the future.

For some cases what you want to look at is the actual run time (or the “wall time” as it’s sometimes called). For example if you’re setting up a clinical diagnosis pipeline this number might matter a great deal as it is means that you can deliver a result to the doctor quicker. However, for many cases in research this is not the most interesting thing to look at, assuming that you don’t have infinite resources.

Taking a step back and explaining this in a bit more detail one first needs to understand that a computer today typically has more than one processor (also known as a core), and many programs will make use of multiple cores at the same time. This gives a simple relation between normal hours and core hours:

core hours = number of hours passed * number of cores used

However few programs will scale linearly with the number of cores, meaning that you will end up in a situation with diminishing returns.

A computer cluster (such as is used by many genomics researchers) consists of multiple such machines where you can run your programs. Typically you will book a number of cores on such a cluster to run your program on, and you will be charged for the number of cores you booked,  as if your were using them 100 percent throughout the entire run-time of the program.

So unless you have for infinite resources you want to use those hours as efficiently as possible, right? Looking at the core hours used rather than the actual run time is one way to make sure that you squeeze out as much as possible from the resources you have. Maybe you have to wait a little longer, but as you’re moving into for example scaling to handle the analysis of thousands whole human genomes core-hours becomes an important factor.

If you didn’t get what those computer folks meant when they were talking about core hours, now you know.

Thoughts on BOSC 2015

I won’t be making any detailed summary from Bioinformatics Open Source Conference (BOSC) 2015 in Dublin since there is already an excellent one available here: https://smallchangebio.wordpress.com/2015/07/11/bosc2015day2b/ However I thought I’d write down some thoughts on the trends that I think I saw.

Before going into the technical stuff I’d just like to add that the picture above was taken at the panel on diversity in the bioinformatics open source community. It was nice seeing the issue addressed, as it is an important a challenge as any technical once we might currently be facing as a community. This in addition to the cards used to take questions for speakers (instead of the usual stand up and talk way), show that the BOSC organizers are willing to take this on. Kodos to them for doing so!

Workflows, workflows, workflows….

It’s clear that the problem of handling workflows is still an unsolved problem. Having spent considerable time and effort setting up and managing pipelines myself, I truly applaud  the on-going efforts to make things a bit more standardized and interoperable using the Common Workflow Language (CWL). If in the future it would actually be possible to download and run somebodies pipeline on more than one platform that would be truly amazing.

There still seems to be some confusion about the exact nature of CWL and what it aims at doing. My understanding is that it will provide a specification consisting of tool definitions and workflow descriptions that platform developers can implement in order to make it possible to migrate these between platforms. As of yet it seems to be somewhat lacking on the implementation side of things (which is to be expected since it was announced to the public at this BOSC if I understand things correctly). I really hope that things will take off on the implementation front and once it does I want to try my hand at translating some of the things that we have setup into CWL.

In my wildest dreams the CWL could also serve as a starting point for build a community which could also be collaborating on and providing other things, such as:

  • tools repositories (like a Docker Hub for bioinformatics tools), providing containers and tool definitions.
  • collaborative workflow repositories (that are actually possible to deploy outside their exact environment – no more re-implementing the GATK best practice pipeline yet another time).
  • reference data repositories – something that could be to bioinformatics what Maven Central is to Java. A single place from which e.g. reference genomes could be downloaded automatically based on a configuration file. (While writing this I realized that I’d seen something similar to what I was describing: Cosmid – so folks, let’s just go and adopt this now!)

Docker everywhere

Docker was mentioned so many times that eventually it became a joke. It does seem to provide a convenient solution to the software packaging problem – however my own limited experience with Docker tells me that the process of using it would have to be simplified in order to make it adoptable outside of a limited group.

What’s needed is something which allows you to access the tool you want to use with a minimum overhead. Something like:

run-in-docker <my-favorite-tool> <args>

Until this is possible I don’t think that we are going to see wide spread adoption outside platform solutions like Arvados, Galaxy, etc. I guess there’s also there are issues of security that would need to be resolved before sys-admins at HPC centers would be willing to install it.

Wrapping up

Going to BOSC was a rewarding  experience and an excellent opportunity to get a feel for where the community is heading in the future. A warm thanks to the organizers as well as all the speakers.

 

 

Codefest 2015

I had the great pleasure of attending the Codefest 2015 in Dublin. Not only did I get to meet some really nice people but I also got some work done. Below are the projects that I worked on and some general notes.

TEQCviewer
https://github.com/marcou/TEQCviewer

In the first day I joined Mirjam Rehr and Maciej Pajak in working on a Shiny app to visualize targeted sequencing data in an interactive way. Not only was this a nice opportunity to shape up on on R and Shiny, but I was also introduced to something new and interesting.

I’d never seen the R-package Packrat before – it’s used to create a set of local dependencies for a R project. Thus making sure that updating a dependency in one project does not break other projects that you might have that uses a older version of the same library. It also bootstraps all the dependencies into the project when you load it, which is just an added benefit.

contAdamination
https://github.com/johandahlberg/contAdamination

I’ve been wanting to try out ADAM for some time now, but I haven’t found the time for it. It felt like the Codefest was a excellent opportunity to get into something new, so that’s what I did day two. After a short brainstorming session, me and Roman Valls Guimerà decided to see if we could port (and possibly improve upon) then idea used in FACS of using bloom filters to find contamination in short read data.

We got off to a good start and got a lot of fun hacking done. We even got some remote help, both with pull requests and advice from Michael L Heuer and Christian Pérez-Llamas, which was awesome!

Not only did I learn a lot in this day, but it also got me excited to keep working on this and see if we could actually make a real product out of this. We’ll see how that goes, but regardless getting it started in the Codefest was super exciting.

Other things of note

A large part of Codefest 2015 was dedicated to the Common Workflow Language – and it seems that there was some progress made in the hackaton. I think it’s going to be very interesting to see where this project is going in the future. If it keeps getting traction in the community I think that it could actually be the solution to the N + 1 pipeline system in bioinformatics. Wouldn’t what be awesome?

Finally, if you’re in any way interested in open source coding for bioinformatics I’d absolutely recommend going to the Codefest. Hope to see you there in the future!

P.S.  Robin Ander also did a write up of the Codefest here, check it out! (even later edit): Here Guillermo Carrasco also wrote about the Codefest here.

It’s already been done…

Not that long ago I wrote about the use of CRISPR-CAS9 systems and the ongoing ethical debate on their use in humans. As it turns out this future is even closer that I might initially have thought. This week a Chinese group published a paper where they used these systems to do genome editing in non-viable human embryos.

The original paper is available here: http://link.springer.com/article/10.1007%2Fs13238-015-0153-5 And nature news piece on the topic can be found her: http://www.nature.com/news/chinese-scientists-genetically-modify-human-embryos-1.17378 Finally the findings were also covered in Swedish mainstream press (in Swedish) here: http://www.dn.se/ekonomi/forsta-manskliga-genmanipulationen-genomford/

The study itself draws attention to some problems related to genome editing using these techniques. The poor success rates (28 out of 86 embryos tested were successfully spliced) and problems with off-target mutations hinder the immediate clinical use of the technique. These are however problems that I’m sure will be addressed by technology development.

More importantly I think that this highlights the importance of establishing ethical frameworks for the use of gene-editing. Interestingly the paper was rejected both by Nature and Science “in part because of ethical objections” – it remains to be seen if such objections will hold in the future.

Personally I’m as of yet undecided on the morality of carrying out gene editing in embryos. While the promise of cures to heritable disease is wonderful, it’s easy to see a slippery slope from there into more dubious uses. Also it warrants the questions of what is to be considered a disease/disability.

Today at the SciLifeLab Large Scale Human Genome Sequencing Symposium Dr. Anna Lindstrand spoke about a study indicating  CTNND2 as a candidate gene for reading problems. Are such mild disabilities to be considered for gene editing? Maybe not – but where do we draw the line? Once the technology is available I have little doubts that some will want to use it produce “genetically enhanced” humans.

The Nature news article referenced above ends of with the somewhat ominous quote: “A Chinese source familiar with developments in the field said that at least four groups in China are pursuing gene editing in human embryos.” Certainly we are going to hear more on this topic in the future.

The bioinformatician is dead – long live bioinformatics

Last week I came across a comment article in Nature by Jeffery Chang titled “Core services: Reward bioinformaticians”. I found it highly interesting, but I’m not sure I agree with the conclusions draw by the author.

Chang argues that “biologists are increasingly finding that questions that are initially based on a single protein or gene quickly expand to require large-scale experiments” and thus the need for bioinformaticians increase. His answer to this is that the scientific community should invest in alternate career paths for bioinformaticians. And while I believe that the analysis of the situation is spot on – I think that there is a different solution to the problem. I think that rather than to to create a separate structure where bioinformaticians act as support to biologists, we should include more applied bioinformatics in basic training for biologists.

I do think that anyone would argue that having a career in physics means having a firm understanding of math. The reason for this is of course that math is an essential tool for a physicist. In the same way I believe that a firm understanding of applied bioinformatics is crucial to the work of a biologist in a high-throughput assay world.

Chang writes:

“To give greater support to researchers, our centre set out to develop a series of standardized services. We documented the projects that we took on over 18 months. Forty-six of them required 151 data-analysis tasks. No project was identical, and we were surprised at how common one-off requests were (see ‘Routinely unique’). There were a few routine procedures that many people wanted, such as finding genes expressed in a disease. But 79% of techniques applied to fewer than 20% of the projects. In other words, most researchers came to the bioinformatics core seeking customized analysis, not a standardized package.”

I think this “routinely unique” situation only strengthens the arguments that the researches asking the questions also need to be able to answer those questions themselves. The view that I’ve sometimes come across, that bioinformatics is “just a support thing” which should be provided be somebody else bothers me. It purports the false image that one can work in an informatics rich area without investing in the know-how.

To some extent bioinformatics is already included in biology programs, but not to the extent that I think that it needs to be. The tools of the trade in bioinformatics need to be in there. Skills such as scripting, working with high-performance computing systems, etc all need to be taught. In the long run making the bioinformaticians of today redundant.

Of course there will always be a need for specialisation (and I don’t want to loose my job), but to argue that bioinformatics is a thing separate from biology I think misses the point. Rather than to make ourselves indispensable we should strive to provide biologists with the tools and the knowledge they need to answer the biological questions of tomorrow. That way we will be able to say – the bioinformatician is dead, long live bioinformatics!

Setting up Spark on OpenStack using Ansible

In 2013 I first found out about Spark and was greatly intrigued, to the point that I actually hosted a journal club** on the topic. There has been many discussions on the merits of Spark around the internet, often focusing on Sparks in-memory processing capabilities. These allow for some rather impressive speed-ups for iterative algorithms compared to the equivalent program written in Apache MapReduce, however, what I found to be the main feature was the ability to express programs in a way very similar to non-parallel programs and allow the framework to figure out the parallelisation.

While I’ve have played around some with Spark on my own machine since then I haven’t until recently had the time nor the resources to try it out in a distributed mode. However, recently I got access to the cloud test-bed Smog on setup up by our local HPC center Uppmax.

To get up and running and try to figure out the basic nuts-and-bolts of Spark I’ve setup a some Ansible playbooks to deploy a basic stand-alone Spark cluster (with hdfs support) to OpenStack and written up some instructions for how to get it going in this github repo: https://github.com/johandahlberg/ansible_spark_openstack.

I haven’t had time to play around with it that much yet, but I hope to do so and report back here later. I’m especially interested in investigating setting up and using Adam (a Spark framework for genomics) and also trying out the Spark Notebook (iPython-notebookish interface for working interactively with Spark).

Finishing of with some acknowledgements I do want to mention Mikael Huss (@MikaelHuss) who has help me test out the playbooks making sure that the instructions are up to scrap, and Sheshan Ali (@zeeshanalishah) who has help me get up and running with OpenStack.

Check out the repo – pull requests are welcome!

** Informal discussion around an research paper common in academia. In our group we have them once per week.