Genomics data – distributed and searchable by content

MinHashes

Recently there has been a lot of interesting activity around MinHash sketches on Titus Browns blog (1, 2, 3).

To quickly recap what a MinHash sketch is, it is a data structure which allows approximate set comparisons in constant memory. In essence it stores a number of output from hash functions, e.g. 1000 such hashes, and then compares two sets by comparing the overlap of the hashes rather than the overlap of the original sets.

This means that you can calculate a distance between two genomes or sets of sequencing reads by computing a MinHash sketch from the sequence k-mers. For a full description of MinHash sketches way better than mine I recommend reading the Mash paper which has a very nice and clear introduction to the concepts.

Distributed data

The idea that I found the most interesting in Browns blog post is that of creating a database of the MinHashes, referencing datasets either with links to where they can be downloaded, or for non-public data, to contact information for owner of the data. When I then came across the Interplanetary File System (IPFS) it struck me that combining these two technologies could make for an extremely interesting match.

Combining these two technologies I imagine that it would be possible to build a distributed database of sequencing data, allowing it to be searched by content. To make this process effective, some type of index would of course need to be applied and once again this idea is explored by Brown in the form of sequence bloom trees.

Preferably the data itself could also be stored in IPFS. For fully public data this seems perfectly feasible, for private data it might be somewhat trickier. For that scenario one could imagine either have pointers to “normal” static resources, to private IPFS networks (a feature which appears to be on the road map for IPFS), or store the data encrypted in the public network.

There are already some interesting developments in using IPFS with genomics data. The Cancer Gene Trust project are using IPFS to create a distributed database of cancer variants and clinical data. The project is called Cancer Gene Trust Daemon and is available on Github. Super interesting stuff, which as far as I can tell is not directly searchable by content. Right now I’m not sure how well the MinHash sketch technique would extend to something like a variant file, but I suspect that since it’s was developed for general document similarity evaluation it should not be impossible. Of the top of my head I would think that it would be possible to use the variant positions themselves to compute the hashes.

Why a distributed data store?

I think that there are a number of reasons why this could be a worthwhile effort. Firstly there are obvious dangers with keeping scientific data in central repositories – what happens when funding for a project runs out or the PhD student leaves the lab? I’m fairly sure that there are many dead or dying scientific web services out there. Having a distributed system would mean that the data will (in theory) live indefinitely in the network.

Secondly having a distributed system means bandwidth can be used way more efficiently, effectively creating a content delivery network for the data, were any actor who’d like could set up a mirror of the data in the network to make sure that their clients do not need to download the human genome from the other side of the world again and again, when it is available on your colleagues workstation.

Finally

Right now I’m mostly excited about both of these technologies and I’d thought I’d share my thoughts. I hope to be able to carve out some time to actually play around with these ideas sometime in the future.

As a final note I have downloaded the human reference genome (GRCh38/hg38) and uploaded it to IPFS, it is available under this address: https://ipfs.io/ipfs/QmQ3gCx4WqaohRYbE8NFmE2Yc8xmARDPJ3jZNpQdfoWyKQ. If you run a IPFS node you might want to pin it. Enjoy!

Tutorial: Verifying messages from Facebook using Scala and the Play Framework

Background

I’ve been interested in chat bots for some time and when Facebook launched their API for building bots on their messenger platform I thought I’d play around with it. I have also wanted to look into using Scala with the Play Framework for building a web service, so I decided to combine the two.

Going about building a small messenger bot I realized that there had to be some way to verify that the messages being posted to the webhook I’d set up – and I noticed that there is a “X-Hub-Signature” in the header of the request – while this is not documented in the API docs that I could find, it is documented here.

The HTTP request will contain an X-Hub-Signature header which contains the SHA1 signature of the request payload, using the app secret as the key, and prefixed with sha1=. Your callback endpoint can verify this signature to validate the integrity and origin of the payload
Please note that the calculation is made on the escaped unicode version of the payload, with lower case hex digits. If you just calculate against the decoded bytes, you will end up with a different signature. For example, the string äöå should be escaped to \u00e4\u00f6\u00e5.

Having spent some time figuring out how to set this up, I thought I’d share my results here.

Step 0: How Facebook posts messages to your webhook

Following the tutorial here you can see how to setup a webhook that Facebook will post any messages sent to your page to. What you then need to do is implement a http(s) endpoint that can handle json on the following format being posted to it:

{
  "object":"page",
  "entry":[
    {
      "id":PAGE_ID,
      "time":1466278528,
      "messaging":[
        {
          "sender":{
            "id":"USER_ID"
          },
          "recipient":{
            "id":"PAGE_ID"
          },
          "timestamp":1466278528,
          "message":{
            "mid":"mid.1567764493618:41d102a3e1ae210a38",
            "seq":79,
            "text":"hello, world!"
          }
        }
      ]
    }
  ]
}

Note that in reality the end-point must be able to handle other types of messages as well, such as a user opting-in, delivery confirmations, etc.

Step 1: Creating the model

While Play does allow you to work directly with json, I think it’s nice to translate it to some Scala classes. Using the implicit readers/formaters provided by the Play framework this is fairly simple.

Just translate the json structure to a set of Scala case classes with field names matching the keys in the json structure.

We will later use this to convert the incoming json to a Scala class.

Step 2: Building a simple verification service

As the documentation states, to verify the request we will use a hash-based message authentication code. The quick version is that we are computing a hash of the message content together with a secret token that we share with Facebook. In this way, we can be sure that the message posted to the webhook actually came from them.

For this tutorial we will implement a very simple SHA1-based verification service. Please note that you should never keep the secret in the source-code, but load it from configuration, an environment variable or some other approach which does not leave it exposed when you check your code into version control.

Now that we have a way to check the incoming bytes it’s time to create a function to verify the request.

Step 3: Constructing a wrapper function to verify requests

While this tutorial only covers verifying incoming messages (i.e. what Facebook calls receiving messages), there are other events as well that might trigger a callback sending data to the webhook you have registered. Therefore it would be neat to implement this as a function which we can then re-use for the other types of callbacks.

The trick here is that we have to parse the message body – and all the examples of stacking actions on-top of each other in the Play documentation assumes that only the last action will actually parse the request. This leads us to adopt a slightly different approach.

We will implement by creating a function which takes a function as a argument (i.e. a higher-order function). In this case the function we accept as an argument will be an action, but this action will only be carried out, i.e. have that function be executed if we successfully validate the request.

Note how we parse the request body as bytes, convert it to a utf8 string (as per the specification), and verify that the signature that our verification service computes is the same as the one sent in the “X-Hub-Signature”.

Since we do parse the body as bytes here we will have to deal with converting the body as bytes to something more useful, such as json, in the next part.

Here is what the code looks like:

Step 4: Receiving messages

Finally we will use what we’ve created above to receive a message, parse it to a Scala class and respond appropriately giving a status 200 if we have successfully parsed the request.

Note that we convert the bytes to json here, and use the json.validate function to convert it to the models we created in part one. 

Wrapping up

Looking at this all together we get about 160 line of code including imports and blank lines. Not to bad. There are probably more elegant ways to solve this problem, but this does seem to get the job done.

As a reward for following this tutorial to the end, here is the entire source in one gist.

Rural to urban – the primary axis of swedish politics?

To begin with I’d like to apologize for the mix of languages in this post. The data source for this post are in swedish, and I’ve not taken the time to translate them. However, as I expect that this is mostly of interest to sweden based readers anyways I expect that this won’t be too much of a problem.

As Mikael remarks in his original post standard PCA is not suited to analysing compositional data (e.g. like in the voting data, where all the voting percentages need to add up to one for each municipality). This mean that I wanted to look into using robust principal component analysis for compositional data (available from the “robComposions” r-package) to see how that would influence the results.

Lets begin by having a look at the scoring plots overlaid with some information about the municipalities.

 

 

 

The two figures above shows two similar views of the Swedish political landscape. There is a clear trend from from rural to urban, with the left-right political spectrum following along approximately the same lines.

A great deal of the variability in the data (77%) is explained in these two principal components. So let’s look at their compositions:

Looking at the loading of the first principal component we see that we should expect to see municipalities with a relatively higher proportion of the traditionally rural Center party (C) votes on the left. On this side we also expect to see municipalities with a high percentage of votes for the right-wing populist/xenophobic party, the Sweden democrats (SD). While on the right we see municipalities with a high percentage of votes for the Liberals (FP) and the Green Party (MP). My interpretation is that this lends strength to the hypothesis that the x-axis shows a separation from rural to urban.

The second principal component interestingly seems to show a separation along the lines traditional Swedish political lines, with the right being on the bottom half of the figure and the left being in the top part.

To get a feel for this, let’s add the names of the municipalities to the PCA.

While this figure is something of a mess we can clearly see the names of the outliers. And this does seem to make sense. Gällivare and Jokkmokk in the north of Sweden traditionally having a strong leaning towards the left, while Danderyd outside of Stockholm is well known as a stronghold of the Swedish right. On the left we see rural municipalities like Ragunda and Strömsund and on the right we see the larger cities of Sweden, like Stockholm and Göteborg.

All-in-all it seems that my results here concur with Mikaels.

Mikael then goes on to do random forest classification to see if certain keep parameters about the different municipalities can predict the voting outcome. I decided to take more straightforward route here. I think that this is especially interesting due to the statements made in the blog Cornucopia(in Swedish) about the correlation between the percentage of votes for SD and the number of asylum seekers in the municipality.

Lars Wideräng claims that there is a strong positive correlation between the number of asylum seekers per capita in a municipality and the SD voting percentage, and a negative correlation with the number of M votes. He states as one possiblitiy that in municipalities with a high number of asylum seekers, voters flee from M over to SD because of this. However I think that that this is oversimplifying the situation.

One claim that he makes is that municipalities with right-wing governments accept fewer asylum seekers. Looking at this density plot, this appears to be correct.

When it comes to SD the Pearson correlation between their votes and the number of asylum seekers is 0.32, and the correlation in itself is statistically significant. Claiming this as a strong correlation I’d say is a bit of a reach, though. See e.g. this link.

Let’s go on by looking at the correlation between the number of votes in the municipality elections and 24 different variables.

Both axis in the above have been ordered by hierarchical clustering, this makes it easier to discern the patterns. What immediately sticks out is a clear block consisting of Moderaterna (M), the Liberals (FP), MP and Feminist Initiative (FI) – it’s not so surprising since we have already seen this in the first principal component of the PCA above. A relatively higher proportion of voting for these parties seem to correlate with factors indicating larger cities such as higher median income, higher education levels, higher population and urbanity, lower degrees of unemployment etc. Conversely the Social Democrats (S) and the the Left Party (V) are more popular in less well off regions.

Finally, to evaluate the claim of Wideräng we can turn our eyes to the SD row. First we can conclude that the municipalities with a relatively higher proportion of SD votes are similar to the municipalities with a stronger support for S and V. The data referred to at Cornocopia about the asylum seekers is encoded as “asylum.seekers.per.1000.inh” in the figure. Looking at this data it does not appear to me that this is a very good explanation for the raise of SD. It rather appears that the general education level in the municipality as shown by the “hasEducation” and “youngUnskilled” columns provided a stronger model of explanation. One might speculate that those with higher levels of education have to some degree been vaccinated against the hateful message of SD by their knowledge. Another way to explain this is that the education level is expected to be higher in the larger cities than in rural communities, and that this is what shows up.

Lets further investigate the claim by plotting the number of asylum seekers in the municipality against the number of parliament votes for SD.

This figure shows us two things. Firstly that, yes, there appears to be positive correlation between the number of asylum seekers and the number of votes for SD. Secondly it shows that while the correlation is statistically significant it does not account for a great deal of the varition. Infact the R-squared value is only 0.1 – meaning that only 10 % of the total variation is explained by the linear model.

Taken together I’d hold this as a fairly strong evidence that there is more to this issue than the view taken by Wideräng. However, as opposed to Wideräng, I do think it’s important to remember that correlation does not imply causation and none of the results provided here provided solid evidence for the true cause of the success of SD.

Speculating, based of this data I’d say that the support for SD appears to be related to a higher degree to the difference between cities and rural areas. Which appears to be the primary axis along which Swedish politics is based.

Using IBM Watson to build a podcast annotator

Some time ago now – back in October – I joined Alvaro Martinez Barrio and Mikael Huss in the Uppsala Hackup Hackathon. We formed the formidable team the “hungry aardvarks” (a name graciously provided by the organizers which was to cool to change). The goal of the hackathon was to play around with the IBM Watson APIs, and it was tons of fun. Since then I’ve been keeping busy with other things – but now I’ve finally gotten around to cleaning the code up a bit and placing it on github.

While early on we were trying to come up with some idea for something to do in bioinformatics using Watson – we realized that most of the APIs available lent themselves better to do something based on machine understanding of natural language. Mikael came up with the idea to do an automatic podcast annotator. This sounded like a really cool project and something that we could hack together in a day – so that’s what we decided to do.

You find the code here:

https://github.com/johandahlberg/hungry_aardvarks

Our idea and execution took us in the the final – but in the end the win was snatched by the team building a speech enabled home-automation system on a raspberry pi. Super cool project – well worth the victory!

Mikael has already written about Hackup here. Go and check it out! He writes more about the our process and what the other teams at the hackathon were up to.

Finally lot’s of kudos to the organized of the Hackaton – Jason Dainter, and to IBM for build such a cool system and making it easy to use.

Understanding the Rwandan genocide of 1994 through data from the Uppsala Conflict Data Program

In this post I’m attempting to understand the one of the most horrific events in recent history through the use of data – the Rwandan genocide of 1994. The Uppsala Conflict Data Program provides data on armed conflicts around the world. For this post I’ve specifically used the UCDP GED 2.0 dataset. It contains 89033 data points, each detailing an event of organized violence in Africa or Asia in the time period of 1989 to 2014. It has data on the actors of the conflict, high, low, and best estimates of the number of deaths, the location of events, and much more. It’s important to note that the UCDP is, as far as I understand, quite conservative in what is included in the dataset so it will not necessarily include all known events of organized violence.

Firstly I need to caveat this saying that I have no expertice whatsoever in conflict studies, I simply found the dataset and thought that it would be interesting to explore it a bit. What struck me when I started to poke around, and what inspired me to write this blog post was how shocked I was to see the extreme extent of the Rwandan genocide. While I knew about it, the exact scale had not really come clear to me until I looked at this data

I began my analysis by looking at the distribution of people killed per event. This is illustrated in the histogram below. Note the log10 scale of the x-axis. This tells us that the vast number of events have relatively few casualites. However it also shows us that there is the wide range – with a single event in-fact having a 316744 casualites – with the dataset dryly describing this event as “Government of Rwanda – Civilians”.

Moving on from there I looked at the number of deaths per year. While there is a downward trend (that is however not statistically significant), it once again highlights the events in Rwanda in 1994 as an extreme outlier. According to this data 517764 people were killed in Rwanda in 1994 – which concurs with the estimates provided at Wikipedia of 500000 to 1000000 deaths.

While this post adds nothing new to the understanding of this subject I hope that even simplistic data analysis can be a very powerful tool in understanding something. For me personally the figures provided here tell a nauseating story of extreme human tradgedy. I wish I could say that I was sure that this was the last time the world looked on while something like this happened. Sadly I’m not so sure.

A R markdown version of this post including code used in the analysis is available here:https://gist.github.com/johandahlberg/41ad32bc02279bc06b6d

Finally thanks to Henrik Persson and Sara Engström for fact-checking and proof-reading.

What is a core hour?

Having worked with setting up so called best practice pipelines for different genomics analysis one question that I’ve heard a lot is how long does it take to run? At this point I like to give an answer that’s something along the lines of “it takes 40 hours and 640 core-hours”, and I’d like to argue that the later is the more interesting of the two. Why is that? I’d like try to explain this in layman’s terms so that I myself (and possibly others) can point here to provide an explanation for this in the future.

For some cases what you want to look at is the actual run time (or the “wall time” as it’s sometimes called). For example if you’re setting up a clinical diagnosis pipeline this number might matter a great deal as it is means that you can deliver a result to the doctor quicker. However, for many cases in research this is not the most interesting thing to look at, assuming that you don’t have infinite resources.

Taking a step back and explaining this in a bit more detail one first needs to understand that a computer today typically has more than one processor (also known as a core), and many programs will make use of multiple cores at the same time. This gives a simple relation between normal hours and core hours:

core hours = number of hours passed * number of cores used

However few programs will scale linearly with the number of cores, meaning that you will end up in a situation with diminishing returns.

A computer cluster (such as is used by many genomics researchers) consists of multiple such machines where you can run your programs. Typically you will book a number of cores on such a cluster to run your program on, and you will be charged for the number of cores you booked,  as if your were using them 100 percent throughout the entire run-time of the program.

So unless you have for infinite resources you want to use those hours as efficiently as possible, right? Looking at the core hours used rather than the actual run time is one way to make sure that you squeeze out as much as possible from the resources you have. Maybe you have to wait a little longer, but as you’re moving into for example scaling to handle the analysis of thousands whole human genomes core-hours becomes an important factor.

If you didn’t get what those computer folks meant when they were talking about core hours, now you know.

Thoughts on BOSC 2015

I won’t be making any detailed summary from Bioinformatics Open Source Conference (BOSC) 2015 in Dublin since there is already an excellent one available here: https://smallchangebio.wordpress.com/2015/07/11/bosc2015day2b/ However I thought I’d write down some thoughts on the trends that I think I saw.

Before going into the technical stuff I’d just like to add that the picture above was taken at the panel on diversity in the bioinformatics open source community. It was nice seeing the issue addressed, as it is an important a challenge as any technical once we might currently be facing as a community. This in addition to the cards used to take questions for speakers (instead of the usual stand up and talk way), show that the BOSC organizers are willing to take this on. Kodos to them for doing so!

Workflows, workflows, workflows….

It’s clear that the problem of handling workflows is still an unsolved problem. Having spent considerable time and effort setting up and managing pipelines myself, I truly applaud  the on-going efforts to make things a bit more standardized and interoperable using the Common Workflow Language (CWL). If in the future it would actually be possible to download and run somebodies pipeline on more than one platform that would be truly amazing.

There still seems to be some confusion about the exact nature of CWL and what it aims at doing. My understanding is that it will provide a specification consisting of tool definitions and workflow descriptions that platform developers can implement in order to make it possible to migrate these between platforms. As of yet it seems to be somewhat lacking on the implementation side of things (which is to be expected since it was announced to the public at this BOSC if I understand things correctly). I really hope that things will take off on the implementation front and once it does I want to try my hand at translating some of the things that we have setup into CWL.

In my wildest dreams the CWL could also serve as a starting point for build a community which could also be collaborating on and providing other things, such as:

  • tools repositories (like a Docker Hub for bioinformatics tools), providing containers and tool definitions.
  • collaborative workflow repositories (that are actually possible to deploy outside their exact environment – no more re-implementing the GATK best practice pipeline yet another time).
  • reference data repositories – something that could be to bioinformatics what Maven Central is to Java. A single place from which e.g. reference genomes could be downloaded automatically based on a configuration file. (While writing this I realized that I’d seen something similar to what I was describing: Cosmid – so folks, let’s just go and adopt this now!)

Docker everywhere

Docker was mentioned so many times that eventually it became a joke. It does seem to provide a convenient solution to the software packaging problem – however my own limited experience with Docker tells me that the process of using it would have to be simplified in order to make it adoptable outside of a limited group.

What’s needed is something which allows you to access the tool you want to use with a minimum overhead. Something like:

run-in-docker <my-favorite-tool> <args>

Until this is possible I don’t think that we are going to see wide spread adoption outside platform solutions like Arvados, Galaxy, etc. I guess there’s also there are issues of security that would need to be resolved before sys-admins at HPC centers would be willing to install it.

Wrapping up

Going to BOSC was a rewarding  experience and an excellent opportunity to get a feel for where the community is heading in the future. A warm thanks to the organizers as well as all the speakers.

 

 

Codefest 2015

I had the great pleasure of attending the Codefest 2015 in Dublin. Not only did I get to meet some really nice people but I also got some work done. Below are the projects that I worked on and some general notes.

TEQCviewer
https://github.com/marcou/TEQCviewer

In the first day I joined Mirjam Rehr and Maciej Pajak in working on a Shiny app to visualize targeted sequencing data in an interactive way. Not only was this a nice opportunity to shape up on on R and Shiny, but I was also introduced to something new and interesting.

I’d never seen the R-package Packrat before – it’s used to create a set of local dependencies for a R project. Thus making sure that updating a dependency in one project does not break other projects that you might have that uses a older version of the same library. It also bootstraps all the dependencies into the project when you load it, which is just an added benefit.

contAdamination
https://github.com/johandahlberg/contAdamination

I’ve been wanting to try out ADAM for some time now, but I haven’t found the time for it. It felt like the Codefest was a excellent opportunity to get into something new, so that’s what I did day two. After a short brainstorming session, me and Roman Valls Guimerà decided to see if we could port (and possibly improve upon) then idea used in FACS of using bloom filters to find contamination in short read data.

We got off to a good start and got a lot of fun hacking done. We even got some remote help, both with pull requests and advice from Michael L Heuer and Christian Pérez-Llamas, which was awesome!

Not only did I learn a lot in this day, but it also got me excited to keep working on this and see if we could actually make a real product out of this. We’ll see how that goes, but regardless getting it started in the Codefest was super exciting.

Other things of note

A large part of Codefest 2015 was dedicated to the Common Workflow Language – and it seems that there was some progress made in the hackaton. I think it’s going to be very interesting to see where this project is going in the future. If it keeps getting traction in the community I think that it could actually be the solution to the N + 1 pipeline system in bioinformatics. Wouldn’t what be awesome?

Finally, if you’re in any way interested in open source coding for bioinformatics I’d absolutely recommend going to the Codefest. Hope to see you there in the future!

P.S.  Robin Ander also did a write up of the Codefest here, check it out! (even later edit): Here Guillermo Carrasco also wrote about the Codefest here.

Whole human genome data released under Creative Commons licence

One of the things that I’ve been working a lot on over the last year is setting up pipelines to analyze whole genome sequencing data from human samples. This work is now coming to fruition and one part of that is that we (at the SNP&SEQ Technology Platform) have now released data for our users and others to see. It’s still a work in progress, but most of the pieces are in place at this stage.

The data is being release under a Creative Commons Attribution-NonCommercial 4.0 International License, so as long as you attribute the work to the SNP&SEQ Technology Platform you can use it for non-commercial purposes. You’ll find the data here:

https://export.uppmax.uu.se/a2009002/opendata/HiSeqX_CEPH/

Being a fan of open science working for an employer that will release data for the benefit of the community makes me jump with joy!

P.S. Like to have a look at the code that makes it all happen, checkout the National Genomics Infrastructure github repo.

It’s already been done…

Not that long ago I wrote about the use of CRISPR-CAS9 systems and the ongoing ethical debate on their use in humans. As it turns out this future is even closer that I might initially have thought. This week a Chinese group published a paper where they used these systems to do genome editing in non-viable human embryos.

The original paper is available here: http://link.springer.com/article/10.1007%2Fs13238-015-0153-5 And nature news piece on the topic can be found her: http://www.nature.com/news/chinese-scientists-genetically-modify-human-embryos-1.17378 Finally the findings were also covered in Swedish mainstream press (in Swedish) here: http://www.dn.se/ekonomi/forsta-manskliga-genmanipulationen-genomford/

The study itself draws attention to some problems related to genome editing using these techniques. The poor success rates (28 out of 86 embryos tested were successfully spliced) and problems with off-target mutations hinder the immediate clinical use of the technique. These are however problems that I’m sure will be addressed by technology development.

More importantly I think that this highlights the importance of establishing ethical frameworks for the use of gene-editing. Interestingly the paper was rejected both by Nature and Science “in part because of ethical objections” – it remains to be seen if such objections will hold in the future.

Personally I’m as of yet undecided on the morality of carrying out gene editing in embryos. While the promise of cures to heritable disease is wonderful, it’s easy to see a slippery slope from there into more dubious uses. Also it warrants the questions of what is to be considered a disease/disability.

Today at the SciLifeLab Large Scale Human Genome Sequencing Symposium Dr. Anna Lindstrand spoke about a study indicating  CTNND2 as a candidate gene for reading problems. Are such mild disabilities to be considered for gene editing? Maybe not – but where do we draw the line? Once the technology is available I have little doubts that some will want to use it produce “genetically enhanced” humans.

The Nature news article referenced above ends of with the somewhat ominous quote: “A Chinese source familiar with developments in the field said that at least four groups in China are pursuing gene editing in human embryos.” Certainly we are going to hear more on this topic in the future.