Genomics data – distributed and searchable by content

MinHashes

Recently there has been a lot of interesting activity around MinHash sketches on Titus Browns blog (1, 2, 3).

To quickly recap what a MinHash sketch is, it is a data structure which allows approximate set comparisons in constant memory. In essence it stores a number of output from hash functions, e.g. 1000 such hashes, and then compares two sets by comparing the overlap of the hashes rather than the overlap of the original sets.

This means that you can calculate a distance between two genomes or sets of sequencing reads by computing a MinHash sketch from the sequence k-mers. For a full description of MinHash sketches way better than mine I recommend reading the Mash paper which has a very nice and clear introduction to the concepts.

Distributed data

The idea that I found the most interesting in Browns blog post is that of creating a database of the MinHashes, referencing datasets either with links to where they can be downloaded, or for non-public data, to contact information for owner of the data. When I then came across the Interplanetary File System (IPFS) it struck me that combining these two technologies could make for an extremely interesting match.

Combining these two technologies I imagine that it would be possible to build a distributed database of sequencing data, allowing it to be searched by content. To make this process effective, some type of index would of course need to be applied and once again this idea is explored by Brown in the form of sequence bloom trees.

Preferably the data itself could also be stored in IPFS. For fully public data this seems perfectly feasible, for private data it might be somewhat trickier. For that scenario one could imagine either have pointers to “normal” static resources, to private IPFS networks (a feature which appears to be on the road map for IPFS), or store the data encrypted in the public network.

There are already some interesting developments in using IPFS with genomics data. The Cancer Gene Trust project are using IPFS to create a distributed database of cancer variants and clinical data. The project is called Cancer Gene Trust Daemon and is available on Github. Super interesting stuff, which as far as I can tell is not directly searchable by content. Right now I’m not sure how well the MinHash sketch technique would extend to something like a variant file, but I suspect that since it’s was developed for general document similarity evaluation it should not be impossible. Of the top of my head I would think that it would be possible to use the variant positions themselves to compute the hashes.

Why a distributed data store?

I think that there are a number of reasons why this could be a worthwhile effort. Firstly there are obvious dangers with keeping scientific data in central repositories – what happens when funding for a project runs out or the PhD student leaves the lab? I’m fairly sure that there are many dead or dying scientific web services out there. Having a distributed system would mean that the data will (in theory) live indefinitely in the network.

Secondly having a distributed system means bandwidth can be used way more efficiently, effectively creating a content delivery network for the data, were any actor who’d like could set up a mirror of the data in the network to make sure that their clients do not need to download the human genome from the other side of the world again and again, when it is available on your colleagues workstation.

Finally

Right now I’m mostly excited about both of these technologies and I’d thought I’d share my thoughts. I hope to be able to carve out some time to actually play around with these ideas sometime in the future.

As a final note I have downloaded the human reference genome (GRCh38/hg38) and uploaded it to IPFS, it is available under this address: https://ipfs.io/ipfs/QmQ3gCx4WqaohRYbE8NFmE2Yc8xmARDPJ3jZNpQdfoWyKQ. If you run a IPFS node you might want to pin it. Enjoy!

Tutorial: Verifying messages from Facebook using Scala and the Play Framework

Background

I’ve been interested in chat bots for some time and when Facebook launched their API for building bots on their messenger platform I thought I’d play around with it. I have also wanted to look into using Scala with the Play Framework for building a web service, so I decided to combine the two.

Going about building a small messenger bot I realized that there had to be some way to verify that the messages being posted to the webhook I’d set up – and I noticed that there is a “X-Hub-Signature” in the header of the request – while this is not documented in the API docs that I could find, it is documented here.

The HTTP request will contain an X-Hub-Signature header which contains the SHA1 signature of the request payload, using the app secret as the key, and prefixed with sha1=. Your callback endpoint can verify this signature to validate the integrity and origin of the payload
Please note that the calculation is made on the escaped unicode version of the payload, with lower case hex digits. If you just calculate against the decoded bytes, you will end up with a different signature. For example, the string äöå should be escaped to \u00e4\u00f6\u00e5.

Having spent some time figuring out how to set this up, I thought I’d share my results here.

Step 0: How Facebook posts messages to your webhook

Following the tutorial here you can see how to setup a webhook that Facebook will post any messages sent to your page to. What you then need to do is implement a http(s) endpoint that can handle json on the following format being posted to it:

{
  "object":"page",
  "entry":[
    {
      "id":PAGE_ID,
      "time":1466278528,
      "messaging":[
        {
          "sender":{
            "id":"USER_ID"
          },
          "recipient":{
            "id":"PAGE_ID"
          },
          "timestamp":1466278528,
          "message":{
            "mid":"mid.1567764493618:41d102a3e1ae210a38",
            "seq":79,
            "text":"hello, world!"
          }
        }
      ]
    }
  ]
}

Note that in reality the end-point must be able to handle other types of messages as well, such as a user opting-in, delivery confirmations, etc.

Step 1: Creating the model

While Play does allow you to work directly with json, I think it’s nice to translate it to some Scala classes. Using the implicit readers/formaters provided by the Play framework this is fairly simple.

Just translate the json structure to a set of Scala case classes with field names matching the keys in the json structure.

We will later use this to convert the incoming json to a Scala class.

Step 2: Building a simple verification service

As the documentation states, to verify the request we will use a hash-based message authentication code. The quick version is that we are computing a hash of the message content together with a secret token that we share with Facebook. In this way, we can be sure that the message posted to the webhook actually came from them.

For this tutorial we will implement a very simple SHA1-based verification service. Please note that you should never keep the secret in the source-code, but load it from configuration, an environment variable or some other approach which does not leave it exposed when you check your code into version control.

Now that we have a way to check the incoming bytes it’s time to create a function to verify the request.

Step 3: Constructing a wrapper function to verify requests

While this tutorial only covers verifying incoming messages (i.e. what Facebook calls receiving messages), there are other events as well that might trigger a callback sending data to the webhook you have registered. Therefore it would be neat to implement this as a function which we can then re-use for the other types of callbacks.

The trick here is that we have to parse the message body – and all the examples of stacking actions on-top of each other in the Play documentation assumes that only the last action will actually parse the request. This leads us to adopt a slightly different approach.

We will implement by creating a function which takes a function as a argument (i.e. a higher-order function). In this case the function we accept as an argument will be an action, but this action will only be carried out, i.e. have that function be executed if we successfully validate the request.

Note how we parse the request body as bytes, convert it to a utf8 string (as per the specification), and verify that the signature that our verification service computes is the same as the one sent in the “X-Hub-Signature”.

Since we do parse the body as bytes here we will have to deal with converting the body as bytes to something more useful, such as json, in the next part.

Here is what the code looks like:

Step 4: Receiving messages

Finally we will use what we’ve created above to receive a message, parse it to a Scala class and respond appropriately giving a status 200 if we have successfully parsed the request.

Note that we convert the bytes to json here, and use the json.validate function to convert it to the models we created in part one. 

Wrapping up

Looking at this all together we get about 160 line of code including imports and blank lines. Not to bad. There are probably more elegant ways to solve this problem, but this does seem to get the job done.

As a reward for following this tutorial to the end, here is the entire source in one gist.

Rural to urban – the primary axis of swedish politics?

To begin with I’d like to apologize for the mix of languages in this post. The data source for this post are in swedish, and I’ve not taken the time to translate them. However, as I expect that this is mostly of interest to sweden based readers anyways I expect that this won’t be too much of a problem.

As Mikael remarks in his original post standard PCA is not suited to analysing compositional data (e.g. like in the voting data, where all the voting percentages need to add up to one for each municipality). This mean that I wanted to look into using robust principal component analysis for compositional data (available from the “robComposions” r-package) to see how that would influence the results.

Lets begin by having a look at the scoring plots overlaid with some information about the municipalities.

The two figures above shows two similar views of the Swedish political landscape. There is a clear trend from from rural to urban, with the left-right political spectrum following along approximately the same lines.

A great deal of the variability in the data (77%) is explained in these two principal components. So let’s look at their compositions:

Looking at the loading of the first principal component we see that we should expect to see municipalities with a relatively higher proportion of the traditionally rural Center party (C) votes on the left. On this side we also expect to see municipalities with a high percentage of votes for the right-wing populist/xenophobic party, the Sweden democrats (SD). While on the right we see municipalities with a high percentage of votes for the Liberals (FP) and the Green Party (MP). My interpretation is that this lends strength to the hypothesis that the x-axis shows a separation from rural to urban.

The second principal component interestingly seems to show a separation along the lines traditional Swedish political lines, with the right being on the bottom half of the figure and the left being in the top part.

To get a feel for this, let’s add the names of the municipalities to the PCA.

While this figure is something of a mess we can clearly see the names of the outliers. And this does seem to make sense. Gällivare and Jokkmokk in the north of Sweden traditionally having a strong leaning towards the left, while Danderyd outside of Stockholm is well known as a stronghold of the Swedish right. On the left we see rural municipalities like Ragunda and Strömsund and on the right we see the larger cities of Sweden, like Stockholm and Göteborg.

All-in-all it seems that my results here concur with Mikaels.

Mikael then goes on to do random forest classification to see if certain keep parameters about the different municipalities can predict the voting outcome. I decided to take more straightforward route here. I think that this is especially interesting due to the statements made in the blog Cornucopia(in Swedish) about the correlation between the percentage of votes for SD and the number of asylum seekers in the municipality.

Lars Wideräng claims that there is a strong positive correlation between the number of asylum seekers per capita in a municipality and the SD voting percentage, and a negative correlation with the number of M votes. He states as one possiblitiy that in municipalities with a high number of asylum seekers, voters flee from M over to SD because of this. However I think that that this is oversimplifying the situation.

One claim that he makes is that municipalities with right-wing governments accept fewer asylum seekers. Looking at this density plot, this appears to be correct.