I recently read an interesting research paper by Johan Bollen, Huina Mao, Xiao-Jun Zeng, from Indiana University entitled “Twitter mood predicts the stock market,” that investigated whether “collective mood states derived from large-scale Twitter feeds” correlated with the value of the Dow Jones Industrial Average. What they found was that their algorithm not only paralleled market changes, it predicted them, with startling 87.6 percent accuracy!
As a provider of Big Data analytics software, we see this type and scale of problem all the time at our customer sites, particularly the correlation of structured and unstructured data. For this particular study, let’s see how easy it is to reproduce this analysis with Datameer Analytics Solution (DAS).
First, let’s download the Dow Jones stock values data. You can get this freely, from Yahoo for example (DJIA). This is a simple CSV file format showing daily prices. You can also download other data, such as the NYSE Composite index, to experiment with.
Second, let’s get some Twitter data from their API, known as the “fire hose”. For this test, we’ll use raw data (i.e. unfiltered tweets) for the entire month of March 2010.
Let’s load all of this data into DAS. In our new 1.3.x version, you can simply upload a file from your local computer, so let’s load our Dow Jones data this way:

Then let’s load the tweets, via an Import Job, which understands Twitter’s format natively:
This amounts to about 30 GBs of compressed data for the month.
Let’s first try and get a more accurate data set, by filtering the tweets to US users. This is something that our researchers apparently did not do: “we note that our analysis is not designed to be limited to any particular geographical location”, but this is easy to do with DAS.
We did not have OpinionFinder nor Google-Profile of Mood States at our disposal to perform sentiment analysis (these could make great new functions some day that could be added via our API!), so let’s use instead a simplified version by taking a list of positive terms (Bag of words model), and find the tweets that contain these terms.
To do this in DAS, let’s import a list of such terms (this can be easily found on different web sites), and create an outer join with our tweets, and then filter to find the tweets that contain these positive words.

In DAS 1.3.x you can filter with a complex expression directly in the ‘Advanced’ tab:

Now let’s count the positive tweets per day. This is just an aggregation sheet using GROUPBY and counting (this is the sheet preview result below, not the actual count on the full data set yet):

This represents the amount of “happiness” mood by day.
Next, let’s create a new workbook to join the resulting worksheet of “happiness” mood per day with our Dow Jones Industrial Average (ticker ^DJIA) data:

A very helpful feature in DAS is the fact that we can seamlessly exchange the DJIA history with, say, NYSE Composite index history via the ‘Exchange Datasource’ button and rerun the workbook to test the correlation with data from other exchanges. This requires no further changes or additional work (more details on this later).
Here is the resulting sheet of our join:

As you may know, building analyses in DAS works on a sample of the entire data set, which enables users to easily interact with the data until they’re satisfied with the analysis. Now that we are happy with our analysis, let’s run the workbook on the entire data set.
Now let’s graph the tweet “happiness” mood and the DJIA market closing value over the same days and compare:


As the researchers pointed out, we can note a correlation between our Twitter “happiness” index, and how the Dow Jones Industrial Average went up or down between two and six days later; first see the progressive parallel mood upswing (1), then the drop on (2) (drop in Twitter mood followed by drop in the DJIA value), an upswing again on March 19 at (3) – Twitter mood goes up quickly followed by the DJIA value -, then a parallel drop on March 22 followed by the same drop in DJIA value a few days after (see (4)). A similar correlation can be found by using NYSE data instead.
Disclaimer note: this analysis was done on only a month’s worth of data, but could be expanded to more data very easily with no further changes in the analysis. We also used a very simple technique of sentiment tracking, which could be further improved. Finally, due to the small amount of data, we did not have outliers in the data like our researchers did (“significant socio-cultural events such as the Presidential election and Thanksgiving, short-lived uptick in positive sentiment specific to those days”), but we could easily filter had we worked with more data.
Pretty easy, wasn’t it? That simplicity: combining all kinds of data ad hoc, while harnessing the power and scalability of Hadoop to extract insights, is what Datameer is all about. I hope you’ve enjoyed this post, and you can learn more about Datameer at www.datameer.com/products.
















