Blog Archives

Predicting the stock market with Datameer

I recently read an interesting research paper by Johan Bollen, Huina Mao, Xiao-Jun Zeng, from Indiana University entitled “Twitter mood predicts the stock market,” that investigated whether “collective mood states derived from large-scale Twitter feeds” correlated with the value of the Dow Jones Industrial Average. What they found was that their algorithm not only paralleled market changes, it predicted them, with startling 87.6 percent accuracy!

As a provider of Big Data analytics software, we see this type and scale of problem all the time at our customer sites, particularly the correlation of structured and unstructured data.  For this particular study, let’s see how easy it is to reproduce this analysis with Datameer Analytics Solution (DAS).

First, let’s download the Dow Jones stock values data. You can get this freely, from Yahoo for example (DJIA). This is a simple CSV file format showing daily prices. You can also download other data, such as the NYSE Composite index, to experiment with.

Second, let’s get some Twitter data from their API, known as the “fire hose”.  For this test, we’ll use raw data (i.e. unfiltered tweets) for the entire month of March 2010.

Let’s load all of this data into DAS.  In our new 1.3.x version, you can simply upload a file from your local computer, so let’s load our Dow Jones data this way:

Upload file

Then let’s load the tweets, via an Import Job, which understands Twitter’s format natively:

This amounts to about 30 GBs of compressed data for the month.

Let’s first try and get a more accurate data set, by filtering the tweets to US users. This is something that our researchers apparently did not do: “we note that our analysis is not designed to be limited to any particular geographical location”, but this is easy to do with DAS.

We did not have OpinionFinder nor Google-Profile of Mood States at our disposal to perform sentiment analysis (these could make great new functions some day that could be added via our API!), so let’s use instead a simplified version by taking a list of positive terms (Bag of words model), and find the tweets that contain these terms.

To do this in DAS, let’s import a list of such terms (this can be easily found on different web sites), and create an outer join with our tweets, and then filter to find the tweets that contain these positive words.

In DAS 1.3.x you can filter with a complex expression directly in the ‘Advanced’ tab:

Now let’s count the positive tweets per day. This is just an aggregation sheet using GROUPBY and counting (this is the sheet preview result below, not the actual count on the full data set yet):

This represents the amount of “happiness” mood by day.

Next, let’s create a new workbook to join the resulting worksheet of “happiness” mood per day with our Dow Jones Industrial Average (ticker ^DJIA) data:

A very helpful feature in DAS is the fact that we can seamlessly exchange the DJIA history with, say, NYSE Composite index history via the ‘Exchange Datasource’ button and rerun the workbook to test the correlation with data from other exchanges. This requires no further changes or additional work (more details on this later).

Here is the resulting sheet of our join:

As you may know, building analyses in DAS works on a sample of the entire data set, which enables users to easily interact with the data until they’re satisfied with the analysis.   Now that we are happy with our analysis, let’s run the workbook on the entire data set.

Now let’s graph the tweet “happiness” mood and the DJIA market closing value over the same days and compare:

As the researchers pointed out, we can note a correlation between our Twitter “happiness” index, and how the Dow Jones Industrial Average went up or down between two and six days later; first see the progressive parallel mood upswing (1), then the drop on (2) (drop in Twitter mood followed by drop in the DJIA value), an upswing again on March 19 at (3) – Twitter mood goes up quickly followed by the DJIA value -, then a parallel drop on March 22 followed by the same drop in DJIA value a few days after  (see (4)). A similar correlation can be found by using NYSE data instead.

Disclaimer note: this analysis was done on only a month’s worth of data, but could be expanded to more data very easily with no further changes in the analysis. We also used a very simple technique of sentiment tracking, which could be further improved. Finally, due to the small amount of data, we did not have outliers in the data like our researchers did (“significant socio-cultural events such as the Presidential election and Thanksgiving, short-lived uptick in positive sentiment specific to those days”), but we could easily filter had we worked with more data.

Pretty easy, wasn’t it?  That simplicity: combining all kinds of data ad hoc, while harnessing the power and scalability of Hadoop to extract insights, is what Datameer is all about.  I hope you’ve enjoyed this post, and you can learn more about Datameer at www.datameer.com/products.

Posted in How-to, Uncategorized | Leave a comment

Fishing the Clickstream…

 

Firstly, I’m excited to announce that there’s a major new release of DAS (1.3) available.  1.3 includes, among other things, some powerful tools to perform clickstream analysis through just a few simple steps, and makes visualization of user behavior a breeze.  I wanted to give you a overview of these new tools, and provide some food for thought on how simple it is to extract meaningful insights into visitor behavior from raw web logs, a common use case for DAS and Hadoop.

The goal here is to be able to scrape raw log files from your Apache or IIS web servers and visualize something like this:

This new visualization in DAS, called the “Circular Connection Graph” tells us the relative density of one-hop clickpaths.  It’s an easy way to measure and visualize click-through rate (CTR) from various campaign landing pages, or to compare the popularity of referring web sites (i.e. marketing partners who drive traffic to your site). But this is just one small fish in the sea of weblogs (see what our customers say about the importance of behavioral analytics).

The real magic for Hadoop and DAS is that this data, when enriched with visitors profile or other interaction data (think: MySQL, Oracle, Teradata, Twitter), can give you fine-grained, visitor-level insights previously out of reach.  Canned web traffic reports from a traditional application might only give you aggregated data; cloud-based analytics solutions might show you detail in the clickstream, but can’t correlate that behavior with the transaction systems of record that track the rest of the customer lifecycle, namely: purchases, balance history, call center interactions or in-store visits.  There’s more about that here.

Let me show you a bit about what I mean.  With standard web analytics packages, you can easily get answers to the basic questions of web behavior (including popular pages, session duration and clicks per session) with canned reports.  These are straightforward aggregations (roll-ups) which are easily done in DAS, and much easier than in raw Hadoop, where you’d write Hive QL, Pig or MapReduce code.

Here’s a few examples of those (click the images if you’d like a larger view).


Thanks to the game-changing economics of Hadoop, you can always afford to save every click.  What does that mean?

1. Raw server logs can be fed into Hadoop, eliminating a separate ETL, modeling or pre-processing stage in the data pipeline.  With DAS, this requires zero coding.

2. Using DAS, key elements of user behavior; not just session stats, but page dwell time and click paths preferred by specific users, can easily be extracted and sliced on any dimension.  That provides insightful stats like what you see below. It could also mean dense visualizations like the one at the top of this post, which can serve up daily insights to the folks responsible for customer acquisition or marketeers managing campaigns.


DAS also gives you flexibility.   First, it separates the wheat from the chaff.  Filtering errors, image requests and page refreshes from the clickstream is simple.  Second, DAS let’s you divide-and-conquer the data pipeline.  Data warehousing expertise can be applied to cleanse, enrich and pre-process the data (e.g. sessionizing traffic your own way, with any timeout), which can then be fed on a platter to the BI and marketing teams to create roll-ups, or to data scientists to look for clusters of visitors or develop predictive models. Finally, you can go wild and join this with anything you can throw at DAS: user profile, demographics, emails from your CRM, Twitter feeds, or last month’s blog post.  Sound like a fantasy?  All you need is a handful of spreadsheets and an imagination.  Click to zoom in on the screenshot below to get a taste.  Or wait for the video I’ll be posting soon.

This is clearly a rudimentary example of clickstream analytics, but it’s a starting point that contains valuable nuggets of insight, and it’s easy to extend.  Most importantly, it makes this machine-generated data accessible.  And that’s what data science is all about.

Want to get started today? Contact us for a free trial download, VMWare, or turnkey instance in the cloud.

Happy fishing!

Posted in Announcements, How-to, Uncategorized | Tagged , , | Leave a comment

It’s About Time…

 

In this post, I’ll show you a thing or two about the powerful capabilities of DAS to perform time series analytics.

Most analysts swimming in today’s sea of unstructured data are poorly served, receiving only daily or weekly canned reports which provide a course, aggregated view of what’s happening within their data.  These reports lack the flexibility and granularity necessary to investigate data sets spanning multiple sources, combe structured and unstructured data, and examine them at different levels of detail.

Empowering users with a self-service approach, DAS allows you to be a real data detective.  DAS makes it easy to slice and dice time-sensitive information such as clickstream data, twitter feeds, game events or even emails and discover hidden trends, whether they be long-term or short-lived.  Let’s take a look into the details of how DAS can take the raw materials and uncover something useful in just a few minutes.

Here’s what we’ll cover:

  • DAS and dates
  • Time series reports by day, hour, and beyond
  • Cleaning up dirty time/date information
  • Fine-grained analysis, down to the minute
  • Visualizing trends in DAS dashboards

Make a date with DAS

Firstly, DAS deals with date and time information natively.  Whatever the source, DAS takes your date/time info and gives you a bona fide date which can be manipulated using a calendar.  This is useful for filtering, allowing you to create different windows in different worksheets.  Here’s a screenshot of that.

Slicing and dicing

DAS provides a number of functions which extract bits and pieces of the date, so you can flexibly assembly slices and summarize them (daily click volume, average size of purchases at 4 am, etc).  The functions MONTH(), YEAR(). DAY() and HOUR() are simple ways to grab date/time elements and then group the data just the way you like it.  You can also create complex slices based on multiple pieces, such as an hourly report over multiple days, or one that normalizes time info by time zone (if that’s in your data).  A picture is worth a thousand words, but there’s also a full list of date functions here.

To get at some particulars of the date (like time zone), it’s necessary to use FORMATDATE(), and tell DAS the specific pattern you’re looking for.   There’s an example of that here, and a list of what you can do with that here.

Right about now

Sometimes you need an immediate understanding of what’s happening with your data (and perhaps take immediate action).  DAS lets you react based on the time your report is run.  The functions NOW() and TODAY(), and the ability to use offsets (e.g. +7 days, -6 hours) allow you to determine the freshness of the information you’re analyzing, or the proximity of events, and automate your response. For example, if I want to analyze only web traffic from the last twelve hours, I would do something like this.

When a date is not a date

Date/time information isn’t always well-prepared.  Quite often, dates are embedded within other data, as parts of a URLs, JSON objects, or even large sections of unstructured text.  Fortunately, DAS lets you construct dates out of any raw text, regardless of format, using ASDATE().  Here’s an example.

Up to the minute

DAS will let you drill into tiny time windows, of any size, to identify irregularities in data that appear for only a few minutes or even seconds.  This can be necessary when you’re monitoring financial markets or feeds from social media sites, or just trying to understand system behavior (like erratic usage patterns or fluctuations in web traffic due to downtime).  While there are a number of ways to do this, the simplest is to put your data into bins (buckets) of the desired size.  To do this, DAS provides GROUPBYBIN(). But first, the date must be converted to a numeric value with TIMESTAMP().  As an example, I can group data into five minute slices by writing GROUPBYBIN(TIMESTAMP(#Date);360000).  The big number is the size of the bin in milliseconds.  Here’s a screenshot of that.  If you’re still scratching your head with this one, you might want to watch the video.  It’s toward the end.

The big picture

A picture is worth a thousand words. When you’ve got billions of events, that’s a lot to talk about.  Assembling dashboards to visualize time series analytics with DAS is simple and straightforward, and the results can be enlightening.  For starters, let’s take a look at a chart that compares the volume of tweets about two trending topics over the course of a day.

Time Chart

Looks pretty, doesn’t it?  Now let’s have a look at a similar chart which examines a smaller time range, but with a finer toothed comb; in one minute slices.  See the spike?

That’s something we’d never have seen in the kind of weekly reports you see below.  Yet, those more course-grained reports are still useful in providing an overview to executives.  Why is that important? Well, it depends on your use case, but a number of Datameer’s customers are interested in identifying temporary irregularities or patterns that might represent value opportunities or even critical problems they can’t wait to address.

Watch the video:  If you’d like to get a little more hands-on, I’ve posted a live demonstration illustrating all these concepts here.

Posted in How-to, Uncategorized | Tagged , , , , | Leave a comment

Migrating Pig functions to DAS

Say you are using Pig and have written some user-defined functions that work well for you. Now, you want to take advantage of your Pig functions

Continue reading »

Posted in How-to | Tagged , , | Leave a comment