Blog Archives

How open data can lead to better parking in San Francisco

If you have ever tried to park around the downtown area of San Francisco during work hours (or any big city for that matter), you’ll know what I am talking about: be prepared to circle around for hours. The good news is the city of San Francisco has released an API to monitor garage parking information on http://www.sfpark.org , among other things. In their own words : “SFpark works by collecting and distributing real-time information about where parking is available so drivers can quickly find open spaces.

This is done via real-time sensors.

As an enthusiastic data analyst, I realized I could use Datameer to get more insights about my parking problem. Sure, SFPark already gives you real-time information about parking (on the first page of http://sfpark.org/),  and analysts have already taken a look at pricing .

What I was looking for was a little different: an overall perspective of parking availability and change over a long period of time.

Datameer allows you to bring in data from a variety of different sources. In this case SF Park‘s API returns JSON data, so we used the built-in adaptor for this. We can easily schedule this import of data in an automated way to say, every 30 minutes or so to start building our time series analysis.

An interesting new feature about importing is you can now partition the data, in the same way Hive does, but in an easy user-friendly way, in this form:

You can then work with a subset of the data, in this case chosen to be on a day-granularity. More information about how to set up partitions here.

Once the data is imported, we can easily deconstruct the JSON data with our set of JSON functions. The complete way to do this and deal with JSON data downloaded from the web is described in details in our video section, but basically we end up with something like this:

We can now construct an analysis and visualize the data to infer some statistics about for example at what time in downtown San Francisco are the garages most full (this was run over a 3 month period in 2012):

It appears from this graph that if you work there, the earlier you arrive the better, because the garages get filled up pretty quickly; people seem to start leaving around 2pm (with a maximum availability of 32%), so it seems like the general trend is to work early in the day, perhaps because all of the financial institutions in that area?

After 5pm (1700) the general availability is around 42%, and it gets easier to park after that.Given that there are around 440,000 total spaces in San Francisco, does the day of the month make a difference in parking space availability? This graph shows that it doesn’t seem so:

The weekends (Feb 26, Mar 3) show the spaces are mostly unoccupied, whereas the average number of spaces occupied on weekdays is around 15,000. Of note, we have an outlier of over 20,000 spaces occupied on Feb 22, not sure what happened that day? (Please tell us if you know!).

Let’s see if there is any drastic difference in garage space occupation for this range of days, per garage or area:

It seems like the number of spaces available is fairly evenly distributed for the garages in the dashboard, except for the Golden Gateway one.

Let’s just add a sort and see the top occupied garages and areas to avoid:

It seems like overall, the Leavenworth area, as well as the south Embarcadero road seem to be pretty bad areas for parking.

What if you wanted to see the results of this study on a particular timeframe only like say on a per-day basis, without having to change the analysis? You can simply enable the result set to be partitioned, like demonstrated here, with this nifty sunburst to control the partition level:

This analysis is refreshed with the latest data on a continuous basis, so let us know if you want to see the latest results as it is being continuously updated.

This study could be further deepened by looking at the prices each garage is charging, and choosing the lowest-price one along with its availability.

Posted in Big Data Analytics Perspectives | Leave a comment

Predicting the stock market with Datameer

I recently read an interesting research paper by Johan Bollen, Huina Mao, Xiao-Jun Zeng, from Indiana University entitled “Twitter mood predicts the stock market,” that investigated whether “collective mood states derived from large-scale Twitter feeds” correlated with the value of the Dow Jones Industrial Average. What they found was that their algorithm not only paralleled market changes, it predicted them, with startling 87.6 percent accuracy!

As a provider of Big Data analytics software, we see this type and scale of problem all the time at our customer sites, particularly the correlation of structured and unstructured data.  For this particular study, let’s see how easy it is to reproduce this analysis with Datameer Analytics Solution (DAS).

First, let’s download the Dow Jones stock values data. You can get this freely, from Yahoo for example (DJIA). This is a simple CSV file format showing daily prices. You can also download other data, such as the NYSE Composite index, to experiment with.

Second, let’s get some Twitter data from their API, known as the “fire hose”.  For this test, we’ll use raw data (i.e. unfiltered tweets) for the entire month of March 2010.

Let’s load all of this data into DAS.  In our new 1.3.x version, you can simply upload a file from your local computer, so let’s load our Dow Jones data this way:

Upload file

Then let’s load the tweets, via an Import Job, which understands Twitter’s format natively:

This amounts to about 30 GBs of compressed data for the month.

Let’s first try and get a more accurate data set, by filtering the tweets to US users. This is something that our researchers apparently did not do: “we note that our analysis is not designed to be limited to any particular geographical location”, but this is easy to do with DAS.

We did not have OpinionFinder nor Google-Profile of Mood States at our disposal to perform sentiment analysis (these could make great new functions some day that could be added via our API!), so let’s use instead a simplified version by taking a list of positive terms (Bag of words model), and find the tweets that contain these terms.

To do this in DAS, let’s import a list of such terms (this can be easily found on different web sites), and create an outer join with our tweets, and then filter to find the tweets that contain these positive words.

In DAS 1.3.x you can filter with a complex expression directly in the ‘Advanced’ tab:

Now let’s count the positive tweets per day. This is just an aggregation sheet using GROUPBY and counting (this is the sheet preview result below, not the actual count on the full data set yet):

This represents the amount of “happiness” mood by day.

Next, let’s create a new workbook to join the resulting worksheet of “happiness” mood per day with our Dow Jones Industrial Average (ticker ^DJIA) data:

A very helpful feature in DAS is the fact that we can seamlessly exchange the DJIA history with, say, NYSE Composite index history via the ‘Exchange Datasource’ button and rerun the workbook to test the correlation with data from other exchanges. This requires no further changes or additional work (more details on this later).

Here is the resulting sheet of our join:

As you may know, building analyses in DAS works on a sample of the entire data set, which enables users to easily interact with the data until they’re satisfied with the analysis.   Now that we are happy with our analysis, let’s run the workbook on the entire data set.

Now let’s graph the tweet “happiness” mood and the DJIA market closing value over the same days and compare:

As the researchers pointed out, we can note a correlation between our Twitter “happiness” index, and how the Dow Jones Industrial Average went up or down between two and six days later; first see the progressive parallel mood upswing (1), then the drop on (2) (drop in Twitter mood followed by drop in the DJIA value), an upswing again on March 19 at (3) – Twitter mood goes up quickly followed by the DJIA value -, then a parallel drop on March 22 followed by the same drop in DJIA value a few days after  (see (4)). A similar correlation can be found by using NYSE data instead.

Disclaimer note: this analysis was done on only a month’s worth of data, but could be expanded to more data very easily with no further changes in the analysis. We also used a very simple technique of sentiment tracking, which could be further improved. Finally, due to the small amount of data, we did not have outliers in the data like our researchers did (“significant socio-cultural events such as the Presidential election and Thanksgiving, short-lived uptick in positive sentiment specific to those days”), but we could easily filter had we worked with more data.

Pretty easy, wasn’t it?  That simplicity: combining all kinds of data ad hoc, while harnessing the power and scalability of Hadoop to extract insights, is what Datameer is all about.  I hope you’ve enjoyed this post, and you can learn more about Datameer at www.datameer.com/products.

Posted in How-to, Uncategorized | Leave a comment

Migrating Pig functions to DAS

Say you are using Pig and have written some user-defined functions that work well for you. Now, you want to take advantage of your Pig functions

Continue reading »

Posted in How-to | Tagged , , | Leave a comment