Monthly Archives in February 2012

Welcome Microsoft

Hadoop is unique from other open source projects beyond just its technology.

Hadoop is the first open source technology that created a big market. Unlike past open source successes such as Linux and MySQL that brought low-cost alternatives to existing technology markets, Hadoop has led the creation of the multi-billion dollar big data analytics market. While a number of traditional, commercial software and hardware technologies have tried to address big data analytics, they are limited by high cost, lack of linear scalability and inability to process unstructured and structured data.

Hadoop is revolutionary in the way that analytics are done. So commercial interest has exploded with virtually every large company having an Hadoop project and smaller, innovative social media, gaming and Web 2.0 companies leveraging the scalability and cost effectiveness of Hadoop to break new ground in interaction analytics. Every major hardware vendor has some form of Hadoop offering now to help meet the demand for more computing power and data storage. Even Oracle, who questioned the relevance of NoSQL databases over the last several years, announced an Hadoop offering a few weeks back.

This week, Datameer and Microsoft announce their partnership to bring Datameer’s end-user BI platform to Microsoft’s new Azure-based Hadoop offering. We are excited about this partnership for three reasons. First, Microsoft’s embrace of Hadoop as a key analytics platform expands the Hadoop eco-system in a big way. Second, Microsoft made the spreadsheet the industry standard interface for basic number crunching with over 500,000,000 Excel users. And third, given the success of the spreadsheet user interface for both Microsoft and Datameer, this partnership is a perfect match to bring Hadoop-based spreadsheet analytics to every business user.

We welcome Microsoft to the Hadoop community and look forward to working with them and our joint customers to connect people to the world’s data.

Stefan Groschupf

Posted in Announcements, Big Data Analytics Perspectives | Tagged , | Leave a comment

Which Super Bowl XLVI QB is better “in the clutch”?

With the Super Bowl XLVI coming up, there has been much debate over the two starting quarterbacks, Tom Brady and Eli Manning and whether or not both are considered “elite”. Tom Brady, without a doubt has the career stats to back up the elite label. His touchdowns, passing yards, quarterback ratings, and almost all other stats easily eclipse that of Eli Manning. Brady ranks right up there with other quarterback greats such as Aaron Rodgers, Drew Brees, and Brett Favre. But regardless of career stats, when questioned about his place in NFL history, Eli Manning himself said he was “elite”.

This argument gave me the idea of comparing stats of the two quarterbacks for this weekend’s Super Bowl. But rather than look at overall career totals/averages (because we all know Brady’s overall stats will reign supreme), I decided to try to compare their QB stats for only “in the clutch”. When I say “in the clutch”, I mean which quarterback delivered the most when it counted, that is, when the game was on the line. For my definition of “clutch”, I will look at who passed for more touchdowns in the all important and closing 4th quarter. And also who passed for more touchdowns in the 4th quarter with less then 5 minutes remaining in the game. With the pressure on, game clock is dwindling, which quarterback reigns supreme?

To perform this analysis, I needed stats that broke down at the play-by-play level. Most NFL stat sites only give stats at an end of game level (i.e. final box score). I found such play-by-play data at  www.advancednflstats.com.

As you can see from the shot above, the data is stored at the individual play level, tracking what quarter, what time, and a description of the play. The website had data going back to 2002. Tom Brady’s rookie season was in 2000 and Eli Manning’s was 2007. So there was enough data to cover the majority of both quarterback’s careers. In fact there was well over 384,000 plays performed in the entire regular season NFL games, dating back to 2002.

Using the trial version of Datameer, I loaded this data and performed some aggregations. Didn’t need to write any code, simply used the wizards to guide me through my analysis.

My first step was to import or ingest the data. Data from the site came in multiple .csv files, one for each year dating back to 2002.

After specifying the data details, Datameer was able to parse through the data and recognize all the column headers and data types.

Once the data had been imported, I opened up a workbook and linked my play-by-play data into a worksheet.

Since this data represented all plays in the NFL for all teams, I used the filter wizard to get only the records for plays by Tom Brady and Eli Manning.

Since the play-by-play data contained descriptions of the play, I applied another filter to find all the touchdown passing play descriptions, but also weeded out interceptions, reversed calls, and non plays. And since we’re looking for stats “in the clutch”, I’m only concerned with the 4th Quarter.

Next I created new columns to flag if the record was a play for Brady or Manning. Since all the details are in the play’s description field, I needed to use a “contains” function to check which quarterback the play was for. By simply double clicking into an empty column I was able to launch the wizard to help me configure the “contains” clause on the description field. I created two new columns, one for Brady and one for Manning.

Now that I have a record for every play and flag for both Brady and Manning, I could now create some analysis. By creating a new sheet and using the GROUPBY function, I grouped my data by the yearly football season.


I then performed a group count on my Brady and Manning boolean flags. One column for Brady and one for Manning.

I saved this workbook and ran it against the entire dataset.

You can now see the results, number of passing touchdowns for Brady and Manning, only in the 4th quarter. A quick plot onto Datameer’s dashboard shows me the following graphs: red lines for Brady’s stats and blue lines for Manning’s.

Then by following the same process above, but now filtering for touchdowns in the 4th quarter and less than 5 minutes remaining in the game, I get the following graphs:

This tells us that Eli Manning is actually better “in the clutch” compared to Tom Brady. Going back a couple years you will see that Manning has, for the most part, matched Brady’s stats. But this is all about “now” and coming into 2011, Manning scored 15 touchdowns in the 4th quarter and 10 with only 5 minutes remaining in the game. Brady’s numbers for the same year are 12 TD’s in the 4th quarter and 7 TD’s with only 5 minutes remaining in the game. So when the game is on the line and time is running out… Eli Manning is your elite QB!

While this data set is small, it shows how easily one can analyze data using Datameer. And since Datameer runs on Hadoop, we could easily scale up to billions of records.

So go ahead, download our trial version of Datameer and see what interesting stats you can come up for the Super Bowl.  And don’t hesitate to send us your results, who knows, it might be our next blog post!

http://datameer.com/products/download-trial.html

Posted in Big Data Analytics Perspectives | Tagged , | Leave a comment