There is currently still a vacuum for easy & scalable solutions in the machine learning space.
At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.
If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.
What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.
Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.
Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.
If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?
Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.
If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…
In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.
One of Twitter’s major problems is to keep statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.
The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.
In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters =
TridentState tweetersToFollowers =
.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
.aggregate(new One(), new Fields("one"))
.aggregate(new Count(), new Fields("reach"));
The possibilities of Trident + Storm, combined with fast scalable datastores, like for instance Cassandra, are enormous. Everything from real-time counters, filtering, complex event processing, machine learning, etc.
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.
The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.
With Hadoop/Hbase/Hive, Cassandra, etc. you can store and manipulate peta-bytes of data. But what if you want to get nice looking reports or compare data held in a NoSQL solution with data held elsewhere? There have been two market leaders in the Open Source business intelligence space that are putting all their firepower onto Big Data now.
Pentaho Big Data seems to be a bit further ahead. They offer a graphical ETL tool, a report designer and a business intelligence server. These are existing tools but support for Hadoop HDFS, Map-Reduce, Hbase, Hive, Pig, Cassandra, etc. have been added.
Jaspersoft’s Open Source Big Data strategy is a little bit behind because connectors are not included yet into the main product and several are still in beta quality and with missing documentation.
Both companies will accelerate the adoption of big data since the main problem with Big Data is easy reporting. Unstructured data is harder to format into a very structured report than structured data. Any solutions that will make this possible and additionally are Open Source are very welcome in times of cost cutting…
In the telecom domain a scalable real-time architecture means paying a lot of money in hardware and licenses. You buy the Oracle RAC solution, build a Weblogic cluster, set-up a storage area network, etc.
In the dotcom world things look differently. Facebook, Google, Twitter, Yahoo, Amazon, etc. have more active users then any telecom system. However they have build their architecture on top of open source solutions and average servers. Some even build their own software and sometimes open-sourced it.
Some of this software has very exotic names: Hadoop, Bigtable, Cassandra, Pig, Elephant-Bird, Dremel, Pregel, Dynamo, etc. Additionally design decisions are taken that would surprise every IT teacher: “do not normalize”, “do not expect immediate consistency”, “no transaction support”, “store in memory instead of on disk”, etc.
However if you can support 500 million users, 100 million daily hits, 130TB of logs, 20 billion tweet messages, 1 million servers, etc. then something you should be doing right.
The telecom software industry seems to have been isolated from the Internet during the last five years. With the shift to IP it is expected that more IT companies will be able to provide telecom solutions. Is this the solution? Not sure! Also IT companies are still playing catch-up in the cloud computing domain. Few IT solutions providers are demonstrating, they now think Map-Reduce instead of Middleware.
Google Voice is coming and most operators seem to be still more worried about churning subscribers. Google Latitude and Maps demonstrated that with new technology and innovation you can destroy the telecom monopoly on location-based services overnight…
If you are a telecom operator and you are worried, perhaps it is time we talk.