Charles – Chuck – Butler, a colleague at Canonical, wrote a very nice blog post explaining the basics of Big Data. It does not only explain them but it also allows anybody to set up Big Data solutions in minutes via Juju. Really recommended reading:
This is a good example of the power of cloud orchestration. Some expert creates charms and bundles them with Juju and afterwards anybody can easily deploy, integrate and scale this Big Data solution in minutes.
Samuel Cozannet, another colleague, used some of these components to create an open source Tweet sentiment analysis solution that can be deployed in 14 minutes and includes autoscaling, a dashboard, Hadoop, Storm, Kafka, etc. He presented it on the OpenStack Developer Summit in Paris and will be providing instructions for everybody to set it up shortly.
There is currently still a vacuum for easy & scalable solutions in the machine learning space.
At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.
If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.
What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.
Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.
Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.
If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?
Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.
If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…
I initially complaint about the complexity of installing Mesos when I was playing around with Spark and Shark. However
when I saw the Twitter Mesos and Framework presentation, I understood why Mesos can be disruptive to how you architect applications in a highly distributed manner typical for Cloud Computing.
You can see the presentation here.
The key is that Twitter combined Mesos with Zookeeper, Linux Control Groups and Google’s Protocol Buffers as well as Spark, Storm and Hadoop. This provides them with a way to easily program services that can be scaled to hundreds of mesos nodes, automatically upgraded and restarted in case of failure. Also resource usage can be controlled via the control groups. Zookeeper manages the configuration. Protocol buffers assure efficient communication between nodes. Services can use Spark and Storm for real-time operations and Hadoop for batch. Developers do not have to worry about scaling the services, deploying them to different nodes, etc. This is handled by the Twitter Framework and Mesos master.
There is only one thing to add: “TWITTER PLEASE OPEN SOURCE YOUR TWITTER FRAMEWORK” or in Twitter language: “#mesos please #opensource #twitterfw now @telruptive “…
The website defines Spark as a MapReduce-like cluster computing framework designed to support low-latency iterative jobs. However it would be easier to say that Spark is Hadoop for real-time.
Spark allows you to run MapReduce jobs together with your data on distributed machines. Unlike Hadoop Spark can distributed your data in slices and store it in memory hence your processing and data are co-located in memory. This gives an enormous performance boost. Spark is more than MapReduce however. It offers a new distributed framework on which different distributed computing paradigms can be modelled. Examples are: Hadoop’s Hive => Shark (40x faster than Hive), Google’s Pregel / Apache’s Giraph => Bagel, etc. An upcoming Spark Streaming is supposed to bring real-time streaming to the framework.
The excellent part
Spark is written in Scala and has a very straight forward syntax to run applications from the command line or via compiled code. The possibilities to run iterative operations over large datasets or very compute intensive operations in parallel, make it ideal for big data analytics and distributed machine learning.
The points for improvement
In order to use Spark, you need to install Mesos. Mesos is a framework for distributed computing that was also developed by Berkeley. So in a sense they are eating their own dog food. Unfortunately Mesos is not written in scala so installing Spark becomes a mix of make’s, ant’s, .sh, XML, properties, .conf, etc. It would not be bad if Mesos would have consistent documentation but due to incubation into Apache the installation process is currently undergoing changes and is not straightforward.
Spark allows to connect to Hadoop, Hbase, etc. However running Hadoop on top of Mesos is “experimental” to say the least. The integration with Hadoop should be lighter. At the end only access to HDFS, SequenceFiles, etc. is required. This should not mean that a complete Hadoop should be installed and Spark should be recompiled for each specific Hadoop version.
If Spark wants to become as successful as Hadoop, then they should learn from Hadoop’s mistakes. Complex installation is a big problem because Spark needs to be installed on many machines. The Spark team should take a look at Ruby’s Rubygems, Node.js’s npm, etc. and make the installation simple, ideally via Scala’s package manager, although it is less popular.
If possible the team should drop Mesos as a prerequisite and make it optional. One of Spark’s competitors is Storm & Trident, you can install a Storm cluster in minutes and have a one click command to run Storm on an EC2 cluster.
It would be nice if there would be an integration SDK that allows extensions to be plugged-in. Integrations with Cassandra, Redis, Memcache, etc. could be developed by others. Also looking at a distribution in which Cassandra’s Brisk is used to mimic Hive and HDFS (a.k.a. CassandraFS) and have it all pre-bundled with Shark, could be an option. Spark’s in-memory execution and read speed, combined with Cassandra’s write speed, should make for a pretty quick and scalable solution. Ideally without the need to fight with namenodes, datanodes, jobtrackers, etc. and other Hadoop hard-to-configure inventions…
The conclusion is that distributed computing and programming is already hard enough by itself. Programmers should be focusing on their algorithms and not need a professional admin to get them started.
All-in-all Spark, Shark, Streaming Spark, Bagel, etc. have a lot of potential, it is just a little bit rough around the edges…
Update: I am reviewing my opinion about Mesos. See the Mesos post.
In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.
One of Twitter’s major problems is to keep statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.
The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.
In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters =
TridentState tweetersToFollowers =
.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
.aggregate(new One(), new Fields("one"))
.aggregate(new Count(), new Fields("reach"));
The possibilities of Trident + Storm, combined with fast scalable datastores, like for instance Cassandra, are enormous. Everything from real-time counters, filtering, complex event processing, machine learning, etc.
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.
The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.
Google has open sourced a low-level nosql storing engine that is authored by the creators of mapreduce and bigtable. Definitely worthwhile to keep an eye on: leveldb. Especially for the products that will be incorporating it.
In a previous post I mentioned that open source graph databases where not ready yet. This one looks promising. Especially because the authors are the number three in the social networking space. At least if they provide access to the code and use a business friendly open source license like Apache’s: stigdb.
Twitter is open sourcing storm on September 19th. It has been referred to as the hadoop of realtime processing. All stream related data is likely to see big advantages by using storm.
Update: Storm has been released on github. Check out the wiki pages.