There is currently still a vacuum for easy & scalable solutions in the machine learning space.
At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.
If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.
What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.
Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.
Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.
If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?
Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.
If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…
The website defines Spark as a MapReduce-like cluster computing framework designed to support low-latency iterative jobs. However it would be easier to say that Spark is Hadoop for real-time.
Spark allows you to run MapReduce jobs together with your data on distributed machines. Unlike Hadoop Spark can distributed your data in slices and store it in memory hence your processing and data are co-located in memory. This gives an enormous performance boost. Spark is more than MapReduce however. It offers a new distributed framework on which different distributed computing paradigms can be modelled. Examples are: Hadoop’s Hive => Shark (40x faster than Hive), Google’s Pregel / Apache’s Giraph => Bagel, etc. An upcoming Spark Streaming is supposed to bring real-time streaming to the framework.
The excellent part
Spark is written in Scala and has a very straight forward syntax to run applications from the command line or via compiled code. The possibilities to run iterative operations over large datasets or very compute intensive operations in parallel, make it ideal for big data analytics and distributed machine learning.
The points for improvement
In order to use Spark, you need to install Mesos. Mesos is a framework for distributed computing that was also developed by Berkeley. So in a sense they are eating their own dog food. Unfortunately Mesos is not written in scala so installing Spark becomes a mix of make’s, ant’s, .sh, XML, properties, .conf, etc. It would not be bad if Mesos would have consistent documentation but due to incubation into Apache the installation process is currently undergoing changes and is not straightforward.
Spark allows to connect to Hadoop, Hbase, etc. However running Hadoop on top of Mesos is “experimental” to say the least. The integration with Hadoop should be lighter. At the end only access to HDFS, SequenceFiles, etc. is required. This should not mean that a complete Hadoop should be installed and Spark should be recompiled for each specific Hadoop version.
If Spark wants to become as successful as Hadoop, then they should learn from Hadoop’s mistakes. Complex installation is a big problem because Spark needs to be installed on many machines. The Spark team should take a look at Ruby’s Rubygems, Node.js’s npm, etc. and make the installation simple, ideally via Scala’s package manager, although it is less popular.
If possible the team should drop Mesos as a prerequisite and make it optional. One of Spark’s competitors is Storm & Trident, you can install a Storm cluster in minutes and have a one click command to run Storm on an EC2 cluster.
It would be nice if there would be an integration SDK that allows extensions to be plugged-in. Integrations with Cassandra, Redis, Memcache, etc. could be developed by others. Also looking at a distribution in which Cassandra’s Brisk is used to mimic Hive and HDFS (a.k.a. CassandraFS) and have it all pre-bundled with Shark, could be an option. Spark’s in-memory execution and read speed, combined with Cassandra’s write speed, should make for a pretty quick and scalable solution. Ideally without the need to fight with namenodes, datanodes, jobtrackers, etc. and other Hadoop hard-to-configure inventions…
The conclusion is that distributed computing and programming is already hard enough by itself. Programmers should be focusing on their algorithms and not need a professional admin to get them started.
All-in-all Spark, Shark, Streaming Spark, Bagel, etc. have a lot of potential, it is just a little bit rough around the edges…
Update: I am reviewing my opinion about Mesos. See the Mesos post.
Hadoop has run into architectural limitations and the community has started working on the Next Generation Hadoop [NGN Hadoop]. NGN Hadoop has some new management features of which multi-tenant application management is the major one. However the key change is that MapReduce no longer is entangled inside the rest of Hadoop. This will allow Hadoop to be used for MPI, Machine Learning, Master-Worker, Iterative Processing, Graph Processing, etc. New tools to better manage Hadoop are also being incubated, e.g. Ambari and HCatalog.
Why is this important for telecom?
Having one platform that allows massive data storage, peta-byte data analytics, complex parallel computations, large-scale machine learning, big data map reduce processing, etc. all in one multi-tenant set-up means that telecom operators could see massive reductions in their architecture costs together with faster go-to-market, better data intelligence, etc.
Telecom applications, that are redesigned around this new paradigm, can all use one shared back-office architecture. Having data centralized into one large Hadoop cluster instead of tens or hundreds of application-specific databases, will enable unseen data analytics possibilities and bring much-needed efficiencies.
What is needed is that several large operators define this approach as their standard architecture hence telecom solution providers will start incorporating it into their solutions. Commercial support can be easily acquired from companies like Hortonworks, Cloudera, etc.
Having one shared data architecture and multi-tenant application virtualization in the form of a Telco PaaS would allow third-parties to launch new services quickly and cheaply, think days in stead of years…
The dotcoms of this world generate massive amounts of web server logs. Users upload files. Make comments. Vote on items. All this data is unstructured data. Google has just pushed the bar higher with their Google instant that is able to change almost instantly* their search results based on real-time data changes. Percolator is the real-time indexing that makes it all possible. Really impressive to see indexes change almost real-time for a company that moves daily 20 peta bytes of data [= almost 30.000.000 CDs].
So if Google is able to index all our web data, what about our voice and video data? Google Voice brought voice transcription to the general public. Voice transcription is based on machine learning. Every time a voicemail is incorrectly transcribed, users are able to “teach” Google how to do it right. This will make Google’s voice transcription quickly the best trained in the world. From there it is only a small step to also connect it to all Google Voice calls. Next step is to index what you say. Real-time indexing is key to interpret this vast amount of new content! So not only when you send a Gmail message talking about a trip you are planning to Paris, will you get advertisement from travel agents, but also when you talk to your friends about the trip.
Can Google go any further? Yes of course. Don’t forget Google Talk and Android. Two more sources to get voice data from. Google Talk can be used now. Android probably only if you use Google Talk or Google Voice on your mobile given the excessive data charges you would get if you would send “normal circuit” calls to Google.
Where are today’s limits of image recognition? Especially in videos? Taking a video with an Android phone and uploading it to Youtube could also mean that GPS data of where the video was taken could be included. Google Goggles ‘ technology could then find out what it is you were doing. Probably not possible today but let´s wait some months…
What this means is that Google will very likely be able to subsidize more customer services if you are willing to trade-in some extra privacy. “Free” calls and even video calls for Android will definitely set it apart from iPhones, Nokias, etc. Google would need to subsidize wholesale interconnects to other providers, which are not that bad compared to end-user prices. But could marketing dollars also subsidize a monthly data plan? If yes then Google could become an MVNO and offer free phone and data services. This would be a killer feature to subscribe to Google Voice on the mobile. Killer in both senses, but not the positive sense for operators…
Where does this leave operators? Again a piece of their voice and SMS pie will disappear. But also a large piece of monthly subscription fees. Today there is very few operators can do to defend themselves if they don’t change their own rules. Every operator that thinks their assets are sacred, RFQ’s will bring innovation and scalability is about writing a large check to Oracle, will likely suffer. Those that are willing to experiment with disruptive innovations and are open to discuss the previously unthinkable, still have a window of opportunity…
* There are delays between the time a page is updated and the new results being visible due to crawler and indexing delays so real-time indexing does not mean real-time search results.
Google has changed very little to its basic architecture building blocks over the years. Everything runs on top of the Google File System and Bigtable. Except for Google Instant which is reversing Map-Reduce usage, new services have been reusing their existing architecture.
Similar observations can be made for the rest of the main players. So why is it that Telecom operators have not invested in one architecture to launch multiple services? No idea.
One architecture for VAS
The concept is simple. Create one common architecture. This architecture should have multiple components:
- A high-available real-time data store – stores all application and user data
- A right-time data analytics service – allows collective intelligence and data mining
- An asset exposure layer – applications can re-use network assets and get isolated from internal complexities
- Presentation layer – facilitate mobile GUI and Web 2.0 development
- Application Engine – allows applications to run and focus on business logic instead of scaling and integration
- Continuous Deployment – instead of monthly big-bang deployments, incremental daily or weekly releases are possible, even hourly like some dotcoms.
- Unified Administration – one place to know what is happening both technically and business-wise with the applications.
- Long-Tail Business Link – all business and accounting transactions for customers, partners, providers, etc. are centralized.
Having such an architecture in place would allow telco innovations to be brought to market at least ten times faster. Application and service designers would have to focus on business logic and not on the rest. Administrators would have one platform to manage and not a puzzle of systems. Integrations would have to be done ones to a common integration layer.
Building such an architecture should be done in the dotcom style and not a telco RFQ. Only by doing iterative projects which bring together the components can you build an architecture that is really used and not a side project that starts to have its own life.
It even makes sense to open source the architecture. Telco’s business is not about building architectures hence having a common platform that was started by one would benefit the whole industry. It even would give a competitive advantage to the one that started the architecture for knowing it better than any competitor. Of course for this to happen, a telco has to recognize that their future major competitors are not the neighboring telco but a global dotcom…