Archive

Posts Tagged ‘pig’

Scaling Machine Learning

October 17, 2012 1 comment

There is currently still a vacuum for easy & scalable solutions in the machine learning space.

At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.

If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.

What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.

Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.

Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.

If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?

Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.

If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…

Trident Storm, Real-Time Analytics for Big Data

August 13, 2012 4 comments

In a previous post I mentioned Storm already. Trident is an extension of Storm that makes it an easy-to-use distributed real-time analytics framework for Big Data. Both Trident and Storm were developed by Twitter.

One of Twitter’s major problems is to keep statistics of Tweets and Tweeted URLs that get retweeted by millions of followers. Imagine a famous person who tweets a URL to millions of followers. Lots of followers will retweet the URL. So how do you calculate how many Tweeters have seen the URL? This is important for features like “Top retweeted URLs”.

The answer was Storm but with the addition of Trident, it has become a lot easier to manage. Trident is doing to Storm what Pig and Cascading are doing to Hadoop: simplification. Instead of having to create a lot of Spouts and Bolts and take care of how messages are distributed, Trident comes with a lot of the work already done.

In a few lines of code, you set-up a Distributed RPC server, send it URLs, have it collect the tweeters and followers and count them. Fail-over and resiliance as well as massive distribution throughput are build into the platform. You can see it in this example code:
TridentState urlToTweeters =
topology.newStaticState(getUrlToTweetersState());
TridentState tweetersToFollowers =
topology.newStaticState(getTweeterToFollowersState());

topology.newDRPCStream("reach")
.stateQuery(urlToTweeters, new Fields("args"), new MapGet(), new Fields("tweeters"))
.each(new Fields("tweeters"), new ExpandList(), new Fields("tweeter"))
.shuffle()
.stateQuery(tweetersToFollowers, new Fields("tweeter"), new MapGet(), new Fields("followers"))
.parallelismHint(200)
.each(new Fields("followers"), new ExpandList(), new Fields("follower"))
.groupBy(new Fields("follower"))
.aggregate(new One(), new Fields("one"))
.parallelismHint(20)
.aggregate(new Count(), new Fields("reach"));

The possibilities of Trident + Storm, combined with fast scalable datastores, like for instance Cassandra, are enormous. Everything from real-time counters, filtering, complex event processing, machine learning, etc.
The Storm concept of Spout [data generation] and Bolt [data processing] can be easily understood by most programmers. Storm is an asynchronous highly distributed framework but with a simple distributed RPC server it can easily be used in synchronous code.

The only drawback I have seen is that DRPC is focused only on Strings (and other primitive types that can be contained in a String). Adding more complex objects (via Kryo, Avro, Protocol Buffers, etc.), or at least bytes, would be useful for companies that do not only focus on Tweets.

Big Data Apps and Big Data PaaS

March 21, 2012 5 comments

Enterprises no longer have a lack of data. Data can be obtained from everywhere. The hard part is to convert data into valuable information that can trigger positive actions. The problem is that you need currently four experts to get this process up and running:

1) Data ETL expert – is able to extract, transform and load data into a central system.

2) Data Mining expert – is able to suggest great statistical algorithms and able to interpret the results.

3) Big Data programmer – is an expert in Hadoop, Map-Reduce, Pig,  Hive, HBase, etc.

4) A business expert – that is able to guide all the experts into extracting the right information and taking the right actions based on the results.

A Big Data PaaS should focus on making sure that the first three are needed as little as possible. Ideally they are not needed at all.

How could a business expert be enabled in Big Data?

The answer is Big Data Apps and Big Data PaaS. What if a Big Data PaaS is available, ideally open source as well as hosted, that comes with a community marketplace for Big Data ETL connectors and Big Data Apps? You would have Big Data ETL connectors to all major databases, Excel, Access, Web server logs, Twitter, Facebook, Linkedin, etc. For a fee different data sources could be accessed in order to enhance the quality of data. Companies should be able to easily buy access to data of others on a Pay-as-you-use basis.

The next steps are Big Data Apps. Business experts often have very simple questions: “Which age group is buying my product?”, “Which products are also bought by my customers?”, etc. Small re-useable Big Data Apps could be built by experts and reused by business experts.

A Big Data App example

A medium sized company is selling household appliances. This company has a database with all the customers. Another database with all the product sales. What if a Big Data App could find which products tend to be sold together and if there are any specific customer features (age, gender, customer since, hobbies, income, number of children, etc.) and other features (e.g. time of the year) that are significant? Customer data in the company’s database could be enhanced with publicly available information (from Facebook, Twitter, Linkedin, etc.). Perhaps the Big Data App could find out that parents (number of children >0), whose children like football (Facebook), are 90% more likely to buy waffle makers, pancake makers, oil fryers, etc. three times a year. Local football clubs might organize events three times a year to gain extra funding. Sponsorship, direct mailing, special offers, etc. could all help to attract more parents, of football-loving-kids, to the shop.

The Big Data Apps would focus on solving a specific problem each: “Finding products that are sold together”, “Clustering customers based on social aspects”, etc. As long as a simple wizard can guide a non-technical expert in selecting the right data sources and understanding the results, it could be packaged up as a Big Data App. A marketplace could exist for the best Big Data Apps. External Big Data PaaS platforms could also allow data from different enterprises to be brought together and generate extra revenue as long as individual persons can not be identified.

Open Source Big Data Reporting & ETL show promises

March 16, 2012 1 comment

With Hadoop/Hbase/Hive, Cassandra, etc. you can store and manipulate peta-bytes of data. But what if you want to get nice looking reports or compare data held in a NoSQL solution with data held elsewhere? There have been two market leaders in the Open Source business intelligence space that are putting all their firepower onto Big Data now.

Pentaho Big Data seems to be a bit further ahead. They offer a graphical ETL tool, a report designer and a business intelligence server. These are existing tools but support for Hadoop HDFS, Map-Reduce, Hbase, Hive, Pig, Cassandra, etc. have been added.

Jaspersoft’s Open Source Big Data strategy is a little bit behind because connectors are not included yet into the main product and several are still in beta quality and with missing documentation.

Both companies will accelerate the adoption of big data since the main problem with Big Data is easy reporting. Unstructured data is harder to format into a very structured report than structured data. Any solutions that will make this possible and additionally are Open Source are very welcome in times of cost cutting…

Open Source Solution Index from the Big Dotcoms

January 26, 2012 Leave a comment

The big names in dotcom world are busy open sourcing some of their secret sause. It is very important to become familiar with these often strangely named projects because they are responsible for several competitive advantages. Since the list is growing please suggest new solutions in the comments section so they can be added.

Google

Facebook

  • Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store
  • Hive a data warehouse infrastructure that provides data summarization and ad hoc querying.
  • FlashCache is a general purpose writeback block cache for Linux. It was developed as a loadable Linux kernel module, using the Device Mapper and sits below the filesystem.
  • HipHop for PHP transforms PHP source code into highly optimized C++. HipHop offers large performance gains and was developed over the past two years.
  • Open Compute Project an open hardware project aims to accelerate data center and server innovation while increasing computing efficiency through collaboration on relevant best practices and technical specifications.
  • Scribe is a scalable service for aggregating log data streamed in real time from a large number of servers.
  • Thrift provides a framework for scalable cross-language services development in C++, Java, Python, PHP, and Ruby.
  • Tornado is a relatively simple, non-blocking web server framework written in Python. It is designed to handle thousands of simultaneous connections, making it ideal for real-time Web services.
  • codemod assists with large-scale codebase refactors that can be partially automated but still require human oversight and occasional intervention.
  • Facebook Animation is a JavaScript library for creating customizable animations using DOM and CSS manipulation.
  • Online Schema Change for MySQL lets you alter large database tables without taking your cluster offline.
  • Phabricator is a collection of web applications which make it easier to write, review, and share source code. It is currently available as an early release and is used by hundreds of Facebook engineers every day.
  • PHPEmbed makes embedding PHP truly simple for all of our developers (and indeed the world) we developed this PHPEmbed library which is just a more accessible and simplified API built on top of the PHP SAPI.
  • phpsh provides an interactive shell for PHP that features readline history, tab completion, and quick access to documentation. It is ironically written mostly in Python.
  • Three20 is an Objective-C library for iPhone developers which provides many UI elements and data helpers behind our iPhone application.
  • XHP is a PHP extension which augments the syntax of the language such that XML document fragments become valid expressions.
  • XHProf is a function-level hierarchical profiler for PHP with a simple HTML-based navigational interface.

Twitter

Twitter open sourced some complete projects (e.g. FlockDB) but especially adds extensions to existing projects. For a full list see here.

Yahoo

  • Apache Traffic Server is fast, scalable and extensible HTTP/1.1 compliant caching proxy server.
  • Hadoop THE nosql solution at the moment was started by Yahoo. Yahoo actively contributes also to several extensions like Avro and Pig.
  • YUI is a free, open source JavaScript and CSS framework for building richly interactive web applications.

LinkedIn

  • Azkaban is simple batch scheduler for constructing and running Hadoop jobs or other offline processes
  • Bobo is a Faceted Search implementation written purely in Java, an extension of Apache Lucene
  • Cleo is a flexible, partial, out-of-order and real-time typeahead search.
  • Datafu is Hadoop library for large-scale data processing.
  • Decomposer is for massive matrix decompositions
  • Glu is a deployment automation platform
  • A set of useful gradle plugins
  • Indexing engine for IndexTank and API, BackOffice, Storefront, and Nebulizer for IndexTank  
  • Kafka is a distributed publish/subscribe messaging system
  • Kamikaze is a utility package for performing operations on compressed arrays of sorted integers
  • Krati is a simple persistent data store with very low latency and high throughput
  • Base utilities shared by all linkedin open source projects
  • A set of utility classes and wrappers around ZooKeeper
  • Norbert is a library that provides easy cluster management and workload distribution
  • Sensei is a distributed, elastic, realtime, searchable database
  • Voldemort is a distributed key-value storage system
  • Zoie is a real-time search and indexing system built on Apache Lucene

Scaling to 500 million users

September 10, 2010 Leave a comment

In the telecom domain a scalable real-time architecture means paying a lot of money in hardware and licenses. You buy the Oracle RAC solution, build a Weblogic cluster, set-up a storage area network, etc.

In the dotcom world things look differently. Facebook, Google, Twitter, Yahoo, Amazon, etc. have more active users then any telecom system. However they have build their architecture on top of open source solutions and average servers. Some even build their own software and sometimes open-sourced it.

Some of this software has very exotic names: Hadoop, Bigtable, Cassandra, Pig, Elephant-Bird, Dremel, Pregel, Dynamo, etc. Additionally design decisions are taken that would surprise every IT teacher: “do not normalize”, “do not expect immediate consistency”, “no transaction support”, “store in memory instead of on disk”, etc.

However if you can support 500 million users, 100 million daily hits, 130TB of logs, 20 billion tweet messages, 1 million servers, etc. then something you should be doing right.

The telecom software industry seems to have been isolated from the Internet during the last five years. With the shift to IP it is expected that more IT companies will be able to provide telecom solutions. Is this the solution? Not sure! Also IT companies are still playing catch-up in the cloud computing domain. Few IT solutions providers are demonstrating, they now think Map-Reduce instead of Middleware.

Google Voice is coming and most operators seem to be still more worried about churning subscribers. Google Latitude and Maps demonstrated that with new technology and innovation you can destroy the telecom monopoly  on location-based services overnight…

If you are a telecom operator and you are worried, perhaps it is time we talk.

%d bloggers like this: