Data volumes are growing exponentially. Unstructured data from Twitter, LinkedIn, Mailling Lists, etc. has the potential to transform many industries if it could be combined with structured data. Machine learning, natural language processing, sentiment analysis, etc. everybody talks about them, hardly anybody is really using them at scale. Too many people when they talk about Big Data unfortunately start with the answer and then ask what the problem it. The answer seems to be Hadoop. News flash: Hadoop is not the answer and if you start from the answer to look for problems then you are doing it wrong.
What are Common Data Problems?
Most Big Data problems are about storage and reporting. How do I store all the exponentially growing data in such a way that business managers can get to in seconds when they need it? Ad-hoc reporting, adequate prediction, and making sense of the exponentially growing data stream are the key problems.
Big Data Storage?
Do you have relational data, unstructured data, graph data, etc.? How do you store different types of data and make it available inside an enterprise? The basics for big data storage is cloud storage technology. You want to store any type of data and be able to quickly scale up storage. RedHat did not buy Inktank for $175M because traditional storage has solved all of today’s problems. Premium SAN and other storage technologies are old school. They are too expensive for Big Data. They were designed with the idea that each byte of data is critical for an enterprise. Unfortunately this is no longer the case. You mind loosing transactional sales data. You don’t mind so much loosing sample tweets you bought from Datasift or Apache log files from an internal low-impact server. This is where cloud storage solutions like Inktank’s Ceph allow commodity storage to be built that is reliable, scalable and extremely cost effective. Does this mean you don’t need SANs any more? Wrong again. TV did not kill Radio. Same here.
Cloud storage technologies are needed because each type of data behaves differently. If you have log data that only is appended then HDFS is fine. If you have read-mostly data then a relational database is ideal. If you have write-mostly data then you need to look at NoSQL. If you need heavy read-and-write then you need strong Big Data architecture skills. What is more important: short latency, consistency, reliability, cheap storage, etc.? Each of these means that the solution is different. No latency means in-memory or SSD. Consistency means transactional. Reliability means replication. You can even now find inconsistent databases like BlinkDB. There is no longer one size fits all. Oracle is no longer the answer to everybody’s data questions.
What will companies need? Companies need cloud storage solutions that offer these different storage capabilities like a service. Amazon’s RDS, DynamoDB, S3 and Redshift are examples of what companies need. However companies need more flexibility. They need to be able to migrate their data between public cloud providers to optimise their costs and have added security. They also need to be able to store data in private local clouds or nearby hosted private clouds for latency or regulatory reasons.
The future of ETL & BI
Traditional ETL will see a revolution. ETL never worked. Business managers don’t want to go and ask their IT department to make a change in a star schema in order to import some extra data from the Internet followed by updates to reports and dashboards. Business managers want an easy to use tool that can answer their ad-hoc queries. This is the reason why Tableau Software + Amazon Redshift are growing like crazy. However if your organisation is starting to pump terabytes of data into Redshift, please be warned: The day will come that Amazon sends you a bill that your CxO will not want to pay and he/she will want you to move out of Amazon. What will you do then? Do you have an exit strategy?
The future of ETL and BI will be web tools that any business manager can use to create ad-hoc reports. The Office generation wants to see dynamic HTML5 GUIs that allow them to drag-and-drop data queries into ad-hoc reports and dashboards. If you need training then the tool is too difficult.
These next-generation BI tools will need dynamic back-office solutions that allow storing real-time, graph, blob, historical relational, unstructured, etc. data into a commonly accessible cloud storage solution. Each one will be hosted by a different cloud service but they will all be an API away. Software will be packaged in such a way that it knows how to export its own data. Why do you need to know where Apache stores the access and error logs and in which format? Apache should be able to export whatever interesting information it contains in a standardised way into some deep storage. Machine learning should be used to make decisions on how best to store that data for ad-hoc reporting afterwards. Humans should no longer be involved in this process.
Talking about machine learning. With the volumes of data growing from gigabytes into petabytes, traditional data scientists will not scale. In many companies a data scientist is similar to a report monkey: “Find out why in region X we sold Y% less”, etc. Data scientist should not be synonymous for dynamic report generators. Data scientists should be machine learning experts. They should tell the computer what they want, not how they want it. Today’s data scientists pride themselves they know R, Python, etc. These tools are too low-level to be usable at scale. There are just not enough people in the world to learn R. Data is growing exponentially, R experts at best can grow linear. What we need are machine learning GUI solutions like RapidMiner Studio but supported by Petabyte cloud solutions. A short term solution could be an HTML5 GUI version of RapidMiner Studio that connects to a back-end set of cloud services that use some of the nice Apache Spark extensions for machine learning, streaming, Big Data warehousing/SQL, graph retrieval, etc. or solutions based on Druid.io. For sure there are other solutions possible.
What is important is that companies start realising that data is becoming a strategic weapon. Those companies that are able to collect more of it and convert it into valuable knowledge and wisdom will be tomorrow’s giants. Most average machine learning algorithms become substantially better just by throwing more and more data at them. This means that having a Big Data architecture is not as critical as having the best trained models in the industry and continue to train them. There will be a data divide between the have’s and have-not’s. Google, Facebook, Microsoft and others have been buying any startup that smells like Deep Belief Networks. They have done this with a good reason. They know that tomorrow’s algorithms and models will be more valuable than diamonds and gold. If you want to be one of the have’s then you need to invest in cloud storage now. You need to have massive historical data volumes to train tomorrow’s algorithms and start building the foundations today…
If you just invested a lot of money in a Big Data solution from any of the traditional BI vendors (Teradata, IBM, Oracle, SAS, EMC, HP, etc.) then you are likely to see a sub-optimal ROI in 2013.
Several innovations will come in 2013 that will change the value of Big Data exponentially. Other technology innovations are just waiting for smart start-ups to put them into good use.
The first major innovation will be Google’s Dremel-like solutions coming of age like Impala, Drill, etc. They will allow real-time queries on Big Data and be open source. So you will get a superior offering compared to what is currently available for free.
Cloud-Based Big Data Solutions
The absolute market leader is Amazon with EMR. Elastic Map Reduce is not so much about being able to run a Map Reduce operation in the Cloud but about paying for what you use and not more. The traditional BI vendors are still getting their head around a usage-based licensing for the Cloud. Except a lot of smart startups to come up with really innovative Big Data and Cloud solutions.
Big Data Appliances
You can buy some really expensive Big Data Appliances but also here disruptive players are likely to change the market. GPUs are relatively cheap. Stack them into servers and use something like Virtual OpenCL to make your own GPU virtualization cluster solution. These type of home-made GPU clusters are already being used for security Big Data related work.
Finally Parallella will put a 16-core supercomputer into everybody’s hands for $99. Their 2013 supercomputer challenge is definitely something to keep your eyes on. Their roadmap talks about 64 and 1000 core versions. If Adapteva can keep their promises and flood the market with Parallella’s then expect Parallella Clusters to be 2013 Big Data Appliance.
Distributed Machine Learning
Mahout is a cool project but Map Reduce might not be the best possible architecture to run iterative distributed backpropagation or any other machine learning algorithms. Jubatus looks promising. Also algorithm innovations like HogWild could really change the dynamics for efficient distributed machine learning. This space is definitely ready for more ground-breaking innovations in 2013.
Easier Big Data Tools
This is still a big white spot in the Open Source field. Having Open Source and easy to use drag-and-drop tools for Big Data Analytics would really excel the adoption. We already have some good commercial examples (Radoop = RapidMiner + Mahout, Tableau, Datameer, etc.) but we are missing good Open Source tools.
I am currently looking for new challenges so if you are active in the Big Data space and are looking for a knowledgable senior executive be sure to contact me at maarten at telruptive dot com.
Data Scientist is going to be the sexiest job of the 21st century. However do we really need a new army of Data Scientists or is there an alternative? There might be and it is called data democracy.
What is data democracy?
Data democracy allows all people to have access to all data insight. In an enterprise, data democracy is about enabling knowledge workers to share insights. To avoid the construction of data silos. To democratize tools that enable each co-worker to become a data scientist without needing a PhD in statistics, mathematics, etc. Visual tools that allow “Excel-users” to use Neural Networks, Support Vector Machines, Random Forests, etc. to make predictions, to classify or cluster data, etc. But without the need to understand the underlying computer learning algorithms into great detail. A sort of corporate RapidMiner that scales.
At the same time we also need better visualization tools. Everybody should be able to create infographics easily. Tools that allow ordinary people to create stunning data visualizations that go beyond the boring reports.
Finally we need better tools to find and share data insights. We need a “Databook”. A Facebook to easily find the data insight you need. A tool that allows you to distribute your predictions about next quarter’s sales and to compare them with the predictions of others.
In summary, we need the data scientists of this world to focus on making access to data insight available to every knowledge worker. Simplify instead of algorithmify! Enable everybody to be a data scientist…
There is currently still a vacuum for easy & scalable solutions in the machine learning space.
At the moment everybody is talking about Hadoop as the de-facto standard for Big Data. Unfortunately Hadoop is not a real-time system. Map-reduce can be used for batch machine learning like training a Logistic Regression/Support Vector Machine/Neural Network, Batch Gradient Descent, etc. However when it comes to real-time predictions it is not the platform of choice. Additionally Java is loosing every day its status of preferred language. New machine learning algorithms are more likely to be developed in R, Scala, Python, Go etc. There is of course Mahout which is scalable but the word “easy” is not a synonym.
If you want to create your own algorithms but do not want to go low-level Java Map-Reduce, then there are some alternatives like Pig [for the SQL-minded], Cascading [Java but easy and allows test driven development!], Scalding [Scala on top of Cascading, made by Twitter. Could be combined with libraries like Scalala for easy vector and matrix similar to Matlab], etc.
What other options are there?
Storm could be an option for time series, predictions based on a pre-trained model, online learning algorithms, etc. However what is missing is an extension like Trident, but for distributed machine learning, that avoids having to reinvent the wheel. A sort of Mahout for Storm.
Spark is another option. But Mesos is still very early days and also here a Mahout for Spark would be a good addition. In comparison with Storm, Spark would be ideal for training complex machine learning algorithms that need to iterate millions of times over the same data set.
Graphlab can be an option for those who are looking for social network analytics or other graph-based machine learning.
If you wanted to work with R then you could use packages like Snow or Parallel. But this would mean you need to reinvent a lot of distributed management of processing nodes. Both packages just incorporate the basic functions to launch some external processing nodes but are lacking professional management of a large cluster. You could also look at RHadoop, as long as you are fine with non-real-time on top of Hadoop. For alternatives for RHadoop you could look at Rhipe. Segue is R + Amazon Elastic Map Reduce, etc.
Update: an interesting extension for R (i.e. pbd) has just been released that promises R execution on over 10.000 cores. Read more about is here.
What is missing?
Simplicity, easy to use & reusable. What is needed is a solution that is cross-platform (R, Scala, Java, Python, Matlab, etc.). With a visual interface like RapidMiner or Knime, that allows 80% of the work to be drag-and-drop. With a re-useable library of the most used algorithms for prediction, clustering, classification, outlier detection, dimension reduction, normalization, etc. Ideally with a marketplace for sharing data and algorithms. With an easy interface to manage your data and create reports, think similar to Datameer. Ideally integrated with tools for data cleaning (e.g. Google’s Refine) and ETL (e.g. Pentaho, Talend, Jasper Reports, etc.). But most of all with a powerful distributed engine that allows both batch processing [Hadoop] and real-time [e.g. Storm]. And finally with a one click install.
If my requirements are missing some important aspects, let me know. If you want to construct such a system, please contact me…
M2M sensors are predicted to generate more data than their human counterparts in the coming years. Unfortunately the price that will be paid for moving this traffic will be substantially lower than human data traffic. So it makes sense to think about ways to dramatically reduce the amount of data M2M sensors transmit.
How to do it?
In recent weeks I have been playing with RapidMiner. This program might be soon installed on a lot of Windows machines next to MS-Office. RapidMiner allows complete data mining layman to easily get hidden information out of the data they have at hand in files, Excels, Access, databases, etc.
RapidMiner shows how with some simple drag-and-drop in 5 minutes you can use complex algorithms like Neural Networks, Support Vectors Machines, Bayesian Classifiers, Decision Trees, Genetic Algorithms, etc. to make sense out of data.
The fact that you can easily train an algorithm to take a decision on your behalf could be a key factor to reduce the amount of M2M sensor data. So instead of sending all the data to a central point and making decisions there, you would put intelligence into the sensors.
This artificial sensor intelligence would not only be limited to single sensor failure. By applying genetic and swarm algorithms and copying mother nature, you would be able to have different sensors behave like for instance an ant colony. Individual sensors would start sharing alarm data and if enough or the right sensors agree then they would launch collective alerts.
Wireless technologies based on for instance White Spaces technologies can be used, and are already used for instance in Cambridge, to cheaply have many sensors communicate with one another. Also harvesting techniques should be used to avoid having to install batteries into the sensors.
The last part of the puzzle would be extra features in a M2M PaaS to manage the distribution of intelligence for de-centralized and self-organizing sensor networks. Sensors are likely to send data to a central server in which humans will have to train computers on what type of data is critical. Once the trained models are available, then they can be distributed to sensors. The M2M PaaS would focus from then on, on adjusting the algorithms in case certain alarms were not caught or when alarms were launched unnecessary.
Every company is using Microsoft Office and especially Excel to do some sort of data analytics. However data volumes have grown exponentially and have outgrown Spreadsheets. You need experts in the business domain, in data analytics, in data migration/extraction/transformation/loading, in server management, etc. to get data analytics done on Big Data scale. This makes it expensive and only usable for the happy few.
Why? There must be easier ways to do it.
I think there are. For those unfamiliar with data analytics but eager to learn, you should take a look at a product called RapidMiner. It is close to amazing how a non-expert is able to use Neural Networks, Decision Trees, Support Vector Machines, Genetic Algorithms, etc. and get meaningful results in minutes. The amazing part is also that RapidMiner is open source hence for usage by 1 analyst it is free.
Rapid-i.com, the company behind RapidMiner, also offers server software to run data analytics remotely. It is here where big data opportunities meet easy data analytics. What if RapidMiner data analytics could be ran on hundreds of servers in parallel and you pay by usage just as you pay for any Cloud compute and storage instances?
RapidMiner as a Service
RapidMiner as a Service, RMaaS, would allow millions of business people to be able to analyse Big Data “without Big Investments”. This type of Data Analytics as a Service would provide any SME with the same data analytics tools as large corporations. Data could come from Amazon S3, Amazon’s DynamoDB, Hosted Hadoops, any webservices, any social network, etc.
Visual as a Service
RapidMiner as a Service is only one of the many domain specific tools that could be offered as a visual drag-and-drop Cloud service. VAS as a Service is another example in which complex telecom assets can be easily combined in a drag-and-drop manner. There are many more. These services will be the real revolution of Cloud Computing since they combine IaaS/PaaS/SaaS into a new generation of solutions that bring large savings for new users and potential large revenues for their providers…