Archive

Archive for the ‘Distributed Machine Learning’ Category

The Wolf of Wall Things

Hacker news had the following article at the top of their list: the wolf of wall tweet. It talks about algorihms that used a rumor and the options market to make millions in seconds. This article refers to Flash Boys as well, a famous book about high frequency trading. However the big news is that whatever the article talks about as being magical is not magical at all, you can do a lot more and you will read about examples later on.

How does tweet option buying work?

Advances in neural networking have led to Deep Belief Networks (DBN). DBN in some cases are able to do natural language recognition and other types of recognition better than human beings, or at least a lot faster. So a DBN that is trained to read from the Twitter firehose and scan lots of news articles will beat humans in speed. Add an interface to options trading and you have what the article describes.

Taking it to the next level – knowing the future of the economy

What if you would know economical facts with a high degree of certainty before anybody else, and not sub seconds but hours, days, weeks, even months before anybody else. This is what Internet of Things combined with DBNs and automated trading can give you. How? Imagine you are into trading car stocks. Via computer vision you are able to count events. There are lots of public street cameras that stream in real-time data about what is happening on the roads. Humans look at them to see if there is a lot of traffic. Computers can use them to recognize and count events. So what would happen if strategically picked street cameras get hooked up to DBNs, you would be able to count how many trucks leave a factory with cars. You would be able to correlate these events month after month with the revenue figures of car manufacturers and then correlate with their stock value. The car manufacturers will at the end of a quarter announce their profits and a key aspect of their success depends on how many cars where sold. If you would know weeks or a month in advance that the volumes of cars coming out of a factory have picked up dramatically then you know the stock value will go up. If you would buy minutes before the figures come out a large quantity of car stocks then trading algorithms will pick up on this and will make you loose lots of the potential profit. However if you can spread purchase orders over weeks in small quantities then HFT can not detect your strategy.

Street cams is only the beginning

Using street cams would only be the beginning. Add weather sensors and lots of other sensors and you can do magic at large scale and would have a magical dashboard of the real economy before anybody else. If you are interested in this subject be sure to reach out on LinkedIn…

Advertisements

10 world changing technology trends

February 13, 2015 Leave a comment

1. Block chain
The block chain is the heart of digital currencies like Bitcoin. What most don’t realise yet is that the block chain will be used for managing everything from domain names, artist royalties, escrow contracts, auctions, lotteries, etc. You can do away with middlemen whose only reason of being is making sure they keep on getting a large cut in the value chain. Unless a middlemen or governmental institution adds real value, they are in danger of being block chained into the past.
2. Biometric security
A good example is the Nymi, a wearable that listens to your unique heart beat patterns and creates a unique identity. Even if people steal your Nymi, it is of no use since they need your heart to go with it.
3. Deep belief networks
Deep belief networks are the reason why Google’s voice recognition is surprisingly accurate, Facebook can tag photos automagically, self-driven cars, etc.
4. Smart labels
They are 1 to 3 millimetres small. They harvest electricity from their environment. They can detect people approaching within half a metre, sometimes even identify them and each product you will buy. Your microwave will not longer have to be told how to warm up a frozen meal.
5. Micro-servers
A $35 Raspberry Pi 2 or Odroid is many multiples more powerful than the first Google server but the size of a credit card. Parallella is $99, same size, and almost ten times more coresP then the first Google server.
6. Apps and App Stores for Smart Devices
Snappy Ubuntu Core allows developers to create apps like mobile apps but to put them on any smart device from robots & drones to wifi, hubs, industrial gateways, switches, dishwashers, sprinkler controls, etc. Software developers will be able to innovate faster and hardware can be totally repurposed in seconds. A switch can become a robot controller.
7. Edge/proximity/fog clouds
Public clouds often have too much latency for certain use cases. Often connectivity loss is not tolerable. Think about security cameras. In a world where 4K quality IP cameras will become extremely cheap, you want machine learning imagine recognition to be done locally and not on the other side of the world.
8. Containers and micro-services orchestration
Docker is not new but orchestrating millions of containers and handling super small micro services is still on the bleeding edge.
9. Cheap personalised robots and drones
£35 buys you a robot arm in Maplin in the UK. Not really useful for major things except for educating the next generation robot makers. Robots and drones will have apps (point 6) for which personalised robots and drones are happening this year.
10. Smart watches and hubs
Smart hubs know who is in the house, where they are (if you wear a phone, health wearable or smart watch), what their physical state is (heartbeat via smart watch), what your face looks like and your voice. Your smart watch will know more about you then you want relatives to know. Today Google knows a husband is getting a divorce before they do [wife searches and uses google maps]. Tomorrow your smart watch will know you are going to have a divorce before you do [heart jumped when you looked at that girl, her heartbeat went wild when you came closer].

Sentiment analysis beyond Tweets

October 18, 2014 Leave a comment

Deep belief networks have made it possible to train computers to predict if a sentence is positive, negative or neutral. Most sentiment analysis captures headlines because tweets can be analysed. However are there business applications beyond social networking analytics?

Here are five examples:
1) Investment banking – reading complex reports
The financial industry is shaving off microseconds for high-frequency trading. However these algorithms assume that they can predict what a single big trade will be like. What if super computers would analyse any governmental report, news feed, etc. in real-time at a fraction of the time a human can do this. Initially these algorithms could get the most import data in front of analysts but there is no reason why automatic algorithms would not be able to make trades. There could be algorithms that look for natural disasters. Others that look at the sentiment of national bank reports.
2) Telecom: detecting defects and reading complaints
What do you do when call quality is bad? You send an SMS to the other person with your message plus some insult about your mobile provider. If your bill is too high, then you call their call centre or open a complaint on the website. Computers can more efficiently detect patterns in this behaviour than humans and can raise alerts before large groups of customers start to complain on Twitter.
3) IT: log processing and intrusion detection
Often strange user behaviour can be detected by analysing the commands that are introduced on a command line. Are they neutral, positive or negative? A hacker that is trying to exploit a bug and afterwards enters into log files to destroy their tracks could be caught because their commands are highly negative.
4) Retail: product reviews
What if a customer starts leaving bad reviews? Or even worse average reviews because they feel bad about a certain feature or services but not about the overall experience. Would you rather have a computer tell you in advance or wait until a crowd gathers enough tweets?
5) Politics: election sentiment
Real-time dashboards with sentiments for different candidates by analysing all written press. Find out what voters feel strong about.

The next IT revolution: micro-servers and local cloud

Have you ever counted the number of Linux devices at home or work that haven’t been updated since they came out of the factory? Your cable/fibre/ADSL modem, your WiFi point, television sets, NAS storage, routers/bridges, media centres, etc. Typically this class of devices hosts a proprietary hardware platform, an embedded proprietary Linux and a proprietary application. If you are lucky you are able to log into a web GUI often using the admin/admin credentials and upload a new firmware blob. This firmware blob is frequently hard to locate on hardware supplier’s websites. No wonder the NSA and others love to look into potential firmware bugs. They are the ideal source of undetected wiretapping.

The next IT revolution: micro-servers
The next IT revolution is about to happen however. Those proprietary hardware platforms will soon give room for commodity multi-core processors from ARM, Intel, etc. General purpose operating systems will replace legacy proprietary and embedded predecessors. Proprietary and static single purpose apps will be replaced by marketplaces and multiple apps running on one device. Security updates will be sent regularly. Devices and apps will be easy to manage remotely. The next revolution will be around managing millions of micro-servers and the apps on top of them. These micro-servers will behave like a mix of phone apps, Docker containers, and cloud servers. Managing them will be like managing a “local cloud” sometimes also called fog computing.

Micro-servers and IoT?
Are micro-servers some form of Internet of Things. Yes they can be but not all the time. If you have a smarthub that controls your home or office then it is pure IoT. However if you have a router, firewall, fibre modem, micro-antenna station, etc. then the micro-server will just be an improved version of its predecessor.

Why should you care about micro-servers?
If you are a mobile app developer then the micro-servers revolution will be your next battlefield. Local clouds need “Angry Bird”-like successes.
If you are a telecom or network developer then the next-generation of micro-servers will give you unseen potentials to combine traffic shaping with parental control with QoS with security with …
If you are a VC then micro-server solution providers is the type of startups you want to invest in.
If you are a hardware vendor then this is the type of devices or SoCs you want to build.
If you are a Big Data expert then imagine the new data tsunami these devices will generate.
If you are a machine learning expert then you might want to look at algorithms and models that are easy to execute on constraint devices once they have been trained on potentially thousands of cloud servers and petabytes of data.
If you are a Devop then your next challenge will be managing and operating millions of constraint servers.
If you are a cloud innovator then you are likely to want to look into SaaS and PaaS management solutions for micro-servers.
If you are a service provider then this is the type of solutions you want to have the capabilities to manage at scale and easily integrate with.
If you are a security expert then you should start to think about micro-firewalls, anti-micro-viruses, etc.
If you are a business manager then you should think about how new “mega micro-revenue” streams can be obtained or how disruptive “micro- innovations” can give you a competitive advantage.
If you are an analyst or consultant then you can start predicting the next IT revolution and the billions the market will be worth in 2020.

The next steps…
It is still early days but expect some major announcements around micro-servers in the next months…

The future of Big Data is linked to Cloud

Data volumes are growing exponentially. Unstructured data from Twitter, LinkedIn, Mailling Lists, etc. has the potential to transform many industries if it could be combined with structured data. Machine learning, natural language processing, sentiment analysis, etc. everybody talks about them, hardly anybody is really using them at scale. Too many people when they talk about Big Data unfortunately start with the answer and then ask what the problem it. The answer seems to be Hadoop. News flash: Hadoop is not the answer and if you start from the answer to look for problems then you are doing it wrong.

What are Common Data Problems?

Most Big Data problems are about storage and reporting. How do I store all the exponentially growing data in such a way that business managers can get to in seconds when they need it? Ad-hoc reporting, adequate prediction, and making sense of the exponentially growing data stream are the key problems.

Big Data Storage?

Do you have relational data, unstructured data, graph data, etc.? How do you store different types of data and make it available inside an enterprise? The basics for big data storage is cloud storage technology. You want to store any type of data and be able to quickly scale up storage. RedHat did not buy Inktank for $175M because traditional storage has solved all of today’s problems. Premium SAN and other storage technologies are old school. They are too expensive for Big Data. They were designed with the idea that each byte of data is critical for an enterprise. Unfortunately this is no longer the case. You mind loosing transactional sales data. You don’t mind so much loosing sample tweets you bought from Datasift or Apache log files from an internal low-impact server. This is where cloud storage solutions like Inktank’s Ceph allow commodity storage to be built that is reliable, scalable and extremely cost effective. Does this mean you don’t need SANs any more? Wrong again. TV did not kill Radio. Same here.

Cloud storage technologies are needed because each type of data behaves differently. If you have log data that only is appended then HDFS is fine. If you have read-mostly data then a relational database is ideal. If you have write-mostly data then you need to look at NoSQL. If you need heavy read-and-write then you need strong Big Data architecture skills. What is more important: short latency, consistency, reliability, cheap storage, etc.? Each of these means that the solution is different. No latency means in-memory or SSD. Consistency means transactional. Reliability means replication. You can even now find inconsistent databases like BlinkDB. There is no longer one size fits all. Oracle is no longer the answer to everybody’s data questions.

What will companies need? Companies need cloud storage solutions that offer these different storage capabilities like a service. Amazon’s RDS, DynamoDB, S3 and Redshift are examples of what companies need. However companies need more flexibility. They need to be able to migrate their data between public cloud providers to optimise their costs and have added security. They also need to be able to store data in private local clouds or nearby hosted private clouds for latency or regulatory reasons.

The future of ETL & BI

Traditional ETL will see a revolution. ETL never worked. Business managers don’t want to go and ask their IT department to make a change in a star schema in order to import some extra data from the Internet followed by updates to reports and dashboards. Business managers want an easy to use tool that can answer their ad-hoc queries. This is the reason why Tableau Software + Amazon Redshift are growing like crazy. However if your organisation is starting to pump terabytes of data into Redshift, please be warned: The day will come that Amazon sends you a bill that your CxO will not want to pay and he/she will want you to move out of Amazon. What will you do then? Do you have an exit strategy?

The future of ETL and BI will be web tools that any business manager can use to create ad-hoc reports. The Office generation wants to see dynamic HTML5 GUIs that allow them to drag-and-drop data queries into ad-hoc reports and dashboards. If you need training then the tool is too difficult.

These next-generation BI tools will need dynamic back-office solutions that allow storing real-time, graph, blob,  historical relational, unstructured, etc. data into a commonly accessible cloud storage solution. Each one will be hosted by a different cloud service but they will all be an API away. Software will be packaged in such a way that it knows how to export its own data. Why do you need to know where Apache stores the access and error logs and in which format? Apache should be able to export whatever interesting information it contains in a standardised way into some deep storage. Machine learning should be used to make decisions on how best to store that data for ad-hoc reporting afterwards. Humans should no longer be involved in this process.

Talking about machine learning. With the volumes of data growing from gigabytes into petabytes, traditional data scientists will not scale. In many companies a data scientist is similar to a report monkey: “Find out why in region X we sold Y% less”, etc. Data scientist should not be synonymous for dynamic report generators. Data scientists should be machine learning experts. They should tell the computer what they want, not how they want it. Today’s data scientists pride themselves they know R, Python, etc. These tools are too low-level to be usable at scale. There are just not enough people in the world to learn R. Data is growing exponentially, R experts at best can grow linear. What we need are machine learning GUI solutions like RapidMiner Studio but supported by Petabyte cloud solutions. A short term solution could be an HTML5 GUI version of RapidMiner Studio that connects to a back-end set of cloud services that use some of the nice Apache Spark extensions for machine learning, streaming, Big Data warehousing/SQL, graph retrieval, etc. or solutions based on Druid.io. For sure there are other solutions possible.

What is important is that companies start realising that data is becoming a strategic weapon. Those companies that are able to collect more of it and convert it into valuable knowledge and wisdom will be tomorrow’s giants.  Most average machine learning algorithms become substantially better just by throwing more and more data at them. This means that having a Big Data architecture is not as critical as having the best trained models in the industry and continue to train them. There will be a data divide between the have’s and have-not’s. Google, Facebook, Microsoft and others have been buying any startup that smells like Deep Belief Networks. They have done this with a good reason. They know that tomorrow’s algorithms and models will be more valuable than diamonds and gold. If you want to be one of the have’s then you need to invest in cloud storage now. You need to have massive historical data volumes to train tomorrow’s algorithms and start building the foundations today…

 

Big Data 2013 Predictions

January 1, 2013 5 comments

If you just invested a lot of money in a Big Data solution from any of the traditional BI vendors (Teradata, IBM, Oracle, SAS, EMC, HP, etc.) then you are likely to see a sub-optimal ROI in 2013.

Several innovations will come in 2013 that will change the value of Big Data exponentially. Other technology innovations are just waiting for smart start-ups to put them into good use.

Real-Time Hadoop

The first major innovation will be Google’s Dremel-like solutions coming of age like Impala, Drill, etc. They will allow real-time queries on Big Data and be open source. So you will get a superior offering compared to what is currently available for free.

Cloud-Based Big Data Solutions

The absolute market leader is Amazon with EMR. Elastic Map Reduce is not so much about being able to run a Map Reduce operation in the Cloud but about paying for what you use and not more. The traditional BI vendors are still getting their head around a usage-based licensing for the Cloud. Except a lot of smart startups to come up with really innovative Big Data and Cloud solutions.

Big Data Appliances

You can buy some really expensive Big Data Appliances but also here disruptive players are likely to change the market. GPUs are relatively cheap. Stack them into servers and use something like Virtual OpenCL to make your own GPU virtualization cluster solution. These type of home-made GPU clusters are already being used for security Big Data related work.

Also expect more hardware vendors to pack mobile ARM processors into server boxes. Dell, HP, etc. are already doing it. Imagine the potential for Distributed Map Reduce.

Finally Parallella will put a 16-core supercomputer into everybody’s hands for $99. Their 2013 supercomputer challenge is definitely something to keep your eyes on. Their roadmap talks about 64 and 1000 core versions. If Adapteva can keep their promises and flood the market with Parallella’s then expect Parallella Clusters to be 2013 Big Data Appliance.

Distributed Machine Learning

Mahout is a cool project but Map Reduce might not be the best possible architecture to run iterative distributed backpropagation or any other machine learning algorithms. Jubatus looks promising. Also algorithm innovations like HogWild could really change the dynamics for efficient distributed machine learning. This space is definitely ready for more ground-breaking innovations in 2013.

Easier Big Data Tools

This is still a big white spot in the Open Source field. Having Open Source and easy to use drag-and-drop tools for Big Data Analytics would really excel the adoption. We already have some good commercial examples (Radoop = RapidMiner + Mahout, Tableau, Datameer, etc.) but we are missing good Open Source tools.

I am currently looking for new challenges so if you are active in the Big Data space and are looking for a knowledgable senior executive be sure to contact me at maarten at telruptive dot com.

Jubatus – distributed scalable online machine learning framework

December 16, 2012 Leave a comment

Finally a solution for real-time distributed machine learning: Jubatus. Jubatus differs from Mahout and other distributed machine learning solutions that its focus is real-time instead of batch. Algorithms are for online classification, regression, recommendation, graph operation (queries, centrality, shortest path), etc. Zookeeper is used to keep the distributed Jubaclassifiers synchronized. Multiple clients connect to the Juakeeper (based on Zookeeper). Jubatus has a plugin framework to convert unstructured data on the fly into feature vectors. Performance seems to be linear for 16 nodes. Jubatus is another solution that Big Data Architects should evaluate…

%d bloggers like this: