With Hadoop/Hbase/Hive, Cassandra, etc. you can store and manipulate peta-bytes of data. But what if you want to get nice looking reports or compare data held in a NoSQL solution with data held elsewhere? There have been two market leaders in the Open Source business intelligence space that are putting all their firepower onto Big Data now.
Pentaho Big Data seems to be a bit further ahead. They offer a graphical ETL tool, a report designer and a business intelligence server. These are existing tools but support for Hadoop HDFS, Map-Reduce, Hbase, Hive, Pig, Cassandra, etc. have been added.
Jaspersoft’s Open Source Big Data strategy is a little bit behind because connectors are not included yet into the main product and several are still in beta quality and with missing documentation.
Both companies will accelerate the adoption of big data since the main problem with Big Data is easy reporting. Unstructured data is harder to format into a very structured report than structured data. Any solutions that will make this possible and additionally are Open Source are very welcome in times of cost cutting…
Hadoop has run into architectural limitations and the community has started working on the Next Generation Hadoop [NGN Hadoop]. NGN Hadoop has some new management features of which multi-tenant application management is the major one. However the key change is that MapReduce no longer is entangled inside the rest of Hadoop. This will allow Hadoop to be used for MPI, Machine Learning, Master-Worker, Iterative Processing, Graph Processing, etc. New tools to better manage Hadoop are also being incubated, e.g. Ambari and HCatalog.
Why is this important for telecom?
Having one platform that allows massive data storage, peta-byte data analytics, complex parallel computations, large-scale machine learning, big data map reduce processing, etc. all in one multi-tenant set-up means that telecom operators could see massive reductions in their architecture costs together with faster go-to-market, better data intelligence, etc.
Telecom applications, that are redesigned around this new paradigm, can all use one shared back-office architecture. Having data centralized into one large Hadoop cluster instead of tens or hundreds of application-specific databases, will enable unseen data analytics possibilities and bring much-needed efficiencies.
What is needed is that several large operators define this approach as their standard architecture hence telecom solution providers will start incorporating it into their solutions. Commercial support can be easily acquired from companies like Hortonworks, Cloudera, etc.
Having one shared data architecture and multi-tenant application virtualization in the form of a Telco PaaS would allow third-parties to launch new services quickly and cheaply, think days in stead of years…
If you are trying to find out what the right hypervisor is for your private cloud or IaaS then you might be asking the wrong question…
Do most applications really need an OS and hypervisor is a better question?
One company of the companies that is exploring this area is Joyent. Thier SmartOS is like the mix between a virtual machine and a combined OS + hypervisor. Instead of installing a hypervisor, on top an operating system, on top an application server or database, the Joyent team thought it would be more efficient to try to remove as many layers as possible between the application/data and the hardware.
According to publicly available videos and material, their SmartOS is based on a telecom technology for high-scalable low-latency application operations. Unfortunately Google does not seem to be able to answer which telecom technology it is. So if you know the answer, please leave a comment.
The idea of running applications as close to the hardware as possible and being able to scale an application over multiple servers is the ultimate goal of many cloud architects. Joyent claims that their SmartOS runs directly on the hardware. On top of SmartOS you are able to install virtualization but ideally you run applications and data stores directly.
The next step would be to combine the operating system with the virtual machine/application server or database server into one. Removing more layers will greatly improve performance as can be seen by Joyent’s performance tests.
So the real question is: do we need so many extra layers?
A distributed storage system, a virtualized webserver, a virtualized app server, a distributed SQL-accessble database or NoSQL solution that would run straight on hardware with a minimal extension to distribute load over multiple machines would be the ideal IaaS/PaaS architecture. It would give customers what they really need: performance, scalability, low-latency, etc. Why add a large set of OS and hypervisor functions that at the end are not strictly necessary?
I have been looking into virtualization but what I find are mainly operation system based virtualizations. What I am looking for are application, integration and datastore virtualization solutions. Google’s App Engine and Oracle’s JRocket Virtual come closed to what I am looking for application virtualization. Why do you need an operating system if you could virtualize your application directly? It would save resources and would be more secure. My ideal solution allows developers to write applications and run them on a virtual application server. This virtual app server can scale applications horizontally over multiple machines. Each application is running in a sandbox hence badly written or unsecure applications will run out of resources and are not able to impact other applications. We would need a similar solution for integration solutions. Both would need out of the box support for multi-tenancy in which either a tenant gets a separate instance or multiple tenants can share one instance if supported by the software. Integration should be separated from the application logic and so should data storage.
Integration is key because the virtual applications could be running on a public cloud but would have to be able to interact with on-site systems. Enormous high-throughput, security, multi-tenancy and resistance to failure are key. One API can be linked to multiple back-office systems or different versions. Different versions of an API can be link to the same back-office system to prepare applications before a major back-office upgrade.
A distributed multi-tenant data store should hold all the end-user and application data. Ideally in a schema-less manner that avoids having to migrate data for data schema changes.
All these virtual elements should be managed by an automated scaling and highly distributed administration that can let applications grow or shrink based on demand, assure integration links are always up and get re-established if they fail, store data in a limitless way, etc. But there is more. The administration should allow to deploy different versions of the same application or integration and allow for step-wise migration to new versions and fast roll-backs.
Why do we need all this?
The first company that will have such elements at its disposal will have enormous competitive advantages in delivering innovative services quickly. They can launch new applications quickly and scale them to millions of users in hours. They can integrate diverse sources and make them universally available to be re-used by multiple applications. They can store data without having an army of DBAs for every application. They can try out new features and quickly scale them up or kill them. In short they can innovate on a daily basis.
The Google’s of this world understood years ago that a good architecture is a very powerful competitive weapon. There is a valid trend to offshore technical work. However technical work should be separated in extremely high-value and routine. Never off-shore high-value work. Also never assume that because the resources are expensive, it must be high-value. Defining and implementing this innovation architecture is extremely high-value. Writing applications on top of it is routine at least starting from number 5.
With the world looking more at XML, SOAP and REST these days, it is perhaps anti-natural to think binary again. However with Protocol Buffers [Protobuf], Thrift, Avro and BSON being used by the large dotcoms, thinking binary feels modern again…
How can we apply binary to telecom? Binary SIP?
SIP is a protocol for handling sessions for voice, video and instant messaging. It is a dialect of XML. For a SIP session to be set-up a lot of communication is required between different parties. What if that communication is substituted by a binary protocol based for instance on protocol buffers? Google’s protocol buffers can dramatically reduce network loads and parsing, even between 10 to a 100 times compared to regular XML.
What would be the advantages:
- Latency – faster parsing and smaller network traffic reduces latency which is key in real-time communication.
- Performance – faster parsing and lower load means that more can be done for less. One server can handle more clients.
- Scalability – distributing the handling of SIP sessions over more machines becomes easier if each transaction can be handled faster.
- No easy debugging – SIP can be human ready hence debugging is “easier”. However in practice tools could be written that allow binary debugging.
- Syncing client & server – clients and server libraries need to be in sync otherwise parsing can not be handled. Protocol buffers ignores extensions that are unknown so there is some freedom for an old client to connect to a newer server or vice-versa.
- Firewalls/Existing equipment – a new binary protocol can not be interchanged with existing equipment. A SIP to binary SIP proxy is necessary.