Hadoop has become today the de facto solution for big data processing. Amazon, Twitter, Google, Facebook and every large company that handles sizable data sets has been using a flavour of Hadoop.
Hadoop is evolving fast: recently, Hortonworks released HDP 2.1 and Cloudera released CDH 5.0. Both of these distributions have reached a high level of maturity and there are many reasons you might want to consider Hadoop even if your data size is only a fraction of a petabyte:
- Archive your data into your Hadoop cluster instead of taking it offline – Hadoop storage is cheap and reliable. All the data is replicated by default onto 3 drives. And once your data is there, it is still usable: you can connect BI tools, ETL (Extract, transform and load) and web tools.
- Hadoop includes first class data transformation tools – Traditional ETL tools can end up generating complex and convoluted transformation packages that are difficult to maintain. You might want to consider Pig, Hive and Oozieas an alternative. Pig is a high level language that has been designed for parallel data transformations. I’d argue that Pig itself makes the setup of a cluster worthwhile; it can make ugly ETL processes maintainable. Also the underlying data storage options are more flexible than old-fashioned SQL databases – a change in the schema will not always break all your scripts. And this can be automated in workflows using Oozie and a web UI.
- When you need data mining and advanced analytics, you will have an infrastructure ready – Scaling your databases might not suffice for advanced analytics because any serious data crunching will frequently require distributed processing and distributed storage. Hadoop will let you achieve that at low cost. While your database vendors can help you distribute your storage (with a hefty bill), there are not many options for distributed processing.
- IT-wise, you can think of your Hadoop cluster as an appliance – There is little maintenance and modern distributions mean the addition and setup of nodes require only a couple of clicks on a Web UI (including high availability options). Monitoring is built-in and the configuration and tuning are now automated out of the box.
- Microsoft likes Hadoop – Although Hadoop is mostly Java-based, Hadoop has become a prime citizen for Windows and Microsoft applications. Firstly, Azure HDInsight is Hadoop. Secondly, your Hadoop data can be accessed from almost any Windows application through SQL with the ODBC drivers. Since SQL Server 2012, a bridge to mix SQL and Hadoop datasets is included. And Hadoop can even run on Windows (the setup is not as mature and straightforward as for Linux, but there is a Hortonworks installer for Windows).
- NoSQL? Included – Hadoop distributions include HBase, a key/value storage solution. Out of the box, HBase is distributed and replicated. You won’t require any infrastructure changes for a scalable low-latency SQL alternative.
- Real-time queries are not quite there yet but it is happening – While there are connectors to access Hadoop data through SQL, Hadoop queries are today slower than SQL database queries and take from 10 seconds to a couple of minutes. This situation is evolving fast: in the most recent distribution, Hortonworks includes their Hive add-on, Stinger , that speeds up SQL queries dramatically, and Cloudera has its own query engine called Impala. But if you can move some of your data onto HBase and query it directly, speed won’t be an issue anymore.
- Data Exploration – Because Hive queries and Pig scripts are distributed, complex queries can run much faster than traditional databases. This can be a powerful tool to gather new insights by correlating and joining different data sets that would have taken down a traditional SQL server.
It is easy to get started and experiment with either Hortonworks or Cloudera. A fully configured Linux virtual machine with all the services is available for Cloudera and Hortonworks. And once you are ready to set up a cluster and integrate Hadoop into your environment, we can help you.