Cloudera claims the enterprise data hub has arrived, adds streaming

Tony Baer, Principal Analyst, Software – Information Management

Cloudera announced this week that “the enterprise data hub has arrived.” Specifically, Cloudera has revised product packaging to make it more convenient (and affordable) for clients to buy the whole platform. As we stated when Cloudera first unveiled the enterprise data hub strategy, the vision is practical, but in the short term the primary hub-oriented roles for which the platform is ready are as “data lake” (where original data is kept intact) and bulk processing tasks (e.g., data transformation).

Nonetheless, Cloudera’s repackaging will make enterprise data hub configurations (e.g., the full Cloudera platform) more accessible and affordable, allowing enterprises to test-drive Hadoop for more use cases. Additionally, Cloudera’s general release of support for the Apache Spark open source project is a key milestone in bringing Fast Data applications to Hadoop.

Evolving Hadoop from island to hub

As we stated last fall in the research note “Cloudera plots its path forward with enterprise data hub strategy,” Cloudera envisions a broader role for Hadoop. Loosely described, the enterprise data hub becomes the place where data lands, is managed in its raw form, and some analytic workloads are run. As the logical successor to the enterprise data warehouse, the data hub would not necessarily be the sole platform for analytics.

Instead, it would be part of an ecosystem where analytics and other workloads (e.g., data transformation, master data management) are apportioned between Hadoop and other platforms including SQL data warehouses, NewSQL, and NoSQL platforms based on criteria such as functionality (e.g., specialized in-database analytics functions), cost, required service level, data locality, and governance.

Is Hadoop ready to become the enterprise data hub?

Hadoop brings several strong advantages to the table. Designed for scale-out cluster architectures that originated with Internet data centers, Hadoop clusters are very linearly scalable; while doubling the number of nodes in a cluster won’t necessarily double the performance of the system, the end result will be fairly close. Hadoop is economical because it uses commodity infrastructure and open source software. And Hadoop has flexibility to run multiple forms of workloads that may include, and go beyond, SQL querying.

Ovum believes that Hadoop will mature in the next 2–3 years to become a viable enterprise data hub candidate. For now, it is ready to provide a “data lake” for maintaining raw data and perform a growing range of analytic workloads beyond MapReduce. But capabilities for managing service levels, performance, availability and reliability, information lifecycles, security, and data governance are works in progress – they are either in early versions or still to be developed.

Ovum expects vigorous competition for the next-generation enterprise data hub. Relational platforms are adding Hadoop integration and capability to perform other forms of processing beyond SQL. While Hadoop has the edge as a “data lake” (the place where original data sits), competition will be wide open for other functions of the hub.

Are enterprises ready for Hadoop as enterprise data hub?

Ovum has not conducted any recent formal surveys; however, from ongoing discussions and feedback from our enterprise clients, we have found that many are interested, but only a few are active in the adoption curve. So, positioning Hadoop as an enterprise data hub is ahead of where the bulk of the enterprise market is. But enunciating the roadmap is important for clients to make future plans – with these announcements, Cloudera’s direction is clear.

Repackaging Cloudera Enterprise

Cloudera’s announcements involved simplifying the packaging of its premium modules; in place of an a la carte strategy that discouraged customers from adopting the full platform (which would be essential for data hub deployments), Cloudera now offers three über support editions comprising:

  • Basic, which includes the existing base package plus backup and data recovery (which was formerly an add-on);
  • Flex, which provides the choice of one of the premium add-ons; and
  • Data Hub, which includes all of Cloudera’s products, including HBase (NoSQL database), Impala (interactive SQL query), Search, Spark (see below), and Navigator (data lineage) add-ons.

The new bundling provides a good middle ground for organizations that just want the basic platform, or maybe a single premium add-on such as interactive SQL, or the full package for data hub. Cloudera states that pricing for data hub will be more favorable compared to before, with that edition priced at the equivalent of paying for a second add-on module.

Sparking Fast Data on Hadoop

This is the announcement that really interested us: Cloudera support for the Apache Spark project is now in general release. Spark is a new Apache open source project for high-performance cluster computing that aggressively tiers data into memory. Cloudera is OEM’ing technology developed by Spark project leader DataBricks.

Spark can be used for fast, interactive query, and with the related Spark Streaming framework, can be used for stream processing of data for realtime analytics. Spark is one of many emerging Fast Data processing frameworks for Hadoop; its best known rival is Storm, where the original development team later joined Twitter and where the technology is now backed by Hortonworks. Storm has been around longer, but Cloudera claims Spark’s codebase is more efficient (we will investigate the claim in a forthcoming update on Fast Data).

However, put this in perspective: officially, both technologies are still early in development. So why are Cloudera and Hortonworks rapidly planting their stakes? It is a response to surging interest in streaming on Hadoop. As stated in the Ovum report What is Fast Data?, technology (e.g., the emergence of open source frameworks in place of costly proprietary event-processing products) and price/performance trends are making realtime applications more affordable to a wider enterprise audience. For Hadoop, the YARN framework provides the means for allocating workloads across clusters, further clearing the way. For enterprises, this is a signal that the path is open to prototype streaming applications should the right use case exist.

Related Stories

Leave a comment


This will only be used to quickly provide signup information and will not allow us to post to your account or appear on your timeline.