Data Platforms & Pipelines Cloudera

Cloudera, Making Big Data Even Better

5 min read

28 Nov, 2019

Databases have been used by organisations for decades. While they have historically worked relatively well, they’ve been fairly limited in their capacity for analysis. As advanced analytics were used more and more (and began to branch into the world of data science) more and more tools were introduced to the data and analytics space. While this offered variety in how problems could be solved, there was one main issue – every time a new tool was used for analysis, the data had to be copied.

“Making copies of data introduces a lot of problems,” says Paul Mumby, a technical consultant at OSS Group.

First, there is an issue with efficiency. Duplicating data takes time and resources, both human and monetary.

“There is also an issue with security,” says Mumby. “Your original database may be secure and auditable. You know who is doing what to your core data. Once you copy out that data, you lose all control over it.”

Luckily, Cloudera has solved the problems inherent to duplication. What the Cloudera Hadoop stack brings to the table is a single place where data can reside. Data can then be accessed through the same security layer and the same privacy control. You retain visibility about what is being done to the data and by whom while still allowing access to data analytics solutions.

The Hadoop ecosystem is not unique to Cloudera. Hadoop is an open-source project managed and maintained by the Apache Software Foundation (ASF), a non-profit community of developers and contributors. This means that Hadoop solutions are freely available.

“You can always download the open-source components yourself and build your own stack,” Mumby points out.

However, this takes time and know-how. Cloudera integrates the open-source components for you, delivering a suite of open source technologies to create a platform where you can store your data and employ multiple approaches to data analytics.

“With Cloudera, you don’t need to figure out how to glue it all together,” says Mumby. “All you have to do is download one thing, hit ‘install’ and you will ultimately get a working environment.”

The ease and breadth of Cloudera’s Hadoop stack make it the most popular distribution of Hadoop globally.

Data in optimisation

In order to get the best use from Cloudera Hadoop, your organization needs to be dealing with big data. For example, OSS Group recently helped a New Zealand business improve their operational efficiency by analyzing massive amounts of data.

“The company had a lot of complex operations and they wanted to be more efficient,” says Mumby.

As a travel and transportation company, they wanted a more efficient way to plan when they should create a new route and when they should retire an old route. Traditionally, this task required labour-intensive and time-intensive market research.

“They figured they should be able to do it using data,” says Mumby.

They sourced data from a ten-year period - from their own ticket sales, from Stats NZ and Stats AU, and from competitor data purchased from market agencies. This data included where people were traveling, when and how often, what they were paying and what competitors’ options were on various routes. This data would allow them to balance traveler interest with competition for a particular route. They ended up with millions and millions of records.

They first attempted to analyse the data on their existing Oracle platform. This resulted in two key problems.

“Performance in Oracle was not where they wanted it to be and they would have had to spend a tremendous amount of money getting it there,” says Mumby.

Additionally, even testing the queries was impacting other workloads and the rest of their production environment.

“Since they have daily operations happening that are fundamental to the running of their business, they couldn’t tolerate any interruptions.”

The company decided to try an experiment. They got a trial edition of Cloudera and did an initial proof of concept. They quickly realised that the tools allowed them to rapidly solve their problems with a relatively small investment in infrastructure and cost.

“They were able to deliver, at least in a proof of concept form, the answers to the business in a reasonable amount of time,” says Mumby.

Once they realised the potential, Mumby came on board to move the proof of concept to production.

“The potential that gave them in optimising their route network in ways they would not have found otherwise was hugely beneficial to the organisation,” says Mumby.

Cloudera in real time

Cloudera also has an exceptional grasp of real-time data. In the case of the transportation and travel company, in addition to analysing historical data, they wanted to analyse current processes and see what could done to hasten turnaround time of vehicles without impacting critical processes. They wanted a dashboard that used real-time data to help operations staff identify delays and lags within the system. Operations staff could then decide how to allocate resources in the best way possible to speed up turnaround.

“They wanted all this in real time,” says Mumby. “They wanted at any second to see their vehicles being shuffled around and reprioritised in response to real-time data.”

That was something they could not do with their existing technology.

“They could load data to a database and do batch analysis, but analysing data as it's flowing through the enterprise in real time wasn't something they could do with their existing Oracle environment.”

While Oracle does provide mechanisms for real-time analysis, the cost of adding those features and the additional infrastructure required was prohibitive.

“They were already experimenting with streaming in Kafka, and using Message Queues, and these things integrated with minimal effort into Cloudera's Data Lake, allowing them to do both large scale batch analytics and real-time analytics on one platform.”

Mumby and his team were able to implement the solution in a relatively short time frame leveraging Cloudera.

“With our Cloudera solution, they were able to reach their target improvement within their target time frame.”

How big does big data need to be?

The above are clear-cut examples of big data – millions of data points from multiple sources spanning a ten-year period in the first case and disparate data sources delivering real-time results in the second case. But how can you tell if YOUR data is big data?

Big data exhibits one (or several) of four properties: volume, velocity, veracity and variety. In the case of the transport company, they ticked the volume and variety boxes of big data. But what about velocity and veracity?

Velocity refers to the speed at which data is being gathered. Big data can be collected from sensors that are recording a constant flow of information.

“There may be very little data,” says Mumby, “but if it’s going very fast, it can definitely fit in the big data category.”

Veracity has to do with whether the data is accurate, precise and trusted. Untrusted data from a possibly corrupt source adds to the complexity of how the data must be dealt with. This pushes it into the big data category.

If your data has one (or all four!) of the above characteristics, chances are it would be better suited to a big-data platform than a conventional database.

Now you know!

By building one set of infrastructure that can serve multiple purposes, big data can be used by organisations to gain operational efficiency in a number processes. Cloudera is the gold standard for analysing big data. OSS Group has deployed Cloudera Hadoop stack solutions for transportation and travel companies, insurance companies and telecoms. The solution would also be appropriate in banks and other enterprise spaces. We encourage you to contact us to discuss whether Cloudera would be a good fit for your business.