5 Keys to Creating a Killer Data Lake

by Zachary Zeus
July 20, 2017
It’s been several years since the term “Data Lake” was coined by my friend and Pentaho co-founder James Dixon.  The idea continues to be a hot topic and a challenge to execute properly. The problem is that too many people think all they need to do is dump data into Hadoop.  But what happens when you merely dump your data?  You get what you asked for— a dump.

However, when properly executed, the data lake has the potential to transform an organization’s ability to manage structured and unstructured data at scale.  This, in turn, allows organizations to more quickly extract value, create insights, and deliver outcomes that help organizations innovate, increase revenue, improve customer engagement, and streamline operations.

So why isn’t everyone using data lakes to solve big data challenges? Well, because big data is hard. New technologies, like Hadoop, have become more enterprise-friendly, and more mature, but creating a data lake to feed an effective data pipeline isn’t easy. But with proper foresight and planning, you’re bound to achieve success.

5 keys to a killer data lake:

  1. Align to Corporate Strategy:  Innovation initiatives are doomed to failure when not aligned with corporate strategy.
  2. Solid Data Integration Strategy: The tools you used yesterday may not support the requirements of tomorrows modern data architectures.
  3. Modern Big Data Onboarding Process:  Manage a changing array of data sources, establish repeatable process at scale, and maintain governance and control over the process.
  4. Embrace New Data Management Practices: Leverage design patterns and usage models that support modern data architectures.  Apply analytics to all data.
  5. Operationalize Machine Learning Models: Remove bottlenecks. Empower teams of data scientists, engineers and analysts to train, tune, test and deploy predictive and other advanced models.


It’s important to align goals for your data lake with the business strategy of the organization you’re working to support. Understanding the larger operational context will help you identify potential road-blocks, articulate the project’s value within the organization, and gain executive buy-in.

Business strategies tend to revolve around concepts like business acceleration, operational efficiency, and security.  Will your data infrastructure help the organization gain a better understanding of your customer, maximize profits, or develop new products? Will it help to cut costs, modernize your data architecture, apply emerging IoT techniques, mitigate risk and adhere to compliance standards?

align data lake with business strategy

Knowing that you’re doing the right things, and why you’re doing them, is crucial. Take the right precautions, consider your value add. All of these solutions roll up to bigger strategic organizational goals.


While the term “big data” is relatively new, the process of collecting and storing large amounts of data for future analysis is old hat. Today, newer integration strategies, approaches and technologies are needed to deal with the unprecedented volume, velocity and variety of the data coming in. And if you don’t have a solid data integration strategy, the potential value of most of that data will never be realized. There just isn’t enough time to process it, prepare it, and analyze it using traditional methods and tools.

In recent years, MapReduce was the execution engine we relied upon for Hadoop. Today, we have Spark for running large-scale data analytics applications. Soon, other options, like Apache Flink, will become more prevalent. So, as you consider your integration strategy, think about the adaptability of your technology investments. For example, will you be able to easily replace the execution engine you currently use to power your data integration and analytics in the future without significant resources and expense? Are your current tools flexible enough to handle changes to security, data governance, and data management processes? It is imperative that your processes are up to date to safeguard your organization? Can your systems successfully capture and manage vast volumes of metadata? Keep in mind that metadata in a big data context is different than what we’re used to with data warehousing. It’s raw, unclean, yet provides more insight into context.

You’ll also want to rethink how you use and deploy your metadata. Be sure to stay on top of best practices in metadata management as the landscape is constantly evolving. Understand that as your data moves through the analytic data pipeline, it’ll pass through places where we put data to work.  Places like data warehouses, staging areas, and data lakes, are used to enable the analytics process.  Along the way, the data is cleansed, transformed, blended, visualized and analyzed, among other things (ref Image below). The key is not merely performing and managing those things, but automating those actions, so that the process is streamlined and the overall performance is significantly improved.  Without automation, we cannot operate at big data scale.

analytic data pipeline

Data warehouses, staging areas, and data lakes are merely landing zones within data pipeline. Data moves through the data lake and continues on to be blended with other data, visualized and ultimately used in machine learning models.


Data lakes get filled in a lot of ways. Sometimes data is ingested en masse, and other times it’s trickle fed. Either way, you must have a mechanism in place that enables you to not only ingest the data, but also to understand what it’s comprised of. Modern data onboarding is more than connecting and loading. The key is to enable and establish repeatable processes that simplify the process of getting data into the data lake, regardless of data type, data source or complexity – while maintaining an appropriate level of governance.

When we think about a traditional integration process, which is something we’ve done forever, relational databases and CVS files come to mind.  We’d have some sort of ingestion or integration process that would pull data from that source.  To create that process, we’d typically open each source file, define the metadata, and create a procedure that can be executed for that file. But in a big data context, where the sheer volume of datasets needing to be onboarded can be huge, creating and mapping every single integration process just isn’t practical – or scalable.

A modern onboarding architecture will dynamically derive metadata – on the fly – at the time of ingestion.  You can literally have 1000 different unique source files feeding through a single ingestion procedure and delivered directly into Hadoop.  Some of those files will have metadata in them, others will not. The process of dynamically ingesting data into the data lake is called metadata injection.  As you derive the metadata on the fly, you have the opportunity to save that metadata after ingestion, which enables users to have better insight into the data in the lake.

Establish a modern onboarding strategy

Pentaho metadata injection helps organizations accelerate productivity and reduce risk in complex data onboarding projects by dynamically scaling out from one template to hundreds of actual transformations.


Big data is a game changer. The advent of new technologies and practices requires us to update integration strategies to keep pace.  The key is to adopt early ingestion and adaptive execution processing, like MapReduce, Spark or Flink, that allow for flexibility.  Consider the following three components when developing your new data management strategy:

  1. Enable metadata on ingest. It’s important to derive metadata during onboarding processes, while capturing it when the data is pushed into Hadoop. Then adopt streaming data processing where appropriate. If you’re dealing with micro batches or real time sensitivities, applying that type of processing at the right time is essential.
  2. Model on the fly. Automate the creation of analytic data models so users can easily spin off analytics and build a model enabling the creation of visualizations on-demand, without involving IT, thus speeding time to value.
  3. Modernizing your data integration infrastructure. Come to terms with the fact that you can’t get what needs to be done with the old tools you’ve been using. Make sure you’re extending your data management processing and strategies to all data, and ultimately apply analytics to all that data anywhere along the pipeline. Don’t just worry about where the data sits and what it looks like.  Take the next step, which is to ensure that your organization and users have access to the data in the format or analytics needed to drive the business. In short, modernize your infrastructure.

Embrace New Data Management Strategies

You can’t rely on your old tools to meet your analytic needs in a big data environment. Extend your data management processing and strategies to all data, and ultimately apply analytics to all that data.


Organizations are learning to leverage machine learning to drive real business value; but it’s not just applying machine learning algorithms, but ensuring that the machine learning workflow is repeatable. As you prepare your data, engineer features, build, train, tune and test data sets, make sure you can deploy it, operationalize, and update it, as part of your production data flows.  This is driving transformational changes in organizations that understand the value of the killer data lake.

Data scientists are smart and provide a valuable service.  But it’s counter-productive when they operate on their own data and focus on one-off projects. Their work can be transformative, but the results are often not easily integrated back into the production environment.  By all means, let your data scientist work in the environment to which they are accustomed, whether it be Spark, Python, R, Weka, or Scala.  But understand that having the capability and the infrastructure to enable operationalization of your machine learning models is crucial.

Apply Machine Learning Algorithms

End the ‘gridlock’ associated with machine learning by enabling smooth team collaboration, maximizing limited data science resources and putting predictive models to work on big data faster.

Want to learn more about how to fill the data lake using Hadoop

More blog posts