Big Data

How I’ve Learned to Stop Worrying and Love the Data Lake

Bill Schmarzo By Bill Schmarzo April 25, 2014

4 25 14 Bill Image 1I have to admit that I’ve struggled with the term “Data Lake.” I first heard the term used in 2010 in some marketing collateral about Hadoop from Pentaho (“Pentaho, Hadoop, and Data Lakes”) and was confused by the use of the term and the explanations. Maybe I was confused because lots of the earlier discussions were about how Hadoop would obfuscate (a.k.a. render obsolete) the need for an enterprise data warehouse.

As I came to explore with more companies the role of Hadoop within an organization’s enterprise data architecture, I came to realize that the data lake concept wasn’t a replacement to the enterprise data warehouse, but instead complements the enterprise data warehouse. And in many cases, the data lake concept can actually liberate the enterprise data warehouse to do more of what it does best—provide the ability for business analysts to monitor and analyze the historical performance of the organization. Let me explain how I’ve learned to stop worrying and love the data lake.

Example:  Analyzing Point of Sale Data

Figure 1: Sample POS Register Receipt

Figure 1: Sample POS Register Receipt

One of the traditional data warehouse examples is for retailers to analyze point-of-sale (POS) transaction data (see Figure 1) to understand market basket analysis and answer questions such as:

  • What were the average sales per market basket?
  • What is the average margin per market basket?
  • What products appear most often in market baskets?
  • What is the average percentage of Private Label Products per market basket?
  • What is the typical distribution of product categories per market basket?
  • What products tend to sell in combination as part of the same market basket?

And of course, I want to answer these questions across the multitude of business dimensions (using classic business intelligence “by” analysis) such as store location, store demographics, product category, time of year, day of week, time of day, promotion, customer type, customer demographics, etc.

This sort of POS data is typically stored in the big data warehouse where these types of questions, and the accompanying business intelligence (BI) analysis (drilling up, drilling down, drilling across) and “by” analysis (“I want to see market basket sales by…”) can be accomplished.

However, what if I wanted to know which sales transactions were scanned into the POS system versus which ones were hand-coded into the POS system? That information might be important for the following reasons:

  • Are there specific products for which the scanner doesn’t work and for which the sales clerk needs to manually enter the UPC code into the POS system? This might indicate that certain Consumer Package Goods manufacturers are placing their UPC codes in locations on the product packaging that make it harder for the scanner to read; for example, having the UPC code located at the curve of the toilet paper packaging.
  • Are there specific sales clerks that hand-code more products than other sales clerks? This might indicate a training problem.
  • Are there specific scanners for which more products are hand-coded versus machine scanned? This might indicate scanners that might need maintenance or replacement.

To answer these questions, I can’t use the POS receipt. I have to have the “t” logs from the actual transactions that come off of the POS cash register. The “t” logs contain the raw data about each transaction:  the exact time (hour:minute:second) of the transaction, the sales clerk’s ID, how the transaction was entered into the system, etc.  The “t” log provides the detailed data necessary to answer the questions posed above about hand-entering UPC codes into the POS system. To answer questions like these, where I need to have access to the detailed transaction logs in its raw format, is the reason I need a data lake.

The data lake is becoming important because I can load my raw, unaltered structured and unstructured data into the data lake as-is without worrying about defining the data model schema before I can load the data. Think “schema on read” (where I define my data model schema and data requirements when I query the data) versus “schema on load” (where I have to define my data model schema as I load the data into the data repository).  hink of it as the difference between the POS data analysis versus the “t” log analysis.

Central Data Repository:  Data Lake versus Data Warehouse?

Readers of my book “Big Data: Understanding How Data Powers Big Business” and followers of my blog have seen the architectural layout below several times (see Figure 2).

Figure 2: Modern Data Architecture

Figure 2: Modern Data Architecture

Here are previous blogs that I wrote on this topic:

So I don’t make the readers have to reread each of these blog, let me summarize the differences between each of the major architectural components:  the BI/EDW environment, the analytics environment, and the data store or data lake.

  • The BI/EDW Environment is your traditional data warehouse that supports the business analysts’ questions and organizational reporting and dashboard needs. This is a production environment with very predictable loads that is SLA-driven and heavily governed. The data in the EDW must be 100% accurate or people go to jail. Most organizations look to enforce data transformation, databases, and BI tools at this level in order to drive down costs and ensure an SLA-compliant environment.
  • The Analytics Environment is where your data scientists can self-provision compute environments and desired data sources in order to freely mine the data. This environment is almost the polar opposite of the BI/EDW environment:  it’s an exploratory environment with very unpredictable load and usage patterns. It’s an environment where the data scientists need to be free to experiment with new data sources, new data transformations, and new analytic models in order to uncover new insights buried in the data and build predictive and prescriptive models of the key business process. It’s loosely governed and typically allows the data scientists to use whichever tools they prefer in their exploration, analysis, and analytic modeling.
  • The Data Lake is the central repository where all the data is loaded “as is.” It should also support data federation to provide access to lightly used data sources that are not a physical part of the data lake, but appear to be. For example, you may not want to download ALL of your detailed social media data from sites such as Facebook, Twitter, Pinterest, Instagram, Tumblr, LinkedIn, Yelp, and Google+ into your data lake, but instead provide a conduit (via the social media site APIs) to those sites for gaining access to their detailed data as needed.

The Hadoop data lake repository can store ALL the organization’s data “as is” in a low-cost HDFS environment (without the added burden of predefining your data schemas), and then feed both the production enterprise data warehouse environment / business intelligence environment and the ad hoc, exploratory analytics sandbox as necessary. Get comfortable to the fact that the data lake may contain data that has no intention of ever reaching the data warehouse.

EDW Enhancement Example: Do the ETL/ELT In The Data Lake

Doing ETL (Extract, Transform, Load) within your data warehouse is common today. However, if your data warehouse is already overloaded, why do that batch-centric, data management heavy work in an expensive environment?

Instead, move the ETL processes off your EDW platform and do the ETL/ELT (Extract, Load, Transform) work in an inherently parallel, open source, cost-effective, scale-out environment like Hadoop. Doing the ETL (as well as ELT) within Hadoop allows you to leverage that natively parallel environment to bring the appropriate compute capabilities to bear at the appropriate times to get the job done more quickly and more cost effectively.

Not only does using Hadoop for your ETL/ELT work make sense from the perspective of cost and processing effectiveness, but it also gives you the capability to create new data transformations that are difficult to do using traditional ETL tools. For example, the creation of new metrics around customer and product performance leveraging frequency (how often), recency (how recently) and sequencing (in what order), can yield new insights that might be better predictors of customer behaviors and product performance.

Beware That Your Data Lake Doesn’t Become Your Data Garage

We all know what our garage looks like.  Tons of boxes, some unopened from the previous move, sit buried in the garage.  And in California it’s even worse, as most people park their overly expensive cars in the streets so that they can pack more junk in their garages.

The garage has truly become a dumping ground for everything that we thought at one time or another might be valuable.  The writers for the movie “Raiders of the Lost Ark” got it right when they decided that the best way to hide the invaluable Ark of the Covenant was in a massive warehouse.  Yep, Figure 3 looks like my garage.

Figure 3: Raiders of the Lost Ark

Figure 3: Raiders of the Lost Ark

Joe Dossantos here at EMC uses a metaphor to talk about this “finding the data” challenge with respect to the data lake.  Let’s say that you had the capacity to build a beautiful new library to store any book that has ever been written.  If you built it and a truck dumps 1 million volumes into the reading room, what value would that be to the people looking for a particular book or theme of books?  We need to take into consideration as we design our data lake:

  • How do you develop the equivalent Dewey Decimal system to help people find the things that they are looking for in the data lake?
  • How can you deliver even more value from understanding the contents (metadata) of the data lake?  Couldn’t you help people understand the general idea of the contents without reading each book?  What is the general opinion of Napoleon?  Was the War of 1812 a good idea?

The solution to this problem is already there and available with things like SOLR.  With SOLR, you can know not only where and what data is available in the data lake, but also understand what the data in the lake actually means.

Summary

The data lake provides a magnitude improvement to your data architecture in terms of capabilities and agility.  Not only does it free up expensive EDW resources, but also it enables a self-sufficient analytics environment whose data requests can be fulfilled without needlessly screwing up the EDW’s Service Level Agreements.

Figure 4: Example Client Architecture

Figure 4: Example Client Architecture

And this is not just wishful thinking.  Figure 4 is an example of a company that is rapidly embracing the data lake concept, not only to free up EDW resources and enable the big data analytics sandbox, but is also looking at the data lake as the foundation for their future EDW as well.  Lots of interesting and liberating data architecture and data management approaches are going to get blown up.  Time to stop worrying and learn to love the data lake.

4 25 14 Bill Image 2

Bill Schmarzo

About Bill Schmarzo


Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

6 thoughts on “How I’ve Learned to Stop Worrying and Love the Data Lake

  1. I like and have recommended similar approaches in the past. One point that would strengthen the post and which I would enjoy seeing a follow-up post on is how you would suggest addressing the legal and regulatory questions that surround businesses’ use of all types of data and which would seem to be exacerbated not only by creating data lakes but by making them accessible to people.

  2. Jonathan, a good point. Data governance, master data management, and data administration is even more important in a data lake world. Traditional data administrative capabilities (easily finding existing data in the data lake, controlling and tracking access and usage, data lineage, anonymizing data, adding data tags/metadata, etc.) become even more challenging in a data lake world unless the proper controls, policies, procedures, compliance training, regulations and tracking are put into place. If one doesn’t properly manage and control the data lake, you pretty quickly end up with a data dump that is both expensive to manage and could expose the organization to PII and other compliance liabilities and risks.

  3. And to make the link between your DataLake and your Analytics requirements you can try the ETR concept Extract Transform and Reports using Pentaho tools where the traditional ETL with all his connectors and functionalities could be a datasource for your reporting tool.

  4. Bill – I like the point you bring up about the data lake and the fear many organizations may have about “downloading the internet”. Your solution of “provide a conduit (via the social media site APIs)” is a great way to reduce the fear of housing all the data of the world and subsequently be responsible (steward) for it when all you’re using it for is experimental analytics.

  5. Bill,
    I’m not a big supporter of the data lake metaphor (http://bit.ly/data-lake) – in a lake, individual water drops are indistinguishable and infinitely fluid. This is exactly the opposite of the type of contextual structuring that you declare (and I agree) is necessary.

    A place to store and contextualize as-is raw data is a definite need today. Routing already cleansed and reconciled data from e.g. enterprise apps / MDM systems there seems like additional and unnecessary work to me. It can go directly to the EDW. This pillar architecture is the basis of the logical architecture in Business unIntelligence (http://bit.ly/BunI_Book) – place the different technologies in parallel rather than in series.

    At a technical level, it’s not yet clear to me if Hadoop can be a suitable base technology for all needs – relational DB, streaming, ETL, etc. – that some vendors claim it to be. And I suspect it will take many years before it is as mature as existing 30 year-old systems in terms of reliability, data management, governance and such.

  6. Barry, good points (and very nice blog). I don’t believe that the data lake will “rip and replace” existing data warehouses any time soon either, but believe that the data lake is a strong complement to existing data warehouses. Over time, however, the data lake (with more enhancements and maturation) could become a compelling data warehouse platform – a platform that supports both the data warehouse and business intelligence environment as well as the more ad hoc, exploratory analytics sandbox (check out my blog “Modernizing Your Data Warehouse” https://infocus.dellemc.com/william_schmarzo/modernizing-your-data-warehouse-part-2/).

    I don’t necessary like the “data lake” term either because it tends to overlook the importance of data governance, master data management and other data management disciplines. However, I love the underlying concepts (schema-on-query versus schema-on-load, common data infrastructure that supports both data warehouse and analytics environments, ELT versus ETL, leveraging an inexpensive natively parallel compute environment to do my data transformation and enrichment). And I do think having the data warehouse and analytics environments on the same data platform simplifies integrating the analytic results and insights back into my BI reports and dashboards. Check out my blog “Store Manager Actionable Dashboard” (https://infocus.dellemc.com/william_schmarzo/store-manager-actionable-dashboard/) for an example of how organizations can leverage the insights out of their analytics environment to create a new, more actionable business intelligence environment.

    BTW, we already have a couple of customers who are building a “data lake” upon which they are running both their data warehouse and analytic environments. They were one of those rare no existing data warehouse situations, so no “rip and replace” required.

    The data lake should be an interesting concept/debate to follow over next 12 months as the underlying technologies, tools and methodologies mature. IMHO.