Big Data

Modernizing Your Data Warehouse Part 2

Bill Schmarzo By Bill Schmarzo November 5, 2013

This is a topic that I wrote about many moons ago, but is one that still comes up time and time again. It is of particular significance to organizations that already have a data warehouse and business intelligence investment yet want to leverage that investment to introduce advanced analytics into their organization. When I get the question of how best to do so, I utilize the chart below (Figure 1) as a way to highlight a high-level architecture that:

  • Preserves (and can even strengthen) a data warehouse environment
  • Enables the integration of advanced data management and analytics capabilities into the organization

Let’s review the 3 components of the below architecture and provide a “How To” guide for you to get here.

Schmarzo Data Warehouse

Figure 1: A Modern EDW and Advanced Analytics Architecture

 

Action 1:  Upgrade Your EDW

Today’s data warehouse platforms are very expensive. As a result, most organizations limit how much data they provide in their data warehouse, opting for 13 months of summarized data versus 13 years of detailed data. Unfortunately, it’s in that detailed data that many invaluable insights about your customers, products, campaign, partners and operations can be found.

For example, over the past 13 years, the economy has gone through 2 full economic cycles where it was flying high, collapsed, and then climbed back up only to repeat the cycle all over again. By looking at each of your customer’s product purchase patterns over those 2 economic cycles, you can closely predict when someone is personally impacted by the economic downtime – both positively as well as negatively. For example, if you’re a grocery chain, you can monitor your customer’s market baskets and uncover changes in their purchase patterns that indicate changes in their economic conditions. You could see your customer moving from expensive to lower-cost products (from steak to hamburger), start using more coupons, or increase their purchases of private label products.

At the individual customer level, the grocery chain probably knows more about the economic conditions across the United States than any financial institution or government agency. The grocery chain would be the first to see the shift in buying patterns at the individual level, and could aggregate those buying patterns into higher levels such as geographic/zip+4, demographic, and behavioral customer categories.

Organizations can upgrade their data warehouse capabilities by using modern MPP (multi-parallel processing), open source, scale-out technologies that can dramatically drive down enterprise data warehouse (EDW) costs (take advantage of the 20x price advantage offered by these new, modern, scale-out technologies versus traditional EDW technologies). Plus you can focus your EDW on only doing EDW things, and off-load non-EDW activities like ETL and advanced analytics to a more appropriate, more cost-effective environment (see Figure 2).

Schmarzo DW 2

Figure 2: Focus Your EDW on EDW Tasks

Action 2:  Upgrade Your ETL Processes

Doing ETL (Extract, Transform, Load) within your existing data warehouse is common today. However, if your data warehouse is already overloaded and overly expensive, why do that batch-centric, data management heavy work in an expensive environment? That’s like having a Tesla haul turnips around the farm.

Instead, optimize your ETL processes by moving ETL off of your expense EDW platform and doing the ETL/ELT (Extract, Load, Transform) work in an inherently parallel, open source, cost-effective, scale-out environment like Hadoop/MapReduce. Doing the ETL (as well as ELT) within Hadoop allows you to leverage that natively parallel environment to bring to bear the appropriate compute capabilities at the appropriate times to get the job done more quickly and more cost effectively.

As we’ve discussed before, not only does using Hadoop for your ETL/ELT work make sense from a cost and processing effectiveness perspective, but it also gives you the capability to create new metrics that are difficult to do using traditional ETL tools. For example, the creation of new metrics around customer and product performance leveraging frequency (how often), recency (how recently) and sequencing (in what order), can yield new insights that might be better predictors of customer behaviors and product performance.

As a side benefit, this Hadoop front-end data repository can store ALL the organization’s data in a low-cost HDFS environment as-is (without the added burden of pre-defining your data schemas), and then feed both the production enterprise data warehouse environment and high-velocity analytics sandbox as necessary (see Figure 3).

Schmarzo DW 3

Figure 3: Optimize Your ETL Processes

Action 3:  Introduce Your Analytics Sandbox

This architecture supports the introduction of a separate analytics environment that off-loads the analytics being done today on your overly expensive EDW.  This separate environment also gives the data science team an on-demand, fail-safe environment for quickly addressing immediate business opportunities independent of the EDW’s strict governance and SLA rules.

The analytics environment couldn’t be more different than your data warehouse environment. Your data warehouse environment is a production environment that needs to support the regular (daily, weekly, monthly, quarterly, annual) production of operational and management reports, as well as dashboards that are used to run the business. In order to do that, most data warehouse environments have strict SLA’s, are heavily governed, and make use of standard tools in order to meet those production requirements while holding down costs.

An analytics environment, on the other hand, is much more ad hoc and on-demand driven. The analytics environment must support continuous exploration of those new data sources that are essential to uncovering actionable insights about your customers, products and operations. It must allow for the development of new data management and analytic tools/techniques that are the most appropriate for your analytics and data science team (see Figure 4).

Schmarzo DW 4

Figure 4: Introducing the Analytics Environment

Summary

These are certainly marvelous times to be in the data business. Technology innovations and maturation advancements are providing new architectural and infrastructure options to enable organizations to extend and enhance their existing data warehouse and ETL investments, while introducing an analytics environment that can off-load the work that is being wasted on the data warehouse. This all occurs simultaneously as your data science team is provided an environment where they are free to explore new data sources and data enrichment techniques to tease out new actionable insights about your customers, products, and operations.  Watch this space for continuing technology innovations and the impacts that these innovations could have on your data and analytics capabilities.

Bill Schmarzo

About Bill Schmarzo


Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

2 thoughts on “Modernizing Your Data Warehouse Part 2

  1. ALL data into Hadoop first? Wouldn’t that just introduce extra complexity and latency for transactional data going into the data warehouse (which is, after all, one of the big current challenges of DW)? And surely we shouldn’t talk about modernization of the DW without noting that modern in-memory transaction environments allow a lot more powerful analytics to be done directly in the transaction system, without any duplication at all (Hadoop does not yet support transactions)… [disclaimer: I’m an Innovation Evangelist for SAP]

  2. Timo, I agree. There are many use cases (e.g., real-time bidding, ad serving, fraud detection, algorithmic trading) where you want to run your analytic models at the time of the transaction. And modern in-memory environments (like HANA) are perfect for those use cases. But remember, I’m only “executing” my analytic models in-memory at the time of the transaction; I’m not “developing” my analytic models at the time of the transaction. Consequently, I need an environment where I can have access to ALL the transactions, at the lowest levels of granularity, so that I can use advanced data management techniques (e.g., recency, frequency, sequencing) and advanced analytic modeling to tease out new propensities, tendencies, behaviors, patterns, trends, correlations, cohort relations, etc. That’s why I dump all the data – after that data has passed through the in-memory transaction environment first – into something like Hadoop. Now the data scientists can have access to whatever data they need, at whatever level of granularity, over whatever period of time, whenever they want.

    By the way, as the in-memory environment for those use cases is analyzing the transactions, I also want whatever “score” the in-memory analysis calculated, added to the transaction. That way the data scientists can track and measure the effectiveness of the in-memory analytic models with the goal of constantly fine-tuning those analytic models.

    So to make the architecture map complete, I probably should add an in-memory environment sitting prior to the Hadoop Data Store (which I refuse to call a Data Lake for reasons of adding yet another word without content to the big data discussion). That would address the real-time analytic use cases.

    Timo, does that make sense?

    Sounds like the subject for a follow-up blog on my part. Timo, please let me know if I can reference your question in my follow-up blog. Thanks! Bill