Big Data

Industrializing the Data Value Creation Process

Doug Cackett By Doug Cackett EMEA Big Data & IoT Solution Lead, Dell EMC Consulting July 24, 2018

For organizations to maximize the data value creation process, it’s critical to have a clear line of sight from their business strategy and stakeholders through to the decisions that could be improved by applying machine learning and other techniques to the available data.

In recent months, what we’ve increasingly seen is Chief Data Officers taking a more active role in facilitating that process, focusing more on desired business outcomes and value delivery, and in doing so transforming themselves into Chief Data Monetization Officers. See the related blog, Data Monetization? Cue the Chief Data Monetization Officer.

For those outcomes to be fully realized and to create value on a true industrial scale, organizations need to have a laser focus on the process – automating steps and reducing waste to dramatically reduce the overall cost and time to insight for the production of “analytical widgets” in our “Data Factory”. If you think about it, that’s exactly what we’ve seen happening in the manufacturing world since the very first Model T rolled off the production line at Ford – why should the world of data be any different?

The Data Factory process really consists of 3 key steps. In the rest of this blog, I’ll outline each step and suggest how we might do more to industrialize the process.

Figure 1: Data Value Creation Process

Step 1 – Discover

The first step in the value chain is to explore the data in order to find something, to discover a pattern in the data that we might be able to apply in a subsequent step to create value for the business. Without Discovery, all you have in the data lakes is lots of data. That’s lots of costs and not a lot of value. None in fact, so this is perhaps the most important step in the process.

Discovery could be just about anything but most often we will be looking to optimize a customer interaction, such as applying personalization elements to an application to make content or an offer more relevant and compelling. Applying the personalization comes in Step 2, but before we get there, we need to uncover the pattern that allows us to be more personal.

To find patterns in Discovery, the data scientist will iterate through a number of steps to prepare data, build a model and then test it until the best one is found. The process is iterative as many factors can be changed such as the way data is prepared, the algorithm used and its parameters. As a model is a representation of knowledge or reality, it is never perfect. The Data Scientist will be looking for the one that performs the best for that specific business problem.

You can think about the value at this stage as personal value. Value to the data scientist in what they have learned, not commercial value to the organization. For that to happen, we need to operationalize the pattern we found by applying the model. See step 2 below.

Testing Models with Machine Learning and Data Science

This isn’t meant to be a data science primer but before we move into the Monetize step, it might be helpful to briefly review some of the basics around Data Science.

To keep it simple, let’s imagine we have a classification problem where we are trying to predict which customers are most likely to respond to a particular marketing campaign and we are going to build a classification model using existing sales and customer data so we can do just that.

To avoid over-fitting the data and ensuring that the model is accurate in the future when new data is applied, we split our data and keep some back so we can test our model with data it has not seen during the training process. We can then tabulate the results into a “confusion matrix” and look at the type of errors made and the general classification rate. False positives are where the model predicted a purchase and no purchase was made and a false negative is the other way around.

Whether any model is good or bad is very contextual. In our case, the 200 false positives may be great if the cost of the campaign is low (email) but may be considered poor if the campaign is expensive or these are our best customers and they’re getting fed up with being plagued with irrelevant offers! The situation is similar with the false negatives. If this is your premium gateway product and there is any chance of someone purchasing it, you may decide this misclassification is OK, however if it’s a fraud problem and you falsely accuse 300 customers then that’s not so great. See the blog on Is Data Science Really Science for more on false positives.

Figure 2: Sample Model Prediction (Confusion Matrix)

When we score our testing data, the model makes a prediction of purchase or non-purchase based on a threshold probability, typically 0.5. As well as changing the model algorithm and parameters, one of the other things the Data Scientist might do is to alter the threshold probability or misclassification cost to see how it impacts the errors in the confusion matrix, making adjustments based on required business goals so the best overall model is found.

Another approach to optimizing marketing campaign effectiveness is to rank results using “expected value” which we calculate by multiplying the probability of a purchase by the expected (purchase) value, often using the customer’s average previous purchase value as a proxy.

For example, we might want to mail the top 10,000 prospects and maximize income from the campaign so we rank our customers by expected value and select the top 10,000. In this way, someone with a probability of 0.3 but average purchase value of $1000 would be higher in our list than someone with a much higher probability of 0.8 and lower average value of $100 (expected value of 300 vs 80).

I’ve just used a simple example here to avoid confusion – the real world is rarely that straight forward of course. We may need to boost or combine models or tackle unsupervised modeling techniques, such as clustering, that are non-deterministic and therefore require greater skills on the part of the Data Scientist in order to be effective.

Step 2 – Monetize

It’s worth noting that I’m using the word “monetize” here as shorthand for “creating value” from the data. I’m not suggesting selling your data, although that may be the case for a limited set of companies. It may also have nothing to do with actually making money – in the case of a charity or government organization the focus will be on saving costs or improving service delivery – but the same broad principles remain the same regardless.

It’s also worth noting that not all of the models coming out of the Discovery step will need to be operationalized into an operational system such as an eCommerce engine. It may be that the insights gained can simply help to refine a manual process. For example, a retailer might benefit from looking at the profile of customers purchasing a particular group of fashion products to see how it aligns to the target customer segment identified by the merchandising team.

Having said that, in most cases, more value is likely to be created from applying AI and machine learning techniques to automated processes given the frequency of those decision points.

We will focus more on this aspect in the remaining part of this blog.

For those problems where we are looking to automate processes, the next thing we need to do is to monetize our model by deploying it into an operational context. That is, we set it into our business process to optimize it or to create value in some way such as through personalization. For example, if this was an online shopping application we might be operationalizing a propensity model so we display the most relevant content on pages or return search results ranked in relevance order for our customers. It’s these kinds of data-driven insights that can make a significant difference to the customer experience and profitability.

What we need to do to operationalize the model will depend on a number of factors, such as the type of model, the application that will consume the results of the model and the tooling we’re using. At its simplest, commercial Data Science tooling like Statistica and others have additional scoring capabilities built in. At the other end of the spectrum, the output from the Discovery process may well just land into the agile development backlog for implementation into a new or existing scoring framework and associated application.

Step 3 – Optimize

I’ve already mentioned that no machine learning model is perfect and to further complicate things, its performance will naturally decay over time – like fine wines, some may age delicately, while others will leave a nasty taste before you get it home from the store!

That means we need to monitor our models so we are alerted when performance has degraded beyond acceptable limits. If you have multiple models and decision points in a process, one model may also have a direct impact on another. It is this domino effect of unforeseen events which makes it even more important not to forget this step.

Another area where the Data Scientist will have a role to play is in the refinement of model testing to ensure statistical robustness. To fast track the process, a Data Scientist may combine many branches of a decision tree into a single test to reduce the number of customers needed in the control group when A:B testing to understand model lift.

Having been alerted to a model that has been degraded through this kind of testing, we’ll need to refresh the model and then re-deploy as appropriate. In many cases, we may just need to re-build the model with a new set of data before deploying the model again. Given that the data and model parameters are going to remain unchanged, this type of task could readily be undertaken by a more junior role than a Data Scientist. If a more complete re-work of the model is required, the task will be put into the Data Scientist backlog funnel and prioritized appropriately depending on the criticality of the model and impact on profits.  Although there is more work involved than just a simple re-calibration, it will still likely be far quicker than the initial development given more is known about the independent variables and most, if not all, of the data preparation will have been completed previously.

Just like in the previous step, if you are using commercial Data Science software to deploy your models, some model management capability will come out of the box. Some may also allow you to automate and report on A:B testing across your website. However, in most instances, additional investments will be required to make the current operational and analytical reporting systems more agile and scalable to meet the challenges placed on them by a modern Digital business. If the business intelligence systems can’t keep pace, you will need to address the issue one way or another!

Industrializing the Process

Techniques and approaches used in modern manufacturing have changed immeasurably since Henry Ford’s day to a point where a typical production line will receive parts from all over the world, arriving just in time to make many different products – all managed on a production line that just doesn’t stop. Looking back at our 3 steps by comparison, it’s clear we have a lot to learn.

A well-worn phrase in the industry is that a Data Scientist will spend 80% of their time wrangling data and only 20% doing the Science. In my experience, Data Scientists spend the majority of their time waiting for the infrastructure, software environment and data they need to even start wrangling (see my related blog, Applying Parenting Skills to Big Data: Provide the Right Tools and a Safe Place to Play…and Be Quick About It!). Delays brought about while new infrastructure is provisioned, software stacks built, network ports managed and data secured all add to the time and costs for each of the data products you’re creating. As a result, the process is often compromised with Data Scientists forced to use a shared environment or a standardized toolset. Without care and careful consideration, what we tend to do is to make what is after all, a data discovery problem, into an IT development one! There’s no ‘just in time’ in this process!

What if you could automate the process and remove barriers in the Discovery phase altogether?

The benefits could be huge!  Not only does that make best use of a skilled resource in limited supply (the Data Scientist), but it also means that downstream teams responsible for the Monetize and Optimize steps can schedule their work as the whole process becomes more predictable. In addition to the Data Science workload, what if the environment and toolchain required by the agile development team to Monetize our model (step 2) could also be automated?

Much can also be done with the data to help to accelerate the assembly process. Many types of machine learning models can benefit from data being presented in a “longitudinal” fashion. It’s typical for each Data Scientist to build and enhance this longitudinal view each time more is discovered about the data. This is another area that can benefit greatly from a more “industrialized view” of things – by standardizing data pre-processing (transformation) steps we improve quality, reduce the skills required and accelerate time to discovery. This is all about efficiency after all, but that also means we must add the necessary process so individual learning can be shared among the Data Science community and the standardized longitudinal view enhanced.

Back to Big Data Strategy

The point we started with was that creating value from data requires broader thinking than just a Big Data strategy. By looking in detail at the 3 steps in the value creation process, organizations can begin to unlock the potential value trapped in their data lakes and industrialize the process to eliminate costs and create greater efficiency with improved time to value.

At Dell EMC, we’re working with many of our customers to assess and industrialize their data value creation process and infrastructure. We’ve also created a standardized architectural pattern, the Elastic Data Platform, which enables companies to provide ‘just in time’ data, tools and environments for Data Scientists and other users to expedite the Discovery process. To learn more, check out this video featuring my colleague Matt Maccaux:

To learn even more about Data Monetization and Elastic Data Platform from Dell EMC experts, read the InFocus blogs:

Driving Competitive Advantage through Data Monetization

Avoid the Top 5 Big Data Mistakes with an Elastic Data Platform

Elastic Data Platform: From Chaos to Olympian Gods

 

Doug Cackett

About Doug Cackett


EMEA Big Data & IoT Solution Lead, Dell EMC Consulting

Doug leads the Dell EMC Big Data & IoT Consulting practice in EMEA, engaged in helping our customers deliver more value from their entire information management estate, including modernizing legacy enterprise data warehouses and operational data stores and fully exploiting the power of Big Data and IoT technologies.

Doug has a background and practical experience working with data mining and machine learning tools, as well as designing and delivering large scale information systems for many of the largest companies around the world. Through his combined experience, Doug is uniquely positioned to offer insights and perspective at the intersection of data science and information management.

Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *