Doug Cackett – InFocus Blog | Dell EMC Services https://infocus.dellemc.com DELL EMC Global Services Blog Wed, 20 Feb 2019 20:46:18 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.7 Applying a Factory Model to Artificial Intelligence and Machine Learning https://infocus.dellemc.com/doug_cackett/applying-a-factory-model-to-artificial-intelligence-and-machine-learning/ https://infocus.dellemc.com/doug_cackett/applying-a-factory-model-to-artificial-intelligence-and-machine-learning/#respond Mon, 29 Oct 2018 09:00:13 +0000 https://infocus.dellemc.com/?p=36592 We’ve understood for a long time that organizations who spend more on, and are better at, deriving value from their data using analytics significantly outperform their peers in the market. All of us also know, because we feel it, that the pace of change is ever increasing.  I see this all the time with the […]

The post Applying a Factory Model to Artificial Intelligence and Machine Learning appeared first on InFocus Blog | Dell EMC Services.

]]>
We’ve understood for a long time that organizations who spend more on, and are better at, deriving value from their data using analytics significantly outperform their peers in the market. All of us also know, because we feel it, that the pace of change is ever increasing.  I see this all the time with the customers I work with, many of whom seem to be suffering from the “Red Queen” effect – each having to change and innovate faster just to keep standing still, let alone make progress against a tide of change.

I’ve also had cause to re-read Salim Ismail’s book, “Exponential Organizations”, recently which got me thinking: in order to compete, we should be designing and building solutions that allow us to exponentially grow our capacity to create value from data in order to meet business demand. Importantly though, how do we do that without also exponentially growing infrastructure costs (a bad idea) and the number of Data Scientists employed (an impossible dream)? That’s a really great exam question and one I’d like to explore in this blog.

Scaling Analytics with a Factory Model

I think that one of the reasons Artificial Intelligence and Machine Learning (AI / ML) are on the radar of all CxOs these days is because they’re seen as a way of closing the yawning gap most companies have between their capacity to collect data and their ability to apply it in the form of actionable insights. In other words, there’s a certain amount of ‘magical thinking’ going on here and AI/ML is the new magic being applied.

Our view is that the answer to our exam question lies in a more industrialized process.  We have been using a factory model concept with our customers to help them to address this central question of scaling efficiently. Take a look at the model in its entirety and then I’ll dissect it.

Download the Interactive Infographic [best viewed in full screen mode].

# 1: How do you drive innovation with AI/ML technologies?

As AI / ML technologies, packaging, frameworks and tooling are emerging so rapidly, there’s a real need to evaluate these new capabilities with a view to understanding the potential impact they might have on your business. The right place to do that is an R&D Lab.

At this point, we’re just trying to assess the technology and identify the potential business value and market impact. Is it a potentially disruptive technology that we need to start thinking about, perhaps disrupting ourselves before we get disrupted by others?  Or, it may just be a slightly better mousetrap than the one we are already using. By assessing the technology at the edge, we can answer questions around the planning horizon and take the appropriate steps to introduce the technology to the right parts of the business so it can be evaluated more completely.

The most important thing to bear in mind here is that this is a critical business function. It can’t be seen as a purely academic exercise conducted by an isolated team. Disruption is a modern reality and an existential threat to every business – the R&D function is a strategic investment that links you and your business to tomorrow’s world.

Development can happen both ways around of course. As well as technology-led, it might be that your Lean Innovation team is scanning the technology horizon to fill engineering gaps in a product that’s being brought to market. Close cooperation between these teams resulting in a melting pot of innovation is exactly what’s needed to survive and thrive over the long term. Today is the least amount of change you will ever see – we had all better get used to it!

The goal, regardless of whether development started from an idea sparked from the technology or the business side, is to become something of significance to the organization. It could be adding a net new product to the current portfolio, together with the corresponding line in the annual chart of accounts, or perhaps a more fundamental change is needed to maximize its potential, spinning it out to a completely new company.  If you’re interested in reading more around this topic, I’d recommend reading Geoffrey Moore’s book, “Zone to Win”.

As strategic developments progress, they will mature from Horizon 3 to Horizon 2 and finally into the more immediate Horizon 1, assuming they continue to be viewed as adding value to the business. At this point, if you haven’t already done so, you may like to stop here and quickly read my previous blog Industrializing the Data Value Creation Process, that looked at a conceptual framework for thinking about the way we extract commercial value from data – it might help you understand the process side of what I’m about to explain!

#2: How do you prioritize Horizon 1 activities?

At its heart, given the infinite demand and finite resources available in most organizations, you need to decide what you are going to spend your time on – this prioritization challenge needs to be based on a combination of factors, including overall strategy, likely value and current business priorities, as well as the availability of the data required.

Download the Interactive Infographic [best viewed in full screen mode].

The data doesn’t need to be available in its final form at this stage of course, but you may need to have at least some accessible to start the discovery process. Besides, data has a nasty habit of tripping you up, as it almost always takes longer to sort out legal and technical issues than you think, so addressing these kinds of challenges before you begin the data discovery work is normally a sound investment.

If data is the new oil, then the first, and most crucial step is discovering the next reserve under the ground. In our case, we’re typically using AI/ML to find a data pattern that can be applied to create commercial value. That discovery process is really crucial so we need to ensure our Data Scientists have the right environment and tools available so we have the best possible chance of finding that oil if it’s down there!

#3: How do you maximize Data Scientist productivity?

We know from experience that one size really doesn’t fit all, especially when it comes to Data Science. Some problems will be data heavy and others data light. Some will require extensive time spent data wrangling while others use heavy-weight GPU acceleration to plough through deep and computationally heavy neural networks. Libraries and tooling is also very likely to be different and may be driven by the personal preferences of the Data Scientists doing the work. Now, while you could force them all to use the one environment and one set of tools, why would you do that if your goal is to maximize productivity and employee satisfaction? The very last thing you need if you’re trying to scale up the Data Science work you’re doing is for your Data Scientists to be walking out of the door because they don’t like the setup. While I’m all in favor of standardization where it makes sense, technology has really moved past the point where this is strictly necessary.

If you scale Data Science by crowding all of your Data Scientists around a single production line with just the one set of tools and shared resources, they can’t help but get in each other’s way. Besides, the production line will inevitably be running at the pace of the slowest Data Scientist….or worse, the production line may even break because of the experiments one Data Scientist is undertaking.

It’s not that Data Scientists don’t collaborate and work in teams – it’s more that each will be much more productive if you give them a separate isolated environment, tailored specifically to the challenge they are faced with and tools they know. That way they get to independently determine the speed of the production line, which tools they use and how they are laid out. See my related blog Applying Parenting Skills to Big Data: Provide the Right Tools and a Safe Place to Play…and Be Quick About It!.

#4: How do you address data supply chain and quality issues?

Just like at a production line you might see at BMW or Ford, if we want to avoid any interruptions in production we need to ensure our supply chain delivers the right parts just in time for them to be assembled into the end-product. In our case this is all about the data, with the end product being a data product of some kind such as a classification model that could be used to score new data or perhaps just the scored results themselves.

As we never want to stop the production line or fail our final assembly, we also need to make sure the data is of an acceptable quality level. Since we don’t want to do that validation right next to the production line, we need to push the profiling and validation activity as far upstream as we can so it doesn’t interfere with the production line itself and any quality problems can be properly addressed.

#5: How do you scale compute and storage?

With a suitable infrastructure in place, combined with access to the right data and tooling, the Data Scientist is all set to do their work.

In most, if not all cases, the Data Scientist will need read access to governed data that is in scope for the analysis, along with the ability to upload external data assets that could be of value. They will also need to be able to iteratively save this data as the source datasets are integrated, wrangled and additional data facets generated to improve model performance. In traditional environments, this might mean a significant delay and additional costs as data is replicated multiple times for each Data Scientist and each use case, but it doesn’t have to happen that way! The other advantage with moving away from a legacy Direct Attach Storage (DAS) approach is that most Network Attached Storage (NAS) and cloud deployments provide change on write snapshot technologies, so replicas take near zero additional capacity and time to create in the first place with only the changed data consuming any capacity.

While we’re on the topic of cost and scale, of course the other thing you want to independently scale is the compute side of things. As I’ve already mentioned, some workloads will be naturally storage heavy and others compute heavy and storage light.  Data Science discovery projects also tend to be ephemeral in nature, but that’s also true of many production workloads such as ETL jobs. By leveraging the flexibility of virtualized infrastructure and dynamically managing resources, you can scale them up and down to match performance needs.  In this way, you can manage the natural variations in business activity and complimentary workloads to dramatically increase server utilization rates.  That heavy ETL load you process at the end of each financial period could be scaled out massively overnight when the Data Science team isn’t using the resources and scaled back when they are.  Through a virtualized approach, we can create differently shaped environments and make better use of the resources at our disposal. A simple control plane makes operational considerations a non-issue.

Once the discovery task is completed, the Data Scientist will want to save their work for future reference and share any new artefacts with their peers. Assuming they found something of value that needs to be put into production, they can prepare a work package that can be dropped into the Agile Development team’s engineering backlog.

#6: How do you accelerate the time to production?

The Agile Development team will typically include a blend of Data Architects, Agile Developers and junior Data Scientists with work prioritized and picked off the backlog based on available resources, as well as effort estimates and business priorities.

The same rules apply to the Agile Development team as they did for the Data Scientists.  Keeping them busy and effective means making sure they have everything they need at their disposal. Waiting for suitable development and analytical environments to be provisioned or data to be authorized or secured is not a good use of anyone’s time!  Using the same virtualized approach, we can quickly create an environment for the agile team that includes a more limited set of Data Science tooling (for scoring models) and the tool chain needed for the development work.

All provisioned in seconds, not weeks or months.

The next stage in the route to production for our data product will be more formal User Acceptance Testing (UAT). We can use our virtualized as-a-Service provisioning yet again here, only this time, rather than including the Agile Developers’ tool chain, we’ll include the testing infrastructure in the environment build instead.

The other aspect of efficiency worth noting is that for the most part the time allocated to the Data Science, development and testing tasks is very predictable. The Data Scientists’ work will often be time boxed – producing the best model possible within a set amount of time. In a traditional approach, additional delays are introduced because nobody could predict when the Data Science work would actually be started because of the unpredictable nature of provisioning delays. Addressing this one issue means that each team has a much better chance of sticking to schedule, making the entire process more dependable.

Once development and testing is completed, we need to move our new data product into a production setting. As discussed previously, some workloads are ephemeral in nature while others are not, often because they are stateful or their workloads can’t be resumed if they were to fail for some reason. Operationalizing the workload means selecting the appropriate environment based on its characteristics and then implementing and instrumenting it up appropriately.  This is an interesting topic in its own right and worthy of a follow-up blog!

#7 How do you know if the model is performing as designed?

Having changed the business process in some fashion because of our new data product, we need to have some way of monitoring its performance – ensuring our real-world results are as expected, triggering either management attention or a simple model rebuild when performance declines below acceptable limits.

In practice, this can often mean adding a new measure or report to existing BI solution or real-time monitoring dashboards. To facilitate this, the Agile Development team may have already created an additional SQL view describing performance that can be simply picked up and consumed by the BI team, greatly simplifying implementation.

Putting It All Together

To achieve success with Artificial Intelligence and Machine Learning, it’s critical that you have the right teams within your organization, including Data Scientists, R&D, Lean Innovation, and Agile Development, as well as an industrialized ‘data factory’ process that enables you to get value from your data as efficiently as possible. Technology of course plays a critical role as well, as you need to be able to provision environments quickly and securely, provide tool flexibility, and have the optimal infrastructure in place to support Data Science workloads.

At Dell EMC, we work with customers at all stages of analytics maturity to plan, implement and optimize solutions and infrastructure that enable organizations to drive value from data and support advanced techniques, including artificial intelligence and machine learning. That includes working across the people, process and technology aspects in a grounded and pragmatic fashion to accelerate the time to value and maximize Big Data investments.

If you’re looking for a trusted partner to help you on your analytics journey and take advantage of the latest technologies and techniques, Dell EMC Consulting is here to help. Learn more about our services and solutions and contact your account rep to discuss further. I also welcome you to join the conversation by posting your thoughts or questions below.

Before you go

Make sure to download the Factory Model for Artificial Intelligence and Machine Learning Interactive Infographic [best viewed in full screen mode].

 

The post Applying a Factory Model to Artificial Intelligence and Machine Learning appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/doug_cackett/applying-a-factory-model-to-artificial-intelligence-and-machine-learning/feed/ 0
Industrializing the Data Value Creation Process https://infocus.dellemc.com/doug_cackett/industrializing-the-data-value-creation-process/ https://infocus.dellemc.com/doug_cackett/industrializing-the-data-value-creation-process/#respond Tue, 24 Jul 2018 12:30:33 +0000 https://infocus.dellemc.com/?p=35753 For organizations to maximize the data value creation process, it’s critical to have a clear line of sight from their business strategy and stakeholders through to the decisions that could be improved by applying machine learning and other techniques to the available data. In recent months, what we’ve increasingly seen is Chief Data Officers taking […]

The post Industrializing the Data Value Creation Process appeared first on InFocus Blog | Dell EMC Services.

]]>
For organizations to maximize the data value creation process, it’s critical to have a clear line of sight from their business strategy and stakeholders through to the decisions that could be improved by applying machine learning and other techniques to the available data.

In recent months, what we’ve increasingly seen is Chief Data Officers taking a more active role in facilitating that process, focusing more on desired business outcomes and value delivery, and in doing so transforming themselves into Chief Data Monetization Officers. See the related blog, Data Monetization? Cue the Chief Data Monetization Officer.

For those outcomes to be fully realized and to create value on a true industrial scale, organizations need to have a laser focus on the process – automating steps and reducing waste to dramatically reduce the overall cost and time to insight for the production of “analytical widgets” in our “Data Factory”. If you think about it, that’s exactly what we’ve seen happening in the manufacturing world since the very first Model T rolled off the production line at Ford – why should the world of data be any different?

The Data Factory process really consists of 3 key steps. In the rest of this blog, I’ll outline each step and suggest how we might do more to industrialize the process.

Figure 1: Data Value Creation Process

Step 1 – Discover

The first step in the value chain is to explore the data in order to find something, to discover a pattern in the data that we might be able to apply in a subsequent step to create value for the business. Without Discovery, all you have in the data lakes is lots of data. That’s lots of costs and not a lot of value. None in fact, so this is perhaps the most important step in the process.

Discovery could be just about anything but most often we will be looking to optimize a customer interaction, such as applying personalization elements to an application to make content or an offer more relevant and compelling. Applying the personalization comes in Step 2, but before we get there, we need to uncover the pattern that allows us to be more personal.

To find patterns in Discovery, the data scientist will iterate through a number of steps to prepare data, build a model and then test it until the best one is found. The process is iterative as many factors can be changed such as the way data is prepared, the algorithm used and its parameters. As a model is a representation of knowledge or reality, it is never perfect. The Data Scientist will be looking for the one that performs the best for that specific business problem.

You can think about the value at this stage as personal value. Value to the data scientist in what they have learned, not commercial value to the organization. For that to happen, we need to operationalize the pattern we found by applying the model. See step 2 below.

Testing Models with Machine Learning and Data Science

This isn’t meant to be a data science primer but before we move into the Monetize step, it might be helpful to briefly review some of the basics around Data Science.

To keep it simple, let’s imagine we have a classification problem where we are trying to predict which customers are most likely to respond to a particular marketing campaign and we are going to build a classification model using existing sales and customer data so we can do just that.

To avoid over-fitting the data and ensuring that the model is accurate in the future when new data is applied, we split our data and keep some back so we can test our model with data it has not seen during the training process. We can then tabulate the results into a “confusion matrix” and look at the type of errors made and the general classification rate. False positives are where the model predicted a purchase and no purchase was made and a false negative is the other way around.

Whether any model is good or bad is very contextual. In our case, the 200 false positives may be great if the cost of the campaign is low (email) but may be considered poor if the campaign is expensive or these are our best customers and they’re getting fed up with being plagued with irrelevant offers! The situation is similar with the false negatives. If this is your premium gateway product and there is any chance of someone purchasing it, you may decide this misclassification is OK, however if it’s a fraud problem and you falsely accuse 300 customers then that’s not so great. See the blog on Is Data Science Really Science for more on false positives.

Figure 2: Sample Model Prediction (Confusion Matrix)

When we score our testing data, the model makes a prediction of purchase or non-purchase based on a threshold probability, typically 0.5. As well as changing the model algorithm and parameters, one of the other things the Data Scientist might do is to alter the threshold probability or misclassification cost to see how it impacts the errors in the confusion matrix, making adjustments based on required business goals so the best overall model is found.

Another approach to optimizing marketing campaign effectiveness is to rank results using “expected value” which we calculate by multiplying the probability of a purchase by the expected (purchase) value, often using the customer’s average previous purchase value as a proxy.

For example, we might want to mail the top 10,000 prospects and maximize income from the campaign so we rank our customers by expected value and select the top 10,000. In this way, someone with a probability of 0.3 but average purchase value of $1000 would be higher in our list than someone with a much higher probability of 0.8 and lower average value of $100 (expected value of 300 vs 80).

I’ve just used a simple example here to avoid confusion – the real world is rarely that straight forward of course. We may need to boost or combine models or tackle unsupervised modeling techniques, such as clustering, that are non-deterministic and therefore require greater skills on the part of the Data Scientist in order to be effective.

Step 2 – Monetize

It’s worth noting that I’m using the word “monetize” here as shorthand for “creating value” from the data. I’m not suggesting selling your data, although that may be the case for a limited set of companies. It may also have nothing to do with actually making money – in the case of a charity or government organization the focus will be on saving costs or improving service delivery – but the same broad principles remain the same regardless.

It’s also worth noting that not all of the models coming out of the Discovery step will need to be operationalized into an operational system such as an eCommerce engine. It may be that the insights gained can simply help to refine a manual process. For example, a retailer might benefit from looking at the profile of customers purchasing a particular group of fashion products to see how it aligns to the target customer segment identified by the merchandising team.

Having said that, in most cases, more value is likely to be created from applying AI and machine learning techniques to automated processes given the frequency of those decision points.

We will focus more on this aspect in the remaining part of this blog.

For those problems where we are looking to automate processes, the next thing we need to do is to monetize our model by deploying it into an operational context. That is, we set it into our business process to optimize it or to create value in some way such as through personalization. For example, if this was an online shopping application we might be operationalizing a propensity model so we display the most relevant content on pages or return search results ranked in relevance order for our customers. It’s these kinds of data-driven insights that can make a significant difference to the customer experience and profitability.

What we need to do to operationalize the model will depend on a number of factors, such as the type of model, the application that will consume the results of the model and the tooling we’re using. At its simplest, commercial Data Science tooling like Statistica and others have additional scoring capabilities built in. At the other end of the spectrum, the output from the Discovery process may well just land into the agile development backlog for implementation into a new or existing scoring framework and associated application.

Step 3 – Optimize

I’ve already mentioned that no machine learning model is perfect and to further complicate things, its performance will naturally decay over time – like fine wines, some may age delicately, while others will leave a nasty taste before you get it home from the store!

That means we need to monitor our models so we are alerted when performance has degraded beyond acceptable limits. If you have multiple models and decision points in a process, one model may also have a direct impact on another. It is this domino effect of unforeseen events which makes it even more important not to forget this step.

Another area where the Data Scientist will have a role to play is in the refinement of model testing to ensure statistical robustness. To fast track the process, a Data Scientist may combine many branches of a decision tree into a single test to reduce the number of customers needed in the control group when A:B testing to understand model lift.

Having been alerted to a model that has been degraded through this kind of testing, we’ll need to refresh the model and then re-deploy as appropriate. In many cases, we may just need to re-build the model with a new set of data before deploying the model again. Given that the data and model parameters are going to remain unchanged, this type of task could readily be undertaken by a more junior role than a Data Scientist. If a more complete re-work of the model is required, the task will be put into the Data Scientist backlog funnel and prioritized appropriately depending on the criticality of the model and impact on profits.  Although there is more work involved than just a simple re-calibration, it will still likely be far quicker than the initial development given more is known about the independent variables and most, if not all, of the data preparation will have been completed previously.

Just like in the previous step, if you are using commercial Data Science software to deploy your models, some model management capability will come out of the box. Some may also allow you to automate and report on A:B testing across your website. However, in most instances, additional investments will be required to make the current operational and analytical reporting systems more agile and scalable to meet the challenges placed on them by a modern Digital business. If the business intelligence systems can’t keep pace, you will need to address the issue one way or another!

Industrializing the Process

Techniques and approaches used in modern manufacturing have changed immeasurably since Henry Ford’s day to a point where a typical production line will receive parts from all over the world, arriving just in time to make many different products – all managed on a production line that just doesn’t stop. Looking back at our 3 steps by comparison, it’s clear we have a lot to learn.

A well-worn phrase in the industry is that a Data Scientist will spend 80% of their time wrangling data and only 20% doing the Science. In my experience, Data Scientists spend the majority of their time waiting for the infrastructure, software environment and data they need to even start wrangling (see my related blog, Applying Parenting Skills to Big Data: Provide the Right Tools and a Safe Place to Play…and Be Quick About It!). Delays brought about while new infrastructure is provisioned, software stacks built, network ports managed and data secured all add to the time and costs for each of the data products you’re creating. As a result, the process is often compromised with Data Scientists forced to use a shared environment or a standardized toolset. Without care and careful consideration, what we tend to do is to make what is after all, a data discovery problem, into an IT development one! There’s no ‘just in time’ in this process!

What if you could automate the process and remove barriers in the Discovery phase altogether?

The benefits could be huge!  Not only does that make best use of a skilled resource in limited supply (the Data Scientist), but it also means that downstream teams responsible for the Monetize and Optimize steps can schedule their work as the whole process becomes more predictable. In addition to the Data Science workload, what if the environment and toolchain required by the agile development team to Monetize our model (step 2) could also be automated?

Much can also be done with the data to help to accelerate the assembly process. Many types of machine learning models can benefit from data being presented in a “longitudinal” fashion. It’s typical for each Data Scientist to build and enhance this longitudinal view each time more is discovered about the data. This is another area that can benefit greatly from a more “industrialized view” of things – by standardizing data pre-processing (transformation) steps we improve quality, reduce the skills required and accelerate time to discovery. This is all about efficiency after all, but that also means we must add the necessary process so individual learning can be shared among the Data Science community and the standardized longitudinal view enhanced.

Back to Big Data Strategy

The point we started with was that creating value from data requires broader thinking than just a Big Data strategy. By looking in detail at the 3 steps in the value creation process, organizations can begin to unlock the potential value trapped in their data lakes and industrialize the process to eliminate costs and create greater efficiency with improved time to value.

At Dell EMC, we’re working with many of our customers to assess and industrialize their data value creation process and infrastructure. We’ve also created a standardized architectural pattern, the Elastic Data Platform, which enables companies to provide ‘just in time’ data, tools and environments for Data Scientists and other users to expedite the Discovery process. To learn more, check out this video featuring my colleague Matt Maccaux:

To learn even more about Data Monetization and Elastic Data Platform from Dell EMC experts, read the InFocus blogs:

Driving Competitive Advantage through Data Monetization

Avoid the Top 5 Big Data Mistakes with an Elastic Data Platform

Elastic Data Platform: From Chaos to Olympian Gods

 

The post Industrializing the Data Value Creation Process appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/doug_cackett/industrializing-the-data-value-creation-process/feed/ 0
Applying Parenting Skills to Big Data: Play with Friends and Learn from Experience https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-play-with-friends-and-learn-from-experience/ https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-play-with-friends-and-learn-from-experience/#respond Tue, 12 Jun 2018 09:00:03 +0000 https://infocus.dellemc.com/?p=35521 This series of blogs was inspired by a discussion I had with a customer senior executive when I found myself exploring the topic of value creation, big data and data science through the lens of parenting skills – offering a different and relatable way of thinking through the challenges many organizations face. In the first […]

The post Applying Parenting Skills to Big Data: Play with Friends and Learn from Experience appeared first on InFocus Blog | Dell EMC Services.

]]>
This series of blogs was inspired by a discussion I had with a customer senior executive when I found myself exploring the topic of value creation, big data and data science through the lens of parenting skills – offering a different and relatable way of thinking through the challenges many organizations face.

In the first blog, Applying Parenting Skills to Big Data: Focus on Outcomes and Set Boundaries, I discussed the notion of ‘long term greedy’ and governance. The second one, Applying Parenting Skills to Big Data: Provide the Right Tools and a Safe Place to Play…and Be Quick About It!, covered tools, discovery environments and industrializing the process to realize value from your data as quickly as possible. I’ll finish out the series focusing on roles and responsibilities, working with partners and learning from the analytics journeys and missteps of others.

A Familiar Pattern of Play

Watching kids play has to be one of the greatest treasures in life, and that’s never more so when they are playing with their best friends as they’ve already learned how to play together and what each brings to the party. They might fall into a familiar pattern of play as a result but it’s one that works and yields a happier, more productive experience for all involved.

Try this experiment at home – next time your kids are in full-on game mode with their best friends tell them you and your spouse want to play as well. See what happens to the game then? You’ve crashed their party and the kids are probably not too thrilled with the interruption or your lack of gaming skills!

We find the same to be true for big data solutions.To achieve full digital transformation in the organization, you need to apply data science to every aspect of the business. To do so affordably, you need to minimize the time taken to discover and operationalize insights and increase the cadence of those experiments. What we have found is that in order to achieve that, the people that aren’t directly involved need to stand back from the process so as not to disrupt or delay it.

Knowing Your Place in the Sandbox

In the case of enterprise big data, we’re talking about the IT and Security teams. They have a vital part to play in the enablement processes, in provisioning a safe place to play and the tooling that’s needed, but they have no role in what happens during the discovery process.

To be effective and maximize the speed and cadence for discovery and monetization, you need to architect and implement a platform that allows your IT and Security teams to stand well back from the process.

CEOs measure outcomes, not intermediate steps.The business doesn’t benefit at all until the pattern or whatever it is the data scientist is looking for in the data is actually implemented into a system of engagement. In turn, as a CDO I’m looking at eliminating waste and delays in the discovery – monetize and optimize chain.There is no greater waste than having your most valued asset – the data scientist – sit on their hands while waiting for an suitably sized environment or for the data they need to use, or the libraries they want to try or… or…

Get Out of the Plumbing Business

We have experience working with clients at every level of maturity in their Big Data journey and across all industries. Based on this experience, we have built a solution called the Elastic Data Platform (EDP), and while we offer a full portfolio of consulting services, we increasingly find ourselves talking about EDP with customers because it fills the gaps in what they are looking to achieve and enables them to use their existing Big Data infrastructure and investments. It helps them focus on outcomes rather than plumbing.

And just like your children when they play with friends, we have built the solution with our friends, filling some key gaps around standard Hadoop distributions. For instance, we use a tool from Blue Data to spin up Hadoop and other components almost instantly into Docker containers. You can choose between a variety of cluster sizes and configurations with ingress and egress edge nodes, various tools such as SAS Viya and connect these to back-end data sources through a policy engine and enforcement points that allow you to provide full fine-grained access control and redaction. Importantly, these clusters can be spun up and torn down in seconds.

Learn from Those Who Have Gone Before You

As well as learning from their friends and through play, it’s also important for kids to learn from their elders; people who have been there, seen it and done it all before. Importantly, kids learn both about specifics (look before crossing the road) and more general things that help to shape their views of the world. Both are important as that helps them learn while not getting hampered by things that are inevitably changing around them.

At Dell EMC, we work across a wide range of difficult and challenging environments in every industry. We see technologies on the leading edge of the wave as well as those that have already been well established. We also have a chance to stand back and understand what the fundamentals are – what works, and importantly, what doesn’t.

In many ways, the Elastic Data Platform along with a number of deployment patterns we have for Dell EMC’s underlying technologies underpins that experience. However, we also support our customers in a range of different engagement styles and specialties, whether it’s specifics around particular technologies such as modern AI platforms or current Hadoop tooling or at a much higher strategic level to shape your future direction.

Bringing It All Together

This wraps up my series on parenting skills and how they relate to big data and analytics.  I’ve hit on many of the key points I see organizations consistently challenged by. No doubt there are many other parallels we could draw, so let me know if you have any additional suggestions for the list!

The post Applying Parenting Skills to Big Data: Play with Friends and Learn from Experience appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-play-with-friends-and-learn-from-experience/feed/ 0
Applying Parenting Skills to Big Data: Provide the Right Tools and a Safe Place to Play…and Be Quick About It! https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-provide-the-right-tools-and-a-safe-place-to-playand-be-quick-about-it/ https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-provide-the-right-tools-and-a-safe-place-to-playand-be-quick-about-it/#respond Tue, 29 May 2018 09:00:03 +0000 https://infocus.dellemc.com/?p=35436 This series of blogs was inspired by a discussion I had with a customer senior executive when I found myself exploring the topic of value creation, big data and data science through the lens of parenting skills – offering a different and relatable way of thinking through the challenges many organizations face. In the first […]

The post Applying Parenting Skills to Big Data: Provide the Right Tools and a Safe Place to Play…and Be Quick About It! appeared first on InFocus Blog | Dell EMC Services.

]]>
This series of blogs was inspired by a discussion I had with a customer senior executive when I found myself exploring the topic of value creation, big data and data science through the lens of parenting skills – offering a different and relatable way of thinking through the challenges many organizations face. In the first blog, Applying Parenting Skills to Big Data: Focus on Outcomes and Set Boundaries, I discussed the notion of ‘long term greedy’ and governance. In this blog, we will discuss tools, having a safe place to play and the importance of end-to-end measurement.

Providing the Right Tools

One of the things I’ve learned as a parent is that tools need to be appropriate. When my sons were growing up, they had tools that looked similar to mine yet were age and task appropriate – think plastic toolkit with a plastic hammer, screwdriver and workbench.

The same is true of our data environments. ETL tooling makes perfect sense for enterprise ETL. Standardized BI tooling also. But neither are particularly useful for data science which is a highly iterative and personal journey. Tools also impact the quality and speed at which you can produce things. That perhaps explains why professionals use professional tools and amateur DIY enthusiasts like me use cheaper ones. It also explains why the quality of the results is different!

The Data Scientist Toolkit

If the continued success of your business is a function of your Data Scientists’ ability to find and apply patterns in your data, then you had better make them as productive as you possibly can. Part of that is in giving them the tools that they need, and these are unlikely to be the same as the tools you currently use across your business.

For a modern Data Scientist, the toolkit might include Python and R in a notebook like Jupyter. However, if you were embarking on a facial recognition problem, it’s pretty clear that using pre-built tooling such as Tensorflow and DLIB would make a lot more sense that trying to build this capability yourself using primitives as they are more task-specific and productive.

Finding a Safe Place to Play

Where you play is also important. If my sons were going to make a mess, I strangely always preferred them to go to a friend’s house rather than play at ours. In data science, there are some real advantages to having a “clean” environment for each new problem that needs to be tackled. We’ll talk more about this later. But sometimes there may not be enough room for a particular problem so doing the work in Azure or GCP also makes sense, bringing the results back once playtime is over!

Data science is also about experimentation, in the same way that your kids learn as they play. Children not only learn things like balance and fine motor coordination, but also social skills such as collaboration and shared goals. As long as the kids are safe, you can relax and they are free to try new things. For some that might be as simple as jumping on a trampoline, for others an awesome double backflip.

Data Scientists will break things when they play. They will do things and try things that make absolutely no sense and sometimes things are just going to go “bang”. That’s fine. Or at least it’s fine as long as they have an isolated place to play and do their stuff.

Or, put another way: if you don’t give Data Scientists a safe place to play, the only other thing you can do is to either stop them experimenting – the opposite of what you really want – or put them all in the same isolated swamp and let them fight it out.

The equivalent here is having lots of kids share the same trampoline at the same time. If that happens, your enterprise might be safe, but collisions are bound to happen and that’s going to have a measurable impact on productivity. Since our goal is all about trying to make our Data Scientist more productive, that seems like the wrong way to go.

It’s Not Just About the Data Scientists

Up until now we’ve been focusing on data science, however there are other players in the organization that are equally important in our ecosystem.

Once the Data Scientist has done his or her job and discovered something in the data that your organization can create value from, the next task is to monetize that insight by operationalizing it in some form. So, along with the Data Scientists, you’ll have agile development teams that need to evolve the data science into something that is enterprise-grade. We will talk about this further in a future blog but the point to take away is that others will also need environments that offer a safe and isolated place to play and the quicker you can provide them, the better.

Speed Counts

When your kids run a race at school, you press the stopwatch at the beginning and stop it at the end. If it’s a relay involving multiple people, the time that’s recorded is the team time – not just one leg of the race.

Following that thought for a moment, the key time for us in business is from the identification of the business problem through the point at which we have found a pattern in the data to the implementation of that pattern in a business process – in other words, from the idea to operationalizing that insight. It’s not just the discovery, and not just operationalizing it. We measure the time from beginning to end and that includes the time taken to pass the baton between people in the team.

The Clock Is Ticking

So, now for a much told myth – Data Scientists spend 80% of their time preparing data and 20% actually doing data science. Hands up if you believe that? Frankly speaking I don’t, well, at least it’s not quite the whole truth!

In my experience, Data Scientists spend much more of their time waiting for an environment and/or data. By simply eliminating that completely non-productive time, you could push through so many more data science projects. I don’t mean to trivialize other aspects of the process as they are all important – however this issue stands out to me as by far the most critical.

If I was a Chief Data Monetization Officer, I’d be looking at how we need to work on speed in the business and measure that in metrics such as time to discover, time to operationalize and time to manage.

Then I’d look at the key blockers that cause delays in the process and architect those out if possible.Time to provision is what has to happen before the Data Scientist or agile development teams can do ANYTHING and I’ve found that often takes MONTHS in most organizations.

So What Does Good Look Like?

Photo of Usain Bolt courtesy of xm.com.

A friend of mine once came up with what I thought was a fantastic idea. She thought it was impossible to know just how exceptional sprinters like Usain Bolt were because everyone else in the race were also good. She suggested that they should randomly pick someone from the audience to run in lane 9. That way you’d have a reasonable comparison to mark “average” against.

If you want to know what good looks like in the world of big data and data science, it’s the ability to fully provision new analytics environments in minutes.

Months or more than a year is a more typical starting point for many and that’s a real problem. And remember, we’re measuring the time taken to give the Data Scientist everything they need, not just some of it – that includes the tools, in an infrastructure that is isolated from their peers and with the libraries and data they need.

Navigating Your Big Data Journey

At Dell EMC, we offer a comprehensive portfolio of Big Data & IoT Consulting services from strategy through to implementation and ongoing optimization to help our customers accelerate the time to value of their analytics initiatives and maximize their investments. We also help organizations bridge the people, process, and technology needed to realize transformational business outcomes. For example, one of our solutions, the Elastic Data Platform, enables Data Scientists to have tool flexibility and isolated environments to work in that can be provisioned in minutes.

In my next blog, I’ll discuss the value of trusted partners and how to benefit from the experience of others.

Stay tuned and happy parenting!

The post Applying Parenting Skills to Big Data: Provide the Right Tools and a Safe Place to Play…and Be Quick About It! appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-provide-the-right-tools-and-a-safe-place-to-playand-be-quick-about-it/feed/ 0
Applying Parenting Skills to Big Data: Focus on Outcomes and Set Boundaries https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-focus-on-outcomes-and-set-boundaries/ https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-focus-on-outcomes-and-set-boundaries/#respond Tue, 15 May 2018 09:00:00 +0000 https://infocus.dellemc.com/?p=35341 I feel like I’ve spent much of my adult life “coaching” people in some form or another, both personally as well as professionally.  In one such conversation with a customer senior executive, I found myself struggling to explain some of the things I thought they needed to do to make their big data and data […]

The post Applying Parenting Skills to Big Data: Focus on Outcomes and Set Boundaries appeared first on InFocus Blog | Dell EMC Services.

]]>
I feel like I’ve spent much of my adult life “coaching” people in some form or another, both personally as well as professionally.  In one such conversation with a customer senior executive, I found myself struggling to explain some of the things I thought they needed to do to make their big data and data science program successful. In that moment, it occurred to me that there are many helpful parallels between big data projects and parenting – a topic that many of us can relate to!

Yes, I know on the face of it that sounds rather mad, but bear with me on this one! Over the course of the next few blogs I plan to touch on the following topics:

  • Long term greedy – focus on outcomes and make the right long term investments to get there
  • Governance – too much and too little will stifle progress
  • Tools to thrive and learn – let the situation and the person dictate their discovery tools
  • A safe place to play – you can expect things to get broken so plan accordingly
  • Speed counts – it’s the time from start to finish that matters, not just one leg of the race
  • Valued friendships – compliment your solution and fill gaps with tried and tested partners
  • Lessons learned – benefit from those who have gone before you

So, let’s get to it!

Focus on Outcomes

In my family, we have always tried to encourage our kids to adopt an approach of being “long term greedy”.  To me that means staying in it, whatever ‘it’ is, if it helps you achieve your longer term goal – even if you’re tempted to bail out sooner.  Say for example, you stay in a job despite frequent calls from recruiters because you believe long term it will provide the foundation of experience you need to open better doors later.  Long term greedy is about investing upfront in the foundational pieces that will inevitably lead to longer term outcomes and success.

So what does long term greedy look like from a business point of view in relation to big data and analytics?

Well, that depends on who you are.

If you’re the CEO of a bank, you’re focused on outcomes such as cost income ratios and return on assets employed, while the CEO in a retailer is focused on like-for-like sales and margin density rates. Every CEO knows they need to invest to grow top line measures in their business; and data has increasingly been seen as a critical area of investment with the potential to differentiate and drive competitive advantage.

The question though is what you should be investing in to achieve the desired outcomes.  I’m going to suggest that just investing in a data lake is not the right thing to do.  Data in of itself creates zero value for the business.  We create value by first Discovering something about the data that is of value to the business and then applying it in an operational context to Monetize it.  We should invest in the things that help to drive an increase in the rate at which we can churn through Discovery backlogs as well as the speed these can be implemented into operational systems.  A data lake may of course be an important part of what’s needed to get there but it’s not the outcome.

In fact, in many forward-looking companies the role of Chief Data Officer is continuing to evolve. Rather than just being a custodian and gatekeeper for data, they’ve become more of a Chief Data Monetization Officer – focused on building systems, processes and people that help drive value from the data. These are longer term investments, not just costs that have to be borne by the business.

Set the Right Boundaries

I’ve also learned as a parent that getting governance right is key to success in the long term. Too much governance, too tight a control and your kids won’t be able to go out and explore. They won’t have a chance to feel uncomfortable and know that’s OK. If you’re not careful, they won’t want to stray more than a foot from your side and that’s going to get old very fast.

On the other hand, we all know that kids need boundaries within which to operate. They need to understand what those boundaries are for their own safety and well-being, as well as others.

The same is true when it comes to data.  Too much governance and nobody can get access to the data let alone use it.  Too little and you find data duplicated all over the place.  The net result is that infrastructure, license and support costs spiral out of control along with the data and before you know it, you end up with a data swamp of questionable economic value and a long tail of costs for the business.

Navigating Your Big Data Journey

At Dell EMC, we offer a comprehensive portfolio of Big Data & IoT Consulting services from big data strategy through big data implementation, and ongoing optimization to help our customers accelerate the time to value of their analytics initiatives and maximize their investments. We also help organizations bridge the gap of people, process, and technology needed to realize transformational business outcomes, including defining a strategy and establishing governance.

In my next blog, I’ll discuss how the tools that help you learn and explore, along with having a safe place to play applies to enterprise big data success, for your data scientists in particular.

Stay tuned and happy parenting!

The post Applying Parenting Skills to Big Data: Focus on Outcomes and Set Boundaries appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/doug_cackett/applying-parenting-skills-to-big-data-focus-on-outcomes-and-set-boundaries/feed/ 0