Service Excellence

4 Key Steps on Your Journey to the Data Lake

Frank Coleman By Frank Coleman Senior Director, DELL EMC Services October 26, 2015

shutterstock_153806942

In my last blog I talked about why you need a Data Lake. Now I’m going to share a few helpful steps on this journey and highlight some “gotchas” to avoid.

Step 1 – Feed the Lake

shutterstock_72308203

Understand all the data needs of your company/customer. If you don’t have the data, you are dead in the…wait for it—yes, I’m going there since we’re talking about data lake—water.

I can’t count the number of times I’ve requested data only to find it was missing an integral column in: Quotes, Billings, Bookings, Install Base, Contracts, Logistics, Case Management, Headcount, Expenses, Web traffic, Mobile, Telephony, Training, Industry data, DUNS….

With the data lake I finally have a place to ask questions about my business where I can see what is happening end to end without having to use 10 different BI tools.

  • Mistake #1: Don’t structure the data into what you think they need. Feed all the data, structured and unstructured. If not, you will always be asked to feed more and paying by the drip is expensive. Also, your customers will always be unhappy and work around your expensive data ingest model by using shadow solutions.

Step 2 – Care for the Lake

The initial feed to the data lake is awesome for asking questions and data discovery. But what happens when you discover something? What if you already knew something and now 10 other groups are re-creating the wheel? Or asking the same question but getting different answers?

This is where I use my Lego example. Let’s say you just dumped a bag of Legos on my desk, showed me a picture of the Death Star with no instructions or pre-packaged bags, and said, “Build it, you have everything you need”.

  • Mistake #2: Oops – you may have a skill gap. I hope your data lake is sitting on one of EMC’s solutions. Sorry I couldn’t help shamelessly plugging our EMC equipment. All kidding aside, this is what we use at EMC. If your team doesn’t know Hadoop, you probably need to learn it or pay for it “as a Service” to get you the unstructured data.
  • Mistake #3: Create a community not a competition. Data SMEs are your friends, not the enemy. Everyone wants to say they are in an advanced analytics / data science team, which is great, but you don’t need to discover the earth is round. Take advantage of data SMEs learnings and share yours. This can later feed into data governance.

Step 3 – Use the Lake with Analytical Sandboxes

shutterstock_170164847Analytical sandboxes are provisioned spaces for users to discover and build new insight. This is finally a place where you can access the data without a BI layer in the way. You can build and merge many different data sets in ways that were previously impossible.

  • Mistake #4: Where is that column? A huge frustration I’ve run into is only knowing our data through traditional BI tools. These BI tools often transform the data or create custom calculations that don’t exist natively. Understanding what data exists and how the BI data was created can save you an enormous amount of frustration and time.

Heads up: You will run into resistance, saying you are duplicating efforts and now have to maintain two copies of the logic. Forcing BI tools to do ETL is a huge mistake as the BI tool will hold you hostage. By building it into the data layer and then dropping it in to BI, it will run faster and take better advantage of your infrastructure.

Step 4: Feed Back into the Lake

shutterstock_118654267Again, you want to part of a community. Once you discover or create a model that adds value, feed it back. If you only keep it in your sandbox, you limit the amount of value this insight can produce. Many groups in my company look at similar information with a very different lens and sharing our findings helps reduce duplicated efforts or even dueling data. Success and Value to me is when we operationalize our findings and built it into workflow, applications and/or change the way we work, not just a report that has cool visualization. Your initial discovery or model may not be the final mile. Getting it fed back can help it feed other solutions or use cases. As an example, the data science model you created in your sandbox most likely isn’t going to feed a production app.

  • Mistake #5: I’m taking my data and going home. Creating insight or a fantastic data science model in your sandbox is awesome; but you are limiting your value. Many of your initial sandbox users may be former shadow IT groupies, where sharing data is not natural or encouraged. Create incentives and reward sharing or you are limiting value.

I hope you found these steps useful and learned how to avoid some land mines. If your experience is different or have other suggestions please comment below.

Frank Coleman

About Frank Coleman


Senior Director, DELL EMC Services

Frank is a Senior Director of Business Operations for Dell EMC Services. He is living the world of Big Data in this role, as he is responsible for using advanced data analytics to improve the customer experience with Dell EMC’s services organization.

This role keeps Frank immersed in Big Data, and he is at the cutting edge of using Big Data to solve real business problems. Frank has a strong blend of technical knowledge and business understanding, and has spent the last nine years focused on the business of service.

Under his leadership, EMC was honored in mid-2012 for the third consecutive year with the Technology Services Industry Association (TSIA) STAR Award for “Excellence in the Use of Metrics and Business Intelligence.” Prior to joining EMC, Frank worked in various fields and remote technical support roles.

Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

3 thoughts on “4 Key Steps on Your Journey to the Data Lake

  1. Hi Frank, I’m enjoing this article because I’m training in Big Data. It gives some clarifications, but even some questions, such as the following in “heads up” section you write “Forcing BI tools to do ETL is a huge mistake as the BI tool will hold you hostage. By building it into the data layer and then dropping it in to BI, it will run faster and take better advantage of your infrastructure”: what do you mean exactly? Thanks in advance!

    • Many times I see very complex logic done in Reporting tools; many tools require a Universe or layer between the actual data and visualization tool. We often like to join many different data sets together for example Case Data with CSAT data. However many times we use different vendors and different tools to get these results so you can’t easily access this data to merge without extracting a .csv file or excel dump from the tools. If you fed the data raw from the systems, you would be missing some of the complex logic done at the Universe layer, forcing you to use the tool as an ETL. Pulling that complex logic into the data layer makes it easier to merge data sets and when you layer in the BI tools the report will run faster as they don’t have to do the complex logic in the Universe or worst in the report itself.

      Hope this helps!

      Frank