David Dietrich – InFocus Blog | Dell EMC Services https://infocus.dellemc.com DELL EMC Global Services Blog Fri, 20 Apr 2018 15:17:21 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.2 2015 EMC World & Federation Business Data Lake https://infocus.dellemc.com/david_dietrich/2015-emc-world-federation-business-data-lake/ https://infocus.dellemc.com/david_dietrich/2015-emc-world-federation-business-data-lake/#respond Thu, 30 Apr 2015 19:35:54 +0000 https://infocus.dellemc.com/?p=23560 In late March, EMC announced its Federation Business Data Lake solution. Earlier in Q1, I also changed roles and joined the Big Data Solutions group as Director of Technical Marketing, where I now focus on the Federation Business Data Lake platform and related technical marketing initiatives. Although it has been used over the past 3-4 […]

The post 2015 EMC World & Federation Business Data Lake appeared first on InFocus Blog | Dell EMC Services.

]]>
In late March, EMC announced its Federation Business Data Lake solution. Earlier in Q1, I also changed roles and joined the Big Data Solutions group as Director of Technical Marketing, where I now focus on the Federation Business Data Lake platform and related technical marketing initiatives.

Although it has been used over the past 3-4 years, the term Data Lake still seems new and foreign to many people. Most often, I find there is a misperception that the data lake represents a data repository only. In this case, isn’t it akin to storage? How about a database? Is it a database – naturally, that stores data too… ? The reality is what EMC is providing is a full technology stack to ingest, store, and analyze data ,  to surface insight and act on it.  Below is a high-level view of the architecture.

EMC Business Data Lake 1-0

As you can see, the storage and databases and data stores are not the sole focus.  Rather, the intent is to provide organizations with a mechanism to solve the problems of data acquisition, data storage (hardware and software), analytics capabilities, and then the ability to access these data layers via API’s and Cloud Foundry toward the top of the stack. The result of this reference architecture and set of supporting services is that organizations have an opportunity to go from knowing they need to do something with Big Data but not knowing how to begin, which technologies to choose, and which people to get to assemble and configure the Big Data technologies, to having a ready-made technology stack that can address these needs in as little as one week.

This represents a massive jump in productivity and time to value, because it changes the questions within the enterprise from “which tools do we need?” and “do we have anyone in IT who can work with each of these strangely named tools?” to “OK, we know the stack now, so which Big Data projects do we want to focus on first, second, and third?”. With this new-found understanding, IT is equipped to re-orient the relationship it has with the business units from one in which IT is expected to troubleshoot problems and configure systems, to one in which they have an opportunity to provide more value to achieving a company’s Big Data vision and strategic Big Data initiatives.

EMC-GS-May-the-4th1-300x145

This is one of the topics I will discuss during several sessions at EMC World in Las Vegas. I invite you to attend my session – EMC IT: How We’re Creating New Business Value From Big Data & Business Data Lakes – where I will touch on the Federation Business Data Lake that EMC is offering as well as some of the skills organizations need to take advantage of these tools. To give you an insider’s perspective, I will be joined by Sean Brown who has spent the last few years at EMC building out the big data platform for EMC IT, and he describe how EMC IT has undergone this transformation over the past few years.

The post 2015 EMC World & Federation Business Data Lake appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/2015-emc-world-federation-business-data-lake/feed/ 0
A Three-Pronged Approach to Enterprise Analytics https://infocus.dellemc.com/david_dietrich/three-pronged-approach-enterprise-analytics/ https://infocus.dellemc.com/david_dietrich/three-pronged-approach-enterprise-analytics/#comments Mon, 08 Dec 2014 15:00:49 +0000 https://infocus.dellemc.com/?p=21855 Over the past 3 years, I have spoken with hundreds of people at many large enterprises doing analytics of one kind or another. While some of these conversations focused on reporting and Business Intelligence, many expressed the desire to move toward data science and Big Data Analytics. These conversations have revealed that many companies don’t […]

The post A Three-Pronged Approach to Enterprise Analytics appeared first on InFocus Blog | Dell EMC Services.

]]>
Over the past 3 years, I have spoken with hundreds of people at many large enterprises doing analytics of one kind or another. While some of these conversations focused on reporting and Business Intelligence, many expressed the desire to move toward data science and Big Data Analytics.

These conversations have revealed that many companies don’t know how to make the shift toward this end of the analytics spectrum. They may know they want to do something, but not what to do or how to get there – only that they have a need. As a colleague of mine remarked, this is akin to a five year old asking for lunch. They don’t know what they want, they don’t know what you have to offer, they just know they are hungry.

Similarly, companies are becoming hungry for data science and predictive analytics, but don’t always know what they want or when they have completed their quest. Occasionally, groups hire a data scientist or two into their reporting teams. I have seldom seen this approach succeed. The Data Scientist tends to be underutilized or is asked to do reporting work (this doesn’t last long). Or the Data Scientist is managed by someone with expertise in Reporting but not in Data Science and thus can provide little guidance on their unique set of challenges.

forksAfter speaking with numerous companies, I’ve noticed a trend where many organizations now take a 3-pronged approach to analytics. Many companies deliberately create three separate groups, with some coupling or coordination among them as follows.

1. Reporting team. This is a centralized reporting team that deals with regular reporting to the business on key metrics of interest. Enterprises seem to have a nearly insatiable appetite for reports and dashboards. I believe part of this is because Business Intelligence artifacts are understandable and familiar to lay people. These teams generally use some kind of RDMBS, OLAP, or similar, and push data to tools such as Crystal Reports, Business Objects, or more recently tools like Tableau, Qliview, or Spotfire.

  • A sample project might be to create a report of the 10 customers that spend the most.

2. Advanced Analytics Team. Separate from #1 above, this group is comprised of quantitative analysts, statisticians, operations researchers, and the like, and generally is managed by someone with expertise in these domain areas. These people tend to work with structured data (though don’t have to), often using data stores of manageable, though large, size. Often, these people use desktop analytical tools, such as SAS, R, SPSS, Python, and Weka to create predictive models, propensity models, attrition models, clustering, and execute data mining methods.

  • KSTEst
  • sample project might be to run a Kolmogorov-Smirnov test to identify a threshold for marketing to customers most likely to buy your products, and not wasting money on those who probably won’t.

 

 

3. Big Data Analytics Team. While this team may employ similar methods as #2 above, a key difference is they work with fundamentally different kinds of data. The data is messier, largely unstructured, and require different tool sets to manage, often using HDFS as the data store or other NoSQL data stores (Hbase, Neo4j, Couchbase, and others). As such, this team must employ some of the advanced analytical methods mentioned in #2 above, but in a new context (Big Data). Therefore, this team must be the most technically savvy of the three groups and be comfortable writing code (SQL, Java, Python, R, Scala, and others are popular), as they may deal with applying advanced quantitative methods to messier data, or use entirely different sets of techniques, such as those in natural language processing.

  • social graphA sample project might be to identify your company’s most valuable customers. Beyond just reporting on the customers that spend the most, or identifying customers most likely to purchase, a Big Data Analytics team can uncover customers who exert the most influence over your other customers and encourage them to buy your products.

Using Big Data tools (i.e. Hadoop, Pivotal Greenplum) and Linked Data, one could connect customer reference data and tweets to determine which customers are influencing other customers to speak at your events, act as references for your customers and prospects, and determine which referrals result in sales.

It should not be automatically assumed that your most valuable customers—as in the Customer Top 10 list I allude to in #1 above—are those who spend the most in a year. Often, it is customers who cause more people to buy your product that may actually prove to be most valuable. A Big Data Analytics team can enable your organization to develop an app for comparing data structures, perform social network analysis at large scales very quickly, and render a result that will help you gain a more accurate picture of your most valuable customers.

Most organizations I’ve spoken with are mainly doing #1 (reporting), a little bit of #2 (advanced analytics), and still getting their feet wet with #3 (Big Data Analytics). How would you assess your organization’s progress in these 3 areas?

 

The post A Three-Pronged Approach to Enterprise Analytics appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/three-pronged-approach-enterprise-analytics/feed/ 2
Big Strata https://infocus.dellemc.com/david_dietrich/big-strata/ https://infocus.dellemc.com/david_dietrich/big-strata/#respond Tue, 21 Oct 2014 18:18:17 +0000 https://infocus.dellemc.com/?p=20759 After returning from Strata + Hadoop World in New York, I am struck with how mainstream Big Data has become. No longer the playground of only a few quiet people with black shirts and ponytails, it has broadened to include business users, leaders, and yes, of course, plenty of engineers, architects, and Data Scientists. Amidst […]

The post Big Strata appeared first on InFocus Blog | Dell EMC Services.

]]>
NYC Javits Center

The Javits Center

After returning from Strata + Hadoop World in New York, I am struck with how mainstream Big Data has become. No longer the playground of only a few quiet people with black shirts and ponytails, it has broadened to include business users, leaders, and yes, of course, plenty of engineers, architects, and Data Scientists. Amidst the hype of Big Data, I’ll offer 3 main takeaways from the conference.

 

1. Strata + Hadoop World growth may be keeping pace with data growth.

Two or three years ago, the conference drew about 500 people; a size that a hotel in midtown Manhattan could accommodate. This year had 5,500 attendees, and required moving the conference to the Javits Center, a huge convention center on the City’s west side. There were a multitude of sessions with 11 concurrent tracks. It was an almost dizzying array of choices of which session to attend at a given time slot. Although the main industry firms (and vendors for that matter) claim there is huge opportunity with Big Data, there are still plenty of naysayers insisting it’s a flash in the pan, or only for an elite minority. This explosion in attendees (10x in 2 years) signals the attraction of Big Data and that there is a growing contingent of supporters, and those who believe in the value Big Data brings.

2. White-Hot Big Data Market

Several years ago, the lone conference message board was populated with notes about finding people’s lost keys, coats, and the like. Then when someone hand-wrote that they wanted to hire a Data Scientist, others quickly followed suit as I mentioned in a previous blog post.

2014 Strata Job Board in shadows

A job board at Strata + Hadoop World shows the dramatic growth of Big Data career opportunities.

This year, there were two large boards, specifically for job postings. Companies posted typed job openings, their business cards, and of course hand wrote jobs on the board, divided into areas such as Data Scientist, Data Engineer, Architect, and Business. Clearly, companies are actively looking for people and held many simultaneous recruiting sessions at rooftop bars in the city, during the day and evenings. The competition for talent is intense, with friends of mine commenting that they were invited to several recruiting happy hour sessions by internet giants at once (decisions, decisions…).

3. The Future is Memory

Many people still equate Big Data with Hadoop. That’s so three years ago! The trends have moved from Hadoop (think Apache, Pivotal HD, Cloudera, Hortonworks), to SQL on Hadoop (HAWQ, Impala, Stinger, Hive), to the current trend of In-Memory computing (Spark, Gemfire, Tachyon, and many other new entrants).
Spark, an in-memory database architecture designed for very high processing performance, was a huge focus at the conference, and I expect it to be for the next few years. In fact, the Strata talks were organized into groups related to Hadoop, Business, and Spark.

As Spark is emerging as an open source standard for in-memory computing, expect other enterprise-grade versions of in-memory databases (and related in-memory management tools, such as Tachyon) to come to the forefront, as happened with Hadoop. More to come on this in my next post.

The post Big Strata appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/big-strata/feed/ 0
I always feel like somebody’s watching me….(and I got no privacy) https://infocus.dellemc.com/david_dietrich/i-always-feel-like-somebodys-watching-me-and-i-got-no-online-privacy/ https://infocus.dellemc.com/david_dietrich/i-always-feel-like-somebodys-watching-me-and-i-got-no-online-privacy/#comments Mon, 04 Aug 2014 18:29:55 +0000 https://infocus.dellemc.com/?p=19618 Much like Rockwell’s song in the early 80’s, online, I always feel like somebody’s watching me (online). Although I don’t think Rockwell was talking about online privacy back then, it is indeed a key issue now. These days it is difficult to know how to use the web in a way that helps get you […]

The post I always feel like somebody’s watching me….(and I got no privacy) appeared first on InFocus Blog | Dell EMC Services.

]]>
Much like Rockwell’s song in the early 80’s, online, I always feel like somebody’s watching me (online). Although I don’t think Rockwell was talking about online privacy back then, it is indeed a key issue now.

These days it is difficult to know how to use the web in a way that helps get you what you need, while maintaining appropriate levels of privacy. If you want to learn about this, read the recent report by the Federal Trade Commission on “Data Brokers”.  This profiles different data brokers and how they acquire, monetize, and share data. Whose data? Your data. My data. Everything you do, look at, purchase, tweet about, email, and Like. These actions leave data trails that can be combined into profiles that data brokers use to make money and sell to marketing firms to understand what products to offer you before you even realize you may want them.  This is surprisingly effective.

What can you do about it? Well, you can ask to have links to your past actions be removed, as has been requested in the EU 

Since I’ve spent several posts talking about privacy risks, I will now share some ways to improve your privacy online. The first thing to realize is changing your privacy online requires a change in your usage behavior. Here are a few ways to try and do this.

Find out How Unique your Browser Is.  In casual parlance, many people will comment on things being a “very unique” idea or product. To my ears, this always sounds strange, as uniqueness is generally a binary state. Either you are unique or you are not. Or are you? In the case of browser data and identification, you actually can be somewhat unique or identifiable, and there are gradations of uniqueness. In this context, it refers to the degree to which your browser can be uniquely identified. One way to test this is by using Panopticlick,which tests your browser to see how unique it is based on the information it will share with sites it visits. In this scenario, information that your computer shares while on the web contributes bits of entropy, such as the name, operating system and precise version number of the browser, which can be used to identify you via the uniqueness of this information.

Try it here https://panopticlick.eff.org/, and it will test you on the spot if you wish.

Dave snip

Another option is to install browser add-ons that tell you who is trying to share and use the browser cookies without your knowledge. Popular tools in this area include Ghostery, which limits cookie sharing based on your choices, and Collusion , which provides a mapping of how cookies are being disseminated without your knowledge when you visit certain sites.

Lastly, there is the option to use anonymous browsers.  DuckDuckGo is one choice. Unlike using Google or Yahoo, which will store search history, provide search assist, auto-complete, and unique hits based on your unique profile and history, DDG does not store this information and therefore provides the same search results to a given search query for everyone. This makes it more anonymous than a regular browser, but you trade some convenience. The next step is ToR (the onion router), a private browser that received a lot of attention during the NSA Snowden event a year ago. ToR allows you to browse anonymously in a confined web relay pattern, thus intermediating your browsing habits, location, and history from regular browsers. Of course, now that ToR has received some attention, it is being used by more and more people (citizens, lawmakers, and law breakers …), which can affect its speed.

dave snip4

These are not the only solutions, but hopefully these provide some ideas for improving your privacy online and learning about your own browsing habits. A strange thing about using the web is many people use a computer in a quiet office, and thus have a false sense of security. In a physical world, they operate alone, at a desk, in a quiet place. Meanwhile, online you are connected to millions of other machines, people, and trackers, making a strange dichotomy between the physical realm and your online behavior. Don’t be fooled by the quiet and solitude while you browse the web, instead pay attention to what you do online.

Aware that many readers work with data, big or small, in future posts I will share points of view regarding data anonymization and de-anonymization, since data sharing is an issue companies will face as they try to become more data science-driven.

The post I always feel like somebody’s watching me….(and I got no privacy) appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/i-always-feel-like-somebodys-watching-me-and-i-got-no-online-privacy/feed/ 3
Predictive IT: Data Science Looks to the Future https://infocus.dellemc.com/david_dietrich/predictive-it-data-science-looks-to-the-future/ https://infocus.dellemc.com/david_dietrich/predictive-it-data-science-looks-to-the-future/#respond Tue, 24 Jun 2014 20:42:41 +0000 https://infocus.dellemc.com/?p=19000 I had the good fortune to contribute to a panel discussion at EMC World 2014 last month, focused on Predictive IT. Moderated by Bill Schmarzo, part of EMC’s Big Data consulting practice, I was joined by three other panelists intimately involved with Big Data across EMC: Krishnakumar Narayanan (“KK”), EMC IT’s Chief Architect and head […]

The post Predictive IT: Data Science Looks to the Future appeared first on InFocus Blog | Dell EMC Services.

]]>
I had the good fortune to contribute to a panel discussion at EMC World 2014 last month, focused on Predictive IT. Moderated by Bill Schmarzo, part of EMC’s Big Data consulting practice, I was joined by three other panelists intimately involved with Big Data across EMC:

  • Krishnakumar Narayanan (“KK”), EMC IT’s Chief Architect and head of IT’s Data-Science-As-a-Service team
  • Frank Coleman, who directs an analytics and reporting team within EMC Customer Service
  • Matt Povey, who helps customers implement Big Data technology solutions
Predictive IT Blog

From left to right: Frank Coleman, Krishnakumar “KK” Narayanan, David Dietrich, Matthew Povey

The positive feedback on the session was driven by having an engaged audience asking good questions coupled with a diverse mix of perspectives among the panelists.
Several undercurrents ran through the questions we received from the audience.  First, confusion remains regarding Business Intelligence and Data Science; when to use each and when to bring these skills on to your team. This is not surprising since it’s still early days and customers need assistance in transitioning from a traditional analytics and reporting team to embark on more complex data science problem sets. Second, people also need help identifying typical IT use cases for Big Data.

Many of the questions focused on simple ways to get started with Big Data and Data Science. I shared a number of ideas on this, though I tried to boil it down to two main suggestions for those trying to move toward doing more Data Science:

  1. Stop thinking about the past. Start thinking about the future. Typically, reporting and Business Intelligence focus on telling you what happened last year or last quarter. Reframing questions to be about the future—What will we do next year? How are things likely to change next time, and how will we know?—cause people to think in different terms, and analyzePredictive IT blog2 data and situations using different methods. This shift will also require getting people on the team with the skills to apply advanced analytical methods that answer these questions.
  2. Adopt a “test and learn” mindset. Too often, people take the “wait and see” approach (“Let’s see how many new customers we get ….”). Instead, consider doing A/B testing or find ways to test ideas and learn from the
    experiments. This critical shift toward data science needs to happen. Don’t be afraid to experiment or admit you don’t know something; that’s generally how people get out of their comfort zones and learn new things.

This Birds-of-a-Feather topic generated some thoughtful dialogue, aided by enthusiastic participation from the audience.  I encourage you to visit the InFocus author pages of my fellow panelist, Frank Coleman, and our moderator, Bill Schmarzo, who regularly address interesting and timely Data Science topics on their respective blogs.

 

The post Predictive IT: Data Science Looks to the Future appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/predictive-it-data-science-looks-to-the-future/feed/ 0
Viewing Big Data as Disruptive Innovation https://infocus.dellemc.com/david_dietrich/viewing-big-data-as-disruptive-innovation/ https://infocus.dellemc.com/david_dietrich/viewing-big-data-as-disruptive-innovation/#comments Sun, 06 Apr 2014 22:30:00 +0000 https://infocus.dellemc.com/?p=17883 Big Data is a hot topic everywhere you look these days and many people are making predictions about its impact. As Freakanomics author Steven Levitt points out in his podcast, predictions are a strange thing. Many pundits are rewarded for making wild predictions, which may be correct once in a great while, but they are […]

The post Viewing Big Data as Disruptive Innovation appeared first on InFocus Blog | Dell EMC Services.

]]>
Big Data is a hot topic everywhere you look these days and many people are making predictions about its impact. As Freakanomics author Steven Levitt points out in his podcast, predictions are a strange thing. Many pundits are rewarded for making wild predictions, which may be correct once in a great while, but they are rarely held accountable for the majority of their predictions, which may be incorrect.

At the risk of making a prediction myself, I agree with the perspective that Big Data is a disruptive force in the market place. This means that people need more than new tools, technologies and skills. They need an open-mind to rethink processes that they have done for a long time and change the way they operate. This is a risky thing. First, it means people must force themselves to adapt. Second, no one likes feeling like they were wrong before and now need to do something different. However, I would suggest viewing this a bit differently.

Because I believe that Big Data is indeed a disruptive opportunity, instead of the challenges mentioned above, I suggest readers consider two positive aspects:

1. There is an opportunity to capitalize on the disruptions, market changes and flux occurring in the market.

2. Opportunity abhors a vacuum. If you do not capitalize on the opportunities, expect that others will. These other entrants can be startups, or competitors, but people will step up to try to take advantage of the disruptions.

With this operating assumption that Big Data represents opportunities for disruptive innovation (as opposed to incremental or evolutionary innovation), I decided to revisit Clay Christianson’s seminal work, The Innovator’s Dilemma. Christianson mentions a path forward for new, disruptive innovations, proceeding in four steps:

Phase 1: Performance.  At this stage, there are many new market entrants, a fair amount of chaos and the main focus from customers is on emerging functionality and feature sets. When an invention is new or a technology hits the market, the first thing people crave is strong product performance and features, while ensuring it is doing the new thing they expect.

Phase 2: Reliability. When the market reaches this juncture, people have accepted the feature set and now want stability and reliability in the products. The focus has shifted from ‘does this product do what we expected’ to ‘how reliable is it’.

Phase 3: Convenience. The relevance here for Big Data means making software available on mobile devices, as iPhone apps, or similar. Rather than making software products command-line driven, pleasing user interfaces come into play, and customers start demanding them.

Phase 4: Price. Once the other three steps have been fulfilled, then the market players are on equal footing and competing on price. As the product becomes a commodity and the other criteria are satisfied, price becomes the only differentiator.

With Big Data, I think we are still early in this lifecycle. Most products are in Phase 1, and some are entering Phase 2. Consider Hadoop. Despite the amount of hype and articles written about Hadoop, most people claim they still have not implemented Hadoop. For most, they are still figuring out what it is, how it works, which feature set it has, and which features it needs to be a viable product. With that said, there are certainly organizations that have implemented Hadoop, and are now looking at its reliability. They want to use it as part of their Enterprise Data Warehouse, or as part of a Data Lake. For this to be a reality, Hadoop needs to have some features to make it more reliable and dependable for the enterprise. It is getting there, as active Hadoop users are working on this, as are Hadoop vendors, such as Cloudera and Pivotal. That said, there are many Big Data applications that are not as far along as Hadoop. Expect a similar evolution for the types of tools along this continuum, and for vendors of Hadoop and other somewhat more established Big Data technologies, as they add reliability and begin to think of convenience. YARN is an example of these emerging technologies.

For example, I consider Hadoop to be straddling the first two stages of this continuum, as portrayed below: Continuum of Disruption

I’m curious to know if others agree with this placement, and where you might put other relatively new technologies, such as Storm.

In summary however, despite many “wild predictions” made about Big Data, the reality is that Big Data is disruptive, and as a disruptive innovation, it should follow an established path. People must realize where they are playing and what they are chasing in regard to Big Data. They must determine which phase of disruption they are in, and ensure they are meeting the needs of the current phase, as well as the next phase in the progression. This is important to understand in order to define and implement a successful Big Data strategy and proactively meet market needs.

The post Viewing Big Data as Disruptive Innovation appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/viewing-big-data-as-disruptive-innovation/feed/ 2
Three Big Misconceptions About Big Data https://infocus.dellemc.com/david_dietrich/three-big-misconceptions-about-big-data/ https://infocus.dellemc.com/david_dietrich/three-big-misconceptions-about-big-data/#comments Wed, 15 Jan 2014 16:30:48 +0000 https://infocus.dellemc.com/?p=16898 As a result of the industry‘s growing interest in Big Data, my favorite topic, I did more public speaking in 2013 than in any other year of my career. I delivered 14 talks at industry conferences and events, at universities, and within EMC. Over the course of delivering these talks, a number of comments, questions […]

The post Three Big Misconceptions About Big Data appeared first on InFocus Blog | Dell EMC Services.

]]>
As a result of the industry‘s growing interest in Big Data, my favorite topic, I did more public speaking in 2013 than in any other year of my career. I delivered 14 talks at industry conferences and events, at universities, and within EMC. Over the course of delivering these talks, a number of comments, questions and misconceptions about Big Data came up again and again. I felt it would be useful to share some of what I heard, so here are three big misconceptions about Big Data:

 1. The most important thing about Big Data is its size

Big Data is mainly about the size of the data because Big Data is big, right? Well, not exactly, says Gary King of Harvard’s Institute of Quantitative Social Science. Certainly there is more data to work with than in the past (this is the Volume of the “3 Vs” – Volume, Variety and Velocity), but if people focus mainly on gigabytes, terabytes and petabytes, they are looking at Big Data mainly as a problem about storage and technology. Although this is definitely important, the more salient aspects of Big Data are typically the other two V’s: Variety and Velocity. Velocity refers to streaming data and very fast data, with low latency data accumulating or entering a data repository that enables people to make faster (or even automated) decisions. Streaming data is a big issue, but to me, the Variety piece is the most interesting of the 3 Vs.

big data sources (2)

The items shown above represent common ways that Big Data is generated. In fact, this illustrates a philosophical issue – it is not just the fact that Big Data has changed, it is more that the definition of what is considered data has changed. That is, most people think of data as rows and columns of numbers, such as Excel spreadsheets, RDBMSs, and data warehouses storing terabytes of structured data. Although this is true, Big Data is predominantly about semi-structured data or unstructured data. Big Data encompasses all of these other things that most people don’t think about when they consider data, such as RFID chips, geospatial sensors in smart phones, images, video files, clickstreams, voice recognition data, and metadata about these data. Certainly we need to find efficient ways to store the volumes of data, but I find that when people begin grasping the variety and velocity of data, they begin to find more innovative ways to use it.

2.  It’s just fine to bring a knife to a gun fight

“OK, but why do I need new tools? Can’t I just analyze Big Data with my existing software tools?” During a panel discussion in which we discussed using Hadoop to parallelize hundreds or thousands of unstructured data feeds, an audience member asked why he couldn’t simply analyze a large text corpus with SPSS. The reality is once you grok #1 above, then you realize that you need new tools that can understand, store and analyze different kinds of data inputs (images, clickstreams, video, voice prints, metadata, XML….) and process them in a parallel fashion. This is why in-memory desktop tools that were adequate for local in-memory analytics (SPSS, R, WEKA, etc.) will buckle under the weight and variety of Big Data sources and why we now need new technologies that can manage these disparate data sources and deal with them in a parallel manner.

3.  Imperfect data quality must mean that Big Data is worthless

“Yes, but with the Big Data, what about data quality? Isn’t it just ‘garbage in-garbage out’ (GIGO) on a larger scale?”

Big Data can certainly be messy, and data quality is important to any analysis. However, the key thing to remember is that the data will be inherently noisy. That is, there will be a lot of distractions, anomalies of different kinds, and inconsistencies. The important thing is to focus on the amount and variety of data, which can be pruned and used for valuable analysis. In other words, find the signal within all of the noise. In some cases, organizations will want to parse and clean large data sources, but in other cases this will be less important.  Consider Google Trends.

google trends

Google Trends shows the top things people are searching for, such as the top things that people searched for on Google throughout 2013, as shown in the photo above. It requires a massive amount of storage, processing power and robust analytical techniques to sift through the searches and rank them. This is an example of using Big Data, where GIGO is less of the focus.

By this point, many people say things such as “Aha! This sounds like a big change.” Yes! As a colleague of mine says, this suggests a distinction whereby people think of Big Data either as a noun or a verb. In other words, thinking of Big Data as a noun treats Big Data as “just more stuff” that needs to be stored and accommodated. Treating Big Data as a verb implies action, and the people in this camp view Big Data as a disruptive force, and an impetus to change the way that they operate. Use Big Data to test ideas in creative ways and approach business problems in analytical ways such as performing A/B testing — consider Google testing 50 shades of blue to find the one Gmail users would click on most, rather than having marketing managers simply guess. Or finding ways to measure what would seem to be unmeasurable, such as companies and universities finding better ways to automate image classification. Explore ideas in new ways – using data to answer the “what if….” questions.

The organizations that view Big Data as a verb will be the winners in this race.

The post Three Big Misconceptions About Big Data appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/three-big-misconceptions-about-big-data/feed/ 2
The Genesis of EMC’s Data Analytics Lifecycle https://infocus.dellemc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/ https://infocus.dellemc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/#respond Fri, 01 Nov 2013 16:37:56 +0000 https://infocus.dellemc.com/?p=16006 When I developed a new Data Analytics Lifecycle for EMC’s Data Science & Big Data Analytics course in 2011, I had no idea the attention it would receive. Although I have been doing analytical work for most of my career, I needed to do considerable research to create a solid process for others to follow. […]

The post The Genesis of EMC’s Data Analytics Lifecycle appeared first on InFocus Blog | Dell EMC Services.

]]>
When I developed a new Data Analytics Lifecycle for EMC’s Data Science & Big Data Analytics course in 2011, I had no idea the attention it would receive. Although I have been doing analytical work for most of my career, I needed to do considerable research to create a solid process for others to follow. After some preliminary research, I realized that there were surprisingly few existing frameworks for conducting data analytics.

The best sources that I came across were these:

  • CRISP-DM, which provides useful inputs on ways to frame analytics problems and is probably the most popular approach for data mining that I found.
  • Tom Davenport’s DELTA framework from his text “Analytics at Work.”
  • MAD Skills: New Analysis Practices for Big Data” provided inputs for several of the techniques mentioned in Phases three to five of my Data Analytics Lifecycle which focus on model planning, execution, and key findings.
  • Doug Hubbard’s Applied Information Economics (AIE) approach from his work “How to Measure Anything.” The focus of this work differs a bit from a classic data mining approach. Hubbard’s approach emphasizes estimating and measuring for the purpose of making better decisions. It has some very useful ideas, and helps one understand how to approach analytics challenges from a unique angle and treat them more like decision science problems.
  • The Scientific Method. Although it has been in use for centuries, it still provides a solid framework for thinking about and deconstructing problems into their principal parts. One of the most valuable ideas of the scientific method relates to forming hypotheses and finding ways to test ideas.

After reading these other approaches to problem solving, I read additional industry articles, and also interviewed multiple data scientists, including several now at Pivotal Data Science Labs, as well as Nina Zumel, a Data Scientist at an independent company, Win-Vector.

This research fueled the creation of a new model for approaching and solving data science or Big Data problems, which is portrayed in this diagram:

Data Analytics Lifecycle

This diagram was designed to convey several key points:

1)      Data science projects are iterative. Each phase does not represent static stage gates, but reflects the cyclical nature of real-world projects.

2)      The best gauge of advancing to the next phase is to ask key questions to test whether the team has accomplished enough to move forward.

3)      Ensure teams do the appropriate work both up front, and at the end of the projects, in order to succeed. Too often teams focus on Phases two through four, and want to jump into doing modeling work before they are ready.

I’ve seen people get excited about this approach when taking our data science classes, and have talked about it online, in blogs and even in books. Last year, I co-authored a blog series with EMC Fellow Steve Todd describing how to apply the Data Analytics Lifecycle approach to measure innovation at EMC. This work has been cited many times, both in terms of the project itself (which was mentioned in Business Week) and the methodology, which was highlighted in CRN magazine. In addition, it was also recently featured in Bill Schmarzo’s new book on Big Data.

This Data Analytic Lifecycle was originally developed for EMC’s Data Science & Big Data Analytics course, which was released in early 2012. Since then, I’ve had people tell me they keep a copy of the course book on their desks as reference to ensure they are approaching data science projects in a holistic way.

I’m glad that practitioners, theorists, and readers have found this methodology useful. If you would like to learn more about frameworks for approaching Big Data projects, I’d suggest you check out our EMC Education course materials on Data Science and also review some of the resources mentioned above.

The post The Genesis of EMC’s Data Analytics Lifecycle appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/the-genesis-of-emcs-data-analytics-lifecycle/feed/ 0
Building Data Science Teams at the Data Scientist Meetup https://infocus.dellemc.com/david_dietrich/building-data-science-teams-at-the-data-scientist-meetup/ https://infocus.dellemc.com/david_dietrich/building-data-science-teams-at-the-data-scientist-meetup/#comments Wed, 16 Oct 2013 12:57:42 +0000 https://infocus.dellemc.com/?p=15733 Big Data today is much like the World Wide Web was in the early 1990s. Back then, the Internet was so new, so different, and so revolutionary, it required completely new operating models, and people struggled to understand what it was, how to use it and how to best take advantage of it. And because […]

The post Building Data Science Teams at the Data Scientist Meetup appeared first on InFocus Blog | Dell EMC Services.

]]>
Big Data today is much like the World Wide Web was in the early 1990s. Back then, the Internet was so new, so different, and so revolutionary, it required completely new operating models, and people struggled to understand what it was, how to use it and how to best take advantage of it.

And because the World Wide Web was so new, there were few formal education programs available. To learn about the Internet and the mysterious World Wide Web, many groups formed organically to learn from each other and help create a community of people who understood and could figure out this new phenomenon together.

Now fast forward to our current day and time. Today we are dealing with this messy, hazy, and tantalizing thing called Big Data. There are many organizations popping up to help educate people (mine included, as I work in EMC Education Services), and many new university programs emerging to meet the demand and fill the skills gap for Data Scientists. As in the early days of the Internet, there are also many self-forming, informal communities that have emerged for people to help each other make sense of this space.  One such community is the Meetup.

meetup

Meetups are just what they sound like — groups of people meeting up to talk about all kinds of topics. Much like Big Data itself, the variety and volume of Meetups is simply staggering. In the Boston area alone, there are more than 3,000 Meetups nearby, ranging from groups about hiking, Frisbee contests, food and technology entrepreneurship, venture capitalists, toy puppy-owner affinity groups, education and data mining, technology and of course, Big Data.

I have found very active Meetups where people are learning about Big Data, Predictive Analytics, Data Scientists, and other related topics such as Hadoop, Cassandra, and Django. One interesting thing about Meetups is that the people attend on their own time, often after a full day of work, sometimes paying attendance fees out of their own pocket, or volunteering to help out. In other words, their behavior shows they are genuinely interested in the subject matter.

I volunteered one recent evening to speak at a local Data Scientist Meetup in Cambridge, MA.

DS group meetupThe Meetup was held at the Microsoft New England Research and Development Center, affectionately called “The NERD Center.”

NERD 1NERD2NERD4

As you can see from these photos, the NERD center is a terrific facility, with large open meeting rooms and a view of the Charles River. My Meetup talk focused on “Building Data Science Teams,” a topic that I also presented at EMC World in Las Vegas several months ago. Although my audience in Vegas was somewhat quiet, my audience in Cambridge was very vocal and asked many questions. My Las Vegas talk generated only a few questions, but in Cambridge I got 30-40 questions over a span of 90 minutes. David-Dietrich-Meetup-screen-capture

Here is a link to my presentation and to video footage of the Meetup (thanks to Kate Hutchinson for the video recording).

As I speak on this topic, some common themes emerge in the audiences’ questions:

  • How do I learn about Big Data and get started with it?
  • How do I get a Big Data job?
  • What are the roles are on a data science team? How does a Data Scientist differ from a Data Engineer or a Database Administrator, and how does this distinction change with new tools, such as Hadoop?
  • What organizational models are there for Big Data and Analytics in an organization?  How do I choose the right model, and what can I infer from each option?
  • How do I deal with sensitive information when the data requires privacy and security?

In future InFocus posts, I will explore these questions. Are these issues relevant to you? If there are other questions you would like to see answered, please feel to post your questions in the comments section.

The post Building Data Science Teams at the Data Scientist Meetup appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/building-data-science-teams-at-the-data-scientist-meetup/feed/ 1
Big Data vs. Big Dollars https://infocus.dellemc.com/david_dietrich/big-data-vs-big-dollars/ https://infocus.dellemc.com/david_dietrich/big-data-vs-big-dollars/#respond Tue, 27 Aug 2013 17:21:06 +0000 https://infocus.dellemc.com/?p=15080 Every day, people are trading privacy as a commodity in exchange for goods and services, although they may not realize such an exchange is actually taking place. I touched on this topic a bit in a previous InFocus post, “Big Data vs. Big Brother.” Google’s free services, such as Gmail, are a prime example of […]

The post Big Data vs. Big Dollars appeared first on InFocus Blog | Dell EMC Services.

]]>
Every day, people are trading privacy as a commodity in exchange for goods and services, although they may not realize such an exchange is actually taking place. I touched on this topic a bit in a previous InFocus post, “Big Data vs. Big Brother.” Google’s free services, such as Gmail, are a prime example of this trade off. In this context, people are getting “free” email usage, but in exchange they are giving Google permission to scan their emails and provide targeted advertising back to them based on the content and habits of their email messages.

On a larger scale, many people have mixed reactions to the recent NSA PRISM project and Andrew Snowden. Here is a brief overview of PRISM, from Wikipedia, in which federal agencies explored available data to determine who was a potential security risk or threat.

PRISM

As you can see, Google was one of many providers sharing data that became an input for the surveillance monitoring.

In the case of PRISM, data monitoring, and the mining of Big Data was conducted under the auspices of the public good. I’ve seen many different reactions to the NSA PRISM articles and publicity. Many people seemed shocked that this activity was taking place. For some, it raises concerns that others are reading their emails, Facebook updates and Likes, tweets, and other things in the name of keeping us safe or otherwise monitoring suspicious behavior. Others have commented to me…

“Well, of course the NSA is doing this. What do people think they are paid to do?”

“They can go ahead read my tweets. I don’t care. I’m doing nothing wrong.” 

This is all true – most people are not doing anything wrong, and are not running large scale surveillance projects from their basements. However, similar data analysis can have a more direct impact. Consider another instance that generates data, such as buying a roto-fryer online for a relative and sending it as a holiday gift. People mining your clickstream data may infer that it is an item for your own use and offer you similar products.

roto fryerWhat if health insurance carriers decided to analyze clickstream data to evaluate risk levels based on your online behavior? For instance, would it be reasonable to assume that people frequenting sites to purchase roto-fryers or downloading recipes from Paula Deen may have less healthy habits than those who frequent webmd.com or sites about exercise and healthy eating? What if insurance prices were influenced by this online browsing or shopping? Is that fair or reasonable?  Should you only browse healthy food sites?

Whatever you think of the PRISM case, it does raise some provocative questions when you begin to think about the implications of privacy and Big Data in regard to economic problems and markets that are predicated on shielding information from other participants.

Consider the health insurance industry again. Typically, insurance carriers estimate the amount of risk an individual represents. They can analyze a huge amount of information about someone to ascertain their risk level — Do they smoke? Are they overweight? What is their family history? — and based on their perceived level of risk, the individual is offered a price for their health insurance. For group insurance plans, companies will benchmark the overall perceived level of risk for the group, and then offer a group rate for the insurance. Although companies can gauge the aggregate amount of risk for a group, a company will not know for sure which specific individual is the most likely to contract an illness, and individual risk profiles are not disclosed.

Imagine what would happen if an individual’s profile were known in this case? If the company could determine who had the highest likelihood of contracting certain illnesses or diseases, they might choose to only insure certain people, and not cover those at high risk. This individual data suddenly becomes a very valuable dataset, which allows insurers to understand in a very specific way those with a very high likelihood of contracting an expensive illness and those at low risk. Using this data they could opt to adjust pricing accordingly. In other words, without a certain level of privacy (and ethics) and protection of sensitive data, the health insurance market would drastically change or break down.

Big Data has also been used to infer human behavior at a granular level. Based on the geospatial data in your phone, researchers are able to determine how active a person you are, how much you exercise you get, how much you leave your home, and the social circles in which you travel. An interesting example of this is the Data for Development (D4D) Challenge, which analyzed aggregated cell phone data in Ivory Coast to improve standards of living. If this information were shared in certain contexts, it could increase the likelihood that predictions are made about specific people contracting certain medical conditions. The World Economic Forum refers to this as “Reality Mining,” and has a great (and short) report on this topic, which discusses a “new deal” around privacy and data ownership.

These are just a few examples that have sparked my interest in understanding the importance of privacy in the era of Big Data. I am in favor of data sharing because it is what enables Data Scientists to run experiments on data that derive new value and insights. I also believe there are tremendous opportunities to curate and analyze Big Data for the public good. However, we also need to be thoughtful and ethical about how data is shared in order to protect identifying information, and we need to be mindful of the disruptive effect it may have on various markets.

The post Big Data vs. Big Dollars appeared first on InFocus Blog | Dell EMC Services.

]]>
https://infocus.dellemc.com/david_dietrich/big-data-vs-big-dollars/feed/ 0