Powering New Insights with a High Performing Data Lake

Topics in this article
To discover new insights with analytics, organizations are looking to find correlations in data across different, combined data sets. For those insights to be discovered, they need to be able to provide access to the data across multiple workgroups and stakeholders concurrently.

Most organizations use a data lake to store their data in its raw, native format. However, building data lakes and managing data storage can be challenging. The areas I most often see organizations struggle with across all types of environments running Hadoop are:

Set Up and Configuration

  • Hadoop services failing due to lack of proper configuration
  • Maintenance of multiple Hadoop environments is challenging and requires more resources

Security and Compliance

  • Lack of consistency and strong security controls in securing Hadoop and the data lake
  • Inability to integrate the data lake with LDAP or Active Directory

Storage and Compute

  • Low cluster utilization efficiency with varied workloads
  • Difficulty in scaling when the increase in data volumes is faster than anticipated
  • Server sprawl and challenges migrating multi-petabyte namespaces from direct attached storage (DAS) to network attached storage (NAS)
  • Lack of Hadoop Tiered Storage meaning cold and hot data are together causing performance issues

Multi-tenancy

  • Difficulty with the Hadoop Cluster supporting the different requirements of the Hadoop Distributed File System (HDFS) workflows
  • Challenges in moving data between environments (e.g., dev to prod and prod to dev) for data scientists to use production data in a secure environment

Hadoop on Isilon

Dell EMC’s Isilon Scale Out Network Attached Storage (NAS) makes the process of building data lakes much easier and offers many features that help organizations reduce maintenance and storage costs by keeping all of their data, including structured, semi-structured and unstructured data, in one place and file system.

Organizations can then extend the data lake to the cloud and to enterprise edge locations to consolidate and manage data more effectively, easily gain cloud-scale storage capacity, reduce overall storage costs, increase data protection and security, and simplify management of unstructured data.

Data Engineering Makes the Magic Happen

Hadoop is a consumer of Isilon, the data lake where all the data resides. To fully enable the capabilities of Isilon using Hadoop, and integrate the clusters securely and consistently, you need knowledgeable data engineers to set up and configure the environment. To illustrate the point, let’s look back at the common challenge areas and how you can mitigate them with proper data engineering and Hadoop on Isilon.

For more information on Multi-tenancy, refer to this whitepaper.

Implementation Process

In order to reap the benefits of Hadoop on Isilon, prior to implementation, data engineers need to secure your critical enterprise data, protect your valuable data, and simplify your storage capacity and performance management.

From there, the process of installing a Hadoop distribution and integrating it with an Isilon cluster varies by distribution, requirements, objectives, network topology, security policies, and many other factors. There is also a specific process to follow as illustrated in the diagram below.  For example, a supported distribution of Hadoop is installed and configured with Isilon before Hadoop is configured for authentication and both Hadoop and Isilon are authenticated with Kerberos.

For more information on setting up and managing Hadoop on Isilon, refer to this white paper.

Engaging a Trusted Partner

The good news is you and your teams don’t need to be experts in data engineering or navigate the implementation and configuration process on your own. Dell EMC Consulting Services can help you optimize your data lake and storage and maximize your investment, whether you’re just getting started with Hadoop on Isilon or have an existing environment that isn’t performing optimally. Our services are delivered by a global team of deeply skilled data engineers and include implementations, migrations, third party software integrations, ETL offloads, health checks and Hadoop performance optimizations as outlined in the graphic below.

Hadoop on Isilon, supported by data engineering services, offers a compelling business proposition for organizations looking to better manage their data to drive new insights and support advanced analytics techniques, such as artificial intelligence. If you are interested in learning more about our Hadoop on Isilon services or other Big Data and Analytics Consulting services, please contact your Dell EMC representative.

About the Author: Sudesh Sapra

Sudesh leads Dell Technologies Consulting Big Data Solution Engineering practice in North America, bringing 25 years of experience in enterprise architecture and strategic technology consulting. As customers transition to next-generation data processing architectures, he partners with them to successfully adopt those changes and drive business value forward. While his core skills and passion center on data architecture, Sudesh has a breadth of experience across data platforms, business intelligence, enterprise architecture, software development, and application architecture. He has also worked across multiple industries, including financial, healthcare, retail and pharmaceutical.
Topics in this article