AI/IoT/Analytics

Don’t Get Rid of Datastage Quite Yet…My Take on ETL vs. Hadoop

Bill Schmarzo By Bill Schmarzo September 24, 2013

I got the following question from one of my blog postings, and since I’ve gotten similar questions in the past, I thought it might be useful to create a shorter blog to discuss the question.  The answer to this question was not simple, and I had to bring in a couple of our data scientists (Dr. Pedro DeSouza and Dr. Wei Lin) to help me construct the most appropriate answers.  And as seems to be typical in most of our big data discussions, the answer really is “it depends.”

questionmark

Question:

I am looking at Hadoop in comparison to traditional ETL (Datastage) for managing mostly structured data. The intent is to feed both data warehouse environments as well as batch integration between systems.

Hadoop doesn’t seem to have some of the more advanced mapping tools that Datastage has so we require more low-level coding to parse the incoming files.

I am trying to make Hadoop do the wrong things – should we choose Datastage instead for the structured data workloads?

Answer:

I don’t think Hadoop (with MapReduce) will replace your traditional ETL (extract, transform, load) tools like Datastage any time soon.  Hadoop is a powerful data management framework with scale out capacities, and MapReduce provides toolsparsing/aggregating capabilities across massive structured and unstructured data sets.  It can perform ETL via custom coding, but does not easily replace your traditional ETL tool.  In fact, I think you’re better off thinking of Hadoop/MapReduce as a complement to your existing ETL tools.  Gives you a new tool in your kitbag!

The ETL tools have lots of built-in functionality for data cleansing, alignment, modeling, and transformation that today would need to be hand-coded in Hadoop.  Better to buy that functionality.

However, a conventional ETL tool would struggle processing even a few hundred GB in a single file.  Today’s standard ETL tools are best to process files up to 50GB or so.  For files larger than 50GB, your best strategy is to hand-code using Hadoop/MapReduce.

File Management

For small files you are better off using conventional ETL than MapReduce in order to use the evolved functionality already developed in ETL tools.  Leave the hardcore Hadoop/MapReduce jobs to large files, unstructured data or complex processing.

Also, there may be select data management and transformation processes that could benefit from the newer ELT (extract, load, transform) development paradigm that leverages the parallel processing and more procedural capabilities of Hadoop/MapReduce.  For example, you could use the ELT process on Hadoop with MapReduce to create advanced data metrics such as frequency, recency, and sequencing.

One last point, many of the traditional ETL vendors are porting their tools to run on Hadoop in order to take advantage of the processing and cost benefits of the natively parallel Hadoop environment.  For example, Talend, Pentaho, and Informatica have “ETL-like” functionality based on MapReduce jobs. You can use them for large files that are processed on the Hadoop cluster.

Seems to indicate that most advanced data management shops are going to have need for both traditional ETL tools as well as Hadoop/MapReduce – the right tool for the right job

Bill Schmarzo

About Bill Schmarzo


Read More

Share this Story
Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

8 thoughts on “Don’t Get Rid of Datastage Quite Yet…My Take on ETL vs. Hadoop

  1. Great article to read! Would like to read more of similar ones. Also I agree with the Answer for the question as it really ‘depends’ to use the right tool for the right job.

  2. Thanks for the article. This is the very question i have been trying to reason through based on recent corporate direction. Would be interested in more discussion on the topic.

  3. I think, the expensive and heavy weight ETL tools like DataStage will be ditched in the nearest future. Less expensive options will be more likely adopted in their place (Like Pentaho, Talend etc). The data processing load will be distributed across the ETL tool and Hadoop in such a way that small to medium volume data is processed in the ETL and the huge volumes will be handled in Hadoop (as Bill mentioned). BTW keep an eye on the REDPOINT.. it has strong potential to become a leader in this space….

  4. Varaprasad,thanks for the information. There is a lot happening in this space over the next 18 months, and I will definitely check out REDPOINT! Thanks again!

  5. i have question When to Use Traditional ETL tools and when to use Hadoop for the same… i mean which one for extract , transform and load

  6. Ahmed, the decision definitely needs to be evaluated on a case-by-case basis, but more and more we’re seeing Hadoop as the basis for the ETL/ELT work because it is faster and cheaper than traditional data warehouse architectures. And the added benefit is that many of the industry’s leading ETL vendors now provide their products to run directly on Hadoop.