Migration of Data Integration Process Flow
A common problem in maintaining data integration and transformation systems in corporate data warehouse environments is the necessity to migrate existing data transformation processes and workflows to new platforms due to changes in operational platforms, technical requirements, data loads etc. Often those migrations bring along the need for change in data transformation paradigm and conversion from Extract-Transform-Load style to Extract-Load-Transform style. A methodology for such migration is presented and tools implementing it are described below.
As the volumes and varieties of structured, semi-structured and unstructured data in an enterprise are growing, traditional data integration and ETL (Extract-Transform-Load) tools are becoming the bottleneck in any data warehouse transformation chain. Landing all the data on an ETL server instead of sending it directly to the database takes one extra hop, and processing big amounts of data - joining, merging, doing lookups - anywhere else than in a database is just plain wrong - that's what the databases are good at.
An approach to eliminating this bottleneck is by loading source data directly into the target database. Thin and scalable ELT (Extract-Load-Transform) is achieved through direct data movement and 100% SQL pushdown. This is especially true in parallel data processing environments, where removing ETL workflow and replacing it with ELT workflow to take advantage of set-based parallel processing to perform the inserts and updates directly within the database can sometimes reduce the end-to-end time of data transformation jobs by orders of magnitude.
By taking advantage of MPP (Massively Parallel Processing) data warehouse platforms to perform the necessary transformations directly within the database, the need for legacy ETL tools (Informatica, IBM DataStage, SAS Data Integration Studio) disappears. However, the transition from ETL mindset to ELT mindset is not straightforward and requires some careful planning.
DataStage is one of the leading ETL products on the BI market. The tool allows integration of the data across multiple systems and processing high volumes of the data. Datastage has an user-friendly graphical frontend to designing jobs which manage collecting, transforming, validating and loading data from multiple sources, such as the enterprise applications like Oracle, SAP, PeopleSoft and mainframes, to the data warehouse systems. The application is capable of integrating meta data across the data environment to maintain consistent analytic interpretations.
The main building blocks of a DataStage data processing job are:
Although this article focuses on IBM DataStage, the same methodology is applicable in case of other ETL tools that expose their process flow metadata, e.g. Informatica, SAS Data Integration Studio, Microsoft SSIS etc.
The idea of migration from one ETL tool to another one is not new. There are several applications available that can convert the transformation process definitions of one such tool to those of another one. However, typically those tools provide the functionality of syntax level conversion, simply substituting commands, terms and names used in source tool with those used in target tool. The process flow does not change with that level of conversion.
Migration from an SMP database platform (Oracle, DB2 etc.) to an MPP platform (Teradata, Greenplum etc.) necessitates also the migration from ETL paradigm to ELT paradigm, which in turn requires substantial changes in process flow. Therefore, syntax level conversion is not sufficient anymore: changing process semantics requires semantic level conversion.
The process of semantic conversion (referred here as ‘transmutation’) involves the following steps:
|Decomposition of original (source) data transformations into execution units of lowest granularity necessary and sufficient to represent the processing logic. The decomposed transformation jobs are stored as graphs with stages (datasets and transformers) as nodes and links (workflow) as edges. These graphs represent ETL-style process flow. Column level transformations and conditions as well as custom SQL statements are decomposed to the detail level that enables conversions and transcodings done in the next steps.||DSX Scanner for scanning, parsing and decoding DSX files. DSX Collector for restructuring and storing decomposed metadata into MMX Repository.|
|Computation of more abstract graphs from the original ones. These graphs contain pure data mappings. The nodes in these graphs correspond to datasets in DataStage process flows and the edges represent calculated data mappings. All data processing stages from original process flows (transformers, lookups, sorts, aggregators) are reduced to data mapping elements or attributes and eliminated. These transmutated graphs represent ELT-style process flow.||DI Mapper for conversion of inherent ETL-style mappings into ELT-style mappings ('Transmutation'). DI Router for rule-based calculation of optimal composition and order of execution (topological sort) for a converted process.|
|Construction of new process flows based on data mappings. This is accomplished utilizing tools and methodologies provided by MMX Foundation: calculated graphs of data mappings, Active Syntax Patterns, scenario- and template-based code construction and the functionality of XDTL language. Code construction is a rule-driven process that also involves transcoding of vendor-specific data types, functions and syntax elements from source processes. The results are functionally equivalent to the original DataStage jobs.||DI Code Constructor for building code fragments (SQL, scripts) and arranging them into preferred executable format (database functions, shell scripts, ELT packages).|
Thus, three levels of conversion are actually achieved at the same time: