Use of Stages in Datastage 8.5 or 8x Series

Use of Stages in Datastage 8.5 or 8x Series :


In this we will see the list of Stages and Activities available in DataStage


All stage can be found in the Palette. They are categorized based on the functionality they provide.
Click on the below categories to see the stages under each of them
 

General Stage

DataStage - General Tab in Palette
Annotation: are like comments for a DataStage jobs. They are used in jobs to provide information about the job

Container: As the name suggests are used to contain or hold a group of stage. In cases where the transformation logic involves a lot of stages, container can be used to hold all the transformation logic inside them. This will make the job look a little less complex for new developers.
Here is how the job will look when opened:
 
Here the transformation logic is inside the container stage. It will be opened when you double click on the container stage.
 

Note that it has “Input” and “Output”; this means that the Container stage can handle only a single input and single output.
 
To make the above container Shared, right click on the container and select the option “Convert to Shared”. By doing this, we are making the transformation logic inside the container available for use by any job in the project.



 
Description Annotation: is used to enter the description about the job. Here is an example
DataStage - General Tab in Palette
Link: is used as a connector between two stage. It shows the direction of flow of data in a DataStage job.

Database Stages

These stages are used to connect to various databases. Most of the stages that you see below allow native connectivity to the respective databases.

Database stages

In cases when native connectivity is not available we use the ODBC stage or Dynamic RDBMS stage
ODBC Stage: Extracts data from or loads data into databases that support the industry standard Open Database Connectivity API
Dynamic RDBMS Stage: is a database stage that performs read write operations on the following list of databases
Informix, UDB/DB2, Microsoft SQL/Server, Oracle and Sybase.
Note: We can also use ODBC connection string to establish a connection to a database.
Stored Procedure: stage is dedicated to run stored procedure as the other database stages do not allow this.
Stages used most frequently will be explained in the next lessons.

Development/Debug Stages

Development Debug Stages
As the name suggests, these stages are used for testing and debugging.
Column Generator:
Is used to generate columns
Head: is used to get the n records starting from the first record. Additionally by specifying other options we can also get the n’th record starting from the 1st record.
Peek: is an output stage used in debugging. In instances where we have transformed the data and would like to check if the transformation is working properly before loading data to the target, we use a peek stage. The records that are sent to Peek stage can be either viewed in the job log in Director or can be written to a file.
Row Generator: as the name suggests is used to generate rows of data
Sample: Will take a random subset of input data that it receives and distributes it among output links
Tail: is used to get the n records starting from the last record.
Write Range Map: is used to create a Range Map that can be used while performing range partition.

File Stages

File Stages
These are used to read and write data to various file systems
 
Complex Flat File:This is used to read a file or write to a file, but cannot be used to do both at the same time. It can have multiple output links and a single reject link.
It can be used to read data from one or more complex flat files like MVS data sets with QSAM and VSAM files. We can also read data from files that contain multiple record types.
Data Set stage is used to create Data Sets. These are operating system files and are referred by a control file which has the extension .ds. Data Sets are only readable from within DataStage and are commonly used to improve performance. It allows only a single input or a single output and unlike sequential files they can be executed in parallel
External Source stage allows us to interact with or call external programs. This stage is mostly used to interface with source that are not currently supported by DataStage. It can have a single output link and single reject link
External Target stage like external source stage is used to write data to source that are currently not supported by DataStage. It can have a single input link and a reject link.
File Set stage is used to create files and save them with the extension .fs. It is useful because most flat files have a 2GB limit and using flat file stage we can store data that is more than the limit. Unlike the sequential file stage, this stage runs only in parallel mode. The difference between Data Set and File Set is that, File Set stores the formatting information that describe the format of the files to be read or written.
Lookup File Set stage is used to create file sets that are going to used with Lookup Stage. The extension for files created by this stage is also .fs. Using a Lookup File Set to perform a lookup will result in a significant increase in performance when compared to other stages. The stage can have a single input link and an output link that must always be a reference link
Example:

Sequential File is used read and write data to a single or multiple flat files. It is one the most commonly used stages in DataStage. It can have a single input link or a single output link and a reject link. This stage executes in parallel mode when extracting data from multiple files but will run sequentially when extracting from a single file.

Processing Stages

Processing Stages
These are the stages that perform all the transformations in DataStage. Let us briefly see what each one of the does.
Please note the most frequently used stages will be explained in detail with examples in the next lessons

Aggregator:
stage as the name suggests is used to perform aggregations and grouping functions. Here are some functions that can be performed using Aggregator stage – Sum, Average, Max value, Min value, Group by and more.
Change Capture stage is used to identify the differences between two streams of data. One link is assigned as the before link and the other link is after link. Based on the comparison done using the defined key fields, the stage generates the following codes 0,1,2,3,4
0 -> (duplicate) denotes the record is a copy
1 -> (new record) denotes its a new record and has to be insert
2 -> (delete) denotes the record can be deleted delete
3 -> (edit/modify) denotes the record exists but the value field in after link is different from the one in before link.
Change Apply stage reads a record from the change data set and from the before data set, compares their key column values, and acts accordingly (Insert, Delete, Edit, Copy).
Compare and Difference stages as the names suggest are used to compare the difference between two records based on key values
Copy stage can create multiple output links for a single input link.
Ex: If our input data has only one record
Name Age Sex
ABC 20 F

We can use this stage to create any number of duplicates for this record
Decode and Encode stages are used to encode the data based on a key value and they decode it based on the same value.
Expand stage is used to extract zipped content/files.
Filter stage is used to segregate records to different links based on filer conditions
FTP stage is used to FTP files.
Funnel stage is used to collect records from multiple input links and funnel them to a single output link.
Join, Lookup and Merge are very useful stages and are used to perform all kinds of joins on data. They differ in memory utilization and number of output & reject links that they allow.
will be explained in detail with examples
Modify stage is used to modify the meta data of the columns.
Pivot stage is used to convert rows to columns and columns to rows
Remove Duplicates stage as the name suggests is used to remove duplicate records from the input data.
Slowly Changing Dimension stage is used to implement SCD type 1 & 2 in datastage jobs
Sort stage is used to sort input data and send it to the output. It can also remove duplicates.
Switch stage takes a single input and assigns each input row to an output link based on the value of a selector field. The Switch stage is similar to Switch statement in C or SQL.
Transformer stage is probably the most used stage in DataStage. It has built-in transformation functions and additional ones can also be created by user (Routines)

No comments:

Post a Comment