Datastage Jobs Best Practices for Tuning

Datastage Jobs Best Practices and Performance Tuning

This Blog give you a complete details, how we can improve the performance of datastage Parallel jobs. Best practices we have to follow, while creating the datastage jobs.


This Blog will help you on following topics.
1.   Performance Tuning Guidelines
1.1       General Job Design
1.2       Transformer Stage
1.3       Data grouping Stages
1.4       ODBC Stages


Refer This link as well : Parallel Job Performance Tuning Tips1



1.0 Performance Tuning Guidelines

1.1 General Job Design
Ø  Jobs need to be developed using the modular development approach. Large jobs can be broken down in to smaller modules, which help in improving the performance.

Ø  In scenarios where same data (huge number of records) is to be shared among more than one jobs in the same project, use dataset stage approach instead of re-reading the same data again.

Ø  Eliminate unused columns

Ø  Eliminate unused references

Ø  If the input file has huge number of records and the business logic allows splitting up of the data, then run the job in parallel to have a significant improvement in the performance
                           
1.2 Transformer stage

Ø  Use parallel transformer stage instead of filter/switch stages ( filter/switch stages will take more resources for execution. For egs: in the case of filter stage the were clause will get executed during run time, thus creating the requirement for more resources, there by decaying the job performance)



Figure: Example of using a Transformer stage instead of using a filter stage. The filter condition is given in the constraint section of the transformer stage properties.


Ø  Use BuildOp stage only when the required logic cannot be implemented using the parallel transformer stage.

Ø  Avoid calling routines in derivations in the transformer stage. Implement the logic in derivation. This will avoid the over head of procedure call

Ø  Implement the logic using stage variables and call these stage variables in the derivations. During processing the execution starts with stage variables then constraints and then to individual columns. If ever there is a prerequisite formulae which can be used by both constraints and also individual columns then we can define it in stage variables so that it can be processed once and can be used by multiple records. If ever we require the formulae to be modified for each and every row then it is advisable to place in code in record level than stage variable level



Figure: Example for using stage variables in and using it in the derivations.
                             
1.3 Data grouping stages
Ø  When dealing with stages like Aggregator, Filter etc, always try to use sorted data for better performance


Figure: Sorting the input data on the grouping keys in an aggregator stage

The example shown in the figure is the properties window for an aggregator stage that finds out the sum of a quantity column by grouping on the columns shown above. In such scenarios, we will do sorting of the input data on the same columns so that the records with same/similar values for these grouping columns will come together there by increasing the performance. Also note that if we are using more than one node, then the input dataset should be properly partitioned so that the similar records will be available in the same node.
                                                                                         
1.4 ODBC Stages
Ø  If possible sort the data in ODBC stage itself; this will reduce the over head of DS sorting the data. Don’t use the sort stage when we have ORDER BY clause in ODBC sql

Ø  Select only the required records or Remove the unwanted rows as early, so that the job need not deal with unnecessary records causing performance degrade

Ø  Using a constraint to filter a record is much slower as compared to having a SELECT….WHERE in ODBC stage.  User the power of database where ever possible and reduce the over head for DS.




Figure: Using the User-defined SQL option in ODBC stages to reduce the overhead of datastage by specifying the WHERE and ORDER BY clause in the SQL used to get data.

Ø  Avoid using “like “ operator in user
Ø  defined queries in ODBC stages. But one thing to be noted here is that , if our custom sql requires a must scenario like it is doing a filter on some string pattern, we will be forced to use the like pattern to get the requirement done.
Avoid using
Ø  Stored Proceedures until and unless the functionality cannot be implemented in Data Stage jobs.

No comments:

Post a Comment