Partitioning considerations For Best Performance

This Blog give you a complete details, how we can improve the performance of datastage Parallel jobs using appropriate partitioning methods.

Refer These links as well : 
1. Datastage Partitioning Methods and Use
2. Datastage Jobs Performance Improvement Tips1
3. Datastage Performance Tuning Tips



1.0 Partitioning considerations:

Ø  Choose a partition method which makes sure that the number of rows per partition is close to equal. This will minimize the processing work load and there by improves the overall run time. Any stage that process a group of related records must be partitioned using a keyed partition technique. (Egs in the case of Aggregator stage, Remove duplicate, Change capture, Change apply, Join, Merge stages etc, as well as for transformers that process group of related records)

Ø  Minimize repartitioning as it decreases the performance unless the partition distribution is highly skewed. Repartitioning results in overhead of network transport as well as even distribution of data among partitions is also gets disturbed.

Ø  Specify hash partitioning for stages that require processing of group of related records. Partitioning keys should include only those key columns that are necessary for proper grouping If the grouping is on a single integer key column, go for Modulus partition on the same key column If the data is highly skewed and the key column values and distribution will not change significantly over time, use the Range partitioning technique

Ø  Use Round robin partition to distribute data evenly across all partitions. (If grouping is not needed).This is very much suggested when the input data is in sequential mode or it is very much skewed Same partitioning requires minimum resources and can be used for  optimization of job and to eliminate repartitioning of the already partitioned data

Ø  When the input data set is sorted in parallel, we need to use Sort merge collector, which will produce a single sorted stream of rows. When the input data set is sorted in parallel and range partitioned, the ordered collector method is more preferred for collection

Ø  For round robin partitioned input data set use round robin collector to reconstruct rows in input order, as the long as the data set has not been re partitioned or reduced.

Ø  Minimize the use of sorts in a job.



Figure: Partitioning tab in a Datastage stage properties

No comments:

Post a Comment