Datastage Performance Tuning Tips Stagewise and overall desgin
You may like these links as well :
4. Click here to know about data partitioning and collecting methods
Basic Parallelism in DataStage Jobs should be optimized rather than maximized. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time). Ideally, if the amount of data to be processed is small, configuration files with less number of nodes should be used while if data volume is more, configuration files with larger number of nodes should be used.
Basic Parallelism in DataStage Jobs should be optimized rather than maximized. The degree of parallelism of a DataStage Job is determined by the number of nodes that is defined in the Configuration File, for example, four-node, eight –node etc. A configuration file with a larger number of nodes will generate a larger number of processes and will in turn add to the processing overheads as compared to a configuration file with a smaller number of nodes. Therefore, while choosing the configuration file one must weigh the benefits of increased parallelism against the losses in processing efficiency (increased processing overheads and slow start up time). Ideally, if the amount of data to be processed is small, configuration files with less number of nodes should be used while if data volume is more, configuration files with larger number of nodes should be used.
Partioning :
Proper partitioning of data is another aspect of DataStage Job design, which significantly improves overall job performance. Partitioning should be set in such a way so as to have balanced data flow i.e. nearly equal partitioning of data should occur and data skew should be minimized.
Memory :
In DataStage Jobs where high volume of data is processed, virtual memory settings for the job should be optimised. Jobs often abort in cases where a single lookup has multiple reference links. This happens due to low temp memory space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and $APT_MONITOR_TIME should be set to sufficiently large values.
Performance Tuning in Overall Job Design :
1. While designing DataStage Jobs care
should be taken that a single job is not overloaded with Stages.
2. Each extra Stage put in a Job
corresponds to lesser number of resources available for every Stage, which
directly affects the Jobs Performance.
3. If possible, big jobs having large
number of Stages should be logically split into smaller units.
4. if a particular Stage has been
identified to be taking lot of time in a job, like a transformer Stage having
complex functionality with a lot of Stage variables and transformations, then
the design of jobs could be done in such a way that this Stage is put in a
separate job all together (more resources for the transformer Stage!!!).
5. While designing jobs, care must be
taken that unnecessary column propagation is not done. Columns, which are not
needed in the job flow, should not be propagated from one Stage to another and
from one job to the next.
6. As far as possible, RCP (Runtime
Column Propagation) should be disabled in the jobs. Sorting in a job should be
taken care try to minimize number sorts in a job.
7. Design a job in such a way as to
combine operations around same sort keys, if possible maintain same hash keys.
8. Most often neglected option is
“don’t sort if previously sorted” in sort Stage, set this option to “true”.
This improves the Sort Stage performance a great deal.
9. In Transformer Stage “Preserve Sort
Order” can be used to maintain sort order of the data and reduce sorting in the
job.In a transformer minimum of Stage variables should be used. More the no of
Stage variable lower is the performance. An overloaded transformer can choke
the data flow and lead to bad performance or even failure of job at some point.
In order to minimize the load on transformer we can Avoid some unnecessary
function calls. For example to convert a varchar field with date value can be
type cast into Date type by simple formatting the input value. We need not use
StringToDate function, which is used to convert a String to Date type. Implicit
conversion of data types.
10. Reduce the number of Stage variables
used. It was observed in my previous project by removing 5 Stage variables and
6 function calls, runtime for the job was reduced from 3 hours to 1 hour 15 min
(approximately) with 110 million records input.
11. Try to balance load on transformers
by sharing the transformations across existing transformers. This would ensure
smooth flow of data. If you require type casting, renaming of columns or
addition of new columns, use Copy or Modify Stages to achieve this. Whenever
you have to use Lookups on large tables, look at the options such as unloading
the lookup tables to datasets and using, user defined join SQL to reduce the
look up volume with the help of temp tables, etc.
12. The Copy stage should be used
instead of a Transformer for simple operations including,Job Design placeholder
between stages , Renaming Columns, Dropping Columns ,Implicit (default) Type
Conversions.
13. The “upsert” works well if the data
is sorted on the primary key column of the table which is being loaded. Or
Determine, if the record already exists or not to have “Insert” and “Update”
separately.
14. It is sometimes possible to
re-arrange the order of business logic within a job flow to leverage the same
sort order, partitioning, and groupings.
15. Don’t read from a
Sequential File using SAME partitioning. Unless more than one source file is
specified, this scenario will read the entire file into a single partition,
making the entire downstream flow run sequentially (unless it is repartitioned)
16. Used sorted data for Aggregator.
17. Handle the nulls properly using modify stage.
18. Minimize the warnings in jobs, don’t suppress
the warnings.
19. Reduce the number of lookups in a job design.
20. Try not to use more
than 20 stages in a job if expected data volume is too high.
21. Don't use more than 7 lookups in the same
transformer; introduce new transformers if it exceeds 7 lookups.
22. It is also advisable to reduce the
number of transformers in a Job by combining the logic into a single
transformer rather than having multiple transformers.
23. Funnel Stage should be run in
“continuous” mode, without hindrance.
24. In Transformer Stage “Preserve Sort Order” can be
used to maintain sort order of the data and reduce sorting in the job.
25. Parameters Tuning in UVConfig File:
25. Parameters Tuning in UVConfig File:
a) T30FILE
This parameter determines the maximum number of dynamic hash files that can be opened system-wide on the DataStage system. If this value is too low, expect to find an error message similar to ‘T30FILE table full’. Following command shows the number of dynamic files in use:
echo "`bin/smat -d|wc -l` - 3"|bc
b) GLTABSZ
This parameter defines the size of a row in the group lock table
Suggested Value by IBM is 200
c) RLTABSZ
This parameter defines the size of a row in the record lock table.
Suggested Value by IBM is 200
d) MFILES:-
This parameter defines the size of the server engine (DSEngine) rotating file pool. The server engine will logically open and close files at the DataStage application level and physically close them at the OS level when the need arises.
Increase this value if DataStage jobs use a lot of files. Generally, a value of around 250 is suitable. If the value is set too low, then performance issues may occur, as the server engine will make more calls to open and close at the physical OS level in order to map the logical pool to the physical pool.
Performance
Analysis of Various stages in DataStage
1. Sequential File Stage -
The sequential file Stage is a file Stage.
The sequential file Stage is a file Stage.
It is the most common I/O Stage used in a DataStage Job.
It is used to read data from or write data to one or more flat
Files.
It can have only one input link or one Output link.
It can also have one reject link.
While handling huge volumes of data, this Stage can itself
become one of the major bottlenecks as reading and writing from this Stage is
slow.
Sequential files should be used in following conditions When we
are reading a flat file (fixed width or delimited) from UNIX environment which
is FTPed from some external systems When some UNIX operations has to be done on
the file Don’t use sequential file for intermediate storage between jobs. It
causes performance overhead, as it needs to do data conversion before writing
and reading from a UNIX file. In order to have faster reading from the Stage
the number of readers per node can be increased (default value is one).
2. Data Set Stage:
The Data Set is a file Stage, which allows reading data from or writing data to a dataset.
2. Data Set Stage:
The Data Set is a file Stage, which allows reading data from or writing data to a dataset.
This Stage can have a single input link or single Output link.
It does not support a reject link.
It can be configured to operate in sequential mode or parallel
mode.
DataStage parallel extender jobs use Dataset to store data being
operated on in a persistent form.
Datasets are operating system files which by convention has the
suffix .
DS Datasets are much faster compared to sequential files.
Data is spread across multiple nodes and is referred by a
control file.
Datasets are not UNIX files and no UNIX operation can be
performed on them.
Usage of Dataset results in a good performance in a set of
linked jobs.
They help in achieving end-to-end parallelism by writing data in
partitioned form and maintaining the sort order.
3. Lookup Stage –
A Look up Stage is an Active Stage.
3. Lookup Stage –
A Look up Stage is an Active Stage.
It is used to perform a lookup on any parallel job Stage that
can output data.
The lookup Stage can have a reference link, single input link,
single output link and single reject link.
Look up Stage is faster when the data volume is less.
It can have multiple reference links (if it is a sparse lookup
it can have only one reference link)The optional reject link carries source
records that do not have a corresponding input lookup tables.
Lookup Stage and type of lookup should be chosen depending on
the functionality and volume of data.
Sparse lookup type should be chosen only if primary input data
volume is small.
If the reference data volume is more, usage of Lookup Stage
should be avoided as all reference data is pulled in to local memory
4.Join Stage :
Join Stage performs a join operation on two or more datasets input to the join Stage and produces one output dataset. It can have multiple input links and one Output link.
4.Join Stage :
Join Stage performs a join operation on two or more datasets input to the join Stage and produces one output dataset. It can have multiple input links and one Output link.
There can be 3 types of join operations Inner Join, Left/Right
outer Join, Full outer join.
Join should be used when the data volume is high.
It is a good alternative to the lookup stage and should be used
when handling huge volumes of data.
Join uses the paging method for the data matching.
5. Merge Stage:
The Merge Stage is an active Stage.
5. Merge Stage:
The Merge Stage is an active Stage.
It can have multiple input links, a single output link, and it
supports as many reject links as input links.
The Merge Stage takes sorted input.
It combines a sorted master data set with one or more sorted
update data sets.
The columns from the records in the master and update data sets
are merged so that the output record contains all the columns from the master
record plus any additional columns from each update record.
A master record and an update record are merged only if both of
them have the same values for the merge key column(s) that you specify.
Merge key columns are one or more columns that exist in both the
master and update records.
Merge keys can be more than one column.
For a Merge Stage to work properly master dataset and update
dataset should contain unique records.
Merge Stage is generally used to combine datasets or files.
6. Sort Stage:
The Sort Stage is an active Stage.
6. Sort Stage:
The Sort Stage is an active Stage.
The Sort Stage is used to sort input dataset either in Ascending
or Descending order.
The Sort Stage offers a variety of options of retaining first or
last records when removing duplicate records, Stable sorting, can specify the
algorithm used for sorting to improve performance, etc.
Even though data can be sorted on a link, Sort Stage is used
when the data to be sorted is huge.
When we sort data on link (sort / unique option) once the data
size is beyond the fixed memory limit , I/O to disk takes place, which incurs
an overhead. Therefore, if the volume of data is large explicit sort stage
should be used instead of sort on link.
Sort Stage gives an option on increasing the buffer memory used
for sorting this would mean lower I/O and better performance.
7. Transformer Stage:
The Transformer Stage is an active Stage, which can have a single input link and multiple output links.
7. Transformer Stage:
The Transformer Stage is an active Stage, which can have a single input link and multiple output links.
It is a very robust Stage with lot of inbuilt functionality.
Transformer Stage always generates C-code, which is then
compiled to a parallel component.
So the overheads for using a transformer Stage are high. Therefore, in
any job, it is imperative that the use of a transformer is kept to a minimum
and instead other Stages are used, such as:Copy Stage can be used for mapping
input links with multiple output links without any transformations. Filter
Stage can be used for filtering out data based on certain criteria. Switch
Stage can be used to map single input link with multiple output links based on
the value of a selector field. It is also advisable to reduce the number of
transformers in a Job by combining the logic into a single transformer rather
than having multiple transformers.
8. Funnel Stage –
Funnel Stage is used to combine multiple inputs into a single output stream. But presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job. Funnel Stage should be run in “continuous” mode, without hindrance.
8. Funnel Stage –
Funnel Stage is used to combine multiple inputs into a single output stream. But presence of a Funnel Stage reduces the performance of a job. It would increase the time taken by job by 30% (observations). When a Funnel Stage is to be used in a large job it is better to isolate itself to one job. Write the output to Datasets and funnel them in new job. Funnel Stage should be run in “continuous” mode, without hindrance.
No comments:
Post a Comment