This Blog give you a complete details, how we can improve the performance of datastage Parallel.
You may like these links as well :
1 Click here to know more datastage performance tuning tips
2. Click here to know Datastage Partitioning Methods and Use
3. Click here to know Datastage Jobs Performance Improvement Tips1
4. Click here to know Partitioning Considerations
Some Most Common Points For Datastage Jobs Performance Tuning :
1. Select suitable configurations file (nodes depending on data volume)
2. Select buffer memory correctly
3. Select proper partition
4. Turn off Runtime Column propagation wherever it’s not required
5. Taking care about sorting of the data.
6. Handling null values (use modify instead of transformer)
7. Try to decrease the use of transformer. (Use copy, filter, modify)
8. Use dataset instead of sequential file in the middle of the vast jobs
9. Take maximum 20 stages for a job for best performance.
10. Select Join or Lookup or Merge (depending on data volume)
11. Stop propagation of unnecessary metadata between the stages.
Here we can improve the performance by enabling the row buffer, the default row buffer size is 128K.B.
1 Click here to know more datastage performance tuning tips
2. Click here to know Datastage Partitioning Methods and Use
3. Click here to know Datastage Jobs Performance Improvement Tips1
4. Click here to know Partitioning Considerations
Some Most Common Points For Datastage Jobs Performance Tuning :
1. Select suitable configurations file (nodes depending on data volume)
2. Select buffer memory correctly
3. Select proper partition
4. Turn off Runtime Column propagation wherever it’s not required
5. Taking care about sorting of the data.
6. Handling null values (use modify instead of transformer)
7. Try to decrease the use of transformer. (Use copy, filter, modify)
8. Use dataset instead of sequential file in the middle of the vast jobs
9. Take maximum 20 stages for a job for best performance.
10. Select Join or Lookup or Merge (depending on data volume)
11. Stop propagation of unnecessary metadata between the stages.
Points we need to consider :
1.Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the
server using Hash/Sequential files for optimum performance
2.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical
values for faster inserts, updates and selects.
3.
Tuned the 'Project Tunables' in Administrator for better performance
4.
Used sorted data for Aggregator.
5.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for
better performance of jobs .
6.
Removed the data and columns not used from the source as early as possible in
the job.
7.
Worked with DB-admin to create appropriate Indexes on tables for better
performance of DS queries .
8.
Converted some of the complex joins/business in DS to Stored Procedures on DS
for faster execution of the jobs.
9.
If an input file has an excessive number of rows and can be split-up then use
standard logic to run jobs in parallel.
10.
Before writing a routine or a transform, make sure that there is not the
functionality required in one of the standard routines supplied in the sdk or
ds utilities categories. Constraints are generally CPU intensive and take a
significant amount of time to process. This may be the case if the constraint
calls routines or external macros but if it is inline code then the overhead
will be minimal.
11.
Try to have the constraints in the 'Selection' criteria of the jobs itself.
This will eliminate the unnecessary records even getting in before joins are
made.
12.
Tuning should occur on a job-by-job basis.
13.
Use the power of DBMS.
14.
Try not to use a sort stage when you can use an ORDER BY clause in the
database.
15.
Using a constraint to filter a record set is much slower than performing a
SELECT … WHERE….
16.
Make every attempt to use the bulk loader for your particular database. Bulk
loaders are generally faster than using ODBC or OLE.
17.
Minimize the usage of Transformer (Instead of this use Copy modify Filter Row
Generator
18.
Use SQL Code while extracting the data
19.
Handle the nulls Properly usning
modify stage.
20.
Minimize the warnings
21.
Reduce the number of lookups in a job
design.
22.
Try not to use more than 20 stages in a
job if expected data volume is too high.
23.
Use IPC stage between two passive stages Reduces processing time
24.
Drop indexes before data loading and recreate after loading data into tables
25.
Check the write cache of Hash file. If the same hash file is used for Look up
and as well as target disable this Option.
26.
If the hash file is used only for lookup then enable Preload to memory. This
will improve the performance. Also check the order of execution of the
routines.
27.
Don't use more than 7 lookups in the
same transformer; introduce new transformers if it exceeds 7 lookups.
28.
Use Preload to memory option in the hash file output.
29.
Use Write to cache in the hash file input.
30.
Write into the error tables only after all the transformer stages.
31.
Reduce the width of the input record - remove the columns that you would not
use.
32.
Cache the hash files you are reading from and writing into. Make sure your
cache is big enough to hold the hash files.
33.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash
files.
34.
Ideally, if the amount of data to be processed is small, configuration files
with less number of nodes should be used while if data volume is more ,
configuration files with larger number of nodes should be used.
35.
Partitioning should be set in such a way so as to have balanced data flow i.e.
nearly equal partitioning of data should occur and data skew should be
minimized.
36.
In DataStage Jobs where high volume of data is processed, virtual memory
settings for the job should be optimized. Jobs often abort in cases where a
single lookup has multiple reference links. This happens due to low temp memory
space. In such jobs $APT_BUFFER_MAXIMUM_MEMORY, $APT_MONITOR_SIZE and
$APT_MONITOR_TIME should be set to sufficiently large values.
37.
Sequential files should be used in following conditions. When we are reading a
flat file (fixed width or delimited) from UNIX environment which is FTP ed from
some external system
38.
When some UNIX operations has to be done on the file Don’t use sequential file
for intermediate storage between jobs. It causes performance overhead, as it
needs to do data conversion before writing and reading from a UNIX file
39.
In order to have faster reading from the Stage the number of readers per node
can be increased (default value is one).
40.
Usage of Dataset results in a good performance in a
set of linked jobs. They help in achieving end-to-end parallelism by writing
data in partitioned form and maintaining the sort order.
41.
Look up Stage is faster when the data volume is less. If the reference data
volume is more, usage of Lookup Stage should be avoided as all reference data
is pulled in to local memory
42.
Sparse lookup type should be chosen only if primary input data volume is small.
43.
Join should be used when the data volume is high. It is a good alternative to
the lookup stage and should be used when handling huge volumes of data.
44.
Even though data can be sorted on a link, Sort Stage is used when the data to
be sorted is huge.When we sort data on link ( sort / unique option) once the
data size is beyond the fixed memory limit , I/O to disk takes place, which
incurs an overhead. Therefore, if the volume of data is large explicit sort
stage should be used instead of sort on link.Sort Stage gives an option on
increasing the buffer memory used for sorting this would mean lower I/O and better
performance.
45. It is also advisable to reduce the
number of transformers in a Job by combining the logic into a single
transformer rather than having multiple transformers.
46.
Presence of a Funnel Stage reduces the performance of a job. It would increase
the time taken by job by 30% (observations). When a Funnel Stage is to be used
in a large job it is better to isolate itself to one job. Write the output to
Datasets and funnel them in new job.
47. Funnel Stage should be run in “continuous”
mode, without hindrance.
48. A single job
should not be overloaded with Stages. Each extra Stage put in a Job corresponds
to lesser number of resources available for every Stage, which directly affects
the Jobs Performance. If possible, big jobs having large number of Stages
should be logically split into smaller units.
49. Unnecessary
column propagation should not be done. As far as possible, RCP (Runtime Column
Propagation) should be disabled in the jobs
50. Most often
neglected option is “don’t sort if previously sorted” in sort Stage, set this
option to “true”. This improves the Sort Stage performance a great deal
51. In Transformer Stage “Preserve Sort Order”
can be used to maintain sort order of the data and reduce sorting in the job.
52. Reduce the number of Stage variables used.
53. The Copy stage should be used instead of a
Transformer for simple operations
54. The “upsert”
works well if the data is sorted on the primary key column of the table which
is being loaded.
55.
Don’t read from a Sequential File using
SAME partitioning.
56.
By using hashfile stage we can improve the performance. In case of hashfile
stage we can define the read cache size & write cache size but the default
size is 128M.B.
57. By using active-to-active link performance also
we can improve the performance.Here we can improve the performance by enabling the row buffer, the default row buffer size is 128K.B.
Many Thanks for Reading Datastage Jobs Performance Tuning Tips.
ReplyDeleteHi
ReplyDeleteFirst i want to thank you .I have a question regarding loading the data to SQL server.
iam getting 11 million records from dataset and loading to sql server database whcih is taking 5 hrs.
Can you suggest me some solution