DEV'S DATASTAGE TUTORIAL,GUIDES,TRAINING AND ONLINE HELP 4 U. UNIX, ETL, DATABASE RELATED SOLUTIONS: When to choose Parallel or Server Datastage Jobs

Situations to choose Parallel or Server Datastage Jobs

The choice of server or parallel depends upon time to implement, functionality and cost.
When we have lots of functionality to implement for lower volume and hardware is less and ease of implementation we can go for Server jobs.
Parallel jobs are costly due to high scale of hardware , difficult to implement, extreme processing capabilities for absurd volumes with vast array of operators for high-performance manipulation.
When the data volume is less it is better to go for Server job as parallel jobs can have a longer start up time.
When data volume is high, it is better to choose parallel job than server job. Parallel job will be a lot faster than server job even if it runs on single node. The obvious incentive for going parallel is data volume. Parallel jobs can remove bottlenecks and run across multiple nodes in a cluster for almost unlimited scalability. At this point parallel jobs become the faster and easier option. A parallel sort stage is lot faster than server stage. A Transformer stage in parallel job with the same transformations in server job is faster. Even on one node with a compiled transformer stage, the parallel version was three times faster. On 1 node configuration that does not have a lot of parallel processing also we can still get big performance improvements from an Enterprise Edition job. The improvements will be multiplied 10 or more than that if we work on 2CPU machines and two nodes in most stages.
Parallel jobs take advantage of both pipeline parallelism and partitioning parallelism.
We can improve the performance of server job by enabling inter process row buffering. This helps stages to exchange data as soon as it is available in the link. IPC stage also helps passive stage to read data from another as soon as data is available. In other words, stages do not have to wait for the entire set of records to be read first and then transferred to the next stage. Link partitioner and link collector stages can be used to achieve a certain degree of partitioning parallelism.
Look up with sequential file is possible in parallel jobs and not possible in server jobs.
Datastage EE jobs are compiled into OSH (Orchestrate Shell script language).
OSH executes operators - instances of executable C++ classes, pre-built components representing stages used in Datastage jobs.
Server Jobs are compiled into Basic which is an interpreted pseudo-code. This is why parallel jobs run faster, even if processed on one CPU.
The major difference between Infosphere Datastage Enterprise and Server edition is that Enterprise Edition (EE) introduces Parallel jobs. Parallel jobs support a completely new set of stages, which implement the scalable and parallel data processing mechanisms. In most cases parallel jobs and stages look similiar to the Datastage Server objects, however their capababilities are way different.
In rough outline:

Parallel jobs are executable datastage programs, managed and controlled by Datastage Server runtime environment
Parallel jobs have a built-in mechanism for Pipelining, Partitioning and Parallelism. In most cases no manual intervention is needed to implement optimally those techniques.
Parallel jobs are a lot faster in such ETL tasks like sorting, filtering, aggregating

Refer This Link to Know More about parallel Jobs Stages: Parallel Jobs Stages

Tabs

When to choose Parallel or Server Datastage Jobs

No comments:

Post a Comment

disqus

Visitor's View Count

Translate This Blog

Professionals plz visit

.