Datastage Interview Related Questions and Answers V1.3
You will get Answers for below mentioned questions, in this Post:
# Details of Data
partitioning and collecting methods in Datastage?
# Remove duplicates
using Sort Stage and Remove Duplicate Stages and Diffrence?
# What is difference between Copy
& transformer stage ?
# What is the use Enterprise Pivot Stage ?
# What is the difference between change capture and change apply stages?
------------------------------------------
# Details of Data partitioning and collecting methods in Datastage?
Partitioning mechanism divides a portion of data
into smaller segments, which is then processed independently by each node in
parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid
computing and Clusters.
- Auto
- Same
- Round robin
- Hash
- Entire
- Random
- Range
- Modulus
Collecting is the opposite of partitioning and
can be defined as a process of bringing back data partitions into a single
sequential stream (one data partition).
1.
Auto
2.
Round
Robin
3.
Ordered
4.
Sort
Merge
** Data partitioning methods : Datastage supports a few types of Data partitioning methods which can be implemented in parallel stages:
Auto - default. Datastage Enterprise
Edition decides between using Same or Round Robin partitioning. Typically Same
partitioning is used between two parallel stages and round robin is used
between a sequential and an EE stage.
Same - existing partitioning remains
unchanged. No data is moved between nodes.
Round robin - rows are alternated
evenly accross partitions. This partitioning method guarantees an exact load
balance (the same number of rows processed) between nodes and is very fast.
Hash - rows with same key column (or
multiple columns) go to the same partition. Hash is very often used and
sometimes improves performance, however it is important to have in mind that
hash partitioning does not guarantee load balance and misuse may lead to skew
data and poor performance.
Entire - all rows from a dataset are
distributed to each partition. Duplicated rows are stored and the data volume
is significantly increased.
Random - rows are randomly distributed
accross partitions
Range - an expensive refinement to
hash partitioning. It is imilar to hash but partition mapping is
user-determined and partitions are ordered. Rows are distributed according to
the values in one or more key fields, using a range map (the 'Write Range Map'
stage needs to be used to create it). Range partitioning requires processing
the data twice which makes it hard to find a reason for using it.
Modulus - data is partitioned on one
specified numeric field by calculating modulus against number of partitions.
Not used very often.
** Data collecting methods : A collector combines partitions into a single sequential stream. Datastage Parallel supports the following collecting algorithms:
Auto - the default algorithm reads
rows from a partition as soon as they are ready. This may lead to producing
different row orders in different runs with identical data. The execution is
non-deterministic.
Round Robin - picks rows from input
partition patiently, for instance: first row from partition 0, next from
partition 1, even if other partitions can produce rows faster than partition 1.
Ordered - reads all rows from first
partition, then second partition, then third and so on.
Sort Merge - produces a globally
sorted sequential stream from within partition sorted rows. Sort Merge produces
a non-deterministic on un-keyed columns sorted sequential stream using the
following algorithm: always pick the partition that produces the row with the
smallest key value.
#Remove duplicates
using Sort Stage and Remove Duplicate Stages and Diffrence?
We
can remove duplicates using both stages but in the sort stage we can capture
duplicate records using create key change column property.
1)The
advantage of using sort stage over remove duplicate stage is that sort stage
allows us to capture the duplicate records whereas remove duplicate stage does
not.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last rec. The same can be done using sort stage by sorting in descending order to retain the first record.
2) Using a sort stage we can only retain the first record.
Normally we go for retaining last when we sort a particular field in ascending order and try to get the last rec. The same can be done using sort stage by sorting in descending order to retain the first record.
#What is difference between Copy
& transformer stage ?
In a copy
stage there are no constraints or derivation so it surely should perform better
than a transformer. If you want a copy of a dataset you better use the copy
stage and if there any business rules to be applied to the dataset you better
use the transformer stage.
We use the copy stage to change the metadata of input dataset(like changing the column name)
We use the copy stage to change the metadata of input dataset(like changing the column name)
# What is the use Enterprise Pivot Stage ?
The Pivot
Enterprise stage is a processing stage that pivots data horizontally and
vertically.
Specifying a horizontal pivot operation
: Use the Pivot Enterprise stage to horizontally pivot data to map sets
of input columns onto single output columns.
Table
1. Input data for a simple horizontal pivot operation
|
|||||||
REPID
|
last_name
|
Jan_sales
|
Feb_sales
|
Mar_sales
|
|||
100
|
Smith
|
1234.08
|
1456.80
|
1578.00
|
|||
101
|
Yamada
|
1245.20
|
1765.00
|
1934.22
|
|||
Table
2. Output data for a simple horizontal pivot operation
|
|||||||
REPID
|
last_name
|
Q1sales
|
Pivot_index
|
||||
100
|
Smith
|
1234.08
|
0
|
||||
100
|
Smith
|
1456.80
|
1
|
||||
100
|
Smith
|
1578.00
|
2
|
||||
101
|
Yamada
|
1245.20
|
0
|
||||
101
|
Yamada
|
1765.00
|
1
|
||||
101
|
Yamada
|
1934.22
|
2
|
||||
Specifying a
vertical pivot operation: Use the Pivot Enterprise stage to
vertically pivot data and then map the resulting columns onto the output
columns.
Table
1. Input data for vertical pivot operation
|
||||||
REPID
|
last_name
|
Q_sales
|
||||
100
|
Smith
|
1234.08
|
||||
100
|
Smith
|
1456.80
|
||||
100
|
Smith
|
1578.00
|
||||
101
|
Yamada
|
1245.20
|
||||
101
|
Yamada
|
1765.00
|
||||
101
|
Yamada
|
1934.22
|
||||
Table
2. Out put data for vertical pivot operation
|
||||||
REPID
|
last_name
|
Q_sales
(January)
|
Q_sales1
(February)
|
Q_sales2 (March)
|
Q_sales_average
|
|
100
|
Smith
|
1234.08
|
1456.80
|
1578.00
|
1412.96
|
|
101
|
Yamada
|
1245.20
|
1765.00
|
1934.22
|
1648.14
|
|
Stage
Variable - An intermediate processing variable that retains value during read
and doesnt pass the value into target column.
Derivation
- Expression that specifies value to be passed onto the target
Constraint- is like a filter condition
which limits the number of records coming from input according to business
rule.
The right order is : Stage variables Then Constraints Then Derivations
#What is the difference between change capture and change apply stages?
Change capture stage
is used to get the difference between two sources i.e. after dataset and before
dataset. The source which is used as a reference to capture the changes is
called after dataset. The source in which we are looking for the change is
called before dataset. This change capture will add one field called
"chage code" in the output from this stage. By this change code one
can recognize which kind of change this is like whether it is delete, insert or
update.
Change apply
stage is used along with change capture stage. It takes change code from the
change capture stage and apply all the changes in the before dataset based on
the change code.
Change Capture is used to capture
the changes between the two sources.
Change Apply
will apply those changes in the output file.
No comments:
Post a Comment