Common Interview Question and Answers for
DATAWAREHOUSING
CONCEPTS
1: Can we lookup a table
from source qualifier transformation. ie. Unconnected lookup
Answer : You cannot lookup from a
source qualifier directly. However, you can override the SQL in the source
qualifier to join with the lookup table to perform the lookup.
2.How can we use mapping
variables in Informatica? Where do we use them?
Answer : Yes. we can use mapping variable in
Informatica.
The
Informatica server saves the value of mapping variable to the repository at the
end of session run and uses that value next time we run the session.
3. Can we override a native
sql query within Informatica? Where do we do it? How do we do it?
Answer: Yes, we can override a
native sql query in source qualifier and lookup transformation.
In
lookup transformation we can find "Sql override" in lookup
properties.by using this option we can do this.
4.How to determine what
records to extract?
Answer
: When addressing a table some dimension key must reflect the need for a record
to get extracted. Mostly it will be from time dimension (e.g. date >= 1st of
current mth) or a transaction flag (e.g. Order Invoiced Stat). Foolproof would
be adding an archive flag to record which gets reset when record changes.
5.When do we Analyze the
tables? How do we do it?
Answer : The ANALYZE statement allows you to validate
and compute statistics for an index, table, or cluster. These statistics are
used by the cost-based optimizer when it calculates the most efficient plan for
retrieval. In addition to its role in statement optimization, ANALYZE also
helps in validating object structures and in managing space in your system. You
can choose the following operations: COMPUTER, ESTIMATE, and DELETE. Early
version of Oracle7 produced unpredicatable results when the ESTIMATE operation
was used. It is best to compute your statistics.
EX:
select
OWNER,
sum(decode(nvl(NUM_ROWS,9999),
9999,0,1)) analyzed,
sum(decode(nvl(NUM_ROWS,9999),
9999,1,0)) not_analyzed,
count(TABLE_NAME)
total
from
dba_tables
where
OWNER not in ('SYS', 'SYSTEM')
group
by OWNER
6. What are snapshots? What are materialized views & where do we
use them?
Answer
: Snapshots are read-only copies of a
master table located on a remote node which is periodically refreshed to
reflect changes made to the master table. Snapshots are mirror or replicas of
tables.
Views
are built using the columns from one or more tables. The Single Table View can
be updated but the view with multi table cannot be updated.
A
View can be updated/deleted/inserted if it has only one base table if the view
is based on columns from one or more tables then insert, update and delete is
not possible.
Materialized view
A
pre-computed table comprising aggregated or joined data from fact and possibly dimensions
tables. Also known as a summary or aggregate table.
7. What is the metadata
extension?
Answer : Informatica allows end users and partners to
extend the metadata stored in the repository by associating information with
individual objects in the repository. For example, when you create a mapping,
you can store your contact information with the mapping. You associate
information with repository metadata using metadata extensions.
Informatica
Client applications can contain the following types of metadata extensions:
Vendor-defined.
Third-party application vendors create vendor-defined metadata extensions. You
can view and change the values of vendor-defined metadata extensions, but you
cannot create, delete, or redefine them.
User-defined.
You create user-defined metadata extensions using PowerCenter/PowerMart. You
can create, edit, delete, and view user-defined metadata extensions. You can
also change the values of user-defined extensions.
8. What is partitioning?
What are the types of partitioning?
Ans: If you use PowerCenter,
you can increase the number of partitions in a pipeline to improve session
performance. Increasing the number of partitions allows the Informatica Server
to create multiple connections to sources and process partitions of source data
concurrently.
When
you create a session, the Workflow Manager validates each pipeline in the
mapping for partitioning. You can specify multiple partitions in a pipeline if
the Informatica Server can maintain data consistency when it processes the
partitioned data.
When
you configure the partitioning information for a pipeline, you must specify a
partition type at each partition point in the pipeline.
The
partition type determines how the Informatica Server redistributes data across
partition points.
The
Workflow Manager allows you to specify the following partition types:
Round-robin partitioning. The Informatica Server
distributes data evenly among all partitions. Use round-robin partitioning
where you want each partition to process approximately the same number of rows.
Hash partitioning. The Informatica Server
applies a hash function to a partition key to group data among partitions. If
you select hash auto-keys, the Informatica Server uses all grouped or sorted
ports as the partition key. If you select hash user keys, you specify a number
of ports to form the partition key. Use hash partitioning where you want to
ensure that the Informatica Server processes groups of rows
with
the same partition key in the same partition.
Key range partitioning. You specify one or more
ports to form a compound partition key. The Informatica Server passes data to
each partition depending on the ranges you specify for each port. Use key range
partitioning where the sources or targets in the pipeline are partitioned by
key range. For more information, see Key Range Partitioning.
Pass-through partitioning. The Informatica Server
passes all rows at one partition point to the next partition point without
redistributing them. Choose pass-through partitioning where you want to create
an additional pipeline stage to improve performance, but do not want to change
the distribution of data across partitions.
9.Where do we use connected
and un connected lookups ?
Answer :
If return port only one then we can
go for unconnected. More than one return port is not possible with Unconnected.
If more than one return port then go for Connected.
10. Do we need an ETL tool?
When do we go for the tools in the market?
Ans:
ETL Tool: It is used to Extract(E) data from multiple source systems(like
RDBMS,Flat files,Mainframes,SAP,XML etc) transform(T) them based on Business
requirements and Load(L) in target locations.(like tables,files etc).
Need of ETL Tool:
An
ETL tool is typically required when data scattered accross different
systems.(like RDBMS,Flat files,Mainframes,SAP,XML etc).
11. What is Full load &
Incremental or Refresh load?
Answer:
Full Load: completely erasing the
contents of one or more tables and reloading with fresh data.
Incremental Load: applying ongoing changes
to one or more tables based on a predefined schedule
12. What is the difference
between etl tool and olap tools ?
Answer : ETL tool is ment for extraction data from the
legecy systems and load into specified data base with some process of cleansing
data.
ex:
Informatica,data stage ....etc
OLAP
is ment for Reporting purpose.in OLAP data avaliable in Mulitidimectional
model. so that u can write smple query to extract data fro the data base.
13. Compare ETL &
Manual development?
Answer : ETL - The process of
extracting data from multiple sources.(ex. flat files,XML, COBOL, SAP etc) is
more simpler with the help of tools.
Manual - Loading the data other
than flat files and oracle table need more effort.
ETL - High and clear visibilty
of logic.
Manual - complex and not so user
friendly visibilty of logic.
ETL - Contains Meta data and
changes can be done easily.
Manual - No Meta data concept and
changes needs more effort.
ETL- Error hadling,log summary
and load progess makes life easier for developer and maintainer.
Manual - need maximum effort from
maintainance point of view.
ETL - Can handle Historic data
very well.
Manual
- as data grows the processing time degrads.
These
are some differences b/w manual and ETL developement.
14: What is a staging area?
Do we need it? What is the purpose of a staging area?
Ans:
Data staging is actually a collection of processes used to prepare source
system data for loading a data warehouse. Staging includes the following steps:
Source
data extraction, Data transformation (restructuring),
Data
transformation (data cleansing, value transformations),
Surrogate
key assignments
15. What are the various
transformation available ?
Ans
: Aggregator Transformation
Expression
Transformation
Filter
Transformation
Joiner
Transformation
Lookup
Transformation
Normalizer
Transformation
Rank
Transformation
Router
Transformation
Sequence
Generator Transformation
Stored
Procedure Transformation
Sorter
Transformation
Update
Strategy Transformation
XML
Source Qualifier Transformation
Advanced
External Procedure Transformation
External
Transformation
16. Where do we use semi
and non additive facts ?
Answer : Additve:
A masure can participate arithmatic calulatons using all or any demensions.
Ex:
Sales profit
Semi additive: A masure can participate
arithmatic calulatons using some demensions.
Ex:
Sales amount
Non Additve:A masure can't participate
arithmatic calulatons using demensions.
17. What are active
transformation / Passive transformations?
Answer
:
Active
transformation can change the number of rows that pass through it. (decrease or
increase rows)
Passive
transformation can not change the number of rows that pass through it.
18. What is a mapping, session, worklet,
workflow, mapplet?
Answer
: A mapping represents dataflow from
sources to targets.
A
mapplet creates or configures a set of transformations.
A
workflow is a set of instruction sthat tell the Informatica server how to
execute the tasks.
A
worklet is an object that represents a set of tasks.
A
session is a set of instructions that describe how and when to move data from
sources to targets.
19. What are the various tools? - Name a
few.
Answer : A few are ....
-
Abinitio
-
DataStage
-
Informatica
-
Cognos Decision Stream
-
Oracle Warehouse Builder
-
Business Objects XI (Extreme Insight)
-
SAP Business Warehouse
-
SAS Enterprise ETL Server
20: What is a three tier
data warehouse?
Answer : A data warehouse can be thought of as a
three-tier system in which a middle system provides usable data in a secure way
to end users. On either side of this middle system are the end users and the
back-end data stores.
21. What are the modules in
Power Mart?
1.
PowerMart Designer
2.
Server
3.
Server Manager
4.
Repository
5.
Repository Manager
21. What are the different
Lookup methods used in Informatica?
Answer
: 1. Connected lookup
2. Unconnected lookup
Connected
lookup will receive input from the pipeline and sends output to the pipeline
and can return any number of values.it does not contain retun port.
Unconnected
lookup can return only one column. it containn return port.
22.What is a Data
Warehouse?
A
Data Warehouse is the "corporate memory". Academics will say it is a
subject oriented, point-in-time, inquiry only collection of operational data.
Typical
relational databases are designed for on-line transactional processing (OLTP)
and do not meet the requirements for effective on-line analytical processing
(OLAP). As a result, data warehouses are designed differently than traditional
relational databases.
23.What is ETL/ How does
Oracle support the ETL process?
ETL
is the Data Warehouse acquisition processes of Extracting, Transforming (or
Transporting) and Loading (ETL) data from source systems into the data
warehouse.
Oracle
supports the ETL process with their "Oracle Warehouse Builder"
product. Many new features in the Oracle9i database will also make ETL
processing easier. For example:
New
MERGE command (also called UPSERT, Insert and update information in one step);
External
Tables allows users to run SELECT statements on external data files (with
pipelining support).
24.What is the difference
between a data warehouse and a data mart?
This
is a heavily debated issue. There are inherent similarities between the basic
constructs used to design a data warehouse and a data mart. In general a Data
Warehouse is used on an enterprise level, while Data Marts is used on a
business division/department level. A data mart only contains the required
subject specific data for local analysis.
25.What is the difference
between a W/H and an OLTP application?
Typical
relational databases are designed for on-line transactional processing (OLTP)
and do not meet the requirements for effective on-line analytical processing
(OLAP). As a result, data warehouses are designed differently than traditional
relational databases.
Warehouses
are Time Referenced, Subject-Oriented, Non-volatile (read only) and Integrated.
OLTP
databases are designed to maintain atomicity, consistency and integrity (the
"ACID" tests). Since a data warehouse is not updated, these
constraints are relaxed.
26.What is the difference
between OLAP, ROLAP, MOLAP and HOLAP?
ROLAP,
MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications.
ROLAP
stands for Relational OLAP. Users see their data organized in cubes with
dimensions, but the data is really stored in a Relational Database (RDBMS) like
Oracle. The RDBMS will store data at a fine grain level, response times are
usually slow.
MOLAP
stands for Multidimensional OLAP. Users see their data organized in cubes with
dimensions, but the data is store in a Multi-dimensional database (MDBMS) like
Oracle Express Server. In a MOLAP system lot of queries have a finite answer
and performance is usually critical and fast.
HOLAP
stands for Hybrid OLAP, it is a combination of both worlds. Seagate Software's
Holos is an example HOLAP environment. In a HOLAP system one will find queries
on aggregated data as well as on detailed data.
27. What is the difference
between an ODS and a W/H?
An
ODS (Operational Data Store) is an integrated database of operational data. Its
sources include legacy systems and it contains current or near term data. An
ODS may contain 30 to 90 days of information.
A
warehouse typically contains years of data (Time Referenced). Data warehouses
group data by subject rather than by activity (subject-oriented). Other
properties are: Non-volatile (read only) and Integrated.
28.What Oracle tools can be
used to design and build a W/H?
Data
Warehouse Builder (or Oracle Data Mart builder), Oracle Designer, Oracle
Express, Express Objects, etc.
29.When should one use an
MD-database (multi-dimensional database) and not a relational one?
Data
in a multi-dimensional database is stored as business people views it, allowing
them to slice and dice the data to answer business questions. When designed
correctly, an OLAP database will provide must faster response times for
analytical queries.
Normal
relational databases store data in two-dimensional tables and analytical
queries against them are normally very slow.
30.What is a star schema?
Why does one design this way?
A
single "fact table" containing a compound primary key, with one
segment for each "dimension," and additional columns of additive,
numeric facts.
Why?
It
allows for the highest level of flexibility of metadata
Low
maintenance as the data warehouse matures
Best
possible performance
31.When should you use a
STAR and when a SNOW-FLAKE schema?
The
star schema is the simplest data warehouse schema. Snow flake schema is similar
to the star schema. It normalizes dimension table to save data storage space.
It can be used to represent hierarchies of information.
32.What is the difference
between Oracle Express and Oracle Discoverer?
Express
is an MD database and development environment. Discoverer is an ad-hoc end-user
query tool.
33.How can Oracle
Materialized Views be used to speed up data warehouse queries?
With
"Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle can
direct queries to use pre-aggregated tables instead of scanning large tables to
answer complex queries.
Materialized
views in a W/H environments is typically referred to as summaries, because they
store summarized data.
34.What Oracle features can
be used to optimize my Warehouse system?
The
following Oracle features can be used to compliment your Warehouse
system/database:
From
Oracle8i One can transport tablespaces between Oracle databases. Using this
feature one can easily "detach" a tablespace for archiving purposes.
One can also use this feature to quickly move data from an OLTP database to a
Warehouse database.
Data
partitioning allows one to split big tables into smaller more manageable
sub-tables (partitions). Data is automatically directed to the correct
partition based on data ranges or hash values.
Oracle
Materialized Views can be used to pre-aggregate data. The Query Optimizer can
direct queries to summary/ roll-up tables instead of the detail data tables
(query rewrite). This will dramatically speed-up warehouse queries and saves
valuable machine resources.
Oracle
Parallel Query can be used to speed up data retrieval by using multiple
processes (and CPUs).
35. What is star schema .
Star Schema is a schema in which a fact table is connected to
multiple dimensions. In case of BW there are 16 dimensions out of which a user
can create a maximum of 13 dimensions and 3 dimensions are SAP defined. The SAP
defined dimensions are Unit, Time and Package Id. Each dimension can consist of
maximum 255 characterstic infoobjects.
36. what is fact
Fact are numerical values such as Number, Amount Quantity etc.
which are used for the analysing purpose with respect to dimensions.
37. what is dimensions
Dimensions are a set of related characterstic objects on which a
user wants to analyze the facts. A single fact table can consist of maximum 255
characterstic infoobjects..
38. What is Partition. How many type of Partition
Partitioning Address the Key problem of supporting very large
table and indexex by allowing to decompose them into smaller and more manageable
pieces called partition
Type of partition .
Range Partition :- In range partition the data in a table or index according to a
range value .
Hash Partition : - partition are the data according to a hash
function .
Composite partition :- partition the data by range and
further subdivides the data into sub partitions using a hash function .
List Partition :-
39. What is Index . How many type of Index .
Indexes are optional structures associated with tables and
cluster.
Type of Index :-
- B-Tree indexes
- B-Tree cluster indexes
- Hash cluster indexes
- Reverse key indexes
- Bitmap Indexes .
..............................To Be Continued
No comments:
Post a Comment