DW&BI Concepts Interview Questions and Answers


Common Interview Question and Answers for 
DATAWAREHOUSING CONCEPTS

            
1: Can we lookup a table from source qualifier transformation. ie. Unconnected lookup
Answer : You cannot lookup from a source qualifier directly. However, you can override the SQL in the source qualifier to join with the lookup table to perform the lookup.

2.How can we use mapping variables in Informatica? Where do we use them?
Answer :  Yes. we can use mapping variable in Informatica.
The Informatica server saves the value of mapping variable to the repository at the end of session run and uses that value next time we run the session.

3. Can we override a native sql query within Informatica? Where do we do it? How do we do it? 
Answer: Yes, we can override a native sql query in source qualifier and lookup transformation.
In lookup transformation we can find "Sql override" in lookup properties.by using this option we can do this.

4.How to determine what records to extract?
Answer : When addressing a table some dimension key must reflect the need for a record to get extracted. Mostly it will be from time dimension (e.g. date >= 1st of current mth) or a transaction flag (e.g. Order Invoiced Stat). Foolproof would be adding an archive flag to record which gets reset when record changes.
 
5.When do we Analyze the tables? How do we do it? 
Answer :  The ANALYZE statement allows you to validate and compute statistics for an index, table, or cluster. These statistics are used by the cost-based optimizer when it calculates the most efficient plan for retrieval. In addition to its role in statement optimization, ANALYZE also helps in validating object structures and in managing space in your system. You can choose the following operations: COMPUTER, ESTIMATE, and DELETE. Early version of Oracle7 produced unpredicatable results when the ESTIMATE operation was used. It is best to compute your statistics.
EX:
select OWNER,
sum(decode(nvl(NUM_ROWS,9999), 9999,0,1)) analyzed,
sum(decode(nvl(NUM_ROWS,9999), 9999,1,0)) not_analyzed,
count(TABLE_NAME) total
from dba_tables
where OWNER not in ('SYS', 'SYSTEM')
group by OWNER

6. What are snapshots?  What are materialized views & where do we use them?
Answer :  Snapshots are read-only copies of a master table located on a remote node which is periodically refreshed to reflect changes made to the master table. Snapshots are mirror or replicas of tables.

Views are built using the columns from one or more tables. The Single Table View can be updated but the view with multi table cannot be updated.

A View can be updated/deleted/inserted if it has only one base table if the view is based on columns from one or more tables then insert, update and delete is not possible.

Materialized view
A pre-computed table comprising aggregated or joined data from fact and possibly dimensions tables. Also known as a summary or aggregate table.

7. What is the metadata extension?
Answer :  Informatica allows end users and partners to extend the metadata stored in the repository by associating information with individual objects in the repository. For example, when you create a mapping, you can store your contact information with the mapping. You associate information with repository metadata using metadata extensions.

Informatica Client applications can contain the following types of metadata extensions:

Vendor-defined. Third-party application vendors create vendor-defined metadata extensions. You can view and change the values of vendor-defined metadata extensions, but you cannot create, delete, or redefine them.
User-defined. You create user-defined metadata extensions using PowerCenter/PowerMart. You can create, edit, delete, and view user-defined metadata extensions. You can also change the values of user-defined extensions.

8. What is partitioning? What are the types of partitioning?
Ans: If you use PowerCenter, you can increase the number of partitions in a pipeline to improve session performance. Increasing the number of partitions allows the Informatica Server to create multiple connections to sources and process partitions of source data concurrently.

When you create a session, the Workflow Manager validates each pipeline in the mapping for partitioning. You can specify multiple partitions in a pipeline if the Informatica Server can maintain data consistency when it processes the partitioned data.

When you configure the partitioning information for a pipeline, you must specify a partition type at each partition point in the pipeline.
The partition type determines how the Informatica Server redistributes data across partition points.

The Workflow Manager allows you to specify the following partition types:

Round-robin partitioning. The Informatica Server distributes data evenly among all partitions. Use round-robin partitioning where you want each partition to process approximately the same number of rows.
Hash partitioning. The Informatica Server applies a hash function to a partition key to group data among partitions. If you select hash auto-keys, the Informatica Server uses all grouped or sorted ports as the partition key. If you select hash user keys, you specify a number of ports to form the partition key. Use hash partitioning where you want to ensure that the Informatica Server processes groups of rows
with the same partition key in the same partition.
Key range partitioning. You specify one or more ports to form a compound partition key. The Informatica Server passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range. For more information, see Key Range Partitioning.
Pass-through partitioning. The Informatica Server passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.

9.Where do we use connected and un connected lookups ?
 Answer :  If return port only one then we can go for unconnected. More than one return port is not possible with Unconnected. If more than one return port then go for Connected.

10. Do we need an ETL tool? When do we go for the tools in the market? 
Ans: ETL Tool: It is used to Extract(E) data from multiple source systems(like RDBMS,Flat files,Mainframes,SAP,XML etc) transform(T) them based on Business requirements and Load(L) in target locations.(like tables,files etc).

Need of ETL Tool:

An ETL tool is typically required when data scattered accross different systems.(like RDBMS,Flat files,Mainframes,SAP,XML etc).


11. What is Full load & Incremental or Refresh load?
Answer:
Full Load: completely erasing the contents of one or more tables and reloading with fresh data.
Incremental Load: applying ongoing changes to one or more tables based on a predefined schedule

12. What is the difference between etl tool and olap tools ?
Answer :  ETL tool is ment for extraction data from the legecy systems and load into specified data base with some process of cleansing data.
ex: Informatica,data stage ....etc
OLAP is ment for Reporting purpose.in OLAP data avaliable in Mulitidimectional model. so that u can write smple query to extract data fro the data base.

13. Compare ETL & Manual development? 
Answer :  ETL - The process of extracting data from multiple sources.(ex. flat files,XML, COBOL, SAP etc) is more simpler with the help of tools.
Manual - Loading the data other than flat files and oracle table need more effort.
ETL - High and clear visibilty of logic.
Manual - complex and not so user friendly visibilty of logic.
ETL - Contains Meta data and changes can be done easily.
Manual - No Meta data concept and changes needs more effort.
ETL- Error hadling,log summary and load progess makes life easier for developer and maintainer.
Manual - need maximum effort from maintainance point of view.
ETL - Can handle Historic data very well.
Manual - as data grows the processing time degrads.
These are some differences b/w manual and ETL developement.

14: What is a staging area? Do we need it? What is the purpose of a staging area? 
Ans: Data staging is actually a collection of processes used to prepare source system data for loading a data warehouse. Staging includes the following steps:
Source data extraction, Data transformation (restructuring),
Data transformation (data cleansing, value transformations),
Surrogate key assignments

15. What are the various transformation available ?
Ans : Aggregator Transformation
Expression Transformation
Filter Transformation
Joiner Transformation
Lookup Transformation
Normalizer Transformation
Rank Transformation
Router Transformation
Sequence Generator Transformation
Stored Procedure Transformation
Sorter Transformation
Update Strategy Transformation
XML Source Qualifier Transformation
Advanced External Procedure Transformation
External Transformation

16. Where do we use semi and non additive facts ?
 Answer  :  Additve: A masure can participate arithmatic calulatons using all or any demensions.
Ex: Sales profit
Semi additive: A masure can participate arithmatic calulatons using some demensions.
Ex: Sales amount
Non Additve:A masure can't participate arithmatic calulatons using demensions.

17. What are active transformation / Passive transformations? 
Answer :  
Active transformation can change the number of rows that pass through it. (decrease or increase rows)
Passive transformation can not change the number of rows that pass through it.

 18. What is a mapping, session, worklet, workflow, mapplet? 
Answer :  A mapping represents dataflow from sources to targets.
A mapplet creates or configures a set of transformations.
A workflow is a set of instruction sthat tell the Informatica server how to execute the tasks.
A worklet is an object that represents a set of tasks.
A session is a set of instructions that describe how and when to move data from sources to targets.

 19. What are the various tools? - Name a few. 
Answer :  A few are ....
- Abinitio
- DataStage
- Informatica
- Cognos Decision Stream
- Oracle Warehouse Builder
- Business Objects XI (Extreme Insight)
- SAP Business Warehouse
- SAS Enterprise ETL Server

20: What is a three tier data warehouse? 
Answer :  A data warehouse can be thought of as a three-tier system in which a middle system provides usable data in a secure way to end users. On either side of this middle system are the end users and the back-end data stores.

21. What are the modules in Power Mart? 
1. PowerMart Designer
2. Server
3. Server Manager
4. Repository
5. Repository Manager

21. What are the different Lookup methods used in Informatica? 
Answer :  1. Connected lookup
2. Unconnected lookup

Connected lookup will receive input from the pipeline and sends output to the pipeline and can return any number of values.it does not contain retun port.

Unconnected lookup can return only one column. it containn return port.

22.What is a Data Warehouse?
A Data Warehouse is the "corporate memory". Academics will say it is a subject oriented, point-in-time, inquiry only collection of operational data.
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.


23.What is ETL/ How does Oracle support the ETL process?
ETL is the Data Warehouse acquisition processes of Extracting, Transforming (or Transporting) and Loading (ETL) data from source systems into the data warehouse.
Oracle supports the ETL process with their "Oracle Warehouse Builder" product. Many new features in the Oracle9i database will also make ETL processing easier. For example:

New MERGE command (also called UPSERT, Insert and update information in one step);
External Tables allows users to run SELECT statements on external data files (with pipelining support).


24.What is the difference between a data warehouse and a data mart?
This is a heavily debated issue. There are inherent similarities between the basic constructs used to design a data warehouse and a data mart. In general a Data Warehouse is used on an enterprise level, while Data Marts is used on a business division/department level. A data mart only contains the required subject specific data for local analysis.


25.What is the difference between a W/H and an OLTP application?
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.
Warehouses are Time Referenced, Subject-Oriented, Non-volatile (read only) and Integrated.

OLTP databases are designed to maintain atomicity, consistency and integrity (the "ACID" tests). Since a data warehouse is not updated, these constraints are relaxed.


26.What is the difference between OLAP, ROLAP, MOLAP and HOLAP?
ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications.
ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions, but the data is really stored in a Relational Database (RDBMS) like Oracle. The RDBMS will store data at a fine grain level, response times are usually slow.

MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes with dimensions, but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast.

HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate Software's Holos is an example HOLAP environment. In a HOLAP system one will find queries on aggregated data as well as on detailed data.

27. What is the difference between an ODS and a W/H?
An ODS (Operational Data Store) is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 90 days of information.
A warehouse typically contains years of data (Time Referenced). Data warehouses group data by subject rather than by activity (subject-oriented). Other properties are: Non-volatile (read only) and Integrated.


28.What Oracle tools can be used to design and build a W/H?
Data Warehouse Builder (or Oracle Data Mart builder), Oracle Designer, Oracle Express, Express Objects, etc.


29.When should one use an MD-database (multi-dimensional database) and not a relational one?
Data in a multi-dimensional database is stored as business people views it, allowing them to slice and dice the data to answer business questions. When designed correctly, an OLAP database will provide must faster response times for analytical queries.
Normal relational databases store data in two-dimensional tables and analytical queries against them are normally very slow.



30.What is a star schema? Why does one design this way?
A single "fact table" containing a compound primary key, with one segment for each "dimension," and additional columns of additive, numeric facts.
Why?

It allows for the highest level of flexibility of metadata
Low maintenance as the data warehouse matures
Best possible performance

31.When should you use a STAR and when a SNOW-FLAKE schema?
The star schema is the simplest data warehouse schema. Snow flake schema is similar to the star schema. It normalizes dimension table to save data storage space. It can be used to represent hierarchies of information.


32.What is the difference between Oracle Express and Oracle Discoverer?
Express is an MD database and development environment. Discoverer is an ad-hoc end-user query tool.



33.How can Oracle Materialized Views be used to speed up data warehouse queries?
With "Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle can direct queries to use pre-aggregated tables instead of scanning large tables to answer complex queries.
Materialized views in a W/H environments is typically referred to as summaries, because they store summarized data.

34.What Oracle features can be used to optimize my Warehouse system?
The following Oracle features can be used to compliment your Warehouse system/database:
From Oracle8i One can transport tablespaces between Oracle databases. Using this feature one can easily "detach" a tablespace for archiving purposes. One can also use this feature to quickly move data from an OLTP database to a Warehouse database.

Data partitioning allows one to split big tables into smaller more manageable sub-tables (partitions). Data is automatically directed to the correct partition based on data ranges or hash values.

Oracle Materialized Views can be used to pre-aggregate data. The Query Optimizer can direct queries to summary/ roll-up tables instead of the detail data tables (query rewrite). This will dramatically speed-up warehouse queries and saves valuable machine resources.

Oracle Parallel Query can be used to speed up data retrieval by using multiple processes (and CPUs).


35. What is star schema .
Star Schema is a schema in which a fact table is connected to multiple dimensions. In case of BW there are 16 dimensions out of which a user can create a maximum of 13 dimensions and 3 dimensions are SAP defined. The SAP defined dimensions are Unit, Time and Package Id. Each dimension can consist of maximum 255 characterstic infoobjects.

36. what is fact
Fact are numerical values such as Number, Amount Quantity etc. which are used for the analysing purpose with respect to dimensions.

37. what is dimensions
Dimensions are a set of related characterstic objects on which a user wants to analyze the facts. A single fact table can consist of maximum 255 characterstic infoobjects..

38. What is Partition. How many type of Partition
Partitioning Address the Key problem of supporting very large table and indexex by allowing to decompose them into smaller and more manageable pieces  called partition
Type of partition .
Range Partition :- In range partition  the data in a table or index according to a range value .
Hash Partition : - partition are the data according to a hash function .
Composite partition :- partition the data by range and further subdivides the data into sub partitions using a hash function .

List Partition :-


39. What is Index . How many type of Index .
Indexes are optional structures associated with tables and cluster.
Type of Index :-
  1. B-Tree indexes
  2. B-Tree cluster indexes
  3. Hash cluster indexes
  4. Reverse key indexes
  5. Bitmap Indexes .
                                                                          ..............................To Be Continued

No comments:

Post a Comment