IBM InfoSphere DataStage Interview Questions
Special / Advanced Stages
Boost your career with IBM InfoSphere DataStage, a powerful ETL tool used for data integration, transformation, and data warehousing. Our platform offers a comprehensive collection of DataStage interview questions and exam preparation materials, covering everything from basic concepts to advanced topics. Whether you're a beginner or an experienced professional, explore real-world scenarios, practical questions, and expert-level insights to confidently prepare for interviews and certification exams.
DataStage Interview Questions
Question 1: What is Pivot Enterprise Stage?
Answer:
Pivot Stage converts rows into columns. It restructures data by transforming multiple rows into a single row with multiple columns.
Question 2: Why is Pivot Stage used?
Answer:
- Data summarization
- Reporting format transformation
- Converting normalized data into denormalized format
Question 3: What are key properties of Pivot Stage?
Answer:
- Pivot Key Column
- Pivot Value Column
- Grouping Columns
Question 4: What is Pivot Key?
Answer:
Column whose values become new column headers.
Question 5: What is Pivot Value Column?
Answer:
Column whose values fill the pivoted columns.
Question 6: Difference between Pivot and Aggregator?
Answer:
- Pivot → restructures data
- Aggregator → summarizes data
Question 7: What happens with duplicate pivot keys?
Answer:
Aggregation or multiple rows may occur depending on configuration.
Question 8: Real-time use case?
Answer:
Converting sales data by region into column-wise format.
🟢 Unpivot Stage (Q9–Q16)
Question 9: What is Unpivot Stage?
Answer:
Unpivot Stage converts columns into rows (reverse of Pivot).
Question 10: Why use Unpivot?
Answer:
- Normalize data
- Prepare data for processing
- Simplify transformations
Question 11: What are key properties?
Answer:
- Input columns
- Output key column
- Output value column
Question 12: Difference between Pivot and Unpivot?
Answer:
- Pivot: Rows → Columns
- Unpivot: Columns → Rows
Question 13: What is key column in Unpivot?
Answer:
Stores original column names.
Question 14: What is value column?
Answer:
Stores values of original columns.
Question 15: Performance considerations?
Answer:
Handles large datasets efficiently but increases row count.
Question 16: Real-time use case?
Answer:
Converting monthly columns (Jan, Feb, Mar) into row-wise format.
🟢 Surrogate Key Generator Stage (Q17–Q24)
Question 17: What is Surrogate Key?
Answer:
A system-generated unique identifier used in data warehouse tables.
Question 18: What is Surrogate Key Generator Stage?
Answer:
Generates unique numeric keys for records.
Question 19: Why use surrogate keys?
Answer:
- Avoid dependency on natural keys
- Improve performance
- Maintain uniqueness
Question 20: What are key properties?
Answer:
- Key column name
- Initial value
- Increment value
Question 21: How does it ensure uniqueness?
Answer:
By incrementing values sequentially.
Question 22: Difference between natural key and surrogate key?
Answer:
- Natural key → business key
- Surrogate key → system-generated
Question 23: What happens on job restart?
Answer:
It continues from last generated key (if properly configured).
Question 24: Real-time use case?
Answer:
Generating Customer_ID in data warehouse.
🟢 Change Capture Stage (CDC) (Q25–Q32)
Question 25: What is Change Capture Stage?
Answer:
Identifies differences between two datasets (source vs target).
Question 26: Why use CDC?
Answer:
- Incremental loading
- Detect inserts, updates, deletes
Question 27: What are change types?
Answer:
- Insert
- Update
- Delete
- No change
Question 28: What inputs are required?
Answer:
- Before dataset
- After dataset
Question 29: How does CDC compare data?
Answer:
Using key columns and value comparison.
Question 30: What is key column in CDC?
Answer:
Used to match records between datasets.
Question 31: Performance considerations?
Answer:
Sorting required → may impact performance.
Question 32: Real-time use case?
Answer:
Detecting changes in customer records for incremental ETL.
🟢 Slowly Changing Dimension Stage (SCD) (Q33–Q42)
Question 33: What is Slowly Changing Dimension?
Answer:
Technique to manage historical changes in dimension tables.
Question 34: What is SCD Stage?
Answer:
Handles changes in dimension data automatically.
Question 35: Types of SCD?
Answer:
- Type 1 → Overwrite
- Type 2 → History tracking
- Type 3 → Partial history
Question 36: What is SCD Type 1?
Answer:
Updates data without storing history.
Question 37: What is SCD Type 2?
Answer:
Maintains full history using new rows.
Question 38: What is SCD Type 3?
Answer:
Stores limited history (previous values).
Question 39: Key columns in SCD?
Answer:
- Business key
- Surrogate key
- Effective date
Question 40: What is active flag?
Answer:
Indicates current record.
Question 41: Difference between CDC and SCD?
Answer:
- CDC → detects change
- SCD → manages history
Question 42: Real-time use case?
Answer:
Tracking customer address changes over time.
🟢 Hash File Stage (Q43–Q50)
Question 43: What is Hash File Stage?
Answer:
Stores data in hashed format for fast lookup.
Question 44: Why use Hash File Stage?
Answer:
- Fast data retrieval
- Efficient lookup operations
Question 45: What is hashing?
Answer:
Technique to map keys to storage locations.
Question 46: Types of Hash Files?
Answer:
- Static
- Dynamic
Question 47: What is primary key in Hash File?
Answer:
Used for indexing and fast access.
Question 48: Difference between Hash File and Dataset?
Answer:
- Hash File → lookup optimized
- Dataset → data storage
Question 49: What is overflow in Hash File?
Answer:
Occurs when bucket exceeds capacity.
Question 50: Real-time use case?
Answer:
Using Hash File for fast lookup in large ETL jobs.
