IBM InfoSphere DataStage Interview Questions
Set Q
Boost your career with IBM InfoSphere DataStage, a powerful ETL tool used for data integration, transformation, and data warehousing. Our platform offers a comprehensive collection of DataStage interview questions and exam preparation materials, covering everything from basic concepts to advanced topics. Whether you're a beginner or an experienced professional, explore real-world scenarios, practical questions, and expert-level insights to confidently prepare for interviews and certification exams.
DataStage Interview Questions
Question 01: What is Slowly Changing Dimension (SCD)?
Answer:
SCD is a concept in Data Warehousing used to manage changes in dimension data over time. It ensures historical data is preserved or updated based on business requirements.
Question 02: What are the types of SCD?
Answer:
- Type 1 → Overwrite
- Type 2 → Maintain history
- Type 3 → Partial history
Question 03: What is SCD Type 1?
Answer:
In Type 1, old data is overwritten with new data. No history is maintained.
Question 04: What is SCD Type 2?
Answer:
Maintains full history by creating a new record for each change with:
- Effective date
- End date
- Active flag
Question 05: What is SCD Type 3?
Answer:
Stores limited history by keeping old value in additional columns.
Question 06: How is SCD implemented in DataStage?
Answer:
Using:
- Surrogate Key Generator
- Lookup stage
- Transformer logic
- SCD stage (in some versions)
Question 07: What is Effective Date in SCD?
Answer:
The date when a record becomes active.
Question 08: What is Expiry Date?
Answer:
The date when a record becomes inactive.
Question 09: What is Current Flag?
Answer:
Indicates active record (Y/N).
Question 10: When to use SCD Type 2?
Answer:
When full historical tracking is required.
🔹 Change Data Capture (CDC)
Question 11: What is CDC?
Answer:
CDC captures only changed data (insert, update, delete) instead of full load.
Question 12: Why use CDC?
Answer:
- Improves performance
- Reduces data load
- Enables incremental processing
Question 13: Types of CDC?
Answer:
- Timestamp-based
- Trigger-based
- Log-based
Question 14: What is Timestamp-based CDC?
Answer:
Uses last updated timestamp to extract changed data.
Question 15: What is Log-based CDC?
Answer:
Reads database transaction logs.
Question 16: How is CDC implemented in DataStage?
Answer:
- Using CDC stages
- Using SQL queries
- Using comparison logic
Question 17: What is Delta Load?
Answer:
Loading only changed data.
Question 18: Difference between Full Load and Incremental Load?
Answer:
| Full Load | Incremental Load |
|---|---|
| All data | Changed data |
| Slow | Fast |
Question 19: What is Soft Delete in CDC?
Answer:
Marking records as deleted instead of removing them.
Question 20: What are challenges in CDC?
Answer:
- Handling deletes
- Data consistency
- Performance
🔹 Surrogate Keys
Question 21: What is a Surrogate Key?
Answer:
A system-generated unique identifier for dimension tables.
Question 22: Why use Surrogate Keys?
Answer:
- Avoid dependency on business keys
- Improve performance
- Handle SCD
Question 23: What is Natural Key?
Answer:
A business-defined key (e.g., Employee ID).
Question 24: Difference between Surrogate and Natural Key?
Answer:
| Surrogate Key | Natural Key |
|---|---|
| System-generated | Business-defined |
| Numeric | Can be string |
Question 25: How to generate Surrogate Keys in DataStage?
Answer:
Using:
- Surrogate Key Generator stage
- Sequence logic
Question 26: What is Key Management?
Answer:
Handling uniqueness and sequence of keys.
Question 27: What is Gap in Surrogate Keys?
Answer:
Missing numbers in sequence due to job failures.
Question 28: How to avoid duplicate keys?
Answer:
- Use sequence generator
- Maintain metadata
Question 29: Can Surrogate Keys be reused?
Answer:
No, they should be unique and not reused.
Question 30: What is Composite Key?
Answer:
Combination of multiple columns as key.
🔹 Hash File Stage
Question 31: What is Hash File Stage?
Answer:
Used for fast lookup and storage using hashed indexing.
Question 32: Why use Hash File Stage?
Answer:
- Fast access
- Efficient lookup
- Good for large data
Question 33: How does Hash File work?
Answer:
Uses hashing algorithm to store and retrieve records quickly.
Question 34: What are types of Hash Files?
Answer:
- Static
- Dynamic
Question 35: What is Dynamic Hash File?
Answer:
Automatically resizes based on data.
Question 36: What is Primary Key in Hash File?
Answer:
Key used for hashing and lookup.
Question 37: What is Overflow in Hash File?
Answer:
Occurs when bucket is full.
Question 38: How to improve Hash File performance?
Answer:
- Proper key selection
- Adequate sizing
- Avoid overflow
🔹 Shared Containers
Question 39: What is Shared Container?
Answer:
Reusable job logic stored separately and used across multiple jobs.
Question 40: Why use Shared Containers?
Answer:
- Reusability
- Maintainability
- Standardization
Question 41: How to create Shared Container?
Answer:
- Create job
- Convert to shared container
Question 42: Can Shared Containers have parameters?
Answer:
Yes, they can accept parameters.
Question 43: What is advantage of Shared Container?
Answer:
Centralized logic → Easy updates.
Question 44: What is disadvantage of Shared Container?
Answer:
- Dependency issues
- Impact on multiple jobs if changed
🔹 Local Containers
Question 45: What is Local Container?
Answer:
Reusable logic within a single job.
Question 46: Difference between Shared and Local Container?
Answer:
| Shared | Local |
|---|---|
| Reusable across jobs | Within job only |
| Stored separately | Inside job |
Question 47: When to use Local Container?
Answer:
- Small reusable logic
- Within same job
Question 48: Can Local Containers be converted to Shared?
Answer:
Yes.
Question 49: What is best practice for Containers?
Answer:
- Use Shared for common logic
- Use Local for small logic
Question 50: Real-world use of Containers?
Answer:
- Standard transformations
- Data cleansing logic
- Reusable pipelines
