IBM InfoSphere DataStage Interview Questions

Set Q

Boost your career with IBM InfoSphere DataStage, a powerful ETL tool used for data integration, transformation, and data warehousing. Our platform offers a comprehensive collection of DataStage interview questions and exam preparation materials, covering everything from basic concepts to advanced topics. Whether you're a beginner or an experienced professional, explore real-world scenarios, practical questions, and expert-level insights to confidently prepare for interviews and certification exams.

DataStage Interview Questions

Question 01: What is Slowly Changing Dimension (SCD)?

Answer:
SCD is a concept in Data Warehousing used to manage changes in dimension data over time. It ensures historical data is preserved or updated based on business requirements.

Question 02: What are the types of SCD?

Answer:

Type 1 → Overwrite
Type 2 → Maintain history
Type 3 → Partial history

Question 03: What is SCD Type 1?

Answer:
In Type 1, old data is overwritten with new data. No history is maintained.

Question 04: What is SCD Type 2?

Answer:
Maintains full history by creating a new record for each change with:

Effective date
End date
Active flag

Question 05: What is SCD Type 3?

Answer:
Stores limited history by keeping old value in additional columns.

Question 06: How is SCD implemented in DataStage?

Answer:
Using:

Surrogate Key Generator
Lookup stage
Transformer logic
SCD stage (in some versions)

Question 07: What is Effective Date in SCD?

Answer:
The date when a record becomes active.

Question 08: What is Expiry Date?

Answer:
The date when a record becomes inactive.

Question 09: What is Current Flag?

Answer:
Indicates active record (Y/N).

Question 10: When to use SCD Type 2?

Answer:
When full historical tracking is required.

🔹 Change Data Capture (CDC)

Question 11: What is CDC?

Answer:
CDC captures only changed data (insert, update, delete) instead of full load.

Question 12: Why use CDC?

Answer:

Improves performance
Reduces data load
Enables incremental processing

Question 13: Types of CDC?

Answer:

Timestamp-based
Trigger-based
Log-based

Question 14: What is Timestamp-based CDC?

Answer:
Uses last updated timestamp to extract changed data.

Question 15: What is Log-based CDC?

Answer:
Reads database transaction logs.

Question 16: How is CDC implemented in DataStage?

Answer:

Using CDC stages
Using SQL queries
Using comparison logic

Question 17: What is Delta Load?

Answer:
Loading only changed data.

Question 18: Difference between Full Load and Incremental Load?

Answer:

Full Load	Incremental Load
All data	Changed data
Slow	Fast

Question 19: What is Soft Delete in CDC?

Answer:
Marking records as deleted instead of removing them.

Question 20: What are challenges in CDC?

Answer:

Handling deletes
Data consistency
Performance

🔹 Surrogate Keys

Question 21: What is a Surrogate Key?

Answer:
A system-generated unique identifier for dimension tables.

Question 22: Why use Surrogate Keys?

Answer:

Avoid dependency on business keys
Improve performance
Handle SCD

Question 23: What is Natural Key?

Answer:
A business-defined key (e.g., Employee ID).

Question 24: Difference between Surrogate and Natural Key?

Answer:

Surrogate Key	Natural Key
System-generated	Business-defined
Numeric	Can be string

Question 25: How to generate Surrogate Keys in DataStage?

Answer:
Using:

Surrogate Key Generator stage
Sequence logic

Question 26: What is Key Management?

Answer:
Handling uniqueness and sequence of keys.

Question 27: What is Gap in Surrogate Keys?

Answer:
Missing numbers in sequence due to job failures.

Question 28: How to avoid duplicate keys?

Answer:

Use sequence generator
Maintain metadata

Question 29: Can Surrogate Keys be reused?

Answer:
No, they should be unique and not reused.

Question 30: What is Composite Key?

Answer:
Combination of multiple columns as key.

🔹 Hash File Stage

Question 31: What is Hash File Stage?

Answer:
Used for fast lookup and storage using hashed indexing.

Question 32: Why use Hash File Stage?

Answer:

Fast access
Efficient lookup
Good for large data

Question 33: How does Hash File work?

Answer:
Uses hashing algorithm to store and retrieve records quickly.

Question 34: What are types of Hash Files?

Answer:

Static
Dynamic

Question 35: What is Dynamic Hash File?

Answer:
Automatically resizes based on data.

Question 36: What is Primary Key in Hash File?

Answer:
Key used for hashing and lookup.

Question 37: What is Overflow in Hash File?

Answer:
Occurs when bucket is full.

Question 38: How to improve Hash File performance?

Answer:

Proper key selection
Adequate sizing
Avoid overflow

🔹 Shared Containers

Question 39: What is Shared Container?

Answer:
Reusable job logic stored separately and used across multiple jobs.

Question 40: Why use Shared Containers?

Answer:

Reusability
Maintainability
Standardization

Question 41: How to create Shared Container?

Answer:

Create job
Convert to shared container

Question 42: Can Shared Containers have parameters?

Answer:
Yes, they can accept parameters.

Question 43: What is advantage of Shared Container?

Answer:
Centralized logic → Easy updates.

Question 44: What is disadvantage of Shared Container?

Answer:

Dependency issues
Impact on multiple jobs if changed

🔹 Local Containers

Question 45: What is Local Container?

Answer:
Reusable logic within a single job.

Question 46: Difference between Shared and Local Container?

Answer:

Shared	Local
Reusable across jobs	Within job only
Stored separately	Inside job

Question 47: When to use Local Container?

Answer:

Small reusable logic
Within same job

Question 48: Can Local Containers be converted to Shared?

Answer:
Yes.

Question 49: What is best practice for Containers?

Answer:

Use Shared for common logic
Use Local for small logic

Question 50: Real-world use of Containers?

Answer:

Standard transformations
Data cleansing logic
Reusable pipelines

IBM InfoSphere DataStage Interview Questions - Set Q