Microsoft offers ₹30-40 LPA for Data Engineer I and ₹50-60 LPA for Data Engineer II

Data engineering interviews at senior levels go far beyond writing SQL queries. Hiring managers at companies like Flipkart, Swiggy, PhonePe, Zepto, and global MNCs expect you to architect distributed systems, reason about trade-offs under scale, and demonstrate battle-tested patterns for reliability. This article walks through 10 high-signal interview questions along with the complete, production-grade answers that differentiate a ₹50 LPA candidate from the rest.

Q1. Design a SQL Query to Compute Rolling Aggregates Over a Massive Partitioned Table

What the Interviewer Is Testing

Your ability to use window functions efficiently at petabyte scale, while respecting partition pruning and avoiding full table scans.

✅ Proper Solution

Assume a table events partitioned by event_date, with billions of rows:

-- Rolling 7-day sum of revenue per user, partition-aware
SELECT
    user_id,
    event_date,
    revenue,
    SUM(revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_revenue
FROM events
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE, 30) AND CURRENT_DATE
  -- Partition pruning: only scan last 30 days
ORDER BY user_id, event_date;

Key Design Principles

Partition pruning first: Always filter on the partition column (event_date) in the WHERE clause before applying window functions. Without this, the engine scans all partitions.
ROWS vs RANGE: Use ROWS BETWEEN instead of RANGE BETWEEN for rolling aggregates on dense date columns. RANGE can be expensive when there are ties.
Pre-aggregate before windowing: If the table has multiple events per user per day, first group to daily granularity, then apply the window. This drastically reduces the window function’s working set.
Materialized intermediate views: In BigQuery/Snowflake/Redshift, create a daily-aggregated materialized view. Run rolling aggregates on that view, not raw events.

-- Better pattern: pre-aggregate, then roll
WITH daily_revenue AS (
    SELECT user_id, event_date, SUM(revenue) AS daily_rev
    FROM events
    WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE, 37) AND CURRENT_DATE
    GROUP BY user_id, event_date
)
SELECT
    user_id,
    event_date,
    SUM(daily_rev) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_revenue
FROM daily_revenue;

Pro tip for interviewers: Mention that in Spark SQL you can hint the engine with /*+ BROADCAST(small_table) */ to avoid shuffles on the join preceding the window.

Q2. How Would You Design Idempotent ETL Jobs to Safely Handle Retries and Partial Failures?

What the Interviewer Is Testing

Whether you understand the difference between idempotency and at-least-once delivery, and how to build ETL pipelines that can be safely re-run without data corruption.

✅ Proper Solution

Core Principle: An idempotent operation, when applied multiple times, produces the same result as applying it once. In ETL terms: re-running the job should not create duplicate rows, double-count metrics, or leave partial state.

Strategy 1: Overwrite, Don’t Append

For batch jobs, use INSERT OVERWRITE (or REPLACE INTO / MERGE) scoped to the processing window rather than raw INSERT.

-- Safe pattern: overwrite a specific partition on every run
INSERT OVERWRITE TABLE fact_orders PARTITION (order_date = '2024-11-01')
SELECT order_id, user_id, amount
FROM staging_orders
WHERE order_date = '2024-11-01';

Re-running this job 10 times yields identical results. Raw INSERT INTO would create 10x the rows.

Strategy 2: Use a Job State/Watermark Table

-- Track what has been processed
CREATE TABLE etl_checkpoints (
    job_name       VARCHAR(100),
    batch_key      VARCHAR(100),  -- e.g., "2024-11-01"
    status         VARCHAR(20),   -- STARTED, SUCCESS, FAILED
    started_at     TIMESTAMP,
    completed_at   TIMESTAMP,
    PRIMARY KEY (job_name, batch_key)
);

At job start: write STARTED. On success: update to SUCCESS. On retry: check if SUCCESS already exists → skip. This is the at-least-once + idempotent sink pattern used by Airflow, Prefect, and Dagster.

Strategy 3: Staging → Merge Pattern (Upsert)

Load raw data into a staging table, then MERGE into the target. Duplicate source rows are naturally de-duplicated:

MERGE INTO target_orders AS t
USING staging_orders AS s ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET t.amount = s.amount, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, user_id, amount, created_at)
                       VALUES (s.order_id, s.user_id, s.amount, s.created_at);

Strategy 4: Content-Addressable Keys for Streaming

Generate deterministic IDs using a hash of business keys so duplicate events produce the same ID and are naturally upserted, not duplicated:

import hashlib, json

def make_idempotency_key(event: dict) -> str:
    canonical = json.dumps({k: event[k] for k in sorted(event)}, sort_keys=True)
    return hashlib.sha256(canonical.encode()).hexdigest()

Q3. How Would You De-duplicate Streaming Data When Events Can Be Replayed?

What the Interviewer Is Testing

Knowledge of stateful stream processing, bloom filters, and the practical trade-off between memory, latency, and correctness.

✅ Proper Solution

Approach 1: Stateful Dedup with a Seen-IDs Store (Flink / Spark Structured Streaming)

# Spark Structured Streaming - dropDuplicates with watermark
from pyspark.sql import functions as F

deduplicated = (
    stream_df
    .withWatermark("event_timestamp", "10 minutes")  # allow 10-min late arrivals
    .dropDuplicates(["event_id"])                    # deduplicate within watermark window
)

The watermark tells Spark to garbage-collect state older than 10 minutes, preventing unbounded memory growth.

Approach 2: Redis/DynamoDB as an Idempotency Store

import redis

r = redis.Redis()
TTL_SECONDS = 3600  # match your replay window

def process_event(event):
    key = f"seen:{event['event_id']}"
    # SET NX = set only if not exists; atomic operation
    is_new = r.set(key, 1, ex=TTL_SECONDS, nx=True)
    if is_new:
        write_to_sink(event)
    # else: duplicate, skip silently

Approach 3: Bloom Filter for High-Throughput, Near-Exact Dedup

When you have tens of millions of event IDs per hour, an exact set in Redis becomes expensive. A Bloom filter offers O(1) lookup with configurable false-positive rate (e.g., 0.1%) at a fraction of the memory cost. Use pybloom_live or Redis’s native BF.ADD command.

Design Trade-offs Table

Method	Correctness	Memory	Best For
Spark dropDuplicates + watermark	Exact within window	Medium	Bounded replay windows
Redis SET NX	Exact	Higher	Cross-service dedup
Bloom Filter	~99.9% (tunable)	Very Low	Ultra high-throughput
DB Upsert (MERGE)	Exact	N/A (disk)	Sink-level dedup

Q4. How Would You Optimize a Slow Query in a Columnar Data Warehouse Without Modifying the Query Logic?

What the Interviewer Is Testing

Infrastructure-level optimization skills — because in many organizations, data engineers don’t own the BI queries but are expected to make them fast.

✅ Proper Solution

1. Fix Clustering / Sort Keys

In Redshift, ensure the table’s sort key matches the query’s WHERE and JOIN predicates. In BigQuery, use clustering columns that align with your filter columns. In Snowflake, use Automatic Clustering on high-cardinality filter columns.

2. Analyze and Update Table Statistics

-- Redshift
ANALYZE table_name;

-- BigQuery: statistics update automatically; use INFORMATION_SCHEMA.TABLE_STORAGE
-- Snowflake
ALTER TABLE orders CLUSTER BY (order_date, region);

3. Add a Materialized View or Aggregate Table

If the query repeatedly joins and aggregates the same large tables, create a pre-computed materialized view. The query optimizer will automatically rewrite against the materialized view if column definitions match (BigQuery and Snowflake do this natively).

4. Partition Pruning Audit

Run EXPLAIN to check if partition pruning is happening. Common culprits that break pruning: wrapping partition columns in functions like DATE(event_timestamp) instead of using the raw partition column, or implicit type casts.

5. Increase Warehouse/Cluster Size (Last Resort)

Before scaling up hardware, exhaust all structural optimizations. Scaling is often 10–20x more expensive than a proper clustering fix.

Q5. Describe Strategies for Handling Late-Arriving Data in a Batch + Streaming Hybrid Pipeline

What the Interviewer Is Testing

Your understanding of the Lambda / Kappa / Delta architecture spectrum and the practical mechanisms for reconciling real-time and historical data.

Strategy 1: Event-Time Watermarks in the Streaming Layer

Configure watermarks that define how late data is tolerated before a window is finalized. In Apache Flink or Spark Structured Streaming:

stream.withWatermark("event_time", "2 hours")
      .groupBy(window("event_time", "1 hour"), "user_id")
      .agg(sum("amount"))

Events arriving more than 2 hours late are dropped from streaming windows but can be captured by the batch layer.

Strategy 2: Delta Table Time Travel + Batch Reprocessing

Use Delta Lake (or Apache Iceberg / Apache Hudi) as your storage layer. Late-arriving events are appended with their original event_time. A daily batch job re-aggregates affected partitions:

from delta.tables import DeltaTable

# Upsert late-arriving records
delta_table = DeltaTable.forPath(spark, "/data/events")
delta_table.alias("t").merge(
    late_events_df.alias("s"),
    "t.event_id = s.event_id"
).whenNotMatchedInsertAll().execute()

# Re-run aggregation for affected dates
spark.sql("REFRESH TABLE aggregated_events")

Strategy 3: Grace Periods + Correction Streams

Publish two streams: a provisional stream (low latency, may miss late data) and a corrected stream (T+24h, includes all late arrivals). Downstream dashboards consume both and apply the correction as an update. This is how financial firms reconcile intraday vs. end-of-day reporting.

Strategy 4: Partition-Level Reprocessing Flags

Maintain a reprocess_flags table. When late data arrives for a given date partition, flag it. A scheduled job scans flags and re-runs only affected partitions, rather than reprocessing everything.

Q6. What Happens Internally During a Data Shuffle in Distributed Systems, and How Do You Reduce Its Cost?

Deep systems knowledge — understanding why certain operations are expensive and how to architect around them.

✅ Proper Solution

What Happens During a Shuffle

A shuffle is triggered by any operation that requires data from multiple partitions to be co-located on the same node — GROUP BY, JOIN, DISTINCT, ORDER BY, repartition(). The process:

Map phase: Each task computes a hash of the shuffle key and writes output to local disk, partitioned by hash bucket.
Transfer phase: Reducers on other machines issue HTTP fetch requests to pull their assigned buckets from each mapper’s disk.
Sort/Reduce phase: Received data is sorted by key, then aggregated.

This involves disk I/O (write + read), network I/O, and serialization — typically the most expensive operation in any distributed job.

How to Reduce Shuffle Cost

Map-side pre-aggregation (combiner): In Spark, use reduceByKey instead of groupByKey. reduceByKey runs a local aggregation before shuffling, dramatically reducing data volume.
Broadcast joins: If one side of a join is small (< a few hundred MB), broadcast it to every executor. Zero shuffle. spark.sql.autoBroadcastJoinThreshold = 100MB
Partition on join keys in advance: Pre-partition both tables on the join key and persist. Subsequent joins on the same key require no reshuffle (sort-merge join reuses existing partitioning).
Avoid unnecessary wide transformations: distinct() before a join is often a shuffle that can be eliminated by fixing the upstream dedup logic.
AQE (Adaptive Query Execution): Enable Spark’s AQE (spark.sql.adaptive.enabled=true) — it dynamically coalesces small shuffle partitions and switches join strategies at runtime.

Q7. How Would You Detect and Mitigate Data Skew in a Large Spark or Distributed SQL Job?

What the Interviewer Is Testing

Whether you can diagnose performance issues in production and apply targeted fixes, not just throw more resources at the problem.

✅ Proper Solution

Detection

In the Spark UI, look at the Stages tab → Task Metrics. If a few tasks take 10–100x longer than median, you have skew.
Check partition sizes: df.groupBy(spark_partition_id()).count().show(). A healthy partition distribution is roughly equal. Skew shows up as one partition having 100M rows while others have 1M.
Common skew causes: NULL keys on joins, power-law distributed IDs (e.g., one mega-merchant has 50% of all orders), hot dates in time-series data.

Mitigation Strategies

1. Salting: Append a random integer [0–N] to the skewed join key, explode the small table by the same factor, then join.

import pyspark.sql.functions as F

SALT_BUCKETS = 20

# Salt the large (skewed) table
skewed_df = large_df.withColumn("salt", (F.rand() * SALT_BUCKETS).cast("int"))
skewed_df = skewed_df.withColumn("salted_key", F.concat(F.col("join_key"), F.lit("_"), F.col("salt")))

# Explode the small table across all salt values
small_exploded = small_df.withColumn("salt", F.explode(F.array([F.lit(i) for i in range(SALT_BUCKETS)])))
small_exploded = small_exploded.withColumn("salted_key", F.concat(F.col("join_key"), F.lit("_"), F.col("salt")))

result = skewed_df.join(small_exploded, "salted_key")

2. Isolate Skewed Keys: Split the dataset into skewed keys + non-skewed keys. Process skewed keys with a broadcast join, non-skewed keys with a sort-merge join. Union results.

3. Skew Hints (Spark 3.x):

SELECT /*+ SKEW_JOIN(orders, 'merchant_id', ('MEGA_MERCHANT_1', 'MEGA_MERCHANT_2')) */
       o.order_id, m.merchant_name
FROM orders o JOIN merchants m ON o.merchant_id = m.merchant_id;

4. AQE Skew Join Optimization: Enable spark.sql.adaptive.skewJoin.enabled=true — Spark 3 will automatically detect and split skewed partitions at runtime.

Q8. Design a Data Ingestion Pipeline That Guarantees Exactly-Once Processing at Scale

What the Interviewer Is Testing

System design maturity — exactly-once is the hardest guarantee in distributed systems. They want to know if you understand why it’s hard and how production systems achieve it.

✅ Proper Solution

Why Exactly-Once Is Hard

A distributed network can drop messages (causing retries → at-least-once) or crash mid-write (causing partial writes). True exactly-once requires coordination between the source, the processor, and the sink — all three layers must participate.

Architecture: Transactional Outbox + Idempotent Sink

[Source DB] → [Outbox Table] → [CDC / Debezium] → [Kafka Topic]
                                                         ↓
                                             [Flink/Spark Consumer]
                                                         ↓
                                          [Idempotent Sink (Delta/Iceberg)]

Layer 1 — Transactional Outbox: Writes to the source DB and to an outbox table happen in the same DB transaction. This prevents the “wrote to DB but failed before sending to Kafka” problem.

Layer 2 — Kafka Exactly-Once Semantics: Use Kafka transactions (enable.idempotence=true, transactional.id) to ensure producers don’t duplicate messages on retry, and consumers commit offsets atomically with processing.

Layer 3 — Idempotent Sink: Use Delta Lake’s transactional writes. Flink’s Delta connector writes files + commits a transaction log entry atomically. Re-running from an earlier checkpoint re-commits the same files (idempotent by design).

# Flink with exactly-once checkpoint to Delta Lake
env.enable_checkpointing(60000)  # checkpoint every 60s
env.get_checkpoint_config().set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)
env.get_checkpoint_config().enable_externalized_checkpoints(
    ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
)

Offset Tracking Pattern for Spark

# Spark Structured Streaming — write offsets + data atomically
query = (
    stream_df.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/pipeline_v1")  # atomic offset + data
    .start("/data/sink/events")
)

The checkpoint directory stores Kafka offsets. On restart, Spark reads the last committed offset and re-processes from exactly that point, writing to Delta atomically. This achieves effective exactly-once end-to-end.

Q9. How Would You Model Data to Support Both Low-Latency Dashboards and Long-Term Analytical Queries?

What the Interviewer Is Testing

Data modeling maturity — specifically the tension between OLAP optimization (wide denormalized tables) and the need for sub-second dashboard response times.

The Unified Pattern: Multi-Tier Data Model

Tier 1: Raw / Bronze Layer     → Normalized, immutable, full history
Tier 2: Curated / Silver Layer → Cleaned, joined, business-entity-centric
Tier 3: Aggregate / Gold Layer → Pre-computed metrics, dashboard-ready

Low-latency dashboards query Tier 3 (Gold). These are pre-aggregated tables updated every 15 minutes to 1 hour. They contain denormalized, ready-to-use metrics: daily_active_users, gmv_by_region_hour, conversion_funnel_by_cohort. Response time: <200ms because data is pre-computed and clustered on the query dimensions.

Long-term analytical queries join across Silver/Bronze layers. Analysts can explore any dimension combination. These queries are expected to take seconds to minutes and are run in batch by analysts, not in real-time by end-users.

Concrete Modeling Choices

Gold layer: Denormalized “wide” fact tables with all frequently-queried dimensions embedded. No joins required at query time. Clustered on (report_date, region) for the most common dashboard filters.
Silver layer: Normalized fact + dimension tables (Kimball-style star schema). Supports ad hoc exploration but requires joins.
Semantic layer: Use a tool like dbt metrics, Looker LookML, or Cube.dev to define a semantic layer over Silver. Dashboards query the semantic layer (which hits Gold cache first, falls back to Silver).

For Sub-100ms Latency

Introduce a caching layer: Apache Druid, ClickHouse, or Apache Pinot for real-time OLAP. Ingest the Gold layer metrics into Druid/Pinot. Dashboard queries route to Druid. Analytical exploration routes to Snowflake/BigQuery Silver layer. This is the architecture used by LinkedIn (Pinot), Uber (Pinot), and Netflix (Druid).

Q10. What Trade-offs Would You Consider Between Partitioning vs. Bucketing/Clustering?

🎯 What the Interviewer Is Testing

The depth of your understanding of physical data layout and how it affects query planning, maintenance overhead, and storage efficiency.

✅ Proper Solution

Partitioning

Data is split into separate directories/files based on column value (e.g., event_date=2024-11-01/). The query engine uses partition metadata to entirely skip irrelevant directories — this is called partition pruning.

When to use: Low-cardinality columns that are almost always in your WHERE clause: date, country, event_type. A table with 3 years of daily partitions has ~1,095 partitions — very manageable.

Pitfalls: Over-partitioning on high-cardinality columns (e.g., user_id) creates millions of tiny files (the “small files problem”), causing metadata overhead that can crush HDFS NameNode or object store list performance. Rule of thumb: target partition sizes of 100MB–1GB.

Bucketing (Hive/Spark) / Clustering (BigQuery/Snowflake)

Data within a partition (or the whole table) is sorted and divided by a hash of the column value into a fixed number of buckets. Unlike partitioning, this doesn’t create separate directories — it controls data layout within files.

When to use: High-cardinality join keys (e.g., user_id, order_id) where you frequently join two large tables on the same key. If both tables are bucketed on user_id with the same number of buckets, Spark can perform a bucket join — no shuffle required.

Comparison Table

Dimension	Partitioning	Bucketing / Clustering
Best for	Range filters (date, region)	Point lookups, joins on high-cardinality keys
Pruning mechanism	Skip entire directories	Skip within files (min/max zone maps)
Cardinality fit	Low (days, countries)	High (user IDs, order IDs)
Shuffle elimination	No	Yes (bucket join in Spark/Hive)
Small files risk	High if over-partitioned	Controlled (fixed bucket count)
Write overhead	Low	Higher (sort + hash during write)

The Optimal Pattern: Partition + Bucket Together

-- Hive / Spark SQL: partition by date, bucket by user_id
CREATE TABLE orders (
    order_id    BIGINT,
    user_id     BIGINT,
    amount      DECIMAL(10,2),
    order_date  DATE
)
PARTITIONED BY (order_date)
CLUSTERED BY (user_id) INTO 256 BUCKETS
STORED AS PARQUET;

Queries filtered on order_date get partition pruning. Joins on user_id within a partition get bucket join (no shuffle). You get both benefits.

How to Use These Answers in an Interview

At the ₹50–60 LPA Data Engineer II level, interviewers don’t just want the correct answer — they want to see reasoning under constraints. For every question, practice a structure like:

Restate the core challenge (shows you understand the problem, not just the answer)
Give the approach you’d choose first and why
Proactively mention the trade-offs of your choice
Mention at least one alternative and when you’d use it instead

This structure transforms a correct answer into a senior-level answer — and that’s the difference between an offer and a rejection at the top of the salary band.

Found this useful? Share it with your network preparing for senior data engineering interviews. Drop your questions in the comments — we answer every one.

Article Categories:

Educations

Microsoft offers ₹30-40 LPA for Data Engineer I and ₹50-60 LPA for Data Engineer II

Q1. Design a SQL Query to Compute Rolling Aggregates Over a Massive Partitioned Table

What the Interviewer Is Testing

✅ Proper Solution

Key Design Principles

Q2. How Would You Design Idempotent ETL Jobs to Safely Handle Retries and Partial Failures?

What the Interviewer Is Testing

✅ Proper Solution

Strategy 1: Overwrite, Don’t Append

Strategy 2: Use a Job State/Watermark Table

Strategy 3: Staging → Merge Pattern (Upsert)

Strategy 4: Content-Addressable Keys for Streaming

Q3. How Would You De-duplicate Streaming Data When Events Can Be Replayed?

What the Interviewer Is Testing

✅ Proper Solution

Approach 1: Stateful Dedup with a Seen-IDs Store (Flink / Spark Structured Streaming)

Approach 2: Redis/DynamoDB as an Idempotency Store

Approach 3: Bloom Filter for High-Throughput, Near-Exact Dedup

Design Trade-offs Table

Q4. How Would You Optimize a Slow Query in a Columnar Data Warehouse Without Modifying the Query Logic?

What the Interviewer Is Testing

✅ Proper Solution

1. Fix Clustering / Sort Keys

2. Analyze and Update Table Statistics

3. Add a Materialized View or Aggregate Table

4. Partition Pruning Audit

5. Increase Warehouse/Cluster Size (Last Resort)

Q5. Describe Strategies for Handling Late-Arriving Data in a Batch + Streaming Hybrid Pipeline

What the Interviewer Is Testing

Strategy 1: Event-Time Watermarks in the Streaming Layer

Strategy 2: Delta Table Time Travel + Batch Reprocessing

Strategy 3: Grace Periods + Correction Streams

Strategy 4: Partition-Level Reprocessing Flags

Q6. What Happens Internally During a Data Shuffle in Distributed Systems, and How Do You Reduce Its Cost?

✅ Proper Solution

What Happens During a Shuffle

How to Reduce Shuffle Cost

Q7. How Would You Detect and Mitigate Data Skew in a Large Spark or Distributed SQL Job?

What the Interviewer Is Testing

✅ Proper Solution

Detection

Mitigation Strategies

Q8. Design a Data Ingestion Pipeline That Guarantees Exactly-Once Processing at Scale

What the Interviewer Is Testing

✅ Proper Solution

Why Exactly-Once Is Hard

Architecture: Transactional Outbox + Idempotent Sink

Offset Tracking Pattern for Spark

Q9. How Would You Model Data to Support Both Low-Latency Dashboards and Long-Term Analytical Queries?

What the Interviewer Is Testing

The Unified Pattern: Multi-Tier Data Model

Concrete Modeling Choices

For Sub-100ms Latency

Q10. What Trade-offs Would You Consider Between Partitioning vs. Bucketing/Clustering?

🎯 What the Interviewer Is Testing

✅ Proper Solution

Partitioning

Bucketing (Hive/Spark) / Clustering (BigQuery/Snowflake)

Comparison Table

The Optimal Pattern: Partition + Bucket Together

How to Use These Answers in an Interview

Related

Related Articles

Stack Overflow’s Decline a Developer Shift

What Is Python and How Is It Used Across Industries?

ETL Informatica Interview Scenario Questions

Snowflake Warehouses Explained

Leave a Reply Cancel reply

How to Dynamically Generate Parameter Files from a Table in IICS

Apple MacBook Neo Review: Everything You Need to Know Before You Buy (2026)