Data engineering interviews at senior levels go far beyond writing SQL queries. Hiring managers at companies like Flipkart, Swiggy, PhonePe, Zepto, and global MNCs expect you to architect distributed systems, reason about trade-offs under scale, and demonstrate battle-tested patterns for reliability. This article walks through 10 high-signal interview questions along with the complete, production-grade answers that differentiate a ₹50 LPA candidate from the rest.

Q1. Design a SQL Query to Compute Rolling Aggregates Over a Massive Partitioned Table
What the Interviewer Is Testing
Your ability to use window functions efficiently at petabyte scale, while respecting partition pruning and avoiding full table scans.
✅ Proper Solution
Assume a table events partitioned by event_date, with billions of rows:
-- Rolling 7-day sum of revenue per user, partition-aware
SELECT
user_id,
event_date,
revenue,
SUM(revenue) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_revenue
FROM events
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE, 30) AND CURRENT_DATE
-- Partition pruning: only scan last 30 days
ORDER BY user_id, event_date;
Key Design Principles
- Partition pruning first: Always filter on the partition column (
event_date) in theWHEREclause before applying window functions. Without this, the engine scans all partitions. - ROWS vs RANGE: Use
ROWS BETWEENinstead ofRANGE BETWEENfor rolling aggregates on dense date columns.RANGEcan be expensive when there are ties. - Pre-aggregate before windowing: If the table has multiple events per user per day, first group to daily granularity, then apply the window. This drastically reduces the window function’s working set.
- Materialized intermediate views: In BigQuery/Snowflake/Redshift, create a daily-aggregated materialized view. Run rolling aggregates on that view, not raw events.
-- Better pattern: pre-aggregate, then roll
WITH daily_revenue AS (
SELECT user_id, event_date, SUM(revenue) AS daily_rev
FROM events
WHERE event_date BETWEEN DATE_SUB(CURRENT_DATE, 37) AND CURRENT_DATE
GROUP BY user_id, event_date
)
SELECT
user_id,
event_date,
SUM(daily_rev) OVER (
PARTITION BY user_id
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_revenue
FROM daily_revenue;
Pro tip for interviewers: Mention that in Spark SQL you can hint the engine with /*+ BROADCAST(small_table) */ to avoid shuffles on the join preceding the window.
Q2. How Would You Design Idempotent ETL Jobs to Safely Handle Retries and Partial Failures?
What the Interviewer Is Testing
Whether you understand the difference between idempotency and at-least-once delivery, and how to build ETL pipelines that can be safely re-run without data corruption.
✅ Proper Solution
Core Principle: An idempotent operation, when applied multiple times, produces the same result as applying it once. In ETL terms: re-running the job should not create duplicate rows, double-count metrics, or leave partial state.
Strategy 1: Overwrite, Don’t Append
For batch jobs, use INSERT OVERWRITE (or REPLACE INTO / MERGE) scoped to the processing window rather than raw INSERT.
-- Safe pattern: overwrite a specific partition on every run
INSERT OVERWRITE TABLE fact_orders PARTITION (order_date = '2024-11-01')
SELECT order_id, user_id, amount
FROM staging_orders
WHERE order_date = '2024-11-01';
Re-running this job 10 times yields identical results. Raw INSERT INTO would create 10x the rows.
Strategy 2: Use a Job State/Watermark Table
-- Track what has been processed
CREATE TABLE etl_checkpoints (
job_name VARCHAR(100),
batch_key VARCHAR(100), -- e.g., "2024-11-01"
status VARCHAR(20), -- STARTED, SUCCESS, FAILED
started_at TIMESTAMP,
completed_at TIMESTAMP,
PRIMARY KEY (job_name, batch_key)
);
At job start: write STARTED. On success: update to SUCCESS. On retry: check if SUCCESS already exists → skip. This is the at-least-once + idempotent sink pattern used by Airflow, Prefect, and Dagster.
Strategy 3: Staging → Merge Pattern (Upsert)
Load raw data into a staging table, then MERGE into the target. Duplicate source rows are naturally de-duplicated:
MERGE INTO target_orders AS t
USING staging_orders AS s ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET t.amount = s.amount, t.updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, user_id, amount, created_at)
VALUES (s.order_id, s.user_id, s.amount, s.created_at);
Strategy 4: Content-Addressable Keys for Streaming
Generate deterministic IDs using a hash of business keys so duplicate events produce the same ID and are naturally upserted, not duplicated:
import hashlib, json
def make_idempotency_key(event: dict) -> str:
canonical = json.dumps({k: event[k] for k in sorted(event)}, sort_keys=True)
return hashlib.sha256(canonical.encode()).hexdigest()
Q3. How Would You De-duplicate Streaming Data When Events Can Be Replayed?
What the Interviewer Is Testing
Knowledge of stateful stream processing, bloom filters, and the practical trade-off between memory, latency, and correctness.
✅ Proper Solution
Approach 1: Stateful Dedup with a Seen-IDs Store (Flink / Spark Structured Streaming)
# Spark Structured Streaming - dropDuplicates with watermark
from pyspark.sql import functions as F
deduplicated = (
stream_df
.withWatermark("event_timestamp", "10 minutes") # allow 10-min late arrivals
.dropDuplicates(["event_id"]) # deduplicate within watermark window
)
The watermark tells Spark to garbage-collect state older than 10 minutes, preventing unbounded memory growth.
Approach 2: Redis/DynamoDB as an Idempotency Store
import redis
r = redis.Redis()
TTL_SECONDS = 3600 # match your replay window
def process_event(event):
key = f"seen:{event['event_id']}"
# SET NX = set only if not exists; atomic operation
is_new = r.set(key, 1, ex=TTL_SECONDS, nx=True)
if is_new:
write_to_sink(event)
# else: duplicate, skip silently
Approach 3: Bloom Filter for High-Throughput, Near-Exact Dedup
When you have tens of millions of event IDs per hour, an exact set in Redis becomes expensive. A Bloom filter offers O(1) lookup with configurable false-positive rate (e.g., 0.1%) at a fraction of the memory cost. Use pybloom_live or Redis’s native BF.ADD command.
Design Trade-offs Table
| Method | Correctness | Memory | Best For |
|---|---|---|---|
| Spark dropDuplicates + watermark | Exact within window | Medium | Bounded replay windows |
| Redis SET NX | Exact | Higher | Cross-service dedup |
| Bloom Filter | ~99.9% (tunable) | Very Low | Ultra high-throughput |
| DB Upsert (MERGE) | Exact | N/A (disk) | Sink-level dedup |
Q4. How Would You Optimize a Slow Query in a Columnar Data Warehouse Without Modifying the Query Logic?
What the Interviewer Is Testing
Infrastructure-level optimization skills — because in many organizations, data engineers don’t own the BI queries but are expected to make them fast.
✅ Proper Solution
1. Fix Clustering / Sort Keys
In Redshift, ensure the table’s sort key matches the query’s WHERE and JOIN predicates. In BigQuery, use clustering columns that align with your filter columns. In Snowflake, use Automatic Clustering on high-cardinality filter columns.
2. Analyze and Update Table Statistics
-- Redshift
ANALYZE table_name;
-- BigQuery: statistics update automatically; use INFORMATION_SCHEMA.TABLE_STORAGE
-- Snowflake
ALTER TABLE orders CLUSTER BY (order_date, region);
3. Add a Materialized View or Aggregate Table
If the query repeatedly joins and aggregates the same large tables, create a pre-computed materialized view. The query optimizer will automatically rewrite against the materialized view if column definitions match (BigQuery and Snowflake do this natively).
4. Partition Pruning Audit
Run EXPLAIN to check if partition pruning is happening. Common culprits that break pruning: wrapping partition columns in functions like DATE(event_timestamp) instead of using the raw partition column, or implicit type casts.
5. Increase Warehouse/Cluster Size (Last Resort)
Before scaling up hardware, exhaust all structural optimizations. Scaling is often 10–20x more expensive than a proper clustering fix.
Q5. Describe Strategies for Handling Late-Arriving Data in a Batch + Streaming Hybrid Pipeline
What the Interviewer Is Testing
Your understanding of the Lambda / Kappa / Delta architecture spectrum and the practical mechanisms for reconciling real-time and historical data.
Strategy 1: Event-Time Watermarks in the Streaming Layer
Configure watermarks that define how late data is tolerated before a window is finalized. In Apache Flink or Spark Structured Streaming:
stream.withWatermark("event_time", "2 hours")
.groupBy(window("event_time", "1 hour"), "user_id")
.agg(sum("amount"))
Events arriving more than 2 hours late are dropped from streaming windows but can be captured by the batch layer.
Strategy 2: Delta Table Time Travel + Batch Reprocessing
Use Delta Lake (or Apache Iceberg / Apache Hudi) as your storage layer. Late-arriving events are appended with their original event_time. A daily batch job re-aggregates affected partitions:
from delta.tables import DeltaTable
# Upsert late-arriving records
delta_table = DeltaTable.forPath(spark, "/data/events")
delta_table.alias("t").merge(
late_events_df.alias("s"),
"t.event_id = s.event_id"
).whenNotMatchedInsertAll().execute()
# Re-run aggregation for affected dates
spark.sql("REFRESH TABLE aggregated_events")
Strategy 3: Grace Periods + Correction Streams
Publish two streams: a provisional stream (low latency, may miss late data) and a corrected stream (T+24h, includes all late arrivals). Downstream dashboards consume both and apply the correction as an update. This is how financial firms reconcile intraday vs. end-of-day reporting.
Strategy 4: Partition-Level Reprocessing Flags
Maintain a reprocess_flags table. When late data arrives for a given date partition, flag it. A scheduled job scans flags and re-runs only affected partitions, rather than reprocessing everything.
Q6. What Happens Internally During a Data Shuffle in Distributed Systems, and How Do You Reduce Its Cost?
Deep systems knowledge — understanding why certain operations are expensive and how to architect around them.
✅ Proper Solution
What Happens During a Shuffle
A shuffle is triggered by any operation that requires data from multiple partitions to be co-located on the same node — GROUP BY, JOIN, DISTINCT, ORDER BY, repartition(). The process:
- Map phase: Each task computes a hash of the shuffle key and writes output to local disk, partitioned by hash bucket.
- Transfer phase: Reducers on other machines issue HTTP fetch requests to pull their assigned buckets from each mapper’s disk.
- Sort/Reduce phase: Received data is sorted by key, then aggregated.
This involves disk I/O (write + read), network I/O, and serialization — typically the most expensive operation in any distributed job.
How to Reduce Shuffle Cost
- Map-side pre-aggregation (combiner): In Spark, use
reduceByKeyinstead ofgroupByKey.reduceByKeyruns a local aggregation before shuffling, dramatically reducing data volume. - Broadcast joins: If one side of a join is small (< a few hundred MB), broadcast it to every executor. Zero shuffle.
spark.sql.autoBroadcastJoinThreshold = 100MB - Partition on join keys in advance: Pre-partition both tables on the join key and persist. Subsequent joins on the same key require no reshuffle (sort-merge join reuses existing partitioning).
- Avoid unnecessary wide transformations:
distinct()before a join is often a shuffle that can be eliminated by fixing the upstream dedup logic. - AQE (Adaptive Query Execution): Enable Spark’s AQE (
spark.sql.adaptive.enabled=true) — it dynamically coalesces small shuffle partitions and switches join strategies at runtime.
Q7. How Would You Detect and Mitigate Data Skew in a Large Spark or Distributed SQL Job?
What the Interviewer Is Testing
Whether you can diagnose performance issues in production and apply targeted fixes, not just throw more resources at the problem.
✅ Proper Solution
Detection
- In the Spark UI, look at the Stages tab → Task Metrics. If a few tasks take 10–100x longer than median, you have skew.
- Check partition sizes:
df.groupBy(spark_partition_id()).count().show(). A healthy partition distribution is roughly equal. Skew shows up as one partition having 100M rows while others have 1M. - Common skew causes:
NULLkeys on joins, power-law distributed IDs (e.g., one mega-merchant has 50% of all orders), hot dates in time-series data.
Mitigation Strategies
1. Salting: Append a random integer [0–N] to the skewed join key, explode the small table by the same factor, then join.
import pyspark.sql.functions as F
SALT_BUCKETS = 20
# Salt the large (skewed) table
skewed_df = large_df.withColumn("salt", (F.rand() * SALT_BUCKETS).cast("int"))
skewed_df = skewed_df.withColumn("salted_key", F.concat(F.col("join_key"), F.lit("_"), F.col("salt")))
# Explode the small table across all salt values
small_exploded = small_df.withColumn("salt", F.explode(F.array([F.lit(i) for i in range(SALT_BUCKETS)])))
small_exploded = small_exploded.withColumn("salted_key", F.concat(F.col("join_key"), F.lit("_"), F.col("salt")))
result = skewed_df.join(small_exploded, "salted_key")
2. Isolate Skewed Keys: Split the dataset into skewed keys + non-skewed keys. Process skewed keys with a broadcast join, non-skewed keys with a sort-merge join. Union results.
3. Skew Hints (Spark 3.x):
SELECT /*+ SKEW_JOIN(orders, 'merchant_id', ('MEGA_MERCHANT_1', 'MEGA_MERCHANT_2')) */
o.order_id, m.merchant_name
FROM orders o JOIN merchants m ON o.merchant_id = m.merchant_id;
4. AQE Skew Join Optimization: Enable spark.sql.adaptive.skewJoin.enabled=true — Spark 3 will automatically detect and split skewed partitions at runtime.
Q8. Design a Data Ingestion Pipeline That Guarantees Exactly-Once Processing at Scale
What the Interviewer Is Testing
System design maturity — exactly-once is the hardest guarantee in distributed systems. They want to know if you understand why it’s hard and how production systems achieve it.
✅ Proper Solution
Why Exactly-Once Is Hard
A distributed network can drop messages (causing retries → at-least-once) or crash mid-write (causing partial writes). True exactly-once requires coordination between the source, the processor, and the sink — all three layers must participate.
Architecture: Transactional Outbox + Idempotent Sink
[Source DB] → [Outbox Table] → [CDC / Debezium] → [Kafka Topic]
↓
[Flink/Spark Consumer]
↓
[Idempotent Sink (Delta/Iceberg)]
Layer 1 — Transactional Outbox: Writes to the source DB and to an outbox table happen in the same DB transaction. This prevents the “wrote to DB but failed before sending to Kafka” problem.
Layer 2 — Kafka Exactly-Once Semantics: Use Kafka transactions (enable.idempotence=true, transactional.id) to ensure producers don’t duplicate messages on retry, and consumers commit offsets atomically with processing.
Layer 3 — Idempotent Sink: Use Delta Lake’s transactional writes. Flink’s Delta connector writes files + commits a transaction log entry atomically. Re-running from an earlier checkpoint re-commits the same files (idempotent by design).
# Flink with exactly-once checkpoint to Delta Lake
env.enable_checkpointing(60000) # checkpoint every 60s
env.get_checkpoint_config().set_checkpointing_mode(CheckpointingMode.EXACTLY_ONCE)
env.get_checkpoint_config().enable_externalized_checkpoints(
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
)
Offset Tracking Pattern for Spark
# Spark Structured Streaming — write offsets + data atomically
query = (
stream_df.writeStream
.format("delta")
.outputMode("append")
.option("checkpointLocation", "/checkpoints/pipeline_v1") # atomic offset + data
.start("/data/sink/events")
)
The checkpoint directory stores Kafka offsets. On restart, Spark reads the last committed offset and re-processes from exactly that point, writing to Delta atomically. This achieves effective exactly-once end-to-end.
Q9. How Would You Model Data to Support Both Low-Latency Dashboards and Long-Term Analytical Queries?
What the Interviewer Is Testing
Data modeling maturity — specifically the tension between OLAP optimization (wide denormalized tables) and the need for sub-second dashboard response times.
The Unified Pattern: Multi-Tier Data Model
Tier 1: Raw / Bronze Layer → Normalized, immutable, full history Tier 2: Curated / Silver Layer → Cleaned, joined, business-entity-centric Tier 3: Aggregate / Gold Layer → Pre-computed metrics, dashboard-ready
Low-latency dashboards query Tier 3 (Gold). These are pre-aggregated tables updated every 15 minutes to 1 hour. They contain denormalized, ready-to-use metrics: daily_active_users, gmv_by_region_hour, conversion_funnel_by_cohort. Response time: <200ms because data is pre-computed and clustered on the query dimensions.
Long-term analytical queries join across Silver/Bronze layers. Analysts can explore any dimension combination. These queries are expected to take seconds to minutes and are run in batch by analysts, not in real-time by end-users.
Concrete Modeling Choices
- Gold layer: Denormalized “wide” fact tables with all frequently-queried dimensions embedded. No joins required at query time. Clustered on
(report_date, region)for the most common dashboard filters. - Silver layer: Normalized fact + dimension tables (Kimball-style star schema). Supports ad hoc exploration but requires joins.
- Semantic layer: Use a tool like dbt metrics, Looker LookML, or Cube.dev to define a semantic layer over Silver. Dashboards query the semantic layer (which hits Gold cache first, falls back to Silver).
For Sub-100ms Latency
Introduce a caching layer: Apache Druid, ClickHouse, or Apache Pinot for real-time OLAP. Ingest the Gold layer metrics into Druid/Pinot. Dashboard queries route to Druid. Analytical exploration routes to Snowflake/BigQuery Silver layer. This is the architecture used by LinkedIn (Pinot), Uber (Pinot), and Netflix (Druid).
Q10. What Trade-offs Would You Consider Between Partitioning vs. Bucketing/Clustering?
🎯 What the Interviewer Is Testing
The depth of your understanding of physical data layout and how it affects query planning, maintenance overhead, and storage efficiency.
✅ Proper Solution
Partitioning
Data is split into separate directories/files based on column value (e.g., event_date=2024-11-01/). The query engine uses partition metadata to entirely skip irrelevant directories — this is called partition pruning.
When to use: Low-cardinality columns that are almost always in your WHERE clause: date, country, event_type. A table with 3 years of daily partitions has ~1,095 partitions — very manageable.
Pitfalls: Over-partitioning on high-cardinality columns (e.g., user_id) creates millions of tiny files (the “small files problem”), causing metadata overhead that can crush HDFS NameNode or object store list performance. Rule of thumb: target partition sizes of 100MB–1GB.
Bucketing (Hive/Spark) / Clustering (BigQuery/Snowflake)
Data within a partition (or the whole table) is sorted and divided by a hash of the column value into a fixed number of buckets. Unlike partitioning, this doesn’t create separate directories — it controls data layout within files.
When to use: High-cardinality join keys (e.g., user_id, order_id) where you frequently join two large tables on the same key. If both tables are bucketed on user_id with the same number of buckets, Spark can perform a bucket join — no shuffle required.
Comparison Table
| Dimension | Partitioning | Bucketing / Clustering |
|---|---|---|
| Best for | Range filters (date, region) | Point lookups, joins on high-cardinality keys |
| Pruning mechanism | Skip entire directories | Skip within files (min/max zone maps) |
| Cardinality fit | Low (days, countries) | High (user IDs, order IDs) |
| Shuffle elimination | No | Yes (bucket join in Spark/Hive) |
| Small files risk | High if over-partitioned | Controlled (fixed bucket count) |
| Write overhead | Low | Higher (sort + hash during write) |
The Optimal Pattern: Partition + Bucket Together
-- Hive / Spark SQL: partition by date, bucket by user_id
CREATE TABLE orders (
order_id BIGINT,
user_id BIGINT,
amount DECIMAL(10,2),
order_date DATE
)
PARTITIONED BY (order_date)
CLUSTERED BY (user_id) INTO 256 BUCKETS
STORED AS PARQUET;
Queries filtered on order_date get partition pruning. Joins on user_id within a partition get bucket join (no shuffle). You get both benefits.
How to Use These Answers in an Interview
At the ₹50–60 LPA Data Engineer II level, interviewers don’t just want the correct answer — they want to see reasoning under constraints. For every question, practice a structure like:
- Restate the core challenge (shows you understand the problem, not just the answer)
- Give the approach you’d choose first and why
- Proactively mention the trade-offs of your choice
- Mention at least one alternative and when you’d use it instead
This structure transforms a correct answer into a senior-level answer — and that’s the difference between an offer and a rejection at the top of the salary band.
Found this useful? Share it with your network preparing for senior data engineering interviews. Drop your questions in the comments — we answer every one.




