12
Views

Introduction

If you’re preparing for a Data Engineering role at J.P. Morgan with around 4 years of experience, you’re not just competing on technical knowledge — you’re being evaluated on how you think about data in a regulated, high-stakes financial environment. J.P. Morgan’s interview process is known for repeatedly testing a specific set of scenarios around pipeline design, SQL, data quality, governance, and incident handling. This guide covers all 12 of those questions with complete, senior-level answers written with banking context in mind.

Table of Content


Question 1: How would you design a data pipeline to process millions of financial transactions with high accuracy and low latency?

Category: System Design

This is one of the most common opening questions at J.P. Morgan. They want to see that you think about throughput, accuracy, and fault tolerance simultaneously — not as trade-offs.

Architecture Blueprint

Ingestion Layer: Use Apache Kafka with partitioning by account ID or instrument ID. This ensures ordering within partitions and horizontal scalability.

Stream Processing: Apache Flink or Spark Structured Streaming for real-time computation — windowed aggregations, balance checks, fraud scoring.

Exactly-once semantics: Enable Flink’s checkpointing combined with Kafka transactions to guarantee no duplicate writes or missed records.

Storage Layer: Write to Delta Lake (ACID compliance) or Apache Iceberg for audit trails. Hot data in Redis or Cassandra for sub-millisecond reads.

Schema Registry: Use Confluent Schema Registry with Avro to enforce schema contracts across producers and consumers.

Monitoring: Dead-letter queues for failed records, Prometheus and Grafana for lag and throughput dashboards, PagerDuty alerts for SLA breaches.

Key Tools to Mention

Kafka · Apache Flink · Delta Lake · Confluent Schema Registry · Redis · Prometheus · Grafana

J.P. Morgan Angle: Always mention regulatory compliance requirements such as MiFID II and Dodd-Frank, which mandate transaction audit trails and timestamp accuracy to the millisecond. This signals finance domain awareness and separates you from generic engineering candidates.


Question 2: Explain the differences between OLTP and OLAP systems and their use cases in banking.

Category: Fundamentals

DimensionOLTPOLAP
PurposeDay-to-day transaction processingAnalytics and business intelligence
Data modelNormalized (3NF)Denormalized (Star/Snowflake schema)
Query typeINSERT / UPDATE / DELETE on small rowsSELECT aggregations over millions of rows
LatencyMillisecondsSeconds to minutes (acceptable)
ConcurrencyThousands of concurrent usersFewer users, heavier queries
Banking examplesATM withdrawals, wire transfers, KYC updatesRisk reporting, PnL dashboards, regulatory capital calculations
Tools used at JPMOracle DB, SQL Server, SybaseSnowflake, Redshift, BigQuery, Teradata

Pro Tip: Mention HTAP (Hybrid Transactional/Analytical Processing) systems like Google Spanner or TiDB that blur the line between OLTP and OLAP — increasingly relevant in real-time trading analytics at major banks.


Question 3: Write a SQL query to identify accounts with transactions exceeding a specified threshold within a 24-hour period.

Category: SQL

This tests your ability to write efficient SQL for financial fraud detection or AML (Anti-Money Laundering) use cases.

-- Identify accounts exceeding $50,000 in total transactions within any 24-hour window

WITH transaction_windows AS (
    SELECT
        account_id,
        transaction_id,
        transaction_amount,
        transaction_timestamp,
        -- Sum all transactions from same account within 24 hours of current row
        SUM(transaction_amount) OVER (
            PARTITION BY account_id
            ORDER BY transaction_timestamp
            RANGE BETWEEN INTERVAL '24 hours' PRECEDING
                  AND CURRENT ROW
        ) AS rolling_24h_total
    FROM transactions
    WHERE status = 'COMPLETED'
      AND transaction_timestamp >= NOW() - INTERVAL '7 days'  -- limit scan
)

SELECT DISTINCT
    account_id,
    MAX(rolling_24h_total) AS peak_24h_total,
    COUNT(*) AS transaction_count
FROM transaction_windows
WHERE rolling_24h_total > 50000
GROUP BY account_id
ORDER BY peak_24h_total DESC;

Why this approach: Using a rolling window function (RANGE BETWEEN) is more accurate than GROUP BY date because it catches cross-midnight activity. Filtering on status = 'COMPLETED' avoids counting pending or reversed transactions — a standard AML requirement at every major bank.


Question 4: A trade-processing pipeline is producing duplicate records. How would you identify and eliminate the duplicates?

Category: Data Quality

Step 1 — Identify Duplicates

-- Find duplicate trades by natural business key
SELECT
    trade_id,
    execution_timestamp,
    instrument_id,
    quantity,
    price,
    COUNT(*) AS duplicate_count
FROM trade_records
GROUP BY
    trade_id, execution_timestamp,
    instrument_id, quantity, price
HAVING COUNT(*) > 1
ORDER BY duplicate_count DESC;

Step 2 — Deduplicate Using ROW_NUMBER

WITH ranked_trades AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY trade_id
            ORDER BY ingestion_timestamp DESC
        ) AS rn
    FROM trade_records
)
SELECT * FROM ranked_trades
WHERE rn = 1;

Prevention Strategy

Idempotent consumers: Use Kafka consumer group offsets combined with exactly-once semantics in Flink to prevent reprocessing the same message.

Unique constraints: Add a UNIQUE(trade_id, source_system) constraint at the database level as a safety net.

Upsert pattern: Use Delta Lake’s MERGE INTO or Iceberg’s upsert to safely handle re-delivered messages without creating duplicates.

Bloom filters: For very high-volume streams, use a distributed Bloom filter backed by Redis to check whether a trade_id was already processed before writing.


Question 5: A source system unexpectedly changes a column’s datatype, causing pipeline failures. How would you prevent and handle such issues?

Category: Resilience

This is a schema drift scenario — extremely common in enterprise banking environments where upstream systems, often legacy, change without prior notice.

Prevention Measures

Schema Registry: Enforce schema compatibility (BACKWARD, FORWARD, or FULL mode) via Confluent Schema Registry. Producers must register their schema before publishing to any topic.

Contract testing: Use tools like Great Expectations or dbt tests to validate incoming data schema against expected column types before any transformation runs.

Schema evolution policies: Allow nullable new columns as additive changes. Reject breaking changes such as type narrowing or column removal.

Handling When It Happens

Safe cast with fallback: Attempt a safe cast of the column. On failure, route the record to a quarantine table with the original payload and a descriptive error message.

Schema evolution in storage: Delta Lake’s mergeSchema = true option can auto-absorb compatible changes like new nullable columns.

Dead-letter topic: Publish unprocessable records to a Kafka dead-letter queue (DLQ) containing the original payload, error reason, column name, and timestamp.

Immediate alerting: Fire a PagerDuty or Slack alert with the column name, old type, new type, and source system name. Never let schema failures fail silently.

Senior-level answer: Mention implementing a formal data contract between producer and consumer teams — a written or codified schema SLA so upstream teams must notify consumers and version their changes before deploying to production.


Question 6: Explain the concept of data lineage and why it is important in financial institutions.

Category: Data Governance

Data lineage is the complete record of where data originates, how it flows through systems, how it is transformed at each step, and where it ultimately lands — a full provenance trail for every data asset in the organization.

Why It’s Critical in Finance

Regulatory compliance: BCBS 239 (Basel Committee’s Risk Data Aggregation principles) mandates that banks must be able to trace any figure in a risk report back to its originating transaction. Without automated lineage, this is a manual and error-prone process.

Audit trails: Regulators such as the SEC, FCA, and RBI require documented proof of how reported figures were calculated. Lineage provides that proof in an auditable, automated form.

Impact analysis: When a source system changes its schema or logic, lineage maps tell you exactly which downstream reports, dashboards, and models are affected — preventing surprise production failures.

Incident root-cause analysis: When a PnL figure looks wrong at end of day, lineage allows engineers to trace the data backward through every transformation to pinpoint exactly where corruption or loss occurred.

Tools Used at Scale

Apache Atlas · DataHub (by LinkedIn) · OpenLineage · dbt lineage graph · Collibra · Alation

J.P. Morgan context: JPM operates under BCBS 239 and uses Golden Source frameworks internally to maintain data lineage across thousands of pipelines. Mentioning BCBS 239 by name signals deep domain knowledge and genuine finance industry experience.


Question 7: A downstream risk reporting application is receiving stale data. How would you investigate the issue?

Category: Troubleshooting

Walk through a structured investigation rather than jumping to a single conclusion. Interviewers are assessing your systematic debugging methodology as much as your technical knowledge.

Investigation Checklist — Answer in This Order

  1. Check pipeline health: Is the ETL or streaming job still running? Check your job scheduler (Airflow DAG, Control-M) for failures, warnings, or SLA misses.
  2. Check consumer lag: For Kafka-based pipelines, check consumer group lag. A steadily growing lag means the consumer is falling behind the producer, and the downstream app is reading from an outdated offset.
  3. Inspect data watermarks: In Flink or Spark Structured Streaming, check the event-time watermark. If it has stopped advancing, a stuck partition or idle source could be causing it.
  4. Check source freshness: Query the source system directly. The problem may originate upstream — a delayed batch job or a broken Change Data Capture (CDC) connector could be starving the pipeline.
  5. Audit the transformation layer: Check the dbt model’s last successful run time, or the last DWH refresh timestamp. A missed scheduled run is the single most common cause of stale downstream data.
  6. Check the caching layer: If a Redis or Memcached cache sits between the database and the reporting app, verify the TTL configuration. Stale cache entries can serve old data even when the underlying records are fresh.
  7. Review resource contention: Database lock contention or a long-running analytical query could be blocking writes to the reporting table entirely.

Closing answer to give: “Once the root cause is identified, I would implement data freshness SLAs monitored via Great Expectations or dbt source tests, with automated Slack alerts if data hasn’t been refreshed within an agreed SLA window — say, 15 minutes for risk reports.”


Question 8: How would you design a historical data archive strategy while maintaining query performance?

Category: Architecture

Tiered Storage Strategy

Hot tier (0–90 days): Keep data in the primary RDBMS or Snowflake — fully indexed, partitioned by date, optimized for sub-second queries by trading desks.

Warm tier (90 days to 3 years): Move to Delta Lake or Apache Iceberg on cloud object storage (S3 or Azure ADLS). Apply Z-ordering on frequently queried columns such as account_id and trade_date for fast reads without full table scans.

Cold tier (3+ years): Compress and move to S3 Glacier or Azure Archive Storage. Data remains queryable but with retrieval latency of minutes — acceptable for regulatory audit requests that are planned in advance.

Query Performance Techniques

Partitioning: Partition Parquet files by year, month, and day. Query engines such as Spark and Athena can skip entire partitions for date-bounded queries, reducing scan costs dramatically.

Columnar format: Parquet and ORC store data column-by-column, enabling predicate pushdown and reducing I/O by over 90% for typical analytical queries that only read a few columns.

Data compaction: Regularly compact small Parquet files into larger ones using Delta Lake’s OPTIMIZE command. This reduces metadata overhead and improves read parallelism.

Column statistics: Maintain column-level min/max statistics to allow file pruning at the query planner level. Use result caching for repeated management reports.

Finance context: Under MiFID II and SEC Rule 17a-4, banks must retain certain transaction records for 5 to 7 years. Design your archive so a compliance team can run a query across 5-year-old trade data in under 30 seconds for regulatory audit purposes.


Question 9: Multiple applications depend on the same dataset, but each requires different refresh frequencies. How would you design the data architecture?

Category: Architecture

This is a Data Mesh or Data Hub design question. The answer is a layered serving architecture where each consumer gets a purpose-built view of the canonical data.

Single source of truth: Maintain one canonical dataset in Delta Lake or Snowflake — the gold layer. No consumer reads raw or silver-layer data directly.

Kafka topics for real-time consumers: Applications needing sub-second refreshes (e.g., trading dashboards, fraud systems) subscribe to Kafka CDC streams directly from the source system.

Materialized views for near-real-time consumers: For apps needing 15-minute or hourly refreshes, create scheduled materialized views or incremental dbt models that consume from the gold layer.

Batch exports for daily consumers: Run nightly Airflow DAGs to generate date-stamped snapshots in Parquet or CSV format for heavy analytics applications and regulatory reporting jobs.

Data virtualization layer (optional): Tools like Trino or Dremio can federate queries across multiple layers without physically moving data — useful for ad-hoc analysis across tiers.

Key principle: Each consumer should be unaware of how the data is refreshed internally. They consume from their dedicated serving layer. This decouples producer cadences from consumer cadences and enables independent scaling of each layer.


Question 10: A critical data load fails just before market opening. What steps would you take to recover quickly and safely?

Category: Incident Response

This question tests your incident management maturity. In a bank, a market-opening data failure can cost trading desks millions of dollars per minute of delay.

Immediate Response — First 5 Minutes

  1. Declare the incident immediately: Notify the on-call team and all relevant stakeholders via PagerDuty or Slack. Do not troubleshoot silently. Time wasted diagnosing alone is time the business is flying blind.
  2. Assess blast radius: Identify which downstream systems are affected — trading desks, risk engines, compliance dashboards. Prioritize response by business impact, not by technical interest.
  3. Check for partial loads: Determine whether any data was partially written. A partial load is often more dangerous than no data at all, because downstream systems may produce inconsistent results without knowing the data is incomplete.

Recovery Options — In Priority Order

  1. Retry the load: If the failure was transient (network timeout, lock contention), retry the idempotent pipeline job immediately using the same source snapshot. Monitor closely for recurrence.
  2. Roll back to last good state: If partial data was written, trigger Delta Lake’s RESTORE command to the last known good version before the failed load began.
  3. Serve T-1 data with a staleness flag: As a business continuity measure, serve yesterday’s data to downstream systems with a clear data-as-of timestamp banner. This is better than serving no data or incorrect data.
  4. Escalate for manual intervention: If automated retry and rollback both fail, escalate to senior engineers with direct database access for targeted surgical fix under time pressure.

Post-Incident Actions

  • Complete a Root Cause Analysis (RCA) document within 24 hours
  • Add pre-market validation checks to the pipeline to catch failures earlier
  • Improve runbooks so the next engineer on call can recover 50% faster
  • Implement circuit breakers to isolate the failure before it cascades

Never answer this as: “I would fix the bug and re-run the job.” J.P. Morgan wants to hear about business continuity planning, rollback strategies, stakeholder communication, and systemic prevention — not just a technical fix.


Question 11: How would you handle late-arriving and out-of-order events in a streaming data pipeline?

Category: Streaming

Late arrivals and out-of-order events are fundamental streaming challenges — especially in financial trading where network delays between exchange feeds and internal systems are common and unpredictable.

Core Concept: Event Time vs Processing Time

Event time is when the transaction actually happened, recorded in the event’s own timestamp field.

Processing time is when your pipeline receives and processes the event.

Always use event time for financial aggregations. Processing time produces incorrect settlement figures and wrong regulatory numbers whenever there is any delivery delay.

Handling Strategy

Watermarks: In Apache Flink or Spark Structured Streaming, define a watermark as event time minus an allowed lateness (for example, 5 minutes). The processing engine waits this long before treating a time window as complete and emitting results.

Allowed lateness: Configure an additional grace period beyond the watermark during which late records can still update already-emitted window results. This is particularly useful for near-real-time risk aggregations.

Side outputs / late data stream: Records arriving after the allowed lateness threshold are routed to a separate side output stream for manual review, correction, or reprocessing in a batch layer.

Lambda or Kappa architecture: Run a batch reprocessing layer that corrects streaming results once all late data has fully arrived — a common pattern for end-of-day settlement and daily PnL reconciliation.

stream
  .assignTimestampsAndWatermarks(
    WatermarkStrategy
      .forBoundedOutOfOrderness(Duration.ofMinutes(5))
      .withTimestampAssigner(event -> event.getEventTimestamp())
  )
  .keyBy(event -> event.getAccountId())
  .window(TumblingEventTimeWindows.of(Time.hours(1)))
  .allowedLateness(Time.minutes(10))
  .sideOutputLateData(lateOutputTag)
  .aggregate(new SumTransactionsAggregator());

Key insight: The watermark lateness setting is always a business trade-off, not just a technical one. A 5-minute wait improves accuracy but delays results by 5 minutes. For end-of-day settlement you can afford longer waits. For real-time fraud detection you may accept minor inaccuracy in exchange for speed. Frame this trade-off explicitly in your answer.


Question 12: Write a SQL query to calculate the daily average transaction amount for each customer.

Category: SQL

A clean SQL question — but write it the way a senior engineer would, with edge-case handling and business context baked in.

Option 1 — Simple Daily Average Per Customer

SELECT
    customer_id,
    DATE(transaction_timestamp)       AS transaction_date,
    COUNT(transaction_id)             AS transaction_count,
    ROUND(AVG(transaction_amount), 2) AS avg_transaction_amount,
    SUM(transaction_amount)           AS total_daily_amount
FROM transactions
WHERE
    status = 'COMPLETED'          -- Exclude pending and reversed transactions
    AND transaction_amount > 0    -- Exclude credits and refunds if required
GROUP BY
    customer_id,
    DATE(transaction_timestamp)
ORDER BY
    customer_id,
    transaction_date DESC;

Option 2 — Rolling 7-Day Average (More Useful for Analytics)

SELECT
    customer_id,
    transaction_date,
    AVG(daily_total) OVER (
        PARTITION BY customer_id
        ORDER BY transaction_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7d_avg
FROM (
    SELECT
        customer_id,
        DATE(transaction_timestamp) AS transaction_date,
        SUM(transaction_amount)     AS daily_total
    FROM transactions
    WHERE status = 'COMPLETED'
    GROUP BY customer_id, DATE(transaction_timestamp)
) daily_summary;

Final Tips for Your J.P. Morgan Interview

Lead with business impact. Every technical answer should connect back to what it means for trading, risk, compliance, or regulation. Engineers who only speak in terms of tools rarely pass the final round.

Name regulatory frameworks. BCBS 239, MiFID II, Dodd-Frank, and AML regulations come up naturally in JPM interviews. Candidates who reference them signal that they can operate in a regulated environment without constant hand-holding.

Structure your answers. Use the format: problem → approach → tools → trade-offs → how you’d monitor it. J.P. Morgan interviewers are trained to look for structured thinking.

Prepare for follow-ups. Every question above will have a follow-up like “what if the volume increased 10x?” or “how would you handle a failure at step 3?” Think through the failure modes of every design you propose.

Article Categories:
Educations

Leave a Reply

Your email address will not be published. Required fields are marked *