SQL Skills

A/B Test Analysis in SQL — What Interviewers Actually Ask

Significance, lift, novelty effects, segmentation. Here's how to write the SQL — and explain the results — without sounding like a stats textbook.

12 min read·SQL Skills

A/B test analysis is where many strong SQL candidates suddenly look uncertain in interviews. The SQL itself is rarely the issue — the queries are usually short. What trips candidates up is the layer around the SQL: defining the metric correctly, reasoning about significance, spotting confounds, recommending whether to ship.

This guide walks through what interviewers actually ask about A/B testing in SQL — the six question shapes — with code for each and the senior moves that earn credit. By the end you'll be able to handle the question without dropping into stats-textbook mode.

The Mental Model

Every A/B test analysis question has three layers:

Compute the metric for each variant. Usually one CTE.
Compare them. Lift, absolute difference, ratio.
Decide what the comparison means. Is it significant? Is it real? Should we ship?

Most SQL prep tools test only layer 1. Real interviews test all three. The senior signal is layering 2 and 3 onto the SQL you already know how to write.

Pattern 1 — The Basic Variant Comparison

The question: "Given an A/B test events table, compute the conversion rate for each variant."

SELECT
  variant,
  COUNT(DISTINCT user_id) AS users,
  COUNT(DISTINCT CASE WHEN converted THEN user_id END) AS converters,
  ROUND(
    COUNT(DISTINCT CASE WHEN converted THEN user_id END)::numeric
    / NULLIF(COUNT(DISTINCT user_id), 0)
    * 100,
    2
  ) AS conversion_rate
FROM experiment_events
WHERE experiment_id = 'checkout_redesign_v3'
GROUP BY variant
ORDER BY variant;

The senior moves:

DISTINCT user_id because the same user can have multiple events; we want unique users per variant
Cast to numeric before dividing — integer division will silently truncate to 0
NULLIF guards against empty variants
Filter on the specific experiment so the analysis doesn't pick up events from other tests

Pattern 2 — Computing Lift

The question: "What's the lift of treatment over control?"

WITH variant_metrics AS (
  SELECT
    variant,
    COUNT(DISTINCT user_id) AS users,
    COUNT(DISTINCT CASE WHEN converted THEN user_id END)::numeric
      / NULLIF(COUNT(DISTINCT user_id), 0) AS conversion_rate
  FROM experiment_events
  WHERE experiment_id = 'checkout_redesign_v3'
  GROUP BY variant
)
SELECT
  control.conversion_rate     AS control_rate,
  treatment.conversion_rate   AS treatment_rate,
  treatment.conversion_rate - control.conversion_rate                          AS absolute_lift,
  (treatment.conversion_rate - control.conversion_rate) / control.conversion_rate AS relative_lift
FROM
  (SELECT conversion_rate FROM variant_metrics WHERE variant = 'control')   AS control,
  (SELECT conversion_rate FROM variant_metrics WHERE variant = 'treatment') AS treatment;

The senior signal: distinguishing absolute lift (1.2 percentage points) from relative lift (10% improvement). These are different numbers for the same comparison. Interviewers love asking "which one would you report to the team?" — the right answer depends on the audience, but you should be able to articulate both.

Pattern 3 — Sample Size and Significance Awareness

The question: "Is this difference statistically significant?"

You're not expected to compute p-values in SQL. You ARE expected to know when sample size is too small to trust the result.

WITH variant_metrics AS (
  SELECT
    variant,
    COUNT(DISTINCT user_id) AS users,
    COUNT(DISTINCT CASE WHEN converted THEN user_id END) AS converters,
    COUNT(DISTINCT CASE WHEN converted THEN user_id END)::numeric
      / NULLIF(COUNT(DISTINCT user_id), 0) AS conversion_rate
  FROM experiment_events
  WHERE experiment_id = 'checkout_redesign_v3'
  GROUP BY variant
)
SELECT
  variant,
  users,
  converters,
  conversion_rate,
  -- Approximate standard error of the proportion
  SQRT(conversion_rate * (1 - conversion_rate) / NULLIF(users, 0)) AS standard_error
FROM variant_metrics;

The senior signal: you can sketch the math but you also flag the caveat: "This standard error assumes a binomial proportion. With a 1.2-point lift and standard errors around 0.3 percentage points per variant, the 95% confidence interval on the difference probably crosses zero. I'd want to run a proper z-test or hand off to a stats tool before claiming significance."

That sentence — "I can compute a rough estimate in SQL but I'd validate it elsewhere" — is the move that separates senior from junior.

Pattern 4 — Segment Analysis (Heterogeneous Effects)

The question: "Does the lift hold up across segments?"

SELECT
  variant,
  device_platform,
  COUNT(DISTINCT user_id) AS users,
  COUNT(DISTINCT CASE WHEN converted THEN user_id END)::numeric
    / NULLIF(COUNT(DISTINCT user_id), 0) AS conversion_rate
FROM experiment_events e
JOIN users u ON u.id = e.user_id
WHERE experiment_id = 'checkout_redesign_v3'
GROUP BY variant, device_platform
ORDER BY device_platform, variant;

The senior signal: acknowledging Simpson's Paradox out loud. An aggregate "treatment wins by 1 point" can hide that treatment wins by 3 points on mobile and loses by 1 point on desktop. Senior analysts always check segment heterogeneity before recommending a ship.

Also worth mentioning: segment-level significance bars are stricter. If you split into 4 segments, each segment has 25% of the sample, so the per-segment significance test is way underpowered.

Pattern 5 — Novelty Effect Detection

The question: "Did the lift fade over the experiment window?"

SELECT
  variant,
  DATE_TRUNC('week', exposure_time) AS week,
  COUNT(DISTINCT user_id) AS users,
  COUNT(DISTINCT CASE WHEN converted THEN user_id END)::numeric
    / NULLIF(COUNT(DISTINCT user_id), 0) AS conversion_rate
FROM experiment_events
WHERE experiment_id = 'checkout_redesign_v3'
GROUP BY variant, DATE_TRUNC('week', exposure_time)
ORDER BY week, variant;

The senior signal: novelty effects mean users behave differently because the variant is new, not because it's better. If treatment outperformed in week 1 but converged with control by week 3, the lift might be temporary. The senior move: when a 2-week test shows a lift, recommend running it a third week to check for fade.

Pattern 6 — Guardrail Metrics

The question: "The primary metric is up — but did anything else regress?"

WITH outcomes AS (
  SELECT
    variant,
    COUNT(DISTINCT user_id) AS users,
    COUNT(DISTINCT CASE WHEN converted     THEN user_id END)::numeric
      / NULLIF(COUNT(DISTINCT user_id), 0) AS conversion_rate,
    AVG(time_to_checkout_ms)::numeric AS avg_checkout_ms,
    COUNT(DISTINCT CASE WHEN errored      THEN user_id END)::numeric
      / NULLIF(COUNT(DISTINCT user_id), 0) AS error_rate
  FROM experiment_events
  WHERE experiment_id = 'checkout_redesign_v3'
  GROUP BY variant
)
SELECT * FROM outcomes;

The senior signal: guardrail metrics are the things you don't want to regress while you optimize the primary metric. Checkout time, error rate, support tickets, refund rate. A treatment that lifts conversion by 1 point but doubles error rate isn't a winner.

When you see only the primary metric in an A/B test, ask: "What are the guardrails we're protecting?" That question alone signals senior thinking.

The Six Follow-Up Questions

After you've written any A/B analysis query, interviewers reliably ask some version of:

"How long did this test run? Was it long enough?" Underpowered tests lie.
"Did you check for novelty effects?" Week 1 results can mislead.
"What about segments?" Aggregate wins can hide segment losses.
"What were the guardrails?" Did you check error rate, latency, etc.?
"What's the cost of being wrong?" A shipping decision with high cost of failure needs higher confidence than one with low cost.
"What would you do next?" This is the most-weighted follow-up. Don't say "ship it." Say "I'd want X confirming data point first, and I'd ship to 25% to monitor in production."

The candidates who land senior offers don't just write the SQL. They walk the interviewer through what they'd recommend, with the caveats, and what they'd want before declaring a decision final.

The Stats Layer Worth Knowing

You don't need to be a statistician. You do need to know:

Z-test for proportions. What it does (compares two rates), when to use it (binary outcome A/B test), and when it fails (small samples, non-normal data).
Sample size estimation. For a given baseline rate, expected lift, and significance threshold, roughly how many users per variant do you need. There are calculators online — just know the order of magnitude.
Multiple comparison correction. If you ran 20 segment analyses, one is likely to show a "significant" result by chance alone. Real significance requires correction.

Knowing these three concepts at a conversational level — not the math — is plenty for SQL interview rounds.

The Mistake That Costs Offers

The single failure mode that comes up most: a candidate writes the SQL, sees treatment outperforming, and recommends shipping. Without checking sample size. Without checking segments. Without mentioning guardrails. Without asking what the cost of being wrong is.

The fix is a mental checklist that runs after every A/B analysis:

Is the sample big enough?
Does the lift hold up by segment?
Did the lift fade over time?
Did anything else regress?
What's the cost of being wrong?

Articulate that checklist out loud as part of your answer, and you'll be in the top tier of candidates for any data analyst loop at FAANG, AI lab, or growth-stage SaaS company.

A/B test SQL is short. A/B test reasoning is the interview.

Practice next

Explore SQL challenges

100+ challenges across Growth, SaaS, Marketing, Product, and Finance — graded by AI, ranked by difficulty.

Explore SQL challenges

Back to all resources