Visual Regression Strategy

Snapshots work until a team owns them

The /visual-regression lab teaches how to take snapshots. This one is about everything else: who updates baselines, why diffs flake, how reviews happen, and what thresholds mean. Five patterns that keep a visual suite useful past the first month.

The decay problem: visual suites die silently. Week 1: every test green. Week 4: 5 % flake. Week 12: 40 % flake, everyone skips review. Week 24: someone runs --update-snapshots globally. The suite is now decorative. Below are the five disciplines that prevent it.

1. Baseline governance — who updates what, when

Baselines are the source of truth. An auto-update culture is a compromised source of truth.

Problem: Tests fail, someone runs --update-snapshots, greens ship — with whatever UI drift happened to be in that run. A month later, the baseline shows an 8-pixel logo shift nobody noticed. Tomorrow a real regression hides the same way.

Playwright example

// playwright.config.ts — explicit baseline strategy
export default defineConfig({
  expect: {
    toHaveScreenshot: {
      // Default threshold — strict for critical flows
      maxDiffPixelRatio: 0.01,
      animations: 'disabled',
      caret: 'hide',
    },
  },
});

// CI flow — baselines are ONLY updated on an explicit branch + PR
// .github/workflows/visual.yml
jobs:
  visual:
    steps:
      - run: pnpm playwright test --project=visual
        # NO --update-snapshots here
      - if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: visual-diffs
          path: test-results/

  # Separate workflow: manually triggered, updates baselines, opens PR
  update-baselines:
    if: github.event_name == 'workflow_dispatch'
    steps:
      - run: pnpm playwright test --project=visual --update-snapshots
      - uses: peter-evans/create-pull-request@v5
        with:
          title: 'chore: update visual baselines'
          branch: visual-baseline-update
          body: 'Review each changed snapshot before approving.'

Rule of thumb: No one auto-updates baselines on CI. Baseline updates go through a dedicated PR that a human reviews diff-by-diff. This is the ONE concession that makes visual regression actually useful in a team.

2. Flaky diff sources — timing, fonts, cursor, GPU

Visual flakes are ~10× more common than functional flakes. Most have the same 5 causes.

Problem: Same code, same viewport, different pixels. Tests go red randomly. Triage finds no real change. Team learns to ignore the reporter. Regression slips through on day 30.

Playwright example

// The five common flake sources and their fixes

// 1. Animation — always disable for baseline snapshots
await page.locator('.hero').screenshot({ animations: 'disabled' });

// 2. Caret — text input caret blinks, varies between runs
await expect(page).toHaveScreenshot({ caret: 'hide' });

// 3. Fonts — system fallbacks vary between CI images. Pre-load.
await page.addStyleTag({
  content: `* { font-family: 'Inter', system-ui, sans-serif !important; }`,
});
await page.evaluate(() => document.fonts.ready);

// 4. Scroll position — implicit scroll between tests produces ghost-state
await page.evaluate(() => window.scrollTo(0, 0));

// 5. Live data (timestamps, relative dates) — freeze or mask
await page.clock.install({ time: new Date('2026-04-21T10:00:00Z') });

// Plus: mask known-dynamic regions so the diff ignores them
await expect(page).toHaveScreenshot({
  mask: [page.locator('[data-live-timestamp]')],
  maskColor: '#000000',
});

Rule of thumb: Before accepting any visual diff as a 'real' regression, rule out the five standard flake sources. If a test passes locally but flakes on CI, the cause is almost always in this list — not in your app code.

3. Review workflow — 4-step baseline approval

Every baseline update is a code review event. Treat it like a prod deploy — explicit, checklisted, reversible.

Problem: Visual diffs without explicit review become noise. The team develops 'just approve, it's probably fine' reflex. Real regressions merge. Baseline quality decays until someone nukes the whole suite.

Playwright example

// The 4-step review checklist (commit as .github/VISUAL_REVIEW.md)

## Before approving a baseline update

1. **Diff the IMAGE, not just the file**
   - Open the baseline vs actual side-by-side
   - Zoom in on the changed region
   - Ask: is the visual change intentional?

2. **Trace the intent**
   - What PR commit caused this change?
   - Does the PR description mention the UI change?
   - If no — someone introduced a silent UI change and it needs discussion

3. **Classify the change**
   - INTENTIONAL: approve, update baseline, note in CHANGELOG
   - UNINTENTIONAL: reject, open bug
   - AMBIGUOUS: slack #design for a second opinion

4. **Cross-browser check**
   - Does the change affect chromium only, or firefox + webkit too?
   - A chromium-only diff often means a GPU or font bug — not a real design change

// Reviewer quality metric: baseline update acceptance rate
// < 30 % accepted = noisy test suite OR noisy UI change process
// > 90 % accepted = rubber-stamping (probably)

Rule of thumb: Every baseline PR must have ≥1 reviewer who opened the actual diff image. 'LGTM' without image-review defeats the suite. If you skip image review, you may as well delete the tests.

4. Cross-browser drift — chromium vs firefox vs webkit

Same DOM, same CSS, different pixels. Each browser renders fonts, antialiasing, scrollbars slightly differently.

Problem: One baseline image cannot cover three browsers. If you pin to chromium, you miss webkit regressions. If you store three baselines, updates are 3× work. Most teams pick one, ignore the rest, and find out in production.

Playwright example

// playwright.config.ts — per-browser baselines via project scoping
export default defineConfig({
  projects: [
    { name: 'chromium', use: { browserName: 'chromium' } },
    { name: 'firefox',  use: { browserName: 'firefox' } },
    { name: 'webkit',   use: { browserName: 'webkit' } },
  ],
  expect: {
    toHaveScreenshot: {
      // Loosen threshold for cross-browser runs — inherent pixel drift
      maxDiffPixelRatio: 0.02,
    },
  },
});

// Per-project snapshot suffix — Playwright names files automatically
// checkout.spec.ts-snapshots/
//   ├── header-chromium-linux.png
//   ├── header-firefox-linux.png
//   └── header-webkit-linux.png

// Strategy: run full visual suite on chromium (every PR, strict threshold),
// run subset (critical flows only) on firefox + webkit (nightly, looser threshold)
test('checkout visual — chromium strict', { tag: '@visual-strict' }, async ({ page }) => {
  // detailed threshold
});

test('checkout visual — all browsers', { tag: '@visual-smoke' }, async ({ page }) => {
  // threshold: 0.03 — catches major layout breaks only
});

Rule of thumb: Pick one canonical browser for pixel-perfect visual tests (usually chromium). Run a smaller cross-browser smoke subset with looser thresholds. Triple baselines only for truly-critical flows (checkout, auth, payment).

5. Threshold tuning — strictness vs noise

maxDiffPixelRatio is the dial. 0.01 = 1 % of pixels may differ. Pick per-context, never global.

Problem: One global threshold is either too strict (every run fails on subpixel AA) or too lax (real regressions hide). A 0.03 threshold lets a 2-pixel logo shift slide. Teams ratchet looser until the suite loses meaning.

Playwright example

// Per-test threshold, documented WHY
test('legal disclaimer renders exactly', async ({ page }) => {
  await page.goto('/tos');
  // Strict — legal text must not drift; regulatory requirement
  await expect(page.getByTestId('tos-body')).toHaveScreenshot('tos.png', {
    maxDiffPixelRatio: 0.001,  // 0.1 %
  });
});

test('animated hero is acceptable', async ({ page }) => {
  await page.goto('/');
  // Looser — known animation, disabled but antialiasing drifts
  await expect(page.getByRole('banner')).toHaveScreenshot('hero.png', {
    maxDiffPixelRatio: 0.02,  // 2 %
  });
});

test('map tile render', async ({ page }) => {
  // Very lax — external map tiles will never be pixel-identical
  await expect(page.locator('#map')).toHaveScreenshot('map.png', {
    maxDiffPixelRatio: 0.1,   // 10 %
  });
});

// Decision table for choosing thresholds:
//   Legal / contract / regulatory text  → 0.001 (0.1 %)
//   Core product UI, typography         → 0.01  (1 %)
//   Animated / complex gradients        → 0.02  (2 %)
//   Third-party embeds (maps, video)    → 0.05-0.10 (5-10 %)
//   Cross-browser snapshots             → 0.02-0.03 (2-3 %)

Rule of thumb: Default to 0.01 (1 % pixel drift). Tighten for regulatory / legal text (0.001). Loosen for animated / third-party embeds (0.02-0.05). Comment WHY next to each override — future maintainers will ratchet otherwise.

Flake sources quick reference

Before calling any visual diff a “regression”, check the standard causes first:

Source	Fix
Animations	`animations: 'disabled' in snapshot options`
Caret / cursor	`caret: 'hide' in snapshot options`
Font fallbacks	`Pre-load font + wait document.fonts.ready`
Scroll position	`Explicit scrollTo(0, 0) before snapshot`
Live timestamps	`page.clock.install + mask timestamps`
GPU antialiasing	`Pin to software rasterisation or CPU-only GPU`
System scrollbar	`Screenshot without viewport, or mask scrollbar`

← Back to all Labs

Snapshots work until a team owns them

// playwright.config.ts — explicit baseline strategy export default defineConfig({ expect: { toHaveScreenshot: { // Default threshold — strict for critical flows maxDiffPixelRatio: 0.01, animations: 'disabled', caret: 'hide', }, }, }); // CI flow — baselines are ONLY updated on an explicit branch + PR // .github/workflows/visual.yml jobs: visual: steps: - run: pnpm playwright test --project=visual # NO --update-snapshots here - if: failure() uses: actions/upload-artifact@v4 with: name: visual-diffs path: test-results/ # Separate workflow: manually triggered, updates baselines, opens PR update-baselines: if: github.event_name == 'workflow_dispatch' steps: - run: pnpm playwright test --project=visual --update-snapshots - uses: peter-evans/create-pull-request@v5 with: title: 'chore: update visual baselines' branch: visual-baseline-update body: 'Review each changed snapshot before approving.'

// The five common flake sources and their fixes // 1. Animation — always disable for baseline snapshots await page.locator('.hero').screenshot({ animations: 'disabled' }); // 2. Caret — text input caret blinks, varies between runs await expect(page).toHaveScreenshot({ caret: 'hide' }); // 3. Fonts — system fallbacks vary between CI images. Pre-load. await page.addStyleTag({ content: `* { font-family: 'Inter', system-ui, sans-serif !important; }`, }); await page.evaluate(() => document.fonts.ready); // 4. Scroll position — implicit scroll between tests produces ghost-state await page.evaluate(() => window.scrollTo(0, 0)); // 5. Live data (timestamps, relative dates) — freeze or mask await page.clock.install({ time: new Date('2026-04-21T10:00:00Z') }); // Plus: mask known-dynamic regions so the diff ignores them await expect(page).toHaveScreenshot({ mask: [page.locator('[data-live-timestamp]')], maskColor: '#000000', });

// The 4-step review checklist (commit as .github/VISUAL_REVIEW.md) ## Before approving a baseline update 1. **Diff the IMAGE, not just the file** - Open the baseline vs actual side-by-side - Zoom in on the changed region - Ask: is the visual change intentional? 2. **Trace the intent** - What PR commit caused this change? - Does the PR description mention the UI change? - If no — someone introduced a silent UI change and it needs discussion 3. **Classify the change** - INTENTIONAL: approve, update baseline, note in CHANGELOG - UNINTENTIONAL: reject, open bug - AMBIGUOUS: slack #design for a second opinion 4. **Cross-browser check** - Does the change affect chromium only, or firefox + webkit too? - A chromium-only diff often means a GPU or font bug — not a real design change // Reviewer quality metric: baseline update acceptance rate // < 30 % accepted = noisy test suite OR noisy UI change process // > 90 % accepted = rubber-stamping (probably)

// playwright.config.ts — per-browser baselines via project scoping export default defineConfig({ projects: [ { name: 'chromium', use: { browserName: 'chromium' } }, { name: 'firefox', use: { browserName: 'firefox' } }, { name: 'webkit', use: { browserName: 'webkit' } }, ], expect: { toHaveScreenshot: { // Loosen threshold for cross-browser runs — inherent pixel drift maxDiffPixelRatio: 0.02, }, }, }); // Per-project snapshot suffix — Playwright names files automatically // checkout.spec.ts-snapshots/ // ├── header-chromium-linux.png // ├── header-firefox-linux.png // └── header-webkit-linux.png // Strategy: run full visual suite on chromium (every PR, strict threshold), // run subset (critical flows only) on firefox + webkit (nightly, looser threshold) test('checkout visual — chromium strict', { tag: '@visual-strict' }, async ({ page }) => { // detailed threshold }); test('checkout visual — all browsers', { tag: '@visual-smoke' }, async ({ page }) => { // threshold: 0.03 — catches major layout breaks only });

// Per-test threshold, documented WHY test('legal disclaimer renders exactly', async ({ page }) => { await page.goto('/tos'); // Strict — legal text must not drift; regulatory requirement await expect(page.getByTestId('tos-body')).toHaveScreenshot('tos.png', { maxDiffPixelRatio: 0.001, // 0.1 % }); }); test('animated hero is acceptable', async ({ page }) => { await page.goto('/'); // Looser — known animation, disabled but antialiasing drifts await expect(page.getByRole('banner')).toHaveScreenshot('hero.png', { maxDiffPixelRatio: 0.02, // 2 % }); }); test('map tile render', async ({ page }) => { // Very lax — external map tiles will never be pixel-identical await expect(page.locator('#map')).toHaveScreenshot('map.png', { maxDiffPixelRatio: 0.1, // 10 % }); }); // Decision table for choosing thresholds: // Legal / contract / regulatory text → 0.001 (0.1 %) // Core product UI, typography → 0.01 (1 %) // Animated / complex gradients → 0.02 (2 %) // Third-party embeds (maps, video) → 0.05-0.10 (5-10 %) // Cross-browser snapshots → 0.02-0.03 (2-3 %)

Flake sources quick reference

Before calling any visual diff a “regression”, check the standard causes first:

Source	Fix
Animations	`animations: 'disabled' in snapshot options`
Caret / cursor	`caret: 'hide' in snapshot options`
Font fallbacks	`Pre-load font + wait document.fonts.ready`
Scroll position	`Explicit scrollTo(0, 0) before snapshot`
Live timestamps	`page.clock.install + mask timestamps`
GPU antialiasing	`Pin to software rasterisation or CPU-only GPU`
System scrollbar	`Screenshot without viewport, or mask scrollbar`