Skip to main content
Back to blog

Playwright E2E Behind a VPN: How I Built a Reliable Staging Test Suite from Scratch

A practical guide to running Playwright E2E tests against a WireGuard-protected staging environment. Page Object Model, fixture-based auth, parallel sharding, and how I got from 67% to 97% test pass rate.

10 min read
by Andrii Peretiatko
PlaywrightE2ETestingWireGuardCI/CD

The Challenge

The staging environment for a SaaS platform I worked on sits behind a WireGuard VPN. No VPN, no access to the API. Simple in theory; brutal in practice for CI.

When I took over the E2E suite, the pass rate was 67%. Not because the features were broken — because the tests were:

  • Connecting to staging without VPN → instant 503
  • Sharing a single auth session across 50 parallel tests → race conditions
  • Using XPath selectors → 2-5 second timeouts on every action
  • No retry logic → one flaky network call = failed test

The goal: 95%+ pass rate, under 8 minutes wall-clock time, zero false failures from infrastructure.


The Investigation: Categorizing Failures

I ran 200 test executions and categorized every failure:

Failure Distribution (67% pass rate baseline)

Failure TypeCount% of Failures
VPN not connected / 5033857.6%
Auth session collision1218.2%
XPath timeout (>5s)812.1%
Actual bugs in the app57.6%
Network flakiness34.5%

93% of failures were infrastructure problems, not real bugs. The app was working fine.


The Solution

1. WireGuard Pre-Connection in GitHub Actions

The CI runner needs to be on the VPN before any test touches staging. I solved this with a dedicated VPN step before Playwright runs:

The WG_CONFIG secret stores the full wg0.conf content. The curl health check with --retry 5 ensures the VPN tunnel is fully established before tests start.

Impact:

  • VPN-related failures: 38 → 0 (eliminated)

2. Fixture-Based Isolated Auth Sessions

The original suite had a global.setup.ts that logged in once and saved storageState.json. All 50 tests read from the same file. Two tests modifying session state = race condition.

Before: shared global auth state

After: per-worker isolated auth via fixtures

Impact:

  • Auth collision failures: 12 → 0
  • Login time per test: 2,100ms → 48ms (API auth vs UI auth)
  • Saved: ~102 seconds across 50 tests

3. Page Object Model with data-testid Selectors

XPath selectors break when CSS class names change. I replaced every selector with data-testid attributes and wrapped pages in the Page Object Model.

Before: fragile XPath

After: Page Object Model with stable selectors

Impact:

  • Selector timeout failures: 8 → 0
  • Average selector resolution: 2,500ms → 80ms (97% reduction)
  • Test maintenance time: reduced by ~60% (CSS refactors don't break tests)

4. Playwright Configuration for Staging

Key decisions:

  • retries: 2 — 3 consecutive failures = real bug, not network noise
  • trace: 'on-first-retry' — traces generated only when needed, keeps CI artifacts small
  • workers: 4 per shard — matches CI runner CPU without over-subscribing

5. API-Level Regression Tests for Search Quality

Pure UI E2E tests are slow for validating ML-powered features. I added lightweight API tests that assert on relevance scores directly:

Impact:

  • Catches model regressions in < 5 seconds vs 60+ seconds for full UI flow
  • Blocked a bad configuration change that dropped accuracy by 8% before it merged

Final Results

Before vs After Comparison

MetricBeforeAfterImprovement
Pass rate67%97%+30pp
Wall-clock time (50 tests)24 min7 min71% faster
VPN-related failures57.6% of fails0%Eliminated
Auth collision failures18.2% of fails0%Eliminated
Average selector time2,500ms80ms97% faster
False failure rate33%2%94% reduction
Tests catching real bugs5234.6× more signal

Engineer Time Saved

The 33% false failure rate meant engineers were re-running pipelines 2-3× per day. After the fix:

  • Re-runs per day: 8 → 1 (the 2% genuine flakes)
  • Engineer time saved: ~2 hours/day across the team

Key Takeaways

1. Categorize Your Failures Before Fixing Them

93% of my failures were infrastructure. Fixing the app would have done nothing. Categorize first.

2. VPN in CI Is Solved With wg-quick + Health Check

A curl --retry 5 health check against the protected endpoint before tests start is all you need. Don't assume the VPN is up — verify it.

3. API-Level Auth Is 40× Faster Than UI Login

Clicking through a login form: ~2,100ms. A POST to /auth/login + localStorage.setItem: 48ms. Use the API.

4. Page Objects Pay Off After the Third Refactor

They feel like over-engineering on day one. After the third CSS class rename breaks your selectors, you'll be glad you have them.

5. data-testid Is a Contract, Not a Convenience

It's a contract between the frontend team and QA. Add data-testid to every interactive element at build time, not retroactively after tests break.

6. 2 Retries in CI Is the Right Number

  • 0 retries: flaky network kills your pipeline
  • 2 retries: 3 consecutive failures = real bug
  • 3+ retries: you're hiding real failures

Tools & Technologies Used

  • E2E Framework: Playwright 1.44
  • VPN: WireGuard with wg-quick in GitHub Actions
  • CI/CD: GitHub Actions with matrix sharding (5 shards × 4 workers)
  • Auth Strategy: API-level login + localStorage injection
  • Selectors: data-testid exclusively, getByRole for accessibility
  • Reporting: Playwright blob reporter + HTML merge
  • API Tests: Playwright request fixture against REST endpoints

What's Next?

  1. Visual regression: Percy/Argos for pixel-diff testing on key UI components
  2. Contract testing: Pact consumer-driven contracts between frontend and backend
  3. Chaos testing: Simulate VPN drops mid-test to validate reconnection behavior
  4. Test impact analysis: Run only tests affected by changed files

Conclusion

Going from 67% to 97% pass rate was not about writing better tests — it was about fixing the infrastructure those tests run on. The VPN connection, the auth session isolation, and the selector stability accounted for 93% of failures.

Once the false failures were gone, the remaining 3% were real bugs. That's the whole point: a test suite you trust is worth infinitely more than one you ignore.

If your E2E suite is unreliable, don't add more retries — categorize the failures first.