Playwright E2E Behind a VPN: How I Built a Reliable Staging Test Suite from Scratch

The Challenge

The staging environment for a SaaS platform I worked on sits behind a WireGuard VPN. No VPN, no access to the API. Simple in theory; brutal in practice for CI.

When I took over the E2E suite, the pass rate was 67%. Not because the features were broken — because the tests were:

Connecting to staging without VPN → instant 503
Sharing a single auth session across 50 parallel tests → race conditions
Using XPath selectors → 2-5 second timeouts on every action
No retry logic → one flaky network call = failed test

The goal: 95%+ pass rate, under 8 minutes wall-clock time, zero false failures from infrastructure.

The Investigation: Categorizing Failures

I ran 200 test executions and categorized every failure:

Failure Distribution (67% pass rate baseline)

Failure Type	Count	% of Failures
VPN not connected / 503	38	57.6%
Auth session collision	12	18.2%
XPath timeout (>5s)	8	12.1%
Actual bugs in the app	5	7.6%
Network flakiness	3	4.5%

93% of failures were infrastructure problems, not real bugs. The app was working fine.

The Solution

1. WireGuard Pre-Connection in GitHub Actions

The CI runner needs to be on the VPN before any test touches staging. I solved this with a dedicated VPN step before Playwright runs:

The WG_CONFIG secret stores the full wg0.conf content. The curl health check with --retry 5 ensures the VPN tunnel is fully established before tests start.

Impact:

VPN-related failures: 38 → 0 (eliminated)

2. Fixture-Based Isolated Auth Sessions

The original suite had a global.setup.ts that logged in once and saved storageState.json. All 50 tests read from the same file. Two tests modifying session state = race condition.

Before: shared global auth state

After: per-worker isolated auth via fixtures

Impact:

Auth collision failures: 12 → 0
Login time per test: 2,100ms → 48ms (API auth vs UI auth)
Saved: ~102 seconds across 50 tests

3. Page Object Model with data-testid Selectors

XPath selectors break when CSS class names change. I replaced every selector with data-testid attributes and wrapped pages in the Page Object Model.

Before: fragile XPath

After: Page Object Model with stable selectors

Impact:

Selector timeout failures: 8 → 0
Average selector resolution: 2,500ms → 80ms (97% reduction)
Test maintenance time: reduced by ~60% (CSS refactors don't break tests)

4. Playwright Configuration for Staging

Key decisions:

retries: 2 — 3 consecutive failures = real bug, not network noise
trace: 'on-first-retry' — traces generated only when needed, keeps CI artifacts small
workers: 4 per shard — matches CI runner CPU without over-subscribing

5. API-Level Regression Tests for Search Quality

Pure UI E2E tests are slow for validating ML-powered features. I added lightweight API tests that assert on relevance scores directly:

Impact:

Catches model regressions in < 5 seconds vs 60+ seconds for full UI flow
Blocked a bad configuration change that dropped accuracy by 8% before it merged

Final Results

Before vs After Comparison

Metric	Before	After	Improvement
Pass rate	67%	97%	+30pp
Wall-clock time (50 tests)	24 min	7 min	71% faster
VPN-related failures	57.6% of fails	0%	Eliminated
Auth collision failures	18.2% of fails	0%	Eliminated
Average selector time	2,500ms	80ms	97% faster
False failure rate	33%	2%	94% reduction
Tests catching real bugs	5	23	4.6× more signal

Engineer Time Saved

The 33% false failure rate meant engineers were re-running pipelines 2-3× per day. After the fix:

Re-runs per day: 8 → 1 (the 2% genuine flakes)
Engineer time saved: ~2 hours/day across the team

Key Takeaways

1. Categorize Your Failures Before Fixing Them

93% of my failures were infrastructure. Fixing the app would have done nothing. Categorize first.

2. VPN in CI Is Solved With wg-quick + Health Check

A curl --retry 5 health check against the protected endpoint before tests start is all you need. Don't assume the VPN is up — verify it.

Clicking through a login form: ~2,100ms. A POST to /auth/login + localStorage.setItem: 48ms. Use the API.

4. Page Objects Pay Off After the Third Refactor

They feel like over-engineering on day one. After the third CSS class rename breaks your selectors, you'll be glad you have them.

5. data-testid Is a Contract, Not a Convenience

It's a contract between the frontend team and QA. Add data-testid to every interactive element at build time, not retroactively after tests break.

6. 2 Retries in CI Is the Right Number

0 retries: flaky network kills your pipeline
2 retries: 3 consecutive failures = real bug
3+ retries: you're hiding real failures

Tools & Technologies Used

E2E Framework: Playwright 1.44
VPN: WireGuard with wg-quick in GitHub Actions
CI/CD: GitHub Actions with matrix sharding (5 shards × 4 workers)
Auth Strategy: API-level login + localStorage injection
Selectors: data-testid exclusively, getByRole for accessibility
Reporting: Playwright blob reporter + HTML merge
API Tests: Playwright request fixture against REST endpoints

What's Next?

Visual regression: Percy/Argos for pixel-diff testing on key UI components
Contract testing: Pact consumer-driven contracts between frontend and backend
Chaos testing: Simulate VPN drops mid-test to validate reconnection behavior
Test impact analysis: Run only tests affected by changed files

Going from 67% to 97% pass rate was not about writing better tests — it was about fixing the infrastructure those tests run on. The VPN connection, the auth session isolation, and the selector stability accounted for 93% of failures.

Once the false failures were gone, the remaining 3% were real bugs. That's the whole point: a test suite you trust is worth infinitely more than one you ignore.

If your E2E suite is unreliable, don't add more retries — categorize the failures first.