Playwright E2E Behind a VPN: How I Built a Reliable Staging Test Suite from Scratch
A practical guide to running Playwright E2E tests against a WireGuard-protected staging environment. Page Object Model, fixture-based auth, parallel sharding, and how I got from 67% to 97% test pass rate.
The Challenge
The staging environment for a SaaS platform I worked on sits behind a WireGuard VPN. No VPN, no access to the API. Simple in theory; brutal in practice for CI.
When I took over the E2E suite, the pass rate was 67%. Not because the features were broken — because the tests were:
- Connecting to staging without VPN → instant 503
- Sharing a single auth session across 50 parallel tests → race conditions
- Using XPath selectors → 2-5 second timeouts on every action
- No retry logic → one flaky network call = failed test
The goal: 95%+ pass rate, under 8 minutes wall-clock time, zero false failures from infrastructure.
The Investigation: Categorizing Failures
I ran 200 test executions and categorized every failure:
Failure Distribution (67% pass rate baseline)
| Failure Type | Count | % of Failures |
|---|---|---|
| VPN not connected / 503 | 38 | 57.6% |
| Auth session collision | 12 | 18.2% |
| XPath timeout (>5s) | 8 | 12.1% |
| Actual bugs in the app | 5 | 7.6% |
| Network flakiness | 3 | 4.5% |
93% of failures were infrastructure problems, not real bugs. The app was working fine.
The Solution
1. WireGuard Pre-Connection in GitHub Actions
The CI runner needs to be on the VPN before any test touches staging. I solved this with a dedicated VPN step before Playwright runs:
The WG_CONFIG secret stores the full wg0.conf content. The curl health check with --retry 5 ensures the VPN tunnel is fully established before tests start.
Impact:
- VPN-related failures: 38 → 0 (eliminated)
2. Fixture-Based Isolated Auth Sessions
The original suite had a global.setup.ts that logged in once and saved storageState.json. All 50 tests read from the same file. Two tests modifying session state = race condition.
Before: shared global auth state
After: per-worker isolated auth via fixtures
Impact:
- Auth collision failures: 12 → 0
- Login time per test: 2,100ms → 48ms (API auth vs UI auth)
- Saved: ~102 seconds across 50 tests
3. Page Object Model with data-testid Selectors
XPath selectors break when CSS class names change. I replaced every selector with data-testid attributes and wrapped pages in the Page Object Model.
Before: fragile XPath
After: Page Object Model with stable selectors
Impact:
- Selector timeout failures: 8 → 0
- Average selector resolution: 2,500ms → 80ms (97% reduction)
- Test maintenance time: reduced by ~60% (CSS refactors don't break tests)
4. Playwright Configuration for Staging
Key decisions:
retries: 2— 3 consecutive failures = real bug, not network noisetrace: 'on-first-retry'— traces generated only when needed, keeps CI artifacts smallworkers: 4per shard — matches CI runner CPU without over-subscribing
5. API-Level Regression Tests for Search Quality
Pure UI E2E tests are slow for validating ML-powered features. I added lightweight API tests that assert on relevance scores directly:
Impact:
- Catches model regressions in < 5 seconds vs 60+ seconds for full UI flow
- Blocked a bad configuration change that dropped accuracy by 8% before it merged
Final Results
Before vs After Comparison
| Metric | Before | After | Improvement |
|---|---|---|---|
| Pass rate | 67% | 97% | +30pp |
| Wall-clock time (50 tests) | 24 min | 7 min | 71% faster |
| VPN-related failures | 57.6% of fails | 0% | Eliminated |
| Auth collision failures | 18.2% of fails | 0% | Eliminated |
| Average selector time | 2,500ms | 80ms | 97% faster |
| False failure rate | 33% | 2% | 94% reduction |
| Tests catching real bugs | 5 | 23 | 4.6× more signal |
Engineer Time Saved
The 33% false failure rate meant engineers were re-running pipelines 2-3× per day. After the fix:
- Re-runs per day: 8 → 1 (the 2% genuine flakes)
- Engineer time saved: ~2 hours/day across the team
Key Takeaways
1. Categorize Your Failures Before Fixing Them
93% of my failures were infrastructure. Fixing the app would have done nothing. Categorize first.
2. VPN in CI Is Solved With wg-quick + Health Check
A curl --retry 5 health check against the protected endpoint before tests start is all you need. Don't assume the VPN is up — verify it.
3. API-Level Auth Is 40× Faster Than UI Login
Clicking through a login form: ~2,100ms. A POST to /auth/login + localStorage.setItem: 48ms. Use the API.
4. Page Objects Pay Off After the Third Refactor
They feel like over-engineering on day one. After the third CSS class rename breaks your selectors, you'll be glad you have them.
5. data-testid Is a Contract, Not a Convenience
It's a contract between the frontend team and QA. Add data-testid to every interactive element at build time, not retroactively after tests break.
6. 2 Retries in CI Is the Right Number
- 0 retries: flaky network kills your pipeline
- 2 retries: 3 consecutive failures = real bug
- 3+ retries: you're hiding real failures
Tools & Technologies Used
- E2E Framework: Playwright 1.44
- VPN: WireGuard with
wg-quickin GitHub Actions - CI/CD: GitHub Actions with matrix sharding (5 shards × 4 workers)
- Auth Strategy: API-level login + localStorage injection
- Selectors:
data-testidexclusively,getByRolefor accessibility - Reporting: Playwright blob reporter + HTML merge
- API Tests: Playwright
requestfixture against REST endpoints
What's Next?
- Visual regression: Percy/Argos for pixel-diff testing on key UI components
- Contract testing: Pact consumer-driven contracts between frontend and backend
- Chaos testing: Simulate VPN drops mid-test to validate reconnection behavior
- Test impact analysis: Run only tests affected by changed files
Conclusion
Going from 67% to 97% pass rate was not about writing better tests — it was about fixing the infrastructure those tests run on. The VPN connection, the auth session isolation, and the selector stability accounted for 93% of failures.
Once the false failures were gone, the remaining 3% were real bugs. That's the whole point: a test suite you trust is worth infinitely more than one you ignore.
If your E2E suite is unreliable, don't add more retries — categorize the failures first.