Been running tests for months and getting mixed results. Some wins look obvious but then don’t hold up when I scale.
What’s your process for actually validating significance? Feel like I’m just guessing half the time.
Been running tests for months and getting mixed results. Some wins look obvious but then don’t hold up when I scale.
What’s your process for actually validating significance? Feel like I’m just guessing half the time.
I check significance differently depending on the test.
For landing pages, I wait for 95% confidence AND run it for 2 weeks minimum. But here’s what got me early on - seasonal stuff screws everything up. I had a ‘winner’ that was just catching a holiday spike.
Now I always break down the results after. Check different traffic sources, devices, time periods. If your winner only works for one slice, it’s probably just noise.
I also ignore anything that wins by less than 10%. Even if it hits significance, seasonal changes and measurement errors will eat the real impact. Focus on bigger wins instead.
After months of testing, sample size is everything. Run a sample size calculator first - you need your baseline conversion rate and the minimum lift you’re targeting.
Here’s the real test: send all traffic to your winner and let it run for a full business cycle. If the lift disappears, you didn’t actually win.
Don’t trust whatever your testing tool says is significant. Run tests until you get 1000+ conversions per variant - that’s the bare minimum. The real test happens afterward. If your ‘winner’ tanks when you scale up or loses its lift after 2-4 weeks, it was never a real win. I validate by running the winner for twice as long as the original test. If the improvement sticks, it’s legit. If not, back to the drawing board.
Test it out for longer. Don’t rely on luck.
Confidence levels are key. Run tests for two weeks minimum. You might still get burned, so stay cautious.