Saturday, April 28, 2018

Get Started with Testing (A Work-Related Post)

from armydre2008
In Marketing, testing is a great way to hone in on what resonates with a customer.

This is not meant to make you an expert in testing, but hopefully get you started with simple testing or help you create better tests if you feel like your testing isn't yielding the results to expected.  If you pursue testing, don't let this be the only article you read on the subject.

Always Be Testing

When you're not testing, you're not learning or improving. There is a dangerous catch though - do not test just to say that you're testing.  Tests should build upon earlier learnings and should push you towards valuable corporate objectives (sales, newsletter signups, etc.) and not towards less quantifiable results (like "time spent on site" if you're not an online game or video platform).

Your testing should be coordinated with others who are testing and you should be documenting all of your testing so that others can learn from it and leverage it as the basis for other tests.

Also, be ready to re-test. When I worked at a video game company, if we tested on one video game's website and had good results, we'd test again on another video game website to see if it was a universal truth or if it was only valid for the audience of the one kind of website. When I worked at a non-profit, we repeated tests against different products because we knew that what worked for recurring giving might not work for one-time-gift appeals.

Vocabulary

Let's start with a few basic terms:

Hypothesis - this is your theory, created in advance, that describes what you are testing and what you expect the outcome to be. This should be detailed and precise.

Test - an attempt to find the relative performance of two or more distinct items over a set period of time.

---

Experiences - this is your control plus all challengers.  I like to label these A for the Control and B, C, D, etc. for the Challengers - this is where the A/B/N comes from.  (The labels help if you've got a table of results and little space.)

Control - this is the as-is experience.  This is your current website or email or advertisement or store layout or whatever.

Challenger - this is one an entity that has a chance that you want to test. On a website, it might be using blue buttons instead of green.  You can have multiple challengers in a single test.

---

Metrics / Success Metrics - this is the one or more things being tested.  I like to label these 1, 2, 3, etc. The labeling also helps if you've got a table of results but little space. If your tests are self-optimizing, metric 1 should be the one that's the trigger.  Often you're measuring rates (click rate, sign-up rate, purchase rate), not a raw number.

Conversion - this is my catch-all term for a tally mark counted towards a success metric. So it may be a click on a button, a view of a video, an actual sale. This is the term most likely to get me angry comments suggesting I'm a hack and that I should be calling them something else.  Comment away.

Results - this should prove or disprove your hypothesis.  The only failed test is one that is inconclusive. Otherwise, even a test that performs contrary to your hypothesis is still a chance to learn.  If you find that you are consistently wrong in your hypothesis, you may need to let someone else start coming up with future hypotheses.

Statistical Significance - this tells you that you can trust your results.  This is the likelihood that if you repeated the test you'd get the same results.  This is a really important topic as failing to achieve statistical significance makes your tests worthless and leaves you defenseless if someone challenges your results.  This can't be stressed enough.

Lift/Drag - typically you report on the performance of the Challenger over the Control, this tells the direction.  If it's not statistically significant, substitute the terms "Trend Lift" or "Trend Drag"

Winner/Loser - the experience that performs best towards your desired success metric, with statistical significance

Added Learnings - this is something you've observed, but not something you were specifically testing for.  If these seem valuable, it is important to write a hypothesis and give them their own test.

---

Multivariate Testing - a whole different topic. This is where you're testing multiple items at the same time. It might be button color, wording and photos. The math is far more complex and it's harder to know which item or items caused the change.  Best done with a website testing tool.

Self-Optimizing - some website testing tools will start to start to show the trending winner more often. If you're testing to make money, this helps you start to make more money quickly.

Ground Rules

Having some ground rules that apply to all testing will be beneficial when results are hard to understand, when someone's pressuring you for results, if you are wondering if you need to invalidate a test and start over or if you're trying to decide if a test is even worth performing.  Create your own to meet your specific needs.

Here are some of the types I've used in the past.

1. A test must run for at least one month and no more than three months.

Why: You need a reasonable timeframe. If you try to read the results too quickly, it could be a fluke or be influenced by things beyond your control - though it is possible that you've hit on some magic and found something so in need of optimizing that it was an instant success.  But, if you drag things out too long, again, things beyond your control can influence your test.  If you can't get to statistical significance in three months, it would also suggest that you're (a) testing something that has no impact on your customers; (b) you don't have enough traffic or (c) you're not being bold enough in your testing.

2. The test must receive at least 10,000 impressions before any results can be assessed.  

Why: Similar to #1, this helps make sure that you've got a big enough audience.

3. A test must receive at least 5,000 impressions a week to remain viable.

Why: If the volume is too low, it may not be worth it to test.  Either you're running too many concurrent tests or you're optimizing for something that may be of low value.

4. An experience must have 300 conversions.

Why: Similar to #3, tries to make sure you're not spending time on low value tests.  Of course, if each conversion is the sale of an Aston Martin, that's different. And an example of why you should establish your own ground rules.

5. Anything less than 3% lift or drag is considered "no change."

Why: You want big changes.  You want to take the kinds of risks that give you solid learnings.  If you can only push the needle a small amount, you're being too conservative in your testing.

6. Results over a year old should be re-tested.

Why: Things can change so much in the course of a year. Technology, world events, consumer tastes are ever evolving. What worked a year ago may make no sense now.

7. A test cannot have more than 4 experiences (control + 3 challengers).  

Why: If you can't pick the best four experiences, you may need to think about what you're really trying to measure for. It might be that instead of questioning whether your buttons should be red, blue, green, purple, orange or yellow you should first be testing whether you should have cold or hot colors.

8. Tests cannot cross seasonal events or major calendar events relevant to your company.

Why: This is especially important if you're in a different group from your advertising team.

9. Control/Baseline/As-Is has to be measured during the test

Why: As the old saying goes, past performance is no guarantee of future results.  You need to know how your control is performing at the same time as the challengers. If you can't do this, your test results will be suspect.

Let's Test!


Step 1 - Form a Hypothesis

"I believe that green buttons on the product page will result in more clicks to put items in the shopping cart than our current blue buttons.  A previous test (link here) showed that when we changed the button size from 20px tall to 40px resulted in more clicks (and more sales). This further tests the notion that our buttons are often overlooked."

Dissecting the hypothesis:


  • Control - blue buttons on the product page
  • Challenger - green buttons on the product page
  • Theory - blue buttons are overlooked by our audience ultimately causing fewer sales
  • Builds on - past test to increase button size
  • Expected outcome - higher click rate for challenger 
  • Measurable metrics - clicks, visitors per experience
  • Baseline/As-Is - click rate (clicks over visitors) for the blue button control during the test timeframe

Plan carefully to avoid creating a reporting nightmare (see below).

Step 2 - Build the test

You'll need to make sure you can serve up the different experiences and measure the outcome.  If it's a website, can the website display the different experiences and track the results independently?  You need to know how many people saw the blue button and how many people saw the green button and how many clicked on each button.

And then you need an easy way to get to the results.  If your dev team is building this for you, the work's not done until you have a way to see the results and a way to turn the test on and off.  There are a number of tools that can make easy work out of this so that you don't need a dev effort to do a website test.

Of course, if it's an email, it's just a matter of splitting the audience and sending different versions to each audience. If you're testing store layouts, you might have to have multiple stores in each design to test.

Clear away any other factors that could influence your test.  If some of your acquisition funnel traffic performs differently, steer them clear of the test and do not include them in the test results.

Step 3 - Check the results

Based on the traffic, I like to present results every 1 to 2 weeks during a test.  Sometimes other stakeholders may decide to end a test, either because the results aren't telling us, or because we've reached statistical significance/confidence and it's time to move on to the next test.

You are always measuring in pairs.  If you have two experiences and one success metric, it's one measurement.

Click Rate of Blue Buttons vs Green Buttons =
A vs B (blue vs green)

But, if you get more complex, your assessments need to get more complex.  Four experiences and one success metric is six measurements:

Click Rate of Blue Buttons vs Green Buttons vs Red Buttons vs Yellow Buttons =
A vs B (blue vs green)
A vs C (blue vs red)
A vs D (blue vs yellow)
B vs C (green vs red)
B vs D (green vs yellow)
C vs D (red vs yellow)

Two experiences and two success metrics is two measurements:

Click Rate of Blue Buttons vs Green Buttons
and Sales Rate of Blue Buttons vs Green Buttons =

Clicks: A vs B (blue vs green)
Sales: A vs B (blue vs green)

And of course four experiences and two success metrics is twelve measurements.


The Super Very Good Absolutely Important Notion of Statistical Significance

In short, Statistical Significance asks "If we run this test again and again and again, will we get the same results?" It involves a lot of heavy lifting of your pairs of data.  You may hear talk of "p-values" but for our purposes, we simplify with lift/drag or winnner/loser and "confidence."  The best site on the internet to make this measurement is kissmetrics.com/growth-tools/ab-significance-test/

How it works: You punch in your impressions and clicks and it will give you conversion, lift and confidence.  Experience (B) is performing better, just from the conversion rate, which you can get yourself with Excel. Even though it has a lift of 2%, it's not yet statistically significant.


But, compare with this one - the lift is greater, the confidence is greater and now we've reached statistical significance.  (You should feel good when confidence exceeds 95%.)


Enter these (or other numbers) into the tool to see how the results change.

If you work with someone who regularly likes to question facts or "go with their gut," statistical significance is a powerful argument.

Happy Testing!

Your first tests are like the biggest chisels and hammers applied to a block of granite - they can get you to a rough outline for focus. But you need to build upon those tests with new tests with greater precision and focus. Refine with more specific tests or to more specific audiences and soon a picture will emerge.  Because if you don't refine, all you'll end up with are a bunch of rocks and dust.

One last thought... what if you just have data?

If you had no hand in deciding what was captured, how it was captured, when it was captured or who the participants being measured were, then all you can do is attempt to make forensic guesses about what happened -- which you then need to validate with new tests.

(Cross-posted to LinkedIn)

No comments: