Effective A/B Testing

Making money with statistics

Ben Tilly

Pictage

(These slides use S5. Click anywhere to continue or use the keyboard shortcuts.)

Contents

  1. Overview of A/B testing
  2. The G-test
  3. Odds and Ends
  4. Development considerations
  5. Reporting
  6. Gotchas
  7. How Long?
  1. Multiple tests
  2. Many versions
  3. Statistics overview
  4. The dollar challenge
  5. The z-test
  6. End

1. Overview of A/B Testing

Previous Section
Next Section
Contents

What is A/B Testing?

Why A/B test?

What Can You A/B Test?

A/B tests do not substitute for

2. The G-test

Previous Section
Next Section
Contents

What is the G-test?

Limitations of the G-test

What to measure

Arrange your measurements

Yes No
A $a_yes $a_no $a
B $b_yes $b_no $b
$yes $no $total
  • $a_yes = # in A who are yes
  • $a_no = # in A who are no
  • $b_yes = # in B who are yes
  • $b_no = # in B who are no

Scary Math Part 1 - Addition

Yes No
A $a_yes $a_no $a
B $b_yes $b_no $b
$yes $no $total
  • $a = $a_yes + $a_no
  • $b = $b_yes + $b_no
  • $yes = $a_yes + $b_yes
  • $no = $a_no + $b_no
  • $total = $a + $b (or $yes + $no)

Scary Math Part 2 - Expectations

Yes No
A $e_a_yes $e_a_no $a
B $e_b_yes $e_b_no $b
$yes $no $total
  • $e_a_yes = $a * $yes / $total
  • $e_a_no = $a * $no / $total
  • $e_b_yes = $b * $yes / $total
  • $e_b_no = $b * $no / $total

Scary Math Part 3 - G-test

Scary Math Part 4 - Calculation

We have 4 measurements and 4 expectations. So we have 4 G-test terms. We add them:
my $g_test
    = 2 * ( $a_yes * log( $a_yes / $e_a_yes )
          + $a_no  * log( $a_no  / $e_a_no  )
          + $b_yes * log( $b_yes / $e_b_yes )
          + $b_no  * log( $b_no  / $e_b_no  )
        );

Scary Math Part 5: Interpretation

use Statistics::Distributions qw(chisqrprob);
my $p = chisqrprob(1, $g_test);
  1. If the samples are all independent...
  2. and the measured values are all at least 10...
  3. and the real performance of A and B is the same...
  4. then $p ≈ prob(G-test should be > $g_test)
  5. If $p is "small", conclude #3 likely wrong

How Small Is "Small"?

3. Odds and Ends

Previous Section
Next Section
Contents

Recap of A/B setup

Recap of G-test evaluation

What if I don't have Perl?

What if I don't want Perl?

Show Me

4. Development considerations

Previous Section
Next Section
Contents

Tests are temporary

Streamline development

Programming API

A sufficient API

my $version
    = get_or_create_test_version(
          $person_id, "some test name",
          ["some version", "another version", ...],
          [4, 1, ...] # optional, defaults to [1, 1, ...]
      );

Behind the scenes

Complications behind the scenes

Code outline for function

sub get_or_create_test_version {
    my ($person_id, $test, $versions, $weights) = @_;












}

Code outline for function

sub get_or_create_test_version {
    my ($person_id, $test, $versions, $weights) = @_;
    my $version = production_test_override($test)
                || get_test_from_cache($person_id, $test);
    return $version if $version;










}

Code outline for function

sub get_or_create_test_version {
    my ($person_id, $test, $versions, $weights) = @_;
    my $version = production_test_override($test)
                || get_test_from_cache($person_id, $test);
    return $version if $version;
    $version = get_test_from_database($person_id, $test);
    if (not $version) {





    }
    save_test_to_cache($person_id, $test, $version);
    return $version;
}

Code outline for function

sub get_or_create_test_version {
    my ($person_id, $test, $versions, $weights) = @_;
    my $version = production_test_override($test)
                || get_test_from_cache($person_id, $test);
    return $version if $version;
    $version = get_test_from_database($person_id, $test);
    if (not $version) {
        # choose_random_version should NOT use $person_id
        $version = qa_test_override($test)
                 || choose_random_version($versions, $weights)
                 || return;
        save_test_to_database($person_id, $test, $version);
    }
    save_test_to_cache($person_id, $test, $version);
    return $version;
}

Naming your tests

Naming example

In the database

Programming recap

5. Reporting

Previous Section
Next Section
Contents

Time for reporting

The conversion funnel

Your conversion funnel

Examples of metrics

Too many metrics?

Reporting is work

Simplify custom reporting

log_test_activity(
    $person_id, $test_name, $test_version, $action
);

If you have a data warehouse

6. Gotchas

Previous Section
Next Section
Contents

Is that it?

Compare apples to apples

Be careful when changing the mix

What is wrong with this?

Beware hidden correlations!

Guarantee independence

People don't like UI changes

Schedule Compression

Wrong metric

That's it!

7. How Long?

Previous Section
Next Section
Contents

A simpler example

Coin observations

A/B test simulation

Best Case Example

Low Conversion Example

Low Lift Example

A/B test simulation conclusions

More conclusions

8. Multiple tests

Previous Section
Next Section
Contents

Running multiple tests

Running multiple tests at once

Random assignment is essential

Bad combinations are a lesser concern

Tip: Do many small tests

9. Many versions

Previous Section
Next Section
Contents

A/B/C... testing the wrong way

Why is this wrong?

Extreme example

A/B/C... testing take 2

Finding the right fudge factor

A/B/C... testing the right way

10. Statistics overview

Previous Section
Next Section
Contents

Back to theory

Some basic terms

Distribution of the fair coin

Basic properties

Estimating E(X) and Var(X)

The normal distributions

Central Limit Theorem

100 coin flips

Convergence to the mean

What is the Chi-square distribution?

History of the G-test

The Yates' continuity correction

11. The dollar challenge

Previous Section
Next Section
Contents

How much money did we make?

A/B Testing Setup

First attempt, sidestep

Second attempt, standard statistics

What does business data look like?

What about at the extreme?

Ignoring the extreme

12. The z-test

Previous Section
Next Section
Contents

Third attempt, z-test

z-test cont.

  (mean(A) - mean(B)) / sqrt(Var(A)/count(A) + Var(B)/count(B))

Limitations

Reconsider the G-test?

End

Previous Section
Next Section
Contents

About this presentation

Questions?

The next slides have some of the questions that came up during and after the talk.

And there were questions

How long have you been doing A/B testing?

I think my first A/B test was in late 2003. So nearly 5 years.

What organizational impacts are there?

A/B testing may be initially threatening to those who perceive themselves as experts. So do your best to have your first couple of A/B tests be fairly easy wins with the product manager on board. One you have financial results it is easier to handle people who don't like discovering that their pronouncements are not infallible.

What you refuse to see, is your worst trap has more to say on why people don't like these challenges.

And more questions

How do you budget the time for A/B testing?

Generally the time to set up A/B tests comes out the regular development schedule. That is why I made such a big deal out of making it as lightweight as possible on the developers. There are a lot of demands on developers, and schedule pressure is one of them. You want to avoid having reasons to push back.

Currently my job is reporting. Reporting on A/B tests is one of my responsibilities. However if you don't have a dedicated person for reporting (we didn't when I first worked on A/B testing) then the time to develop reports will again come out of your development budget. Which is a good reason to try to reuse reporting from one project to another.

And more questions

What if we want to do this in Java?

All of the code presented is easily written in any language, including Java, except for the call to chisqrdistr. My suggestion there is to get Rhino and then you can embed statistics-distributions.js in your Java program and call it. You only need to call one function.

For other languages you can set up a command line utility or a web service. Unless you want to port the utility again... If you want to port the library then I suggest starting with the JavaScript version. It is more consistently parenthesized so there will be fewer possible gotchas.

And more questions

What do you think of Google's Website Optimizer?

If you want to get going on A/B testing and don't want to worry about doing any of the reporting, are concerned you'll get it wrong, etc, then by all means use it. However be aware that it has a number of major limitations. It works by using JavaScript to inject static content into your page. The fact that it is static makes it hard to have the bit that you're A/B testing be dynamic. You can do it, by having it inject some JavaScript that dynamically rewrites your HTML, but that is hackish.

There are other limitations. It doesn't support multiple metrics. It can't be used to A/B test email content.

If none of those are a problem for you, then by all means use it. However I think the flexibility of doing it yourself are worthwhile.

And more questions

What about A/B testing large pieces of functionality?

You can A/B test anything. However with large projects it may not make sense during development to try to hide it behind an A/B test. In that case you could just release the project and then start A/B testing features within it.

Which way you go is up to you.

And a bug

How to calculate the fudge factor?

Due to a bug in a simulation, I had calculated the fudge factors wrong for n-way chi-square and g-tests. My thanks to Lukas Trötzmüller for catching this many years later, and my apologies to people who relied on that mistake over the years.