With so many experiments, including upside down box art examples, we need to think carefully about what our metrics are telling us. Suppose we look at the rate through clicks, measuring the fraction of members in each experience that clicks on a title. This metric alone can be a confusing measure of whether this new UI is successful, as members can simply click on the title of the upside down product experience to make it easier to read. In this case, we also want to evaluate which fraction of the members later move away from that title and proceed to run it.
In each case, we have seized it, despite obstacles we can scarcely imagine. ” These metrics include measuring members’ engagement with Netflix: Are the ideas we’re testing helping our members choose Netflix as their entertainment destination on any given night?
A lot of statistics are involved too – how big of a difference is considered significant? How many members do we need in an experiment to identify the effect of a given dose? How do we analyze data most efficiently? Focusing on high-level insights we will describe some of those details in later posts.
Hold everything else constant
Since we create our control (“A”) and treatment (“B”) groups using random assignments, we can ensure that individuals in both groups are balanced in all the dimensions that can be meaningful to the test. Random assignments ensure, for example, that the average length of Netflix membership is not significantly different between control and treatment groups, nor content choice, primary language selection, and so on. The only remaining difference between the groups is the new experience we are testing, ensuring that our assumptions about the impact of the new experience are in no way biased.
To understand how important this is, let’s consider another way to make a decision: we can roll out the new upside down box art experience (discussed above) to all Netflix members and see if there are any major changes to our metrics. If there is a positive change, or no evidence of a meaningful change, we will keep the new experience; If there is evidence of a negative change, we will go back to the previous product experience.
Suppose we did that (again – this is a fantasy!), And on the 16th day of a month the upside down experience flipped the switch. How would you do if we collected the following information?
The data looks good: we publish a new product experience and the engagement of the members increases a lot! But if you have this data, plus the knowledge that Product BUI reverses all the box art, how confident can you be that the new product experience is really good for our members?
Do we really know what the new product experience is? Created Increased engagement? Is any other explanation possible?
Did you also know that Netflix has released a new title, such as Stranger Things or a new season of Bridgeton, or a hit movie like Army of the Dead, coming out of the (hypothetically) new upside down product experience on the same day? Now we have multiple possible explanations for increasing engagement: it could be a new product experience, it could be a hit headline that is all over social media, it could be both. Or it could be something completely different. The bottom line is that we don’t know if we have experience with new products Because Increased engagement.
Instead if we want to run an A / B test with upside down box art product experience, where one group receives the current product (“A”) and the other group receives the complete opposite product (“B”) for months, and collects the following information:
In this case, we are moving towards a different conclusion: the results of the upside down product are generally less busy (not surprising!), And both groups see an increase in busyness with the introduction of larger headlines.
A / B tests give us a statement of effectiveness. We only introduced the upside down product experience in Group B, and since we randomly assigned members to groups A and B, everything else remained fixed between the two groups. Therefore we can end up with high probability (next time details) that upside down product Created Decreased engagement.
This hypothetical example is extreme, but the broader teaching is that there is always something we cannot control. If we express an experience to everyone and measure only one metric before and after the change, there may be relevant differences between the two periods that prevent us from making a causal decision. Maybe it’s a new title that shuts down. Maybe it’s a new product partnership that unlocks Netflix for more users to enjoy. There is always something we do not know. Running A / B testing, where possible, to help us prove effectiveness and change the product with the confidence that our members voted for them by their actions.
It all starts with an idea
An A / B test starts with an idea – we can make some changes to the UI, personalization system that helps members find content, signup flow for new members, or any other part of the Netflix experience that we believe will give a positive result. Results for our members. Some of the ideas we examine are growing innovations, such as ways to improve the text copy displayed on Netflix products; Something more ambitious, such as the test, has led to a “top 10” list that Netflix now shows in the UI.
Of all the innovations that have been made to Netflix members worldwide, the top ten started out as an idea that turned out to be a testable guess. Here, the basic idea was that popular headlines in each country would benefit our members in two ways. First, we can help members share experiences by communicating what is popular and connecting with each other through conversations about popular topics. Second, we can help members choose some nice content for viewing by fulfilling the underlying human desire to be part of a shared conversation.