NETFLIX UPDATE

Explanation of A / B Test Results: False Positive and Statistical Significance October 2021 by Netflix Technology Blog


How do we use P-Value to determine if there is statistically significant evidence that the currency is unfair অথবা or whether our new product experience improves stability? This returns to the 5% false positive rate that we initially agreed to accept: We conclude that a p-value less than 0.05 has a statistically significant effect. This formalizes the insight that we should reject the null assumption that the currency is fair if our results are not likely to occur under the guess of a fair currency. In the example of observing 55 heads on a 100 coin flip, we calculated a p-value of 0.32. Since the p-value is greater than the 0.05 significance level, we conclude that there is no statistically significant evidence that the currency is unfair.

Two tests we can do from a test or an A / B test: we have an effect on either conclusion (“currency is unfair”, “top 10 features increase member satisfaction”) Or We conclude that there is not enough evidence to conclude that there is an effect (“the currency cannot be concluded to be unfair”) much like a jury trial, where the two possible outcomes are “guilty” or “not guilty” – and “not guilty” “innocent” “Similarly, this (frequent) method of testing A (B) does not allow us to conclude that there is no effect – we can never conclude that the currency is fair, or that the new product feature has any effect on our members. No. We just concluded that we did not gather enough evidence to refute the zero notion that there is no difference. Critically, we can’t conclude that the coin was fair – after all, if we gather more evidence, flipping the same coin 1000 times, we have enough compelling evidence to refute the zero assumption of fairness. Can find

The A / B test has two more concepts that are closely related to P-values: Rejection zone For an experiment, and Confidence gap For an observation. We cover both of them in this section, based on examples of currencies from above.

Rejection zone. Another way to make decision rules for testing is called “rejection zones” – the set of values ​​for which we conclude that the currency is unfair. To calculate the rejection region, we once again assume that the zero assumption is true (the currency is fair), and then define the rejection region as a set of possible outcomes that are not greater than 0.05. The rejection region consists of the results that are the most extreme, if the zero assumption is correct – the results where the evidence against the null assumption is the strongest. If an observation falls in the region of rejection, we conclude that there is statistically significant evidence that the currency is not fair, and “canceled”. In the case of general currency testing, the rejection region corresponds to less than 40% or more than 60% of the head observations (shown in Figure 3 with blue shaded bars). We say the boundaries of the rejection region, here 40% and 60% head, important values ​​of the test.

There is a similarity between the rejection region and the p-value and both lead to the same conclusion: the p-value is less than 0.05 and if the observation is in the rejection region.

Confidence intervals. So far, we have first approached the null hypothesis to create a decision rule, which is not always a statement of change or equality (“the currency is fair” or “product innovation does not affect the member’s satisfaction”). We then define the possible outcome under this zero assumption and compare our observations with that distribution. To understand the gaps in confidence, it helps to turn the problem around to focus on observation. We then go through a thought practice: on the basis of observation, which of the following values ​​will decide not to reject, assuming that we have fixed a false positive rate of 5%? For the example of our currency reversal, the observation is 55% head on 100% flip and we do not cancel the fair currency. Or we will reject the zero estimate that the probability of a head is 47.5%, 50%, or 60%. There is an absolute value for which we will not reject zero, with a probability of about 45% to 65% head (Figure 4).

This range of values ​​is a confidence interval: a set of values ​​under the null hypothesis that will not reject based on the information obtained from the test. Since we mapped the gap using the 5% significance level test, we created the 95% confidence interval. The explanation is that, under repeated testing, confidence intervals will cover the actual value (here, the actual probability of the head) 5% of the time.

There is a parallel between the confidence interval and the p-value and both lead to the same conclusion: the 95% confidence interval does not cover the zero value if and only if the p-value is less than 0.05 and in both cases we reject the zero estimate of any effect.

Figure 4: Creating a confidence interval by mapping the set of values, which when used to define a zero estimate, will not be rejected for the given observation.

Using the practice of continuous thinking based on currency reversal, we have built insights on the two positives we can make based on false positivity, statistical significance and P-values, rejection regions, confidence intervals and test data. This key concept and insight map directly compares the experience of treatment and control in A / B testing. We define a “zero guess” without any distinction: “B” does not affect experience member satisfaction. We then test the same thinking: what are the possible outcomes for the difference in metric values ​​between the treatment and control groups and their corresponding possibilities, Assuming there is no difference in member satisfaction? We can then compare observations with this distribution from the test, such as with the currency example, calculate a p-value and draw a conclusion about the test. And like the currency example, we can define rejection zones and calculate confidence intervals.

But false positives are the only two mistakes we can make when working on test results. In the next post in this series, we will include the concept of other types of errors, false negatives and closely related to statistical energy. Follow the Netflix Tech blog to stay up to date.



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button