Written by Minal Mishra
The quality of a client application is most important for digital products worldwide, as it is the primary way of communicating customers with a brand. At Netflix, we have a significant investment in ensuring that new versions of our applications are well-tested. However, Netflix is available for streaming on thousands of devices and is powered by hundreds of micro-services that are deployed independently, which makes it extremely challenging to test extensively internally. Therefore, it becomes important to complement our release decision with strong evidence from the field during the update process.
Our team was formed to dig up health signals from the field to quickly evaluate new versions of client applications. Since we have invested in systems to enable this approach, it has accelerated the pace of development, which has rationally led to improved development practices and quality of applications. The goal of this blog post is to focus on the field of investment for this vision and the challenges we face today.
We are all familiar with the benefits of frequent and small part release. This helps to bring a healthy balance in the equation of speed and quality. The challenge for clients is that every instance of the application runs on a Netflix member’s device and the signals emanate from the firehouses of events sent by devices around the world. Depending on the type of client, we need to determine the exact strategy for sampling the consumer device, and provide a system that can enable different client engineering teams to see their signal. Therefore, the sampling strategy is different if it is a mobile application vs. a smart TV. In contrast, a server application runs on a server that is generally identical and a routing abstraction can serve sampled traffic in the new version. And the signal to evaluate a new version originates from a relatively few thousand similar servers instead of millions of different devices.
A widely accepted strategy for client applications is gradually introducing new versions of software instead of releasing releases for all users. Periodically or periodically rollout. This method has two main advantages.
- First, if something fails miserably, a break for the release triage can be given by limiting the number of affected customers.
- Second, backend service or infrastructure can be scaled intelligently such as adopted ramp up.
This chart represents a counter metric, which periodically displays version acceptance during rollout. The percentage of devices switched to the N + 1 version is slowly increasing. In the past, during this period client engineering teams would visually monitor their metric dashboards to evaluate signals as more and more consumers moved to newer versions of their applications.
The client-side error rate chart is shown here at the same time as version transfer. We have observed that the metric of the new version N + 1 stabilizes as soon as it is rolled out and reaches close to 100%, where the metric of the current version N becomes noisy at the same time. Trying to compare any metric during this period can be a futile endeavor, as is clear in the case where no customer affects the change in error rate but we cannot explain it from the chart. Typically, teams move from one metric to another to visually identify metric deviations, but time can still be confusing. Periodic rollouts have many advantages, but there is a significant opportunity cost to wait before reaching a critical level of adoption of new versions.
So we have brought the science of controlled testing into the framework of this decision that has been used to evaluate features. The main goal of A / B testing is to conduct a powerful test that is going to bring repeatable results and enable us to make accurate decisions about whether or not to launch a product feature (read more about A / B testing on Netflix here). When using application updates, we recommend an extreme version of the A / B test: we test the entire application. The new version may include user-facing features that are designed to be A / B tested and have a feature behind the flag. However, most of the time it is adding new obvious improvements, simple bug fixes, performance enhancements, production of previous A / B test results, logging etc. which are being sent to the application. If we apply the A / B test method (or Client Canary Since we prefer to call them to differentiate them from the traditional A / B test based on the traditional characteristics) the allocation will look the same for any version.
This chart shows the new and baseline versions increasing with the allotted time. However, most users are already in the baseline version. We are randomly using a fraction of those users as a “test” control group. This ensures that there are no sample discrepancies between the treatment and control groups. It is easy to visually compare client-side error rates for both versions and even “apply statistical estimates to change conversations from”We think“There is a change in the matrix”We know”
But there is a difference between the feature-related A / B testing on Netflix and the growing product changes used for client canaries. The main differences are: a shorter runtime, multiple executions of the analysis are sometimes combined with assignments and the use of data to support null hypotheses. Runtime is predetermined, which, similarly, is the stopping rule for client canaries. Unlike feature A / B testing on Netflix, we limit our evidence collection to a few hours so we can publish updates within a working day. After gathering all the evidence we analyze the metrics to find the fatal response quickly.
The three main steps of any A / B exam can be divided into assignment, metric collection and analysis. We use orchestration to connect and manage client applications through the life cycle of A / B testing, which reduces the cognitive burden of their frequent deployment.
Once your new application is packaged, tested and published, sampling is the first step. Since time is the essence here, we rely on dynamic allocations and allocate devices that come into service within the canary period based on predefined rules. We use the assignment service used for all testing on Netflix for this purpose.
However, for applications that are behind an external app store (for example mobile apps), we can only access the periodic rollout solutions provided by the app stores. We can control the percentage of updated app recipients, which may increase over time. To mimic the client canary solution, we created a synthetic allocation service to perform installation samples after the app update. The service attempts to assign a device to the control group that matches the profile of a device commonly seen in the treatment group, which was assigned by the App Store’s Staged Rollout solution. This ensures that we are controlling for key variables that are likely to affect the analysis.
Metrics are a fundamental component for client canary and A / B testing because they give us the insights we need to make decisions. And in our case, metrics sent to our service from millions of user events need to be calculated in real time. Working on the scale of Netflix, we need to process event streams on a scalable platform like Mantis and save time-series data on Apache Druid. To be more economical with time-series data, we save metrics for a few weeks’ sliding time window and compress it into granularity in 1 minute.
Another challenge is to enable client application engineers to contribute to metrics and dimensions, as they are aware of what valuable insights can be. To do this, our real-time metric data pipeline provides accurate abstraction to eliminate the complexity of a distributed stream processing system and enables these contributions to be used for offline calculation feature A / B test evaluation. The entry and the latter provide additional motivation for client engineers to contribute. In addition, it brings us closer to consistent metric definitions in both realtime and offline systems.
Since we accept contributions, we need to have proper checks to ensure that the data pipeline is reliable and robust. Changes in user events, stream processing jobs or even platforms can affect metrics, and so it is essential that we actively monitor data pipelines and ingestions.
Hist Historically, we have relied on conventional statistical testing built in Quinta to detect metric shifts to publish new versions of applications. It has served us well over the last few years, but at Netflix we always want to improve. Here are some reasons to explore alternative solutions:
- Under the hood the ACA uses a specific horizon statistical estimation test that is subject to peeping due to the frequent analysis performed during the canary period. And without a correction, it can ruin our false positive guarantees, and the correction itself is a function of peek numbers – which is not known in advance. This often leads to more errors in the results.
- Due to the limited time available for canaries, rare event metrics such as accidents are often missing from control or treatment and therefore may not be evaluated.
- Our insights suggest any form of metric compression, such as combining 1 minute granularity, leading to a reduction in energy for analysis, and the tradeoff is that we need more time to identify metric shifts with confidence.
We are actively working on a promising solution to address some of these limitations and look forward to sharing more in the future.
Today, most of Netflix’s client applications use the Client Canary model to constantly update their applications. We have seen a significant increase in adoption of this method over the last 4 years as shown in this growing graph of client canary counts.
Time constraints, speed, and quality requirements have created a number of challenges in the frequently updated domain of client applications that our Netflix team aims to address. We described some metric-related issues in a previous post “How Netflix uses Druid for real-time insights to ensure a high quality experience”. We intend to share more in future diving for challenges and solutions in the allocation, analysis and orchestration space.