By Christos G. Bumpis, Chao Chen, Anush K. Moorthy And Ji Lee
Measuring video quality on a scale is an essential component of the Netflix streaming pipeline. Perceptual quality measurements are used to run video encoding optimization, compare video codecs, test A / B, and optimize streaming QoE decisions. In particular, the VMAF metric is at the core of improving Netflix member streaming video quality. It has become one Indeed The standard for perceived quality measurement within Netflix and, thanks to its open-source nature across the video industry.
As VMAF is evolving and integrated into Netflix with more encoding and streaming workflows, we need scalable ways to encourage video quality innovation. For example, when we design a new version of VMAF, we need to roll it out effectively across the entire Netflix catalog of movies and TV shows. This article explains how we designed microservices and workflows on the Cosmos platform to strengthen such video quality innovation.
Until recently, video quality measurement was created as part of our reloaded production system. This system is responsible for processing incoming media files such as video, audio and subtitles and enables them to be played on streaming services. The reloaded system is a well-matured and scalable system, but its exclusive architecture can slow down rapid innovation. More importantly, within reload, video quality is measured with video encoding. This tight coupling means that the following cannot be achieved without re-encoding:
A) New video quality algorithm rollout
B) Maintain data quality in our catalog (such as through bug fixes).
Re-encoding the entire catalog to generate updated quality scores is a very expensive solution and therefore impossible. There are plenty of such coupling problems with our reloaded architecture, and so the Media Cloud Engineering and Encoding Technologies teams are working together to create a solution that solves many concerns with our previous architecture. We call this system Cosmos.
Cosmos is a computing platform for workflow-driven, media-centric microservices. Cosmos provides a variety of benefits highlighted on linked blogs, such as separation of concerns, independent installation, observability, rapid prototyping and commercialization. Here, we describe how we built a video quality service architecture using Cosmos and how we managed migration from reloaded to Cosmos to calculate video quality while running a production system.
At Cosmos, all video quality is calculated by an independent microservice called Video Quality Service (VQS). VQS takes two videos as input: a source and its derivatives, and returns the measured perceptual quality of the derivative. The measured quality can be a single value, requiring only a single metric output (e.g., VMAF), or it can also return scores of multiple perceptual values, where the request asks for such a calculation (e.g., VMAF and SSIM).
VQS, like most Cosmos services, consists of three domain-specific and scale-agnostic levels. Each layer is built on top of a corresponding scale-conscious cosmos subsystem. There is an external-facing API layer (Optimus), a rule-based video quality workflow layer (Plato) and a serverless computer layer (Stratum). Interstate communication is based on our internally developed and maintained timestone alignment system. The figure below shows each level and the corresponding cosmos subsystem in parentheses.
- VQS API expresses layer endpoints: to request a quality measure (measure quality) and to get an asynchronous quality result (gate quality).
- There are rules at the VQS workflow level that determine how to measure video quality. Similar to segment-based encoding, the VQS workflow consists of segment-based quality calculations, followed by an assembly step. This enables us to use our scale to increase throughput and reduce delays. The segment-based quality step calculates the quality for each segment and the assembly step combines the results of all the quality calculations. For example, if we have two frames with two and three frames and VMAF scores [50, 60] And [80, 70, 90] Respectively, the assembly aggregates the scores in steps [50, 60, 80, 70, 90]. The chunking rule calls the fragment-based quality calculation function in the stratum for all parts of the video (see below) and the assembly rule calls the assembly function.
- The VQS stratum layer consists of two functions, which perform segment-based quality calculation and assembly.
The following trace graph from our observation portal, Nirvana, sheds more light on how VQS works. The request provides sources and derivatives whose quality must be calculated and requests that VQS provide quality scores using VMAF, PSNR and SSIM as standard metrics.
Here is a step-by-step description of the processes involved:
1. VQS is measured using Quality Endpoint. The VQS API layer will translate the external request to VQS-specific data models.
2. The workflow begins. Here, based on the length of the video, throughput and latency requirements, available scales, etc., the VQS workflow decides that it will split the quality calculation into two parts and, therefore, create two messages (one for each part) to run independently. Piece-based quality calculation by stratum function. Three requested quality metrics will be calculated for each segment.
3. The quality calculation for each piece begins. The figure does not show the time of fragmentation individually, however, each fragmentation quality calculation begins and is completed (notes 3a and 3b) independently based on the availability of resources.
3 b. When all the fractional value calculations are completed, Plato begins to assemble.
4. The assembly begins, with a separate call for assembler stratum functions for each metric. As before, the start time of each metric assembly may vary. This kind of division of calculation allows us to partially fail, to come back early, to scale independently depending on the metric complexity.
4a and 4b. The assembly of two metrics (such as PSNR and SSIM) is complete.
4c and 5. The assembly of the VMAF is completed and thus the complete workflow is completed. Quality results are now available to callers through getQuality Endpoint.
The above is a simplified picture of the workflow, although in reality, the actual design is extremely flexible, and supports a variety of features, including quality production at different quality metrics, adaptive refraction techniques, different temporal granularity (frame-level, segment). Level and aggregate) and measure quality for different uses, such as for different types of devices (such as a phone), SDR, HDR and others.
Although VQS is a dedicated video quality microservice that addresses the couplings mentioned above with video encoding, there is another aspect that should be addressed. The entire reloaded system is currently being transferred to Cosmos. This is a big, cross-team effort which means some applications are still reloaded, others have already been created at Cosmos. How can we take advantage of VQS when some applications that use video quality measurements are still reloaded? In other words, how do we manage to live in both worlds?
To live such a life, we’ve created a number of “bridging” workflows that allow us to route video quality traffic from Reloaded to Cosmos. Each of these workflows also acts as a translator of reloaded data models into appropriate Cosmos-service data models. Meanwhile, Cosmos-only workflows can be integrated with VQS without the need for bridging. This allows us to not only work in both worlds and provide existing video quality features, but also roll out new features all over the world (either for reloaded or Cosmos customer applications).
To complete our design, we need to solve one last puzzle. While we have a way to make VQS calls, the VQS output is designed to avoid centralized data modeling of reloading. For example, VQS relies on the Netflix Media Database (NMDB) for storing and indexing quality scores, while reloaded systems use a mix of non-queryable data models and files. To assist in our transition, we have launched another Cosmos Microservices: Document Conversion Service (DCS). DCS is responsible for the transition between cosmos data models and reloaded data models. Further, DCS interfaces with NMDB and is therefore able to convert from the data store to reloaded file-based data and vice versa. DCS has a few more end points that convert similar data when needed so that the Roman-riding described above can happen nicely.
We have moved almost all of our video quality counts from Reloaded in Cosmos. VQS currently represents the largest workload operated by the Cosmos platform. Video quality at Cosmos has matured and we’re investing in making VQS more flexible and efficient. In addition to supporting existing video quality features, all of our new video quality features have been created in VQS. Stay tuned for more details on this algorithmic innovation.
This has been made possible with the help of many stunning Netflix colleagues. We would like to thank George Yeh and Susanna Suredi for their contributions to Reloaded-Cosmos Bridge development, Ameya Bhasani and Frank San Miguel for their contribution to strengthening VQS on the scale, and Susie Zia for her help in performance analysis. Also, the Media Content Playback Team, the Media Compute / Storage Infrastructure Team, and the entire Cosmos Platform Team who have enlivened Cosmos and sincerely supported our initiative at Cosmos.
If you are interested in becoming a member of our team, we are hiring! Our current job postings can be found here: