Self-diagnosis and remedy on Netflix data platform By Netflix Technology Blog | January, 2022

By Vikram Srivastava And Marcelo Mewarm

Netflix has one of the most complex data platforms in the cloud on which our data scientists and engineers run batch and streaming workloads. As our customers grow globally and enter the world of Netflix gaming, the number of batch workflows and real-time data pipelines is growing rapidly. The data platform is built on top of a number of distributed systems, and due to the inherent nature of these systems, this workload is bound to fail in stages. Solving these problems is not a trivial task and requires collecting logs and metrics from different systems and analyzing them to identify the root causes. On our scale, even a small percentage of the interrupted workload can create a sufficient burden of operational support for the data platform team when manual steps are involved in troubleshooting. And we can’t discount productivity effects because of this on data platform users.

It motivates us to be proactive in identifying and managing failed workloads in the production environment, avoiding obstacles that can slow down our team. We are working on an auto-diagnostic and remediation system called Pensive on the data platform to address these concerns. Aiming to solve problems of failure and slow work stress and to remedy them wherever possible without human intervention. As our platform continues to grow and various situations and problems can disrupt the workload, Pensive needs to be proactive in identifying a wide range of problems in real-time at the platform level and in determining the impact across workloads.

Thinking infrastructure has two separate systems to support batch and streaming workloads. This blog explores these two systems and how they auto-diagnose and remedy across our Big Data platforms and real-time infrastructure.

Batch thoughtful architecture

Batch workflows on data platforms are driven using a scheduler service that launches containers on the Netflix Container Management Platform to execute workflow steps. These initiatives started working on the Apache Spark and Presto running clusters through Genie If a workflow step fails, the scheduler asks Pensive to diagnose the step error. Pensive collects logs for failed work to step-by-step from relevant data platform elements and then extracts stack traces. Thoughtful is a regular expression based rule engine that has been curated over time. The rules encode information about whether the error is due to a platform problem or user error and whether the error is transient. If one of the rules matches a regular expression, Pensive returns information about the error to the determinant. If the error is transient, the scheduler will retry that step a few more times with the indexed backoff.

The most important part of the pencil is the set of rules used to classify an error. We need to develop them as we develop platforms to ensure that the percentage of errors that Pensives cannot classify is low. Initially, the rules were added on an ad-hoc basis because of requests from platform material owners and users. We have now moved on to a more systematic approach where unknown errors are fed into a machine learning process that performs clustering to offer new regular expressions for commonly occurring errors. We take suggestions to the platform component owners then classify the source of the error and whether it is transient in nature. In the future, we want to automate this process.

Platform-wide problem identification

Pensive classifies errors for individual workflow step failures, but using real-time analysis of errors identified by Pensive using Apache Kafka and Apache Druid, we can quickly identify platform issues that affect many workflows. Once individual diagnoses are stored on a druid table, our monitoring and alert system, called Atlas, integrates every minute and sends warnings if the number of failures due to platform errors suddenly increases. This has dramatically reduced the time it takes to identify hardware or bugs in recent rollout data platform software.

Streaming Thoughtful Architecture

Apache Flink enables real-time stream processing tasks on the Netflix data platform. And most of the Flink works run under a managed platform called Keystone, which abstracts the built-in Flink work details and allows users to access data from Apache Kafka streams and publish to various data stores such as ElasticSearch and Apache Iceberg on the AWS S3.

Since the data platform operates keystone pipelines, users expect that platform problems will be identified and remedied by the keystone team without any involvement from their edges. Furthermore, Kafka streams have a limited retention period of data, which adds time pressure to solve problems to avoid data loss.

For each Flink job running as part of a Keystone pipeline, we monitor metrics that indicate how far the Flink customer is behind the Kafka producer. If it exceeds a threshold, Atlas sends a notification to Streaming Pensive.

Like its batch counterpart, the streaming pencil also has a rule engine for diagnostics. However, in addition to logs, Streaming Pensive’s Keystone Pipeline has rules for testing different metric values ​​for multiple components. The problem may occur at source Kafka Stream, the main Flink Job or Sync where Flink Job is writing data. Streaming Pencil diagnoses this and tries to fix the problem automatically when it occurs. Here are some examples of things we can do automatically:

  • If the streaming pencil detects that one or more flink task managers are running out of memory, it may relocate the flink cluster with more task managers.
  • If the streaming pencil detects an unexpected increase in the rate of incoming messages in the source Kafka cluster, it may increase the subject retention size and duration so that we do not lose any data while the consumer is behind. If the spike goes away after a while, the streaming pencil retrieval changes can be reversed. Otherwise, it will give the employer a page to investigate whether there are any bugs due to the increased rates or whether customers need to reconfigure to handle higher rates.

Although our success rate is very high, there are some cases where automation is not possible. If manual intervention is required, the relevant material will give the page to the team to take timely action to resolve the Streaming Penis problem.

Pensive has had a significant impact on the usability of the Netflix data platform. And helps engineering teams reduce operational workloads, freeing them from tackling more complex and challenging issues. But our work did not end anywhere. We have a long roadmap ahead of us. Some of the features and extensions we plan are:

  • Batch Pensive is currently only diagnosing failed jobs, and we want to increase the scope for optimization to determine why jobs have slowed down.
  • Batch workflows are configured automatically so that they end successfully or quickly and use less resources if possible. One example where this can dramatically help is Spark Job, where memory tuning is a significant challenge.
  • Expand thoughtfulness with machine learning classifiers.
  • The streaming platform has recently added data mesh and we need to expand the streaming pencil to cover it.

This could not have been accomplished without the help of the Big Data Compute and real-time data infrastructure team within the Netflix data platform. They have been great partners for us as we work to improve our thinking infrastructure.

Source link

Related Articles

Back to top button