NETFLIX UPDATE

Towards a Reliable Device Management Platform | By Netflix Technology Blog August, 2021


RAE is effectively configured to be a router connected to the devices (DUTs) under test. In RAE, there is a service called Local Registry, which is responsible for identifying, onboarding and maintaining information about all devices connected to the LAN side of RAE. When a new hardware device is connected, the local registry identifies and collects information about it, such as networking information and ESN. At periodic intervals, the local registry checks to check the connection status of the device. As the features and characteristics of the device change over time, these changes are stored in the local registry and at the same time published upstream on the control plane of the device management platform. In addition to the attribute change, a complete snapshot of the device record is published upstream by the local registry as a form of state reunion at regular intervals. These checkpoint events enable fast state restructuring by data feed users while avoiding missed updates.

Next to the cloud, a service called Cloud Registry receives information updates from devices published by local registry instances, processes them, and then pushes the physical data to a datastore supported by CockroachDB. Cockroach DB has been selected as the backing data store since it provides SQL capabilities and our data model for device records was normalized. In addition, unlike other SQL stores, Telapokadibi is designed to be horizontally scalable from the ground, which solves our concerns about the cloud registry’s ability to scale with the number of devices on board the device management platform.

MQTT forms the basis of the control plane for the device management platform. MQTT is an OASIS standard messaging protocol for the Internet of Things (IoT) and is designed as an extremely lightweight but reliable publish / subscribe messaging transport with standard code footprint and minimal network bandwidth for remote devices. The MQTT client connects with the MQTT broker and sends a prefix message with a topic. In contrast, brokers are responsible for receiving all messages, filtering them, determining who is subscribed to what, and sending messages to subscribed clients accordingly. The key features that MQT has made most appealing to us are categorized topics, client authentication and approval, per-subject ACL, and support for two-way request / response message patterns, all of which are important to our business use. Control aircraft.

Inside the control plane, device commands and device information updates are prefixed with a topic string so that the RAE serial number and device_session_id, Which is a UUID associated with a device session. Adding these two bits of information to the subject for each message allows us to apply the subject ACL and control what RAE and DUT users can see and interact with in terms of security and isolation against other users’ devices.

Since Kafka is a supported messaging platform on Netflix, a bridge is built between the two protocols so that cloud-side services can communicate with the control plane. Through the bridge, MQTT messages are converted directly to the Kafka record, where the record key is set as the MQTT subject to which the message was delivered. Since MQTT contains device information updates device_session_id In this case, it means that all device information updates for a given device session will effectively be displayed on the same Kafka partition, thus ordering a well-defined message for our use.

In addition to serving regular messages between users and the DUT, the control aircraft is stress-tested at approximately hourly intervals, where about 3000 transient MQTT clients are created to connect and generate Flash traffic with MQTT brokers. It can be a canary test to verify if brokers are online and able to handle the sudden flow of client connections and high message loads. As such, we can see that the control of the device management platform is very dynamic with the traffic load time on the aircraft.

At Netflix, we focus on creating solutions that use as much paved-path tooling as possible (see post here and here). In particular, the taste of Spring Boot Native maintained by the runtime team is the basis of many web services developed within Netflix (including the cloud registry). The Netflix Spring package brings all the integrations needed for applications to work seamlessly within the Netflix ecosystem. In particular, Kafka integration is most relevant for this blog post.

Looking at the system setup we have described, we have come up with a list of basic business requirements that must be addressed in the Kafka-based device update processing solution of the cloud registry.

Since the processing work pressure changes significantly over time, the solution must be scaled first and foremost with the message load providing back-pressure support defined in the responsive stream specification অন্য in other words, the solution must be able to push and switch between pools based on low flow. Whether back-pressure models are able to cope with the rate of message production depends on whether.

The semantics required to receive information updates from the right device are used in the way that messages are generated. Since message orders are guaranteed on Kafka partitions, and all updates from a given device session are assigned to the same partition, this means that the update processing command can be applied to each device as long as only one thread is allocated per partition. At the same time, events arriving on different partitions should be processed in parallel for maximum throughput.

If underlying KafkaConsumer Crash due to transient system or network events, it should restart automatically. If an exception is thrown while receiving a message, the exception should be caught nicely, and the use of the message should continue uninterrupted after the offensive message has been removed.

When a service is reinstalled, or its instance group is resized, the application must be closed and unavoidable. For example, processor shutdowns should be called from outside the Kafka cost context so that beautiful application can be stopped. In addition, since Kafka messages are usually dragged in batches KafkaConsumer, The implemented solution, after receiving the shutdown signal, should swallow and drain all the already brought messages in its inner row before closing.

As mentioned earlier, Spring is widely employed as a paved-way solution for developing services on Netflix, and Cloud Registry is a Spring Boot Native application. Thus, the implemented solution must integrate authentication and metrics support with Netflix Spring benefits – the first for access to Kafka clusters and the next for service monitoring and alerts. In addition, the life cycle management of the implemented solution also needs to be integrated into the management of the spring life cycle.

Implemented solutions must be friendly enough to support long-term maintenance. This means that it must be at least unit- and functionally-testable for rapid and repetitive response-driven development, and the code must be reasonably ergonomic to minimize the learning curve for new maintainers.

Many frameworks are available for reliable stream processing for web service integration (for example, Kafka Streams, Spring KafkaListener, Project Reactor, Flink, Alpakka-Kafka, to name a few). We chose Alpakka-Kafka as the basis for the Kafka processing solution for the following reasons.

  1. Alfakka-Kafka has been able to meet the requirements of all the systems we provide, including the requirements for Netflix Spring integration. It provides advanced and sophisticated control over stream processing, including automatic back-pressure support and streams monitoring.
  2. Compared to other solutions that can meet the needs of all our systems, ACCA is a much more lightweight structure, its integration into the Spring Boot application is relatively short and concise. Also, the Akka and Alpakka-Kafka codes are much shorter than other solutions, which reduces the learning curvature of maintainers.
  3. The maintenance cost over time for Alpaca-Kafka-based solutions is much lower than other solutions, because both Akka and Alpaca-Kafka have mature ecosystems in terms of documentation and community support, at least 12 and 6 years, respectively.

The construction of Alpaca-based Kafka processing pipeline can be summarized with the following figure:



Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button