Alok Tyagi, Hariharan Anantakrishnan, Ivan Porto Carrero and Kirti Laxminarayan
Netflix is known as a network observable sidecar Flow Exporter Which uses the eBPF tracepoint to capture TCP streams in real time. With less than 1% of CPU and memory, for example, this highly functional sidecar provides flow information on a scale for network insights.
The cloud network infrastructure that Netflix uses today includes AWS services such as VPC, DirectConnect, VPC Peering, Transit Gateways, NAT Gateways, etc., and devices owned by Netflix. The Netflix software infrastructure is a large distributed ecosystem consisting of specialized functional levels managed across AWS and Netflix proprietary services. When we try to keep the ecosystem simple, the innate nature of taking advantage of different technologies will lead us to such challenges as:
- App dependency and data flow mapping: As the number of micro-services increases day by day without understanding and visibility between the dependence of an application and the flow of information, it is difficult to identify systemic problems for both the service owner and the central team.
- Path validity: The inability of the service to communicate with other companies may result in the speed of change between the streaming of the Netflix production and the studio environment.
- Service division: Ease of cloud deployment has resulted in the organic growth of multiple AWS accounts, deployment practices, interconnection practices, etc. It is difficult to improve our reliability, security, and capabilities without network visibility.
- Network availability: The expected growing growth of our ecosystem makes it difficult to understand our network barriers and potential limits that we can reach.
Cloud network insights A suite of solutions that provide functional and analytical insights into the cloud network infrastructure to solve identified problems. By collecting, accessing and analyzing network data from various sources such as VPC flow log, ELB access log, EBPF flow log, for example, we can provide network insights to users and central teams through multiple data visualization strategies such as lumens, atlases, etc.
Flow Exporter is a sidecar that uses the EBPF tracepoint to capture TCP flows in near-instances in instances that strengthen the Netflix Microservices architecture.
An EBPF flow log record represents one or more network flows containing TCP / IP statistics that occur at intervals of a variable integration.
The sidecar is implemented using a highly efficient EBPF using carefully selected transport protocols Less than 1% of CPU and memory In any instance in our fleet. The choice of transport protocols such as GRPC, HTTPS and UDP depends on the feature setting of the runtime.
The runtime behavior of the flow exporter can be managed dynamically by changing the configuration through Fast Properties. Flow Exporter Atlas also reveals various operational metrics. These metrics are visualized using lumens, a self-service dashboarding infrastructure.
The flow collector is a regional service that consumes and enriches the flow. IP addresses in the cloud can move over time from an EC2 instance or a Titus container. We use gold to attribute an IP address to a specific application at a specific time. Gold is an IPv6 and IPv4 address identity tracking service.
Flow Collector uses two data streams, IP address change event from Sonar to Kafka and EBPF flow log data from Flow Exporter Sidecar. It performs real-time attribution of flow data, including application metadata from Gold. The characteristic currents are pushed towards the keystones which lead to their hive and druid datastores.
Attributed flow data is driven by various uses within Netflix such as network monitoring and network usage forecasting available through lumen dashboard and machine learning based network segmentation. Data security and other partner teams also use it for insights and incident analysis.
Providing network insights into the cloud network infrastructure using EBPF flow logs on a scale is possible through EBPF and a highly scalable and efficient flow collection pipeline. After several repetitions of the architecture and a few tunes, the solution proved to be able to scale.
We are currently receiving and enriching billions upon billions of EBPF flow logs per hour and providing visibility to our cloud ecosystem. Rich data allows us to analyze networks across different dimensions (such as availability, performance, and security) so that applications can effectively deliver their data payloads through a cloud-based ecosystem spread across the globe.