Sniffer was created from the ground up to manage the huge scale of Netflix. We currently process millions of log records per minute and analyze these events to perform in-house custom detection. We collect results from a number of sources, including AWS Security Hub, AWS Config Rules, and our own in-house custom identification. Once ingested, the results are enriched and processed with additional metadata collected from Netflix’s internal data sources. Finally, investigations against the suppression rules are examined and sent to our control plane for triage and remediation.
We’ve been building, installing and managing Snare for almost a year and since then, we’ve seen tremendous improvements in conducting our cloud security searches. Many results are corrected automatically, others use Slack Alerts to loop on OnCall to Triage via the Snare UI. One big improvement was a direct time saving for our detection squad. Using Snyder, we’ve been able to fine-tune and consolidate more of our false positive searches across our ingestion stream, leading to an average reduction of 73.5%. With this extra time, we’ve been able to focus on new identities and new features for Snare.
When it comes to new identification, we’ve more than doubled our number of internal detections and brought in a number of identification solutions from security vendors. The Snyder Framework enables us to write identification quickly and efficiently, removing all our plumbing and configurations away from us. Identification writers only need to be concerned with their actual identification logic and everything else is managed for them.
For security vendors, we’ve worked with AWS most notably to ensure that services like GuardDuty and Security Hub are first-class citizens in terms of identification sources. Integration with Security Hub has been a complex design decision from the beginning because we have been receiving a lot of leverage since we received AWS security findings in a simple format and in a centralized location. Security Hub has played an integral role in our platform, and has made it easier to evaluate and access AWS security services using new features. Our plumbing between the security hub and the snare is managed by the AWS organization, as well as the eventbridge rules deployed in each region and to help integrate all investigations into our centralized snare platform.
One area that we are investing heavily in is the possibility of our automatic remedies. We’ve looked at a few different options, starting with fully automated remedies, manually triggered remedies, as well as automated playbooks to collect additional data during event trials. Because of the unique DAGs we can create and the simple “wait” / “task token” functionality, we’ve decided to use the AWS step function as our implementing environment, allowing people to engage when needed for approval / input.
Building on the steps functions, we have created a 4-step remedy process: pre-processing, decision, remedy and post-processing. Pre / post processing band-off-band can be used to manage resource checks, or any work needed to ensure a successful remedy. The decision step is used to perform a final pre-flight test before the remedy. It can reach a person, verify that the resource is still around, and so on. The remedy step is where we do our actual remedy. We’ve been able to use it with great success by enabling infrastructure-wide misconfigured resources to automatically correct near real-time and create new fully automated event response playbooks. We are still exploring new ways that we can use it and are excited about how we can develop our approach in the near future.
Image from a remedy to enable S3’s universal access block in a non-compliant bucket. Each preferred stage allows dynamic routing at different stages based on the output of the previous function. Waiting stages are used when human intervention / approval is required.