Why is no one talking about maintenance in detection engineering?

Full Report

As detection engineer, you may recognize the following situations:A client reports that the detection you spent the whole day meticulously perfecting is suddenly producing numerous false positives.The tuning that worked flawlessly last year, now stands deprecated and, worse yet, creates blind spots.Another team attempts to deploy a custom detection using your deployment pipeline, only to find themselves debugging your code instead.The detection documentation that you (and ChatGPT 😉) worked so hard to put together now confuses the SOC team rather than providing clarity.In this first blog of a series, we’ll explore the concept of maintenance, its critical importance, the conventional wisdom of “if it ain’t broke, don’t fix it,” and the paradox that keeps me awake at night.Before proceeding further, let’s establish a shared understanding of what maintenance truly entails.Defining maintenanceMaintenance as a term in software engineering is not a new thing. Thousands upon thousands of articles explore the importance of maintaining software and provide practical guidance on implementing maintenance methodologies. Software engineers and developers know that maintenance is one of the most critical things when you deliver software.According to various sources,Software maintenance refers to the process of modifying and updating software after its initial development and deployment, to correct faults, improve performance or other attributes, add new features to meet evolving user requirements, or adapt to a changed environment.Software maintenance is an ongoing process that is essential for the longevity of a software system, to keep it effective, adaptable and relevant in an ever-evolving technological landscape.However, according to Wikipedia, software maintenance is:often considered lower skilled and less rewarding than new development.not as well studied as other phases of the software life cycle, despite comprising the majority of costs.Ouch. It hurts my feelings that maintenance is considered “lower skilled”, as I spent a big part of my career on maintaining and tuning detections. And still do.Software maintenance has distinct categories, but I’m not writing an article about software maintenance in general. So, let’s dive deep into the world of maintenance in detection engineering, and I’ll try to apply the categories of software maintenance directly to detection engineering.Transitioning to detection engineeringSince modern detection engineering embraces the ‘detection-as-code’ paradigm — where detection rules and logic are managed as software artifacts — these fundamental software maintenance principles, such as readability, maintainability, test-ability, modularity and others, directly apply to detection engineering.As with software engineering, detection engineering follows a structured process.An example of a detection lifecycle, from SafeBreach’s “Detection Engineering: A Comprehensive Guide” blogThe final step of this never-ending cycle is maintenance. You may also see maintenance referred to as optimization or, more simply, tuning.Validating the importance of detection maintenance through its categoriesIn general, I don’t understand the notion of ‘If it ain’t broke, don’t fix it.’ Instead, I’m a fan of a different idea — one I came up with myself: ‘Even if it’s not broken, fix it.’ I promise this will make sense by the end of this blog post.Let’s go through the maintenance categories, with some detection engineering specific examples, to highlight their importance. The core categories of software maintenance are:CorrectiveAdaptivePerfectivePreventiveCorrective is the reactive phase of maintenance. It’s usually urgent and most of the times initiated by a client (if you’re unlucky, multiple clients). An example of corrective maintenance in detection engineering is when a detection rule is created to monitor multiple scenarios, such as changes in security policies within Azure DevOps, under the assumption that all actions will generate the same event logs. If the detection is implemented without comprehensive testing across all scenarios, it may later be discovered that certain policy changes produce different telemetry data. This results in the detection rule only partially covering the intended scenarios, necessitating corrective maintenance to address the oversight.Adaptive maintenance is next. Adaptive maintenance relates to the situation where you need to update to comply with new software requirements. An example of this category is when Microsoft changes the API that many of your tools are using, so now your deployment pipelines don’t work. In an ideal world, adaptive maintenance should be proactive, as you can expect or wait for changes like that, but it tends to be reactive. It’s a race condition between you and your clients. Ideally, you get the error first and “win”: do the update and pro-actively inform your client. When the client gets it first, you “lose” and re-actively update.Sometimes, there’s a thin line between corrective and adaptive maintenance. As an example: all of a sudden all your “DeviceProcessEvents”-based detections are triggering false positives; just because you haven’t specified the ActionType “ProcessCreated” in your rule. You didn’t expect that, our favorite company, Microsoft will would introduce new ActionTypes, such as the “ProcessCreatedAggregatedReport”. The change in the detection query is minor — , you just adding a new line of code, — but its equally important as you need to fix all affected detections in your library central repository and all deployed detections in at your clients.My favorite one is up next. Perfective maintenance focuses on improving the detection query or documentation, or adding new features — or everything together. It’s proactive and sometimes initiated based on client feedback. Examples include removing old and deprecated tuning code from your clients’ detections, re-aligning detection thresholds to match the client’s latest environment changes, or making the detection more resilient by enhancing the core detection query.Last but not least, is preventive maintenance. As the name suggests, with this type of maintenance you are looking into the future and take measures to prevent errors or to improve quality without taking feedback from a client. It’s a proactive, planned and periodic type of maintenance, trying to make detections more stable by optimization. An example of preventative maintenance is to go back and improve the documentation of old detections, as new blind spots or false positives could have emerged.I think by now, its pretty clear that doing maintenance is not just an optional or voluntary action, but rather a mandatory, pre-planned and structured process.A paradox is forming the current state of maintenanceThe question that naturally arises now is: why are we wasting time explaining the obvious? Everyone knows we should maintain our detections, documentation, internal tools, deployment pipelines, etc. Even if it’s ‘not sexy.’There is an interesting paradox here.Even if we all, especially blue team people, are familiar with, at least, a variation of the detection engineering lifecycle, I have the impression that nobody talks about maintenance in the industry. No articles, no talk on Twitter, no one seems to care about one the most crucial things in detection engineering.I would imagine that all detection teams have some kind of methodology they follow when it comes to maintenance. But then again, if that’s the case, it’s all behind closed doors. I also get the sense that for some less mature detection engineering teams, maintenance isn’t a priority. And to some extent, that’s understandable.Thus, whether due to a reluctance to share or the inexperience of less mature teams, we arrive at the current state of maintenance in detection engineering.Next stepsYou might be asking, where are we going from here?As I mentioned at the beginning, this blog is part of a blog series focused on shedding more light on maintenance in detection engineering. The goal of this first blog is to spark a discussion around this topic and, hopefully, encourage people in the industry to share more.More specifically, here are some intriguing questions worth exploring:How do different teams approach maintenance?Are there any universal tuning principles we could follow?How much time does your team spend on each of the four types of maintenance?What metrics are tracked to measure the effectiveness of the rule set? Is a data-driven approach the best way forward?Do these metrics contribute to the maturity of the detection engineering process? If so, how?What are the maturity models for maintenance? Are they different from those briefly outlined in Elastic’s article on the detection engineering maturity model?Do teams keep historical data to gain a deeper understanding of rule behavior by collecting data from multiple client environments?Do automated tuning solutions (e.g., using ML) actually work, or do they just add to the problem, increasing the need for maintenance?*Note*: There are some resources out there that support the notion that we neglect maintenance in favor of innovation and novelty — a side effect of capitalism. However, that topic is beyond the scope of this article.ConclusionLet me begin the conclusion with a quote that I saw repeatedly while researching for this blog:Maintenance is often the hardest job and at the same time the least glamorous.Whatever the reason might be, detection teams are focusing more on recruiting talent, refining processes, developing a concrete detection strategy, building a detection backlog to prioritize detection needs, automating deployment to fully leverage the detection-as-code approach, and creating the highest-quality, most resilient detections possible. Meanwhile, maintenance seems to have slipped through the cracks.This first blog post is an attempt to restore focus on maintenance as a crucial aspect of detection engineering. Perhaps if we consciously discuss this topic more, we won’t need alert fatigue case studies in the future (great article, by the way) or maintenance-related roles like fine-tuning engineers. In my opinion, we need to tackle the alert fatigue problem collectively rather than relying on small, temporary fixes.A man can dream, right?Thank you for taking the time to explore this topic with me. I’m eager to hear your comments on this one. Drop a comment or DM me on social media (X, LinkedIn). In the meantime, I’m working on further improving our Sentry Detect managed detecting engineering service. 😃Resources- Detection Engineering: A Comprehensive Guide- Wikipedia’s definition of software maintenance- Elastic’s Detection Engineering Behavior Maturity Model (DEBMM)- Hail the maintainers- Why Do People Neglect Maintenance?- Anton’s Alert Fatigue: The Study- SOC Alert Fatigue and The Need for Dedicated Finetuning Engineer RoleWhy is no one talking about maintenance in detection engineering? was originally published in FalconForce on Medium, where people are continuing the conversation by highlighting and responding to this story.

Analysis Summary