Full Report
As detection engineer, you may recognize the following situations:A client reports that the detection you spent the whole day meticulously perfecting is suddenly producing numerous false positives.The tuning that worked flawlessly last year, now stands deprecated and, worse yet, creates blind spots.Another team attempts to deploy a custom detection using your deployment pipeline, only to find themselves debugging your code instead.The detection documentation that you (and ChatGPT 😉) worked so hard to put together now confuses the SOC team rather than providing clarity.In this first blog of a series, we’ll explore the concept of maintenance, its critical importance, the conventional wisdom of “if it ain’t broke, don’t fix it,” and the paradox that keeps me awake at night.Before proceeding further, let’s establish a shared understanding of what maintenance truly entails.Defining maintenanceMaintenance as a term in software engineering is not a new thing. Thousands upon thousands of articles explore the importance of maintaining software and provide practical guidance on implementing maintenance methodologies. Software engineers and developers know that maintenance is one of the most critical things when you deliver software.According to various sources,Software maintenance refers to the process of modifying and updating software after its initial development and deployment, to correct faults, improve performance or other attributes, add new features to meet evolving user requirements, or adapt to a changed environment.Software maintenance is an ongoing process that is essential for the longevity of a software system, to keep it effective, adaptable and relevant in an ever-evolving technological landscape.However, according to Wikipedia, software maintenance is:often considered lower skilled and less rewarding than new development.not as well studied as other phases of the software life cycle, despite comprising the majority of costs.Ouch. It hurts my feelings that maintenance is considered “lower skilled”, as I spent a big part of my career on maintaining and tuning detections. And still do.Software maintenance has distinct categories, but I’m not writing an article about software maintenance in general. So, let’s dive deep into the world of maintenance in detection engineering, and I’ll try to apply the categories of software maintenance directly to detection engineering.Transitioning to detection engineeringSince modern detection engineering embraces the ‘detection-as-code’ paradigm — where detection rules and logic are managed as software artifacts — these fundamental software maintenance principles, such as readability, maintainability, test-ability, modularity and others, directly apply to detection engineering.As with software engineering, detection engineering follows a structured process.An example of a detection lifecycle, from SafeBreach’s “Detection Engineering: A Comprehensive Guide” blogThe final step of this never-ending cycle is maintenance. You may also see maintenance referred to as optimization or, more simply, tuning.Validating the importance of detection maintenance through its categoriesIn general, I don’t understand the notion of ‘If it ain’t broke, don’t fix it.’ Instead, I’m a fan of a different idea — one I came up with myself: ‘Even if it’s not broken, fix it.’ I promise this will make sense by the end of this blog post.Let’s go through the maintenance categories, with some detection engineering specific examples, to highlight their importance. The core categories of software maintenance are:CorrectiveAdaptivePerfectivePreventiveCorrective is the reactive phase of maintenance. It’s usually urgent and most of the times initiated by a client (if you’re unlucky, multiple clients). An example of corrective maintenance in detection engineering is when a detection rule is created to monitor multiple scenarios, such as changes in security policies within Azure DevOps, under the assumption that all actions will generate the same event logs. If the detection is implemented without comprehensive testing across all scenarios, it may later be discovered that certain policy changes produce different telemetry data. This results in the detection rule only partially covering the intended scenarios, necessitating corrective maintenance to address the oversight.Adaptive maintenance is next. Adaptive maintenance relates to the situation where you need to update to comply with new software requirements. An example of this category is when Microsoft changes the API that many of your tools are using, so now your deployment pipelines don’t work. In an ideal world, adaptive maintenance should be proactive, as you can expect or wait for changes like that, but it tends to be reactive. It’s a race condition between you and your clients. Ideally, you get the error first and “win”: do the update and pro-actively inform your client. When the client gets it first, you “lose” and re-actively update.Sometimes, there’s a thin line between corrective and adaptive maintenance. As an example: all of a sudden all your “DeviceProcessEvents”-based detections are triggering false positives; just because you haven’t specified the ActionType “ProcessCreated” in your rule. You didn’t expect that, our favorite company, Microsoft will would introduce new ActionTypes, such as the “ProcessCreatedAggregatedReport”. The change in the detection query is minor — , you just adding a new line of code, — but its equally important as you need to fix all affected detections in your library central repository and all deployed detections in at your clients.My favorite one is up next. Perfective maintenance focuses on improving the detection query or documentation, or adding new features — or everything together. It’s proactive and sometimes initiated based on client feedback. Examples include removing old and deprecated tuning code from your clients’ detections, re-aligning detection thresholds to match the client’s latest environment changes, or making the detection more resilient by enhancing the core detection query.Last but not least, is preventive maintenance. As the name suggests, with this type of maintenance you are looking into the future and take measures to prevent errors or to improve quality without taking feedback from a client. It’s a proactive, planned and periodic type of maintenance, trying to make detections more stable by optimization. An example of preventative maintenance is to go back and improve the documentation of old detections, as new blind spots or false positives could have emerged.I think by now, its pretty clear that doing maintenance is not just an optional or voluntary action, but rather a mandatory, pre-planned and structured process.A paradox is forming the current state of maintenanceThe question that naturally arises now is: why are we wasting time explaining the obvious? Everyone knows we should maintain our detections, documentation, internal tools, deployment pipelines, etc. Even if it’s ‘not sexy.’There is an interesting paradox here.Even if we all, especially blue team people, are familiar with, at least, a variation of the detection engineering lifecycle, I have the impression that nobody talks about maintenance in the industry. No articles, no talk on Twitter, no one seems to care about one the most crucial things in detection engineering.I would imagine that all detection teams have some kind of methodology they follow when it comes to maintenance. But then again, if that’s the case, it’s all behind closed doors. I also get the sense that for some less mature detection engineering teams, maintenance isn’t a priority. And to some extent, that’s understandable.Thus, whether due to a reluctance to share or the inexperience of less mature teams, we arrive at the current state of maintenance in detection engineering.Next stepsYou might be asking, where are we going from here?As I mentioned at the beginning, this blog is part of a blog series focused on shedding more light on maintenance in detection engineering. The goal of this first blog is to spark a discussion around this topic and, hopefully, encourage people in the industry to share more.More specifically, here are some intriguing questions worth exploring:How do different teams approach maintenance?Are there any universal tuning principles we could follow?How much time does your team spend on each of the four types of maintenance?What metrics are tracked to measure the effectiveness of the rule set? Is a data-driven approach the best way forward?Do these metrics contribute to the maturity of the detection engineering process? If so, how?What are the maturity models for maintenance? Are they different from those briefly outlined in Elastic’s article on the detection engineering maturity model?Do teams keep historical data to gain a deeper understanding of rule behavior by collecting data from multiple client environments?Do automated tuning solutions (e.g., using ML) actually work, or do they just add to the problem, increasing the need for maintenance?*Note*: There are some resources out there that support the notion that we neglect maintenance in favor of innovation and novelty — a side effect of capitalism. However, that topic is beyond the scope of this article.ConclusionLet me begin the conclusion with a quote that I saw repeatedly while researching for this blog:Maintenance is often the hardest job and at the same time the least glamorous.Whatever the reason might be, detection teams are focusing more on recruiting talent, refining processes, developing a concrete detection strategy, building a detection backlog to prioritize detection needs, automating deployment to fully leverage the detection-as-code approach, and creating the highest-quality, most resilient detections possible. Meanwhile, maintenance seems to have slipped through the cracks.This first blog post is an attempt to restore focus on maintenance as a crucial aspect of detection engineering. Perhaps if we consciously discuss this topic more, we won’t need alert fatigue case studies in the future (great article, by the way) or maintenance-related roles like fine-tuning engineers. In my opinion, we need to tackle the alert fatigue problem collectively rather than relying on small, temporary fixes.A man can dream, right?Thank you for taking the time to explore this topic with me. I’m eager to hear your comments on this one. Drop a comment or DM me on social media (X, LinkedIn). In the meantime, I’m working on further improving our Sentry Detect managed detecting engineering service. 😃Resources- Detection Engineering: A Comprehensive Guide- Wikipedia’s definition of software maintenance- Elastic’s Detection Engineering Behavior Maturity Model (DEBMM)- Hail the maintainers- Why Do People Neglect Maintenance?- Anton’s Alert Fatigue: The Study- SOC Alert Fatigue and The Need for Dedicated Finetuning Engineer RoleWhy is no one talking about maintenance in detection engineering? was originally published in FalconForce on Medium, where people are continuing the conversation by highlighting and responding to this story.
Analysis Summary
# Best Practices: Detection Engineering Maintenance
## Overview
These recommendations focus on establishing a structured, ongoing maintenance process for Detection-as-Code artifacts. Maintenance is crucial for ensuring the longevity, effectiveness, adaptability, and relevance of security detections, despite often being neglected in the industry in favor of new development.
## Key Recommendations
### Immediate Actions (Reactive/Corrective & Adaptive)
1. **Establish an Urgent Triage Process for Reactive Issues:** Implement a documented, high-priority channel (e.g., dedicated ticketing queue or on-call rotation) for addressing issues flagged by the SOC team or clients that result in immediate false positives or detection failures (Corrective Maintenance).
2. **Audit Recently Deployed Detections for Missing Telemetry Assumptions:** Immediately review new or complex detections that cover multiple scenarios (e.g., broad cloud configuration changes) to ensure all telemetry paths and event formats are accounted for, preventing immediate performance degradation or partial coverage.
3. **Define an Adaptive Change Monitoring Cadence:** Assign specific personnel to monitor vendor release notes (e.g., Microsoft API changes) relevant to data sources used in detections. Document a rapid response plan for known breaking changes.
### Short-term Improvements (1-3 months)
1. **Integrate Software Maintenance Principles into SDLC:** Formally adopt core software engineering principles (readability, modularity, testability) for all new detection code. Ensure all committed rules use standardized template structures.
2. **Initiate Cross-Scenario Testing for Critical Detections:** For existing high-fidelity detections, conduct targeted testing to simulate different action types or environmental variables that might result in unexpected telemetry—addressing potential blind spots created by undocumented platform changes.
3. **Standardize Documentation Review:** Schedule mandatory Perfective Maintenance sprints quarterly to systematically review and enhance the documentation for the top 20% of utilized detection rules, focusing on clarity for the SOC team and detailing data source dependencies.
4. **Begin Deprecation Cleanup:** Conduct an initial sweep to identify and remove or archive outdated tuning code, deprecated event fields, or rule logic that is no longer necessary or is actively causing noise.
### Long-term Strategy (3+ months)
1. **Establish a Proactive Preventive Maintenance Schedule:** Dedicate a fixed, recurring percentage of engineering capacity (e.g., 20% of each sprint) specifically to Preventive Maintenance. This time should be used for planned reviews of older, stable detections to improve resilience against future environmental shifts.
2. **Develop Data-Driven Tuning Metrics:** Define and begin tracking specific Key Performance Indicators (KPIs) for rule performance (e.g., False Positive Rate (FPR), Time To Detect (TTD) for controlled tests). Use historical data to inform threshold adjustments (Perfective Maintenance).
3. **Implement Version Control and Rollback Capacity:** Ensure all detection logic, configuration files, and custom deployment scripts are managed under robust version control. Verify the pipeline supports seamless, reliable rollbacks to previous stable versions following major updates or environment shifts.
4. **Document Maintenance Allocation:** Track and report on the time spent across the four maintenance categories (Corrective, Adaptive, Perfective, Preventive) to build organizational awareness of maintenance costs and necessity.
## Implementation Guidance
### For Small Organizations
- **Focus on Corrective/Adaptive Reactivity:** Since resources are limited, prioritize immediate triage for client-reported issues (Corrective) and stay ahead of known platform changes (Adaptive) through dedicated monitoring.
- **Adopt the "Fix It Anyway" Mindset:** When modifying any detection, apply Perfective maintenance to immediately update its documentation and remove any obsolete tuning code, bundling fixes into development tasks.
- **Simple Documentation Standard:** Utilize clear, concise READMEs within the code repository for each detection to ensure quick context transfer when personnel changes occur.
### For Medium Organizations
- **Formalize Maintenance Categories:** Introduce explicit tasks in the ticketing/backlog system aligned with the four maintenance categories. This forces prioritization visibility.
- **Introduce Periodic Pre-Release Validation:** Before deploying major updates to detection repositories, implement a small set of automated regression tests to catch scenario failures (Corrective/Adaptive) that might have been introduced by recent changes.
- **Define Clear Ownership:** Assign primary and secondary ownership for specific detection families, ensuring continuous monitoring and scheduled Perfective review cycles for those areas.
### For Large Enterprises
- **Implement Mature Detection-as-Code Pipeline:** Ensure CI/CD pipelines automatically enforce code quality standards (readability, modularity) before rule deployment.
- **Establish Environmental Shadowing/Historical Data Retention:** Maintain infrastructure to collect historical telemetry data or run new rules in a shadowed, non-alerting mode within client environments to proactively identify potential false positives before full deployment (Preventive).
- **Develop Maturity Modeling:** Use tracked metrics (time allocation per maintenance type, trend in FPR) to assess the maturity of the detection engineering process and justify dedicated resources for ongoing optimization.
## Configuration Examples
*Note: The source article discusses concepts rather than specific configuration commands, but the necessary action implies specific practices:*
| Maintenance Type | Required Configuration/Code Action Example |
| :--- | :--- |
| **Corrective** | Adding a specific `ActionType: ProcessCreatedAggregatedReport` to an Azure detection query that was previously missing this necessary event filter. |
| **Adaptive** | Updating all detection queries that rely on a deprecated API endpoint URL or field name to use the modern replacement identified in vendor release notes. |
| **Perfective** | Refactoring a complex detection query into smaller, reusable modules to improve readability and simplify future tuning changes. |
| **Preventive** | Applying consistent baseline threshold settings across an entire family of similar network-based detections to standardize behavior before a new trend emerges. |
## Compliance Alignment
The principles inherent in structured maintenance directly support adherence to the following security frameworks by ensuring continuous operational effectiveness:
* **NIST SP 800-53 (Rev. 5):** Supports **RA-5 (Vulnerability Monitoring and Scanning)** and **CA-7 (Continuous Monitoring)** by ensuring controls remain relevant and effective against the evolving threat landscape.
* **ISO/IEC 27001:** Aligns with the requirement for **A.12.1.2 (Change Management)**, as detection logic is critical software infrastructure requiring controlled modification and review.
* **CIS Critical Security Controls:** Supports **Control 17 (Incident Response)** by ensuring that detection mechanisms are tuned to provide actionable, high-fidelity alerts, reducing alert fatigue and improving SOC efficiency.
## Common Pitfalls to Avoid
1. **The "If It Ain't Broke, Don't Fix It" Mentality:** This leads to outdated rules, platform dependency breakage, and future emergency maintenance overload. Embrace the proactive stance: "Even if it’s not broken, fix it (improve it)."
2. **Neglecting Documentation Review:** Assuming documentation authored during initial development remains accurate. Outdated documentation leads to SOC confusion and inefficient incident response.
3. **Treating Maintenance as "Lower Skilled":** Devaluing maintenance leads to the assignment of less experienced personnel, which increases the likelihood of introducing new bugs (Corrective maintenance) during necessary updates.
4. **Focusing Only on New Detections:** Ignoring the established, deployed codebase results in accumulated technical debt, where legacy rules become increasingly resistant to updates and contribute heavily to blind spots and noise.
## Resources
- **Reference Framework:** SafeBreach’s Detection Engineering Lifecycle (as a model for incorporating maintenance as the final, continuous step).
- **Maturity Guidance (Conceptual):** Review Elastic’s Detection Engineering Behavior Maturity Model (DEBMM) as a benchmark to contextualize the required maturity level for maintenance processes.
- **Cultural Resources (Conceptual):** Research on "Hail the maintainers" initiatives to promote the value of sustaining engineering work.