**Session Date/Time:** 11 Sep 2024 14:00 # [NMOP](../wg/nmop.html) ## Summary The NMOP virtual meeting focused on network anomaly detection, featuring detailed presentations from operators (Swisscom, Bell Canada, Orange) and a research institute (Insa) on their real-world experiences, challenges, and proposed solutions. The session also included updates on two recently adopted working group drafts: "Architecture for Network Anomaly Detection Framework" and "Experiment on Network Anomaly Life Cycle," along with a discussion on terminology. Key themes included the need for robust telemetry, standardized data, improved correlation, and continuous learning in anomaly detection systems. ## Key Discussion Points * **Operator Experiences & Requirements for Anomaly Detection:** * **Swisscom (Thomas Graf) - Network Incident Postmortem:** * Presented a postmortem of a network incident during a maintenance window involving ISIS and SRv6 configuration changes, leading to application connectivity loss (VoIP, mobile control plane). * Anomaly detection systems observed traffic drops, flow count spikes, topology changes, but did not alert the Network Operation Center (NOC) due to current system limitations (e.g., auto-profiling missing in real-time version). NOC was alerted by application teams. * Identified critical gaps: direct ISIS control plane visibility (beyond BGP LS redistribution), forwarding plane path visibility (e.g., RFC 7799 passive hybrid type 1), and correlation of configuration changes (e.g., via NETCONF transaction IDs) with observed network behavior. * Reproducing the issue in a lab environment helped identify the root cause. * Alerting delay is currently 120-180 seconds due to flow aggregation, considered acceptable. * **Bell Canada (Dan Beland) - Journey to Monitoring:** * Shared Bell's progress since starting with RFC 9232, highlighting a significant mindset shift towards centralizing IPFIX, BMP, and YANG push telemetry, and standardizing analytics. * Identified challenges with legacy systems, multi-vendor data standardization, and data quality issues. * Highlighted a partnership with Swisscom to share experiences and push for industry standardization. * Presented two outages: one where a BNG backup failed to pick up (also observed by Swisscom), and a "mother of all" outage with multiple link failures. In both cases, detecting "something is down" was fast, but correlating it to customer impact and understanding the root cause quickly was the challenge. * Emphasized the need for real-time anomaly detection, comprehensive network topology views, and accelerated troubleshooting, with a target reaction time of 7-8 minutes for critical issues. * Stressed that standardization at the network node level (across vendors and platforms) is crucial for future closed-loop automation. * **Orange (Leonel Tadeu) - Knowledge Graph for Incident Management:** * Introduced `draft-tadeu-nmop-knowledge-graphs-cross-operator-incident-management`, proposing a Knowledge Graph (KG) framework for anomaly detection and incident management. * The goal is to learn incident signatures and remediation procedures, and facilitate their sharing. * Proposed constructing KGs by combining "digital map" concepts with operational data, OSS data, and YANG-based configuration data. * Advocated for a "YANG-KG Semantic Generalization" strategy, using a network of ontologies (including a higher-level meta-ontology) to abstract and integrate diverse YANG models for better sharing and behavioral modeling, rather than just direct translation. * Presented a six-use-case roadmap for experimental validation of this strategy using semantic web technologies. * **Insa (Alex Lopez) - Anomaly Detection in ISP:** * Presented a collaborative project (NII, Swisscom, Insa, BMC) on a rule-based anomaly detection platform mimicking operator actions for BGP/MPLS VPN environments. * The platform aggregates "concern scores" from various checks (e.g., traffic counters, prefix withdrawals, interface state changes) to alert the NOC. * Current focus extends to monitoring Autonomous Systems for disruptions (e.g., lost top talker) and anomalies (e.g., traffic shifting from setpeer to transit providers, impacting costs). * Demonstrated two use cases: detecting a lost top talker (based on IPFIX and BGP data) and identifying traffic shifts leveraging BGP communities. * Highlighted that operators want not just alerts, but also understanding of *why* they are alerted, advocating for IETF standards and open solutions based on operator feedback. * **NMOP WG Documents:** * **Architecture for Network Anomaly Detection Framework (Wan Ting):** * Recapped `draft-ietf-nmop-anomalydetection-framework`, which proposes a generic architecture to standardize comparison, exchange, and integration of different anomaly detection systems. * The architecture aims to extract common components, focusing on network-specific anomaly detection across forwarding, control, and management planes. * Described four phases: Data Collection & Processing, Anomaly Detection, Refining/Learning, and Replay. * Acknowledged feedback from the last IETF, including suggestions for document structure optimization and terminology consolidation (e.g., "service," "customer," "symptom"). * A suggestion was made to map the architecture to common business processes (e.g., Incident Management, Network Design) for greater tractability. * **Experiment on Network Anomaly Life Cycle (Vincenzo Sciancalepore):** * Presented `draft-ietf-nmop-anomalydetection-lifecycle`, which defines a continuous improvement process for anomaly detection systems through three stages: Detection, Validation, and Refinement. * The core contribution is codifying learning into "labels" that describe "symptoms" (referencing `anomaly-metadata` draft) and a network anomaly data model. * Introduced a "Label Store" component for managing these labels across various actors (Network Engineers, Automatic Detectors, Data Scientists, Automatic Refiners). * Highlighted requirements for the API: semantic consistency, human/machine readability, interoperability, and enabling expert validation. * Presented the open-source project "Antagonist" as a proof of concept, demonstrating human-based and machine learning-based detection and refinement, using "confidence score" and "concern score" for prioritization and feedback. * The roadmap includes validating with rule-based approaches (e.g., SAN) and integrating with the Swisscom lab. The authors seek feedback from other operators on whether the cycle and data model fit their processes. * **Terminology Discussion (Adrian Farrel):** * Adrian brought up the `terminology` draft, asking for feedback on the definition of "anomaly" and its reference to the newly adopted `anomaly-architecture` draft. * He also encouraged review of "incident" (which references the incident YANG draft) and "symptom" definitions. * There was a strong sense among participants that precise terminology is crucial for avoiding miscommunication and facilitating common understanding in the working group. ## Decisions and Action Items * **Decision:** `draft-ietf-nmop-anomalydetection-framework` was adopted by the NMOP working group. * **Action Item:** Authors of `draft-ietf-nmop-anomalydetection-framework` to address comments from the last IETF and incorporate suggestions (e.g., terminology consolidation, mapping to business processes). * **Action Item:** Adrian Farrel to update references in the `terminology` draft to point to the adopted `anomaly-architecture` draft. * **Action Item:** Working group members are encouraged to review the `terminology` draft (`anomaly`, `incident`, `symptom` definitions) and provide feedback on the mailing list. * **Action Item:** Mahesh to provide a reference for a PhD student working on correlating configuration changes to outages, with a suggestion for them to present their work to the NMOP WG. ## Next Steps * **Operator Contributions:** * Continue documenting and sharing "lessons learned" and "quick wins" from operator outage experiences. * Further investigate means for auto-generation of correlation and efficient mitigation, exploring candidate technical approaches and the utility of data annotation. * **Knowledge Graph Experiment:** * Call for contributions and collaboration on the Knowledge Graph experiment, with a focus on anomaly detection. Michael Gasser expressed interest and noted his recently posted `draft-gasser-nmop-knowledge-graph-framework-for-net-ops`. * Leonel Tadeu and Thomas Graf noted the potential of Knowledge Graphs to aid causality analysis. * **WG Document Progress:** * **Architecture Framework:** Continue refining the `draft-ietf-nmop-anomalydetection-framework` based on feedback, potentially adding more application examples and open-source code references. * **Anomaly Life Cycle:** Finalize validation of the "Antagonist" proof of concept with rule-based approaches (e.g., SAN) by November 2024. Integrate with Swisscom lab environment and seek feedback from other operators on the life cycle process and data model. * **Terminology:** Continue the discussion on the `terminology` draft on the mailing list to ensure clarity and precision for key concepts.