Markdown Version | Transcript | Recording 1 | Recording 2 | Session Materials
Session Date/Time: 17 Mar 2026 01:00
IRTFOPEN
Summary
The IRTFOPEN session at IETF 125 focused on the intersection of internetworking research and Artificial Intelligence (AI). Chair Dirk Kutscher opened the session by highlighting the goal of identifying constructive roles for the IRTF in addressing challenges posed by distributed AI systems, agent communication, and the merging of distributed computing with networking. The session featured three technical presentations covering disaggregated LLM inference architectures, reliability engineering for AI infrastructure, and the security and naming requirements for AI agent communication.
Key Discussion Points
Disaggregated Architecture for LLM Inference (Mooncake)
Mincheng Cheng presented Mooncake, a KV cache-centric disaggregated architecture designed to handle the increasing costs of LLM inference.
- Challenges: Large-scale inference is constrained by GPU supply, high costs, and response time (TTFT and TBT). Scaling laws (data, model size, context length) exacerbate these issues.
- Architecture: Mooncake separates the pre-fill stage (computation-bound) from the decoding stage (bandwidth-bound). It treats the pooled DRAM and SSDs across GPU servers as a massive KV cache memory pool.
- Transfer Engine: A core component utilizing multi-NIC topology, zero-copy RDMA, and GPU-Direct RDMA.
- Evolution (Next Gen): Development of a unified memory segment abstraction to handle heterogeneous hardware and protocols. New features include dynamic per-packet/segment load balancing (adaptive routing) and self-healing mechanisms (falling back from RDMA to TCP in case of network failure).
- Discussion: Participants (including Dave and Xing Jiang) queried the applicability of RDMA over Ethernet (RoCE) versus wide-area networking and the integration of congestion control. Mincheng noted that while currently focused on data center RDMA, the system is becoming increasingly network-aware to handle stragglers and latency-critical Expert Parallelism (EP) traffic.
Reliability Engineering – Challenges in Networking for AI
Hong Xu discussed the necessity of autonomous reliability engineering for AI infrastructure.
- Complexity: The massive scale (e.g., 100k+ GPUs) and co-designed software/hardware stacks make manual root-cause analysis nearly impossible.
- Proposed Solutions:
- Open Arena: A proposed open benchmarking environment for injecting failures and testing diagnostic agents.
- TSGuard: A diagnostic agent developed with Microsoft Azure that uses a tiered approach (RAG for recurring issues, reasoning for complex new incidents).
- MyCraft: A fine-grained tracing tool for collective communications (e.g., AllReduce) that provides real-time progress logs to identify "stuck" operations.
- Research Needs: Developing taxonomies for failures, managing long context lengths for diagnostic agents, and ensuring cross-model communication robustness.
- Discussion: Rod questioned the overhead of these monitoring tools. Hong Xu clarified that MyCraft’s impact on training latency is minimal (<1%). Roberta raised concerns regarding LLM hallucinations in troubleshooting; Hong Xu emphasized the importance of "scaffolding" and verifying model guesses against benchmarking scripts.
On AI Agent Communication
Lixia Zhang addressed the networking and security requirements for autonomous AI agents.
- New Front of Networking: Agent communication patterns are not inherently new (I-to-J), but their scale, extreme dynamics (transient agents), and multi-party autonomy challenge current paradigms.
- Security Failures: Existing security solutions (Web PKI, OAuth, centralized CAs) are viewed as "late add-ons" that lack a unified namespace. The reliance on centralized cloud providers for identity and tokenization is a bottleneck for real-time agent interaction.
- Proposals:
- Unified Namespace: Advocated for the DNS namespace to serve as a foundation for agents, users, and organizations.
- Decentralized Trust: Move toward localized trust management and decentralized CAs to handle the lifecycle of billions of agents.
- 3As: Re-invigorating Authentication, Authorization, and Audit (accountability) as core architectural requirements.
- Physical Agents: Noted that physical agents (robots/industrial AI) introduce real-time constraints and the risk of physical damage, necessitating even stricter safety and communication guarantees.
- Discussion: Dave questioned if agents could simply use their owner's identity. Lixia argued that while an owner provides the "party" for accountability, agents need their own cryptographic identities for autonomous actions and delegation chains.
Discussion: Internetworking Challenges for AI
Dirk Kutscher moderated a general discussion on the IRTF's role.
- Data Sharing: There is a strong need for shared datasets of network failures, though privacy and proprietary concerns remain a hurdle.
- Taxonomy: A sense of the room indicated interest in developing a formal taxonomy of failures for AI networking (e.g., NIC vs. switch vs. software stack).
- Coordination: Recognition that agentic AI provides a significant push toward decentralized networking (DINRG) and requires architectural shifts rather than incremental protocol changes.
Decisions and Action Items
- No formal consensus is established in IRTF sessions; however, there was a clear interest in further exploring a "failure taxonomy" and decentralized identity for agents.
Next Steps
- A follow-up session is scheduled for later in the week to continue the discussion on specific research topics.
- Chair Dirk Kutscher will summarize the themes from this session to inform future IRTF activity planning regarding AI workloads.
Session Date/Time: 18 Mar 2026 06:00
IRTFOPEN
Summary
The IRTFOPEN session at IETF 125 included the IRTF Chair's administrative updates, the presentation of two Applied Networking Research Prize (ANRP) awards, and a structured discussion regarding internetworking challenges for Artificial Intelligence. The session highlighted research into transport-level encryption for data centers, burstiness control for real-time communication, and the future role of the IRTF in addressing AI-driven systems architecture, naming, and trust.
Key Discussion Points
IRTF Chair’s Presentation
Dirk Kutscher provided an overview of recent IRTF activities:
- Workshops: Reports were shared from the "Internetworking Challenges for AI" workshop (CoNEXT, Dec 2024) and the HKUST Internet Research Workshop.
- Travel Grants: Grants are available for early-career academics and PhD students. The deadline for the July meeting in Vienna is March 27.
- Applied Networking Research Workshop (ANRW): The call for papers is open until April 17.
- ANRP Awards: Six awards were made out of 70 nominations for 2025. Two award-winning papers were presented during this session.
- Slides: Chair's Presentation
ANRP: Designing Transport Level Encryption for Data Center Networks
Tianchi Gao presented SMT (Secure Message Transport), a protocol designed to provide TLS 1.3-level security for data center RPCs while avoiding TCP-induced head-of-line (HoL) blocking.
- Technical Details: SMT uses a message-based abstraction rather than a byte stream. It leverages existing hardware offloading (TSO and TLS offload) by using a packet format that mimics TCP headers for the NIC while using a non-TCP IP protocol number. It implements a per-message record sequence number space to allow out-of-order delivery without breaking replay protection.
- Performance: Evaluation showed SMT outperforms kTLS by 20-30% in unloaded latency and significantly improves tail latency in NVMe-over-Fabric and Redis tests.
- Discussion:
- Stewart Cheshire questioned the frequency of HoL blocking in well-managed data centers using DCTCP. Tianchi Gao clarified that even without loss, core-sharing and large message interference remain issues.
- Antoine Fressancourt queried the use of TLS for key exchange and why SMT didn't use lower-level cryptographic offloading. Tianchi Gao noted that integrated TSO/TLS offload in the NIC minimizes latency compared to external crypto engines.
- Slides: Designing Transport Level Encryption for Data Center Networks
ANRP: Sending Burstiness Control for High-Quality Real-Time Communication
Xiangjie Huang presented ACE (Adaptive Control of Burstiness and Encoding), a system to manage pacing delay in Real-Time Communication (RTC).
- Technical Details: ACE addresses "pacing delay" caused by bursty video frames overshooting network buffers. It consists of two parts: ACE-N (Network side), which adaptively adjusts a token bucket size based on estimated network queuing size, and ACE-C (Encoder side), which increases encoding complexity to reduce the size of outlier frames (tail frames) instead of reducing quality (CBR).
- Results: Deployment at ByteDance showed a 15% reduction in stall rates for cloud gaming.
- Discussion:
- Christian Huitema noted the tension between BBR-style pacing and video frame bursts, suggesting the trade-off of CPU for bandwidth is a promising research direction.
- Stewart Cheshire commented on the relevance to L4S deployment, where applications must pace traffic to avoid penalties from queue protection functions.
- Slides: Sending Burstiness Control for High-Quality Real-Time Communication
Internetworking Challenges for AI (Follow-up Discussion)
Dirk Kutscher summarized the discussion from the previous day's dedicated session, highlighting three research themes:
- Frameworks: Testbeds, benchmarks, and datasets for reproducible AI networking research.
- Naming/Identity: Principled trust delegation and naming for autonomous AI agents.
- Co-design: Integration of networking and distributed computing (e.g., KV-cache-centric architectures).
Discussion Participants:
- Rodney Grubbs questioned if these were uniquely AI problems or broader systems architecture issues, noting that Large Language Models (LLMs) represent a significant, hard-to-migrate state.
- Dave Plonka emphasized the distinction between "trust delegation" and "responsibility," arguing that humans must remain responsible for agent actions.
- Lisha Jiang (author of On AI Agent Communication) argued that AI agents change the system qualitatively due to their collaborative nature and the need for sub-millisecond local trust in edge scenarios (e.g., manufacturing robots).
- Christian Huitema warned against focusing only on naming (the "lamppost effect") while ignoring the massive centralization caused by the cost of training models.
- Wes Hardaker and Colin Perkins discussed the IRTF's role, suggesting it serves best as a venue to report novel research results (similar to MAPRG) rather than trying to coordinate fast-moving industrial engineering.
- Lars Eggert noted that a dedicated venue for AI research in the IRTF would be beneficial for the overall IETF community to guide newcomers looking for AI topics.
Referenced Materials:
- Disaggregated Architecture for LLM Inference (Mooncake)
- Reliability Engineering – Challenges in Networking for AI
- On AI Agent Communication
Decisions and Action Items
- None. (IRTF does not make standards-track decisions).
Next Steps
- Travel Grant Applications: Interested PhD students and early-career academics should apply by March 27.
- ANRW Submission: Research papers for the summer workshop in Vienna must be submitted by April 17.
- Continued AI Discussion: Participants are encouraged to move the discussion on AI internetworking challenges to the IRTF mailing list.
- SMT Protocol: Tianchi Gao signaled intent to submit an internet draft for SMT for the next IETF meeting.