Markdown Version

Session Date/Time: 17 Mar 2026 01:00

Dirk Kutscher: Okay, good morning. This is IRTF Open. It's the first of two sessions this week. So we want to dedicate this session to a discussion about internetworking research challenges for AI. So we put this together in a cooperation—so this is Antoine Fressancourt, who helped with the organization, and I'm Dirk Kutscher, the IRTF chair.

And quickly, before we get started—so you already agreed to this when you registered for the IETF, but so we're operating under the IETF IPR rules. So let us know in a short time if you are involved in any IPR activities or you see anything in this direction. This is going to be recorded and also streamed live. And there are a few rules regarding privacy and code of conduct. So please observe the IETF Code of Conduct and Anti-Harassment procedures, and also the special IRTF Code of Conduct which makes additional statements about ethics in research and a few other things. And please use the Meetecho tools and the queue management system for the later discussion.

Quick reminder—so because I think this is in particular in order since the topic is discussed so widely and intensively this week—so there’s a lot of stuff going on on AI, agent communication, and so on. So we are here at the Internet Research Task Force. So nothing that we do will result directly in standards. So we're doing research, we want to enable experiments and then gain insights that could potentially later inform IETF standards work, for example. So that’s why we are often not so much concerned about, you know, short-term standards or engineering solutions, but like bigger research challenges. And this is also a little bit the spirit of this meeting.

So we will have AI note-taking as well, but nevertheless we need a note-taker in the room who could capture the gist of the discussion. Can we get any volunteer? Ah, thank you very much! That’s great, thank you. So we'll use the online note-taking tool. Anybody is welcome to chip in additional observations if they want.

Okay, so why are we holding this meeting? So, well, we have been discussing, you know, research challenges for networking for AI for quite a while. So we had side meetings on this, and it’s obviously one of the major topics in networking and distributed systems research at the moment. So the goal of this meeting is to, you know, further discuss and find out, you know, what could the IRTF do constructively in this field? So what would be a good role for the IRTF? What would be good topics? And so, well, as these systems get, you know, more pervasive, more distributed, more deployed in different scenarios—not only, you know, proprietary data centers—questions arise as to, yeah, how would this be related to internetworking, you know, protocol engineering, protocol design? But also maybe to new approaches of how to, you know, see the overall stack in this field? So for example, the intersection of distributed computing and networking. And so we hope we can get some more insights and also with your help discuss this a bit further and yeah, then potentially come to some conclusions in the near future.

So as if you are not aware, there was a previous ACM CoNext workshop in December last year that we organized, and I think it went pretty well. So we had really diverse presentations from different companies, different academics working in this field. And well, I left that workshop with quite good confidence that there is really interesting research to be done in this field.

Okay, so this is the agenda for today. So we have three talks: one by Mincheng Cheng on "Disaggregated Architecture for LLM Inference"; one by Hong Xu on "Reliability Engineering – Challenges in Networking for AI"; and then one by Lixia Zhang on "AI Agent Networking." And as you can see, we left quite a bit of time for discussion. So after each of these talks, of course, we'll have the usual Q&A, but then at the end, we hope to, you know, discuss the overall topic with you and looking forward to your suggestions and your comments in this direction. Is there any, you know, other suggestion, any other need for agenda bashing? Great. Okay, then I'll bring on the first speaker in just a moment. Mincheng, can you hear me? Okay, great. Give me just a second. Okay, thanks very much for coming.

Mincheng Cheng: Okay, thank you for the invitation. It's my honor to present here. So typically I'm from the distributed system community, and now our distributed system is more and more related to the network. So it’s my honor to present some of the new challenges and how we solve it in order to enable a disaggregated architecture for the large language model inference.

So you know that just yesterday, or I mean this morning, it is GTC 2026, and Jensen [Huang] also spoke that nowadays we are in the era of LLM inference. Now with the growing of so-called the LLM agents, the cost in the inference field are much, much larger than the training. Because typically from the beginning of ChatGPT, we are just chatting with the AI, which essentially is a single-turn response and short inputs and short outputs. But now we are using agents, typically an agent such as OpenDevin or something like that. We are executing a multi-turn and very complex execution of the LLM interaction. And in each of this interaction, it will have very long inputs to gather all the context needed for inference the corresponding output. And the output, due to the thinking pattern, it may always think a lot, maybe 1K or 2K tokens, and then output the final result. And this complex and this difference has made the result which is currently the inference cost become much, much larger than the training cost. And now we are focusing on how to reduce the inference cost.

Because you know that in order to have a higher intelligence, the LLM researchers are typically relying on the so-called scaling law. And the scaling currently we scale three main factors: first is we gather more high-quality data; the second is we train larger models; and the third is we enable a larger context. Typically because we are currently using the agentic work, this context will be higher than 64K or even 128K. So it is much, much longer than the original cases. But this kind of scaling not only bring us a higher intelligence, it also bring us three challenges. First is lack of GPU supply, even though that currently we are still increasing the supply of GPU but it is still not very enough. And the second is higher inference cost. Now if you are using some OpenDevin in your laptop, you know that the tokens are still very expensive. And the third is longer response time. So we want interactive response, but it currently maybe one task, you give it one task, it might return 10 or 20 minutes later because the longer response time.

So the problem becomes very apparent: how we can improve the inference MFU [Model Flops Utilization], how we can maximize the flops utilization of our GPU? And the second is how we can reach this goals without sacrificing the user experience? Because if we increase the MFU but at the cost that each request will only output 10 tokens per second, it will lead to even longer response time, and that is not the user desired.

And one of the most important solutions is that we can use heterogeneous hardware. For example, in China we have two kind of GPUs that are both based on something restrictions: first is H800, it is better for flops, better for computation; and the second is H20, which is better for bandwidth because its computation is restricted. And also we have traditional DRAM and LPDDR solutions attached to the CPU. They are better at the capacity because even though that currently the DDR is more expensive than last year, but it is still less expensive than HBM. And this is just a demonstration. With the growing of the hardware community, there are actually even more heterogeneous kinds of hardware. For example, today's GTC, Jensen presented the LPU, the Groq-based solution, it's best for bandwidth. And originally they have presented the CPX solution, which use GDDR, not that good at bandwidth, but with a lot of T-flops, which is very good at computation. So how we can use different dimensions that are presented by different kinds of devices is one of the main goals to enable this disaggregated architecture for LLM inference.

And this is possible because there are two critical properties of the large language model inference. The first it is it has two different stages: one is the pre-fill, which we use to optimize for the first time—Time To First Token (TTFT). And because it parallel process all the input tokens, it is very computation bounded. And the second is the decode, in which stage we will optimize for the Time Between Token (TBT). And in this case, because of the auto-regression nature of large language models, it will output one token by one step. It consume a lot of bandwidth. And because of the long context, it leads to a lot of KV cache. It will also consume a lot of capacity of the memory. So one stage, the pre-fill stage, consume computation, and one stage, the decode stage, consume bandwidth.

And the second property is that we have KV cache. You know that also because of the auto-regression nature of the large language models, actually every token will only attention for all the tokens previous to it, not any token later than it. So if two requests, they share the same prefix, they can directly skip the computation of this shared tokens and directly reuse the intermediate result produced before, which we call it the KV cache. And because that currently we are using again the agent, the Claude, the OpenDevin—all these agents will try their best to preserve the prefix. So even though that I say that for currently every request will have 64K inputs or even 128K inputs, 80% or even 90% of the prefix will be the same as the previous turn. So the KV cache hit ratio will be very high, as high as 80% to 90%. So reusing KV cache is one of the most important solutions that we can reduce the cost of LLM inference.

So with all these properties that is nature to a large language models, now we can naturally use different hardware for different stage in the LLM inference. We can use the device better for computation for the pre-fill stage, such as H800, such as the CPX solution. And we use the device better for bandwidth for the decode stage. And we use the device better for the capacity, the DRAM, the SSD, for holding a lot of KV cache for future reuse.

So based on this basic idea, we have made a very flourishing open-source community which is called Mooncake. So the left part is an architecture of the Mooncake. It currently based on a lot of hardware, can manage a very complex network topology, and contain a lot of functionality and supporting different kind of other communities in the LLM inference. And we are honored that just in this month we have also officially joined the PyTorch ecosystem. We are currently with close collaboration with a lot of other communities that are also in the Torch community.

And this all this project is initialized based on a collaboration with me at the Tsinghua University and Moonshot AI, who is the producer of the Kimi K2 and Kimi K2.5 series of model. And this is currently one of the SOTA models in the open-source weighted large language models. So it is very important. And due to at the first day that Moonshot AI and Kimi are obsessed by the long context, so we have experienced the challenge that we have mentioned before much, much earlier than other providers.

And based on this challenges, we have made three contributions in our proposed Mooncake architecture. First is PD disaggregation [Prefill-Decoding Disaggregation]. Even though that the openly published paper, we are not the first that published the idea of PD disaggregation. It’s based on Microsoft and Peking University. We are the first that openly report that we have deployed, successfully deployed, a very large-scale PD disaggregated LLM serving architecture. And the second is that we made a KV cache-centric architecture that using pooling all the DRAM of all the GPU servers as a very large memory pool to holding all the KV cache to enable to maximize the reusement of the KV cache. And the third, we have made a lot of scheduling policies that are specialized designed for this kind of scenarios.

And I will discuss this one by one. The first, PD disaggregation, as you can see that we use different device pools for different kind of stage. And originally this PD disaggregation is only used by Moonshot in the real production, but now it is a status quo because now we are using a lot of very large language models, initialized by the DeepSeek and later by Kimi K2 and K2.5. They are terabytes-level large language models. And in this case, you typically will need different kind of parallelism strategy. For example, we will use a small parallelism strategy such as TP8 for the pre-fill, and we will use large EP [Expert Parallelism] or so-called wide EP parallel strategy for the decode stage. So this disaggregation not only bring us the benefit of using take advantage from heterogeneous hardware, it also give us the flexibility to use different strategy on different stage. So this is currently become the status quo of serving large language models.

And the second is the KV cache. The size of KV cache is actually very huge. Originally we say that we need a lot of VRAM because we have large models, but actually the KV cache is the cheap killer of your HBM. Because currently the KV cache will grow linearly with your context length, it will grow linearly with your model size, and it will grow linearly with your batch size. So as a result, it will grow much faster because all these three input parameters are grows. So we may, in a real production, we have terabytes or even petabytes of KV cache that is reusable. So we need a very huge memory pool to store it.

And gradually actually Mooncake not only used for PD disaggregation, as you can see in the middle we have a Mooncake transfer engine for PD disaggregation, transferring the data from the pre-fill stage to the decoding stage. And we also use this Mooncake transfer engine to share the KV cache, transfer the KV cache from the DRAM, the SSD to the inference engine. And now we are also building Mooncake Elastic EP, which will used for the EP parallelism communication. A lot of different kind of solutions are all based on this kind of transfer engine. And the core component of Mooncake is its transfer engine. So it is highly related to the network. How to use this kind of network is one of the most important challenges for us.

And now it is not only a academia paper that has received a best paper reward at USENIX FAST, it is also a widely deployed system in maybe Moonshot AI, the Zhipu AI, the Alibaba, the Huawei, the NVIDIA—a lot of different kind of companies are using this transfer engine in maybe not tens or hundreds of thousands of GPUs, but if it is not millions of millions, or even tens of millions of GPUs. And it is integrated with two famous inference engines, one is vLLM and the other one is SGLang. They are all both based on our transfer engine to orchestrate their distributed inference. So you may know that originally both vLLM and SGLang are best for single machine, multi-GPU scenario. But now as we have discussed, we need distributed service, we need disaggregated service, so they need a network orchestration layer and they are collaborate with our community to make this possible. And also we have collaborated with the Dynamo system from NVIDIA to standardize this part of the network communication and distributed orchestration.

So in the second part of my talk, I will briefly talk about what is the Mooncake and what we are building for the next version of Mooncake. So first, as we have, it must transfer very fast. And the second, it must store more KV cache as much as possible. And the final part, as a widely used library, it should be easy to use.

So the first part, the transfer engine, we take advantage from the topology of multi-NIC [Network Interface Card] topology. So you are not only make use of one network interface card, every machine now have eight network interface cards. You should pool them, you should make use best of the aggregated bandwidth. And because the aggregated bandwidth is so high, you must reduce your CPU cost as much as possible because otherwise you will be bounded either by the CPU contention or by the local DRAM if you—so we are obsessed by the zero-copy feature. We ensure an end-to-end one-time copy. So only from the source to the destination, only one copy, no intermediate result. And the result is that we are much faster than existing solutions such as TCP-based solutions or gRPC. We are based on the RDMA or even GPU-direct RDMA.

And the second is that we pool the whole DRAM of every—we gather all the GPU, the DRAM from all the GPU servers, and now we are extending to the SSD devices to the all the GPU servers in order to have a very large pool for all a lot of KV cache and use this KV cache to reuse it across different turns of your agent to make best use of them. And the finally, we just provide very simple, easy-to-use API: only Put and Get, very similar to a Redis key-value store, but it's zero-copy. So it will give you directly pointers to the source of the memory blocks and pointers to the destination block, and we will ensure a zero-copy transferring. And I think this is different from the existing solutions.

So this is the basic of Mooncake, and as you can see, we enable a 500% increase in the SLO [Service Level Objective]-aware throughput, or we call it the goodput of the serving inference throughput of the Moonshot AI.

But this is the first version of Mooncake. After one year or two years of large-scale deployment, we found that we are facing a lot more challenges. And the first challenge is that we are currently the network is also heterogeneous. We have RDMA cross-machine, and we have intra-rack which is, for example, the NVLink. And we currently have different kind of GPU vendors, and each vendor produce a different kind of interconnects. So this kind of heterogeneous become a new problem that we need to solve in the next version of the Mooncake transfer engine.

And the reason is that in our original design, we have choose for simplicity, we have choose an so-called imperative pattern. At the start time, it will be statically binding to a certain protocol, whether your transfer engine wants to use RDMA, whether your transfer engine want to use NVLink. But now we want a binding, a fusion of all these different kind of network channels.

And the second, because originally it is still deployed in a series, a one-on-one series. Each series will not contain a lot of GPUs. So the reliability is not a very huge problem. But now we have wide EP, we have a lot of GPUs very connected with each other. So the so-called failure blast diagram or the diameter, the blast diameter become very large. If one GPU is crashed, it will take over, it will cascadingly make a lot of GPUs stop working. So how we can ensure the fault tolerance, ensure the resilience become the second problems that we need to handle.

And these are several explanations of these problems. So as you can see, we originally either binded to NVLink or binded to RDMA. But currently to DGX [Deep Learning Supercomputer] GPUs have multiple paths. We want to make use of the aggregated bandwidth of all these paths. And the second is the static brand scheduling. Because that we will evenly slice all the payloads into fixed chunk size and we statically scatter all this chunk to different network card in order to make use of the aggregated bandwidth. But there are always stragglers, so the bandwidth utilization is not as good as we expected, especially on the later part at the long tail. And also the resilience become a problem. If one network interface card is crashed, because we still have seven left, we still want the communication to available to the upper layer applications. But originally we would directly raise an error, and the upper layer engines also do not have handle this kind of error, and it will crash and we need a full restart. And that is not desired now. We need a more resilient solution.

So based on this new challenges, we are currently actively developing the next version of transfer engine, Transfer Engine New Technology—Transferring Engine NG. And its basic idea is that we actually made a unified segment abstraction. We unified all kind of resources: the DRAM, the VRAM, the SSD, into memory segment because all these different kind of memories or even disks can be addressed by a memory buffer and randomly accessed. So we even though they are provided by different devices, we treat them logically identical as a memory segment with random access capability and we register them into a global namespace. So this unified abstraction provide us the possibility to have a better orchestration of the heterogeneous hardware.

And also in the first step, we will automatically probe the hardware information, and then we will map all the different network interface card to each of the memory segment. So with a NUMA [Non-Uniform Memory Access]-aware and a lot of other PCIe root complex-aware—so with this topology-aware, we will choose one which network interface card has the best affinity, which will be the best. But even with the best, we still enable different kind of fallback, a lot of fallback loops. So if the best is not available to you, you can still use the second route. And if the second route is not available to you, you can still use the third route. And in this new implementation, each route can be binded to different protocol implementation. Maybe the best route is from the NVLink, and the second best is from the RDMA, and as a fallback, we can still use TCP if there are some disaster in the RDMA network.

And this is the first. And the second is we enable a dynamic per-request orchestration. I think that this idea is actually very similar, very popular in the traditional data center protocols: you have packet per-packet routing. Each packet in your TCP stream will be routed independently. But in original implementation of our transfer engine, we will have a static binding of the route. So if that route is not available or that route is congestion, become congested, it will lead to a very low performance. But now in the new technology, we again in the upper layer, we also enable a dynamic telemetry-driving adaptive scheduling. We will continuously probing the current load of each route and try to split the stream into a lot of packets and independently routing each packet on independent route. So with this solution and this mathematical formulas, a very easy congestion-related, a very easy congestion-based LL routing, we can enable a very even more balanced network.

And also we have a lot of cross-process fairness. So currently we are also enabling multi-tenancy on our network. The tenancy are not only used by one kind of network, we are using it for PD disaggregation, for KV cache-aware, and also EP, so it is used on different cases by different users. So we need enable fairness, and we need to enable some certain level of multi-tenancy SLO guarantees.

And also at the last but not least thing is that we need to enable self-healing. So currently, if one network interface card, we can automatically fallback to other route. So this is again an implementation level of optimizations. And so even with all the RDMA network is—all the RDMA not are not available to the inference engine, we can still fallback to TCP to continue the work.

And the final result we actually compare our next version of transfer engine with our original transfer engine and also another counterpart which is the NVIDIA-NICs. As we can see that with all these optimizations, the throughput is better. And this gain in throughput is mainly by the per-packet dynamic spraying and independent routing. And also we enable a lower latency, larger throughput. And we are extending its use case not only in the inference, but also in the reinforcement learning. You know that currently reinforcement learning have training stage and inference stage. And it will need to transfer the large model checkpoint from the training servers to the inference servers. And this transferring also are very bandwidth bounded and our transfer engine next version can be much higher can deliver much higher throughput in this scenario.

So all our work are based on a open-source community which is the kvcache.ai and the Mooncake community. So if you are interested in the discussion or contribution, we are welcome to come to our community. Thank you.

Dirk Kutscher: Thanks very much. Great talk. Okay, we do have time for questions. I know that’s a lot of stuff to digest, but I think there could be questions. If not, I have one to start off. So you mentioned your new development, the new technology for the like unified system. So I know there are like other industry solutions these days like Unified Bus. You probably have heard about this. How does it compare to that?

Mincheng Cheng: Yeah, actually, I think that the unified abstraction is used to abstract away all the difference of the providers. And the reason that why we build a new abstraction is because the—we call it the zero-copy properties, we do not want a lot of layers stacked. We want one, only one layer. So we build it and we directly use it so that we can make use best of the zero-copy. So we borrow this idea from existing solutions, but due to the implementation, we need to rebuild it again.

Dirk Kutscher: I see. Quick follow-up: so you mentioned multi-tenancy as one of the objectives. So are you considering TCP at all or do you think then we need other protocols for that?

Mincheng Cheng: Currently, we are using a lot—most of the network are transferred by the RDMA. So the in terms of multi-tenancy, it has two scenarios. The first is even in one tenant—in one serving of the model, we have three different kind of use case. And we need priority, for example, the EP traffic must be the highest priority because it is latency-critical. And the other are bandwidth are more bandwidth-critical. So this is the tenancy in a single use. And the second is we are currently serving multiple models. And different models will use different GPUs and they will scale in and scale out. So this is the second kind of tenancy, but we they all use the RDMA.

Dirk Kutscher: Okay, thanks. Okay, we have Dave in the queue.

Dave: Yeah, thanks for the talk. Super interesting. I'm Dave from Wischnet. That was a lot to take in. I was fascinated by the part about the NICs and RDMA and kind of building up—it was kind of like bigger, better, faster, more for everything in the data center. And that seems good for lots of workloads, not just LLMs. Did I understand correctly that what this is targeted to is this KV cache layer that all LLMs will be advantaged by? That’s the first thing I was wondering. And then mapping it into the IETF or what the IRTF might do: is the RDMA you're talking about all Ethernet or does this map over to something that's in the IETF domain, clearly, that you suggest we focus on for workloads that could be wide area as well?

Mincheng Cheng: Yeah, for the first question, we are actually the so-called KV cache-centric architecture are proposed by us. And you know that essentially PD disaggregation are transferring KV cache from pre-fill to decode. And KV cache reuse is, of course, centralized to the KV cache. So KV cache is one of the most dominant payloads that will be transferred around the this kind of GPU data centers. And the second is the model weights. So currently transfer engine are used to transferring the model weights in the reinforcement learning scenarios from the training stage to the inference stage. And one the third growing scenario is the activation and the—which is used for stabilize the RL inference. Typically we need to transfer more information from the not only the weight but also, for example, the activation that used in the produced in the training to the decoding to enable a more stabilized RL inference. And the fourth is currently we are exploring the CPU checkpoint because we are now serving a lot of CPU agents, CPU agent harnesses, and they need to halt, resume, and scale. And this also lead to the scaling of their state, the CPU DRAM state for these all these CPU harnesses, and this part we are also using the transfer engine. So essentially transfer engine is a transfer engine that make best use of the high-bandwidth network. And currently we in different kind of scenario, we need this kind of high bandwidth, so we explore the applied scenario of this transfer engine. So this is the first question.

And the second question, I think actually even though that at the beginning transfer engine is designed totally from a distributed system view, we are very high level. But currently we are more and more related to the low-level hardware properties—the RDMA, the InfiniBand. And for example, the priority of different kind of network flows and how we can have a better telemetry about the workload, the andthe availability. And as you can see, we try to learn many ideas such as the packet spraying, the per-independent routing—all these ideas are well-established in the original network community, and we are just rebuilding it in our transfer engine. And maybe some of these properties can be integrated or adopted by the network protocols, and we can have a better co-design. I think that the current era is an era of co-design. Our system is co-designed with the algorithms, and system is co-designed with the hardware. So there are a lot of future directions.

Dave: Oh, thanks so much. Thank you.

Dirk Kutscher: Okay, we have Xing Jiang in the queue.

Xing Jiang: Hi, can you hear me?

Dirk Kutscher: Yes, we can.

Xing Jiang: Okay, thank you. So, a question about for the KV cache application: do we considering about network awareness and also the collaborate between the, for example, the traffic priority and the congestion control during for the for the KV cache?

Mincheng Cheng: Yeah, actually I think that the congestion control become more and more important and because, you know, as I have mentioned, I'm from a distributed system team community. So originally in small scale, we are not network-aware about the congestion. So the first version of transfer engine are not aware to the congestion. But now with a larger deployment, we found that the congestion always lead to the always lead to some P99 latency, high P99 latency. And that is what we are working on today. And especially for the EP traffic, it is very latency-critical, so the congestion become a very important problem. And one of the reason is, as I have mentioned, the bandwidth-bounded traffic mixed with the latency-centric traffic. So this become a reason of the congestion. And the second is the stragglers in the network interface card. In periodically, we can find some of the network interface card are not transferred as fast as others. So this is the second source of the stragglers.

Xing Jiang: Yeah, thank you.

Mincheng Cheng: Thank you.

Dirk Kutscher: Looks like we have one more question.

Questioner: Ah, yeah. Thank you, Mincheng. And I have a question. Previously we are mainly discussed about the inference, the inference scenario, right? And I'd like to ask about the new challenge and opportunities in the agent area. So what will be the new challenge faced by the KV cache in no matter agent serving or agentic RL? Yeah.

Mincheng Cheng: Yeah, I think that the basic problem is it become larger and larger. So you know that when we become reaching the limit, all the problems become all the small problems become bigger problems. Originally as I have mentioned, currently the average context length become 64K and it is growing to 100K. So the all the KV—the size of the KV cache are become larger. And originally we only keep KV cache in the DRAM. Now we are exploring SSD. And you know that the SSD become a totally different problem than the DRAM. How to manage the I/O, how to manage the latency, and the management even become a more important problem. And also we become the problem become that it is not only originally reusing KV cache is always better than recompute the KV cache. But now because the KV cache become much larger and if your KV cache are evicted to a cold layer such as the SSD, maybe—and if you have a very strict TTFT SLO, maybe you need to choose to recomputation in order to meet your your SLO. So the decision, the scheduling become even more, more harder than before. So when the size is growing, I found that the problem become larger. And again, we now have multi-node NVLink. How to handle it? How to manage it with the RDMA? In which part we should use multi-node NVLink? In which part we should use RDMA? And even, you know, in China we have maybe 10 different kind of GPU vendors. How you manage this kind of complexities? So all these problems become become a problem in the current era.

Questioner: Hmm, thank you. And I come from Huawei, and our team is really working on towards this topic, and I think we can discuss later. Thank you.

Mincheng Cheng: Yeah, of course, of course.

Dirk Kutscher: Great stuff. Thanks very much again, Mincheng. Great talk.

[Applause]

Dirk Kutscher: Okay, moving on to the next talk. Let me—have some Meetecho hiccups at the moment. Give me a second. Reload. Okay, thank you. Okay, so we're delighted to have Hong Xu next. He’s a professor at CUHK over in Hong Kong, and he will talk about "Reliability Engineering – Challenges in Networking for AI." Hong, you can control the slides on your side with your cursor keys. Welcome.

Hong Xu: All right. All right. Can you guys hear me?

Dirk Kutscher: Yes, we can. This is quite loud.

Hong Xu: Okay. Okay. Thank you for having me here. This is actually my first time joining IETF and IRTF. So I'm going to talk about a topic that's a little bit different in terms of the in terms of the focus. So instead of talking about performance, efficiency, and so on, I decided to talk about reliability engineering, focusing on networking but also actually in general infrastructure for AI workloads.

All right, so in terms of the networking infrastructure for AI, as we all know, there is a lot of rapid development in the space. So if we look at different levels of this networking stack, there has been a lot of innovation in terms of the hardware and also software technologies. So at the server level and also the rack level, we've seen a lot of new things in terms of the interconnect technologies. So at the server level we have things like CXL and NVLink and Unified Bus and so on and so forth. And at the rack level, we also have Superpods using these various interconnect technologies from NVIDIA, from Huawei, and also some open interconnect standards.

So at the cluster level, in terms of the networking fabric, there has also been like some innovation in terms of integrating optics, which is really becoming a necessary when we're going for, you know, high bandwidth for transferring all the models' weights and KV cache and so on and so forth, right? And and most sort of most critically, the kind of the scale of our infrastructure for for not just for networking but also for other part of the AI infrastructure is is really it's really huge. And that's really becoming a problem for not just for performance and efficiency, but also for fault tolerance, for reliability, right? So there has yeah, there we we start to see that there is a lot of focus coming out of industry that try to look at, for example, training GPU training over 100,000 GPUs and then how to handle the fault tolerance from from the framework, from the training framework sort of perspective, right? So all of this is basically kind of the background that sort of underpins the importance of reliability and also how we solve the reliability challenge.

So um so basically, essentially, there are kind of three main things that make reliability especially challenging in terms of the AI workloads. One is the scale factor that I just mentioned. And also the other thing is that the complexity of the infrastructure, not just the hardware technologies, but also because we are going more and more towards the co-design kind of approach. So we're not just when we have a new hardware technology like Unified Bus and so on and so forth, we don't just develop the hardware technologies, there is a whole software stack and there is a whole sort of ecosystem that's designed around the whole thing. And and there are even like a lot of upper-layer sort of technologies at the machine learning systems or machine learning frameworks kind of level that try to do a lot of innovation for various reasons. So as Mincheng just mentioned, when they're designing their new generation Mooncake transfer engine, they want to sort of dynamically bind or utilize different interfaces, different networking stacks for various purposes. But you can imagine that with these type of like crazy sort of things and technologies going on, then the complexity of of maintaining this kind of software and hardware stack, and when there is a failure, how do you exactly pinpoint which part which part in this complex software-hardware stack is actually causing the problem? So that's going to be really, really challenging.

So that's reason number one, that's sort of also challenge number one. And the reason number two is that all of the training and also even all of the inference workloads, especially with agents and so on and so forth, all these workloads are highly distributed, and they are distributed over an increasing large scale kind of a number of GPUs and and also networking links. And so when this just means that whenever we have a small hiccup that causes the job to maybe restart or partially restart, then then that actually causes a huge financial cost. Um, so we would really love to minimize this sort of downtime and being able to actually fundamentally resolve the problem and the root cause of the problem.

And so the third kind of reason why reliability challenging is especially reliability engineering is especially challenging these days is because you know, the kind of the practice, the mainstream practice in terms of root causing or diagnosing the problem and then resolving the problem is still heavily a manual process. And so it's obviously very tedious and and fundamentally, it's just very difficult for us to keep up with all these rapidly evolving landscape in terms of both the hardware technologies, the software stack, and also you know, various things that are happening at even the upper layer that try to actually being more aware of the infra and then try to do all sorts of things to optimize according to the infra, right? Um, so on the right, you can see this is actually a statistics of different types of instance that we were able to collect from some GPU clusters at Microsoft Azure. But this was during 2023 to 2024, so it it might be a bit different from current situation now, but you can see that there are various types of instance that are reported by customers running these AI workloads. And a lot of it is related to GPU, but a lot of it is also related to software and also networking as well and other components.

So basically, um in this talk, I want to kind of share and also sort of pitch our vision towards this autonomous um kind of framework in order to perform troubleshooting and diagnosis for infrastructure, for AI infrastructure. So basically the overall vision is very simple, which is that, you know, we want to build an agentic framework, we want to build sort of like a OpenDevin kind of system that can manage the NetOps or the entire AI infrastructure in terms of diagnosis and troubleshooting.

So in order to realize this vision, I believe there are three fundamental reasons and I sort of want to explain what are these challenges and then how potentially we can solve them. Um, but a lot of it is still kind of un-undefined things here, which is sort of the reason why I want to talk about it here, so I hope more people can join forces here and then discuss about it.

Um, so the the most basic or the most essential challenge towards this vision is basically evaluation, in my opinion. So if we want to have an system or an agent, right, that can automatically troubleshoot instance, um, we need to provide a we need to provide a reproducible and comprehensive and scientific kind of way that can systematically measure the effectiveness of such kind of a system. Um, and so that is essentially quite lacking in my opinion currently. So I'll talk about it a bit more in the next slide.

And and then once we have such kind of a evaluation framework, then we can move on to question number two, which is how do we design such an agent? How do we basically how do we engineer the workflow of such an agent so that it can handle various different types of instance? And then when instance is straightforward, then we don't have to think too much about it. Um, we can even do just simple pattern matching probably. But then when the instance is indeed a very new instance and involves different like parts of the software and hardware stack, then we have to actually spend a lot of time thinking about it and then do a lot of real work to pinpoint the problem. Um, so that's that's really challenging to me.

And so that's the second part. And the third part is basically the system part. So once we sort of figure out how the agent how the design paradigm should be for the agent, um, so basically we move on to the system part, which is how can we serve and how can we run this agent efficiently and also reliably? Um, so yeah.

So basically, um we think there are these are kind of potential solutions that we can use in terms of answering those three questions. So in terms of eval, um we believe that it's very essential that we have some kind of open arena, um which is very similar to like LM Arena or or other kind of open arenas that we have for the models themselves, um but this time for, you know, agents for NetOps or AIOps, right? So we want to have we want to build sort of a open system that can reproduce, that can inject and reproduce all sorts of failures and instance, um so that we have we can do standardized and reproducible benchmarking. And so that essentially is what the Eval challenge wants to solve.

And to be more concrete, um so there are a couple of smaller but still critical questions here, which is, you know, first of all, if we want to have such kind of arena, then we must have a data set that that are realistic, that comes out of the realistic infrastructures and also data center networks and covers various kind of problems. And then we can use these data sets to inject failure, right? So that's that's our first kind of problem.

And the second problem towards this task is that we want to we also need to build an environment so that the agents can actually play in this, interact within this environment. And then when there's a failure injected, then the agents can observe and agents can do in troubleshooting in terms of invoking different tools and scripts and so on and so forth on the on the hosts and on the switches and on the other parts of the infrastructure, right? So that itself is actually a very difficult kind of system kind of task.

And challenge number three is that we also want this arena to integrate the production tools that we have been using for a lot of time. Um, things like PinMatch, things like MyCraft, which is another system that we developed with Bytedance for tracing collective communications. And there are a lot of other troubleshooting tools and monitoring tools that are already deployed in force. So we want our arena to be able to also invoke those services as well, whenever needed. So that's the first part, right? First challenge.

And the second challenge is the design paradigm for the agent. Basically, we're trying to answer the question of, you know, how this agent should how the workflow should be designed. So I think at least there are three also three kind of basic questions here. One is that we have to really look at the context length problem and handle it. Otherwise, this just simply wouldn't work. So if you imagine that we're dealing with a difficult problem, then there is there has to be a lot of troubleshooting and then we're collecting a lot of data from all kinds of logs, all kinds of systems, we're looking at various metrics. That's a lot of context. Only a small part of that would be useful at the end, but we don't know. Um, so if we just blindly feed these data as context to the model or to various like agents, then it wouldn't be very efficient and wouldn't scale. So how do we actually solve this problem?

And problem number two is basically the reasoning part. Essentially, how we should design these agents so that they can actually think about it sort of corresponding to the difficulty level of the problem, right? So like I said before, um if the problem is quite straightforward, then you don't have to spend a lot of time. But but if the problem is indeed challenging, then you might have to go through a lot of a lot of effort. So you have you can think about like whether you want to design the agent to work sort of sequentially in terms of how it reasons the root causes, or whether it wants to think about things in parallel in terms of, you know, I can think that there are three potential hypotheses for this problem, maybe I should go out and then verify each of these independently in parallel and then and even when I'm checking the network, for example, I can check various parts of the network sort of independently in parallel to speed up the process and so on and so forth. So so that's that's number two.

And then problem number three is that when we are dealing with instance, especially newer instance, that don't come up very often or, you know, haven't come up very often, it starts to emerge due to like new systems and new hardware, then we really want this AI, this agent system to be able to actually learn from some of the limited experience interacting with these instance. Um, so so that's going to be really interesting also.

Right. So the last part is the software system itself, right? How can we design a software system that supports this type of troubleshooting agent so it's more efficient and also it's more reliable? So you can imagine there are tons of open questions here. And again, I just list three potential things. So one, I think, is cross-model communication, which which is actually quite fundamental to all kinds of agentic workloads, I believe. So when you have multiple agents and multiple models sort of interacting with each other as part of the entire workflow, right, to solve the problem, then you have to communicate across these models. And you can imagine that it's not just the context length kind of a limitation here, there's there's more fundamental problem here, which is that when you think about communicating across the models using KV cache or using tokens essentially, um that's that's one potential way. But you could also think about other ways that don't really rely on the token or the embedding kind of abstraction, but but using some other abstractions that only models can understand. Um, so it's different than tokens which are essentially designed for for people to consume, right? So so there's I believe there's really interesting fundamental challenges here. So so that's cross-model communication.

And the second issue is the robustness and also assurance of such a system itself. Um, so you could apply like red teaming sort of technology here to understand how this troubleshooting agent itself might fail. And when it fails, what should you do, right? And the third problem is also full software sort of Dev and Ops cycle support for this type of agents so that we don't just run this agent, we can also support like different different functionalities such as versioning and and testing and the rolling back to a particular older version when things don't work out and so on and so forth. So when going for production, these kind of things would also matter a lot.

Right. So I've talked about a lot of open challenges. And then in the remaining time, I'll sort of just briefly talk about some of our work that kind of try to make some progress towards this vision. Um, so we've done few things. One is about the open arena, which I just talked about. And so we're basically this is still ongoing work. We're working with some of the biggest AI companies, um collecting data and trying to build this open arena that can inject failures and so on and so forth. So that's an ongoing work.

And we've developed TSGuard, troubleshooting guard, which is going to appear in FSE this year, software engineering top conference. So this talks about a user-centric troubleshooting agent for AI workloads. Uh, this this is a joint work with Azure. And we've also deployed a system called NetOps AI, which is also a troubleshooting agent but targeting networking networks. And this has been running at leading company, AI company in China as well. And finally, I'll briefly introduce MyCraft, which is a tracing tool that we deployed and running at Bytedance for collective communications. Okay, so I'll be I'll be quick.

Um, so first about the TSGuard. Um, so so this work really started a couple of years ago, actually, it started in 2024 when I was doing sabbatical in Microsoft Research. And so we had this idea, which is try to see if LLMs can help us troubleshoot all these, you know, instance coming from Azure's customers. And we want to particularly focus on their AI workloads and AI infrastructures. Um, so so that's essentially the motivation for this work.

And because of some limitations working with Azure, so at the end what we decided to do is sort of the user-centric kind of agent, which means that what we can do for our agent is only to observe the instance that are reported by the customers. But we are unable for our agent is unable to directly interact with the infrastructure, so what we're trying to do is just based on the incident report from the customers, we do an diagnosis and then try to make some educated guess about what the root cause is.

And in terms of our design, we adopt a two-phase approach, which is pretty intuitive, I guess. One is an offline knowledge building kind of process to build the foundation for the agent, um build the brain for the agent, and the phase two is the actual online troubleshooting face.

And so in the offline case, basically we look at this one-year period of incident data like I showed you before, and then we basically derive a taxonomy in terms of structuring all these different possible types of incident that we've seen in the past and, you know, the root cause and also essentially how to deal with each of these types of incident. And we also add some domain-specific rules and knowledge that are specific to Azure into this taxonomy. So that becomes our basis for phase two, the online diagnostic.

So the diagnostic system part basically is designed around a tiered kind of heuristic. So what we are saying here is essentially, you know, first of all, when an incident comes up, we we will really quick quickly run a sort of pattern matching kind of workflow and then try to see if this is this is a recurring incident that has been observed before and then we know how to handle that. So that is resolved using RAG and other technologies. And once we find out that this is not a recurring incident, it's a kind of complex incident, then we will move to Tier 2, which is, you know, the slow path. And then the agent will step by step try to systematically come up with different guesses and then verify each of the guesses in terms of what the problem really is. And then during the process, it could deploy various benchmarking tools. Um, although like I said before, we can't actually experiment that on Azure directly, so we were experimenting that in a emulated environment. And and so the last Tier 3 is that when we've seen a incident that's completely new to us, then we would spend even more time and then come up with various possibilities and then evaluate these possibilities even with more time.

Right. So essentially we evaluate such a design against a various baselines at that time. Um, and we are able to show that essentially in terms of the F1 score and other things, TSGuard can achieve a much better diagnostic accuracy for various types of incident.

Right. And so the second kind of work is the NetOps AI, which is basically a deployed system running in a leading company. So the figure is not a real screenshot of the system because we were not allowed to do that. So the figure is actually generated by AI but based around the same idea. And basically we want to show you what kind of things this system can do nowadays. So it's actually more than a troubleshooting AI, it's more like a portal for NetOps. So it integrates a lot of functionalities like metadata query depending on the context. So you can query about various parts of the network directly sort of just it's like entering interacting with a LLM. And you can do end-to-end diagnosis as well. You can do telemetry analysis and and also you can do other things. Um, so we we want to we will try our best to try to collect some data and then try to write up some technical report about the experience of using this system in production. Um, but that is still ongoing.

And the last work I want to share is MyCraft, which is also about reliability engineering. So it's more about building an actual tracing system as part of the infra so that we could provide more information about what what went wrong. Um, and this is mostly focusing on training. Okay. Um, so essentially what MyCraft does is that it provides fine-grained tracing abilities for collective communications. By collective level sort of tracing, we mean that in MyCraft, it is able to provide the real-time logs that are generated periodically, especially when a collective communication remains unfinished. Um, so previously a lot of these tracing tools will only give you things like timestamps when the operation finished. But if the operation cannot finish or is stuck in the middle, then you won't be able to get anything. And so this is MyCraft is different in terms of, you know, it's able to actually give you real-time progress kind of logs even when your collective communication is stuck. And then with this kind of fine-grained information, then you would be able to detect problem a lot quicker, and you would be even able to detect the root cause potential root cause of the problem. So that's what MyCraft does.

And so this has been running since October 24, and we we were collecting data for two months period of time. And so basically in the SOSP paper, we reported a lot of experiments and also a lot of data. But here essentially I just show you quickly showed you like the trigger time and also the root cause analysis time distribution for this two-month period of data that we collected. And you can see that we've were able to perform over 1200 root cause analysis, and 90% of them 90% of these problems were detected within 15 seconds of onset, and 60% of these analysis are completed within 20 seconds. Um, so that's very efficient.

Okay, so that's everything I want to actually talk about. Um, so basically I believe that it's a great time to work on reliability engineering for not just networking actually, but for the entire AI infra, and we're welcome all sorts of discussion and collaboration if you're willing to actually share some of the data sets regarding instance and failures, um we would love that. Right. And so yeah. So just before I close, um just want to quickly mention that actually in yesterday's one of the sessions, my student gave a talk about our other sort of project which is about network configuration management based on LLMs. So if you are interested, please also check that out. All right. Thank you.

Dirk Kutscher: Great stuff. Thanks very much, Hong. Great talk. [Applause] Thank you. Let's see whether there are questions. And I have Rod in the queue.

Rod: Thanks. This looks like really good work. Um the one thing I didn't get from the talk is um what the overhead is on the production systems themselves for the for the monitoring and tools that are that are actually used for this?

Hong Xu: Ah, so you mean the things like MyCraft and PinMatch and uh yes, yeah the the you're the whole set of things you talked about. You talked about about three different things and you talked about the computational cost of doing the analysis, but you didn't talk about the cost of actually collecting the data.

Hong Xu: Right, I understand. Um so in terms of so in terms of these individual systems like PinMatch, like uh MyCraft, their overhead is actually quite small. Um so for example for uh MyCraft, actually uh I don't I don't have the slide here, but on the paper we were showing that the its impact to a training job is is very minimal. It's like less than 1% in terms of the additional latency overhead. And uh the MyCraft system itself is actually just running on a server. And and and it's it's seeing a lot of data, but uh in terms of the actual resource that we need to spend processing these data, I think it's quite you know overhead is not that big of a problem. But in terms of the entire like agentic system that will process uh each of these uh instance, um I don't have very concrete data uh about that overhead. Um what I know is that you have to provision some additional GPU servers uh just for those uh just for those uh agents. But um uh yeah, but I think the overhead is not a huge issue over here.

Rod: That’s what I was expecting. I was just wondering if there was something you weren't you weren't saying. Um the so that’s actually uh really encouraging. But what what about the ability of the system to detect problems with itself, with its own monitoring tools and things like that? Can you do that as well with this data gathering?

Hong Xu: Right. Uh that's a that's a very interesting question. Yeah. I we haven't tried that actually. Um so I I believe I believe we could do that. Yeah. I believe we could try that. Yeah.

Rod: Thanks. By the way, love the picture on the whiteboard.

Hong Xu: [Laughing] That's from my daughter.

Dirk Kutscher: Okay, another question, Roberta.

Roberta: Hello, Professor. Thank you for the presentation. I have some questions related to security and also related to hallucinations. You're using cloud bot and you're using agents. I imagine all the agents that have been mentioned they are generative agents.

Hong Xu: Yes.

Roberta: How are you dealing with the hard data manipulation that might, you know, 13 to 17 percent of the times might happen because of hallucinations when they are reading hard data such as, you know, log files, those kind of registries to to be able to to do the training? And also, into your agents that you use to do like the troubleshooting, uh and maybe regeneration, I'm imagining here from the yeah, how are you circumventing this this risk?

Hong Xu: Yeah, um great question. So obviously uh this goes for all agent workloads, but it's probably more critical uh when when we are using agents for like maintaining infrastructure. Um, so a lot of I think there's there's a lot of things that you could do to sort of safeguard uh and then alleviate this hallucination problem. Um, so a lot of that comes down to how you build the so-called uh scaffolding or or the harness around your agents and then how you structure the workflow so that um you can use like different paradigms like skills and and other things uh and you when you when you write your skill, you can uh sort of uh mandate uh so that the model will not the model will have to verify all of all of the you know possibilities against benchmarking scripts or other data that it actually collects. And uh so uh I think a lot of that is quite helpful uh in in preventing or like alleviating hallucinations, and that's that's also what people have been doing uh when deploying these kind of things. Yeah.

Roberta: You're actually actively working on scaffolding for those cases? Because what I've been seeing on research and seen on presentations here are that people are, you know, implicitly trusting that the scaffolding that exists is already good enough for taking care of some of those questions. But in security for defense, we have a really huge issue for, you know, the problem dimension being infinite because it's non-deterministic, right? So when you train an agent to scaffolding into one uh one type of problem, one type of issue, changing just a little bit of that issue will, you know, remain like infinite versions of the issue that might happen and trigger a response for the agents that it was not on the training like the original one. So this is kind of because it would solve a lot of issues that we have been seen here, so this is why I was asking. I don't know if you can share. We can talk later.

Hong Xu: Right, right, sure. Uh yeah, so so um I'm less aware of uh the you know the security side. Um I think so I think um basically uh there are so in terms of uh troubleshooting here, um there is a lot of uh additional like ways for you to actually verify even if the model actually hallucinates. Um, so that that could be sort of the last line of defense uh over here. But that is only for for this scenario, for this particular task. Yeah.

Roberta: Thank you, Professor.

Hong Xu: Yeah.

Dirk Kutscher: Okay, um yeah, I have another question. Um so the open arena topic, uh that’s actually interesting. So we also had some previous discussion so to what extent could um you know testbed and maybe data sharing data set sharing activities be useful in let's say an IRTF context? Um I just wanted to to ask your opinion. I mean, often these data sets um, you know, are a bit difficult to share, right? So there’s lots of uh, you know, privacy data in them. Do you see a a chance for an activity, so you know testbeds, data set sharing, um like in an open environment like like the IRTF?

Hong Xu: Um yeah, um so so I think this is really, really fundamental. And uh I mean, if if we are able to uh put together uh such kind of a data set, it will be really fundamental uh to to research and and to industry. Um, so I think for IETF or IRTF, um I think it would be possible to sort of come up with like a taxonomy kind of things and uh and so so so we have a way to like categorize all the all the possible like failures regarding networking. So maybe it's a switch failure and in terms of switch failure what essentially different types of switch failure versus host side kind of problems and so on and so forth. And then there is software stack problem and other. So I think it's possible to come up with a taxonomy first. And then with this taxonomy then maybe it would be easier for different people, even companies, to to contribute uh like incident data in terms of uh what is the failure pattern and and how exactly the failure looks like. Uh so so so yeah. So when we are interacting with uh of course we are working with some companies so they are willing to at least share some actual failure data which they believe are actually quite common. So it's not really proprietary so to so to speak. Um, but maybe when it comes to other types of failures so networking probably is easier uh in that regard but then when it comes to other part of the infra, it might become more sensitive so to speak. Um so I think for networking it's probably easier.

Dirk Kutscher: Yeah, maybe the most challenging part would be things regarding IB, which is um really a black box and yeah.

Hong Xu: True.

Dirk Kutscher: Okay, that makes sense. Thanks very much. Okay, great. Thank you very much again, Hong. Great talk.

Hong Xu: Yeah, thank you. Thank you. [Applause]

Dirk Kutscher: And now something different. So the next talk will be by Lixia Zhang, and I don't have her slides on the data tracker yet. It’s not synchronized. Give me a second. Reload. Okay, thank you. It works. Just need to get the slides in.

Lixia Zhang: Um, hello everyone. This is Lixia Zhang from UCLA. I uploaded the slide late because we are doing research in real time, right?

Dirk Kutscher: And I don't get them here, but I can maybe share them for you and then—

Lixia Zhang: Uh, let me just get started. To talk about networking, this is 2026, exactly 40 years of the IETF activities. I attended back in the 1986, the very first IETF. At the time, we were really struggling with just how we could interconnect all the computers together. I remembered those days, the two challenges are routing and then congestion control. The bandwidth was slow and the traffic grows exponentially. So therefore—on the first time you do anything you don't know how to do it and you learn through the practice. [Tries to change slides] If I can turn into the display mode—I figured it out. Great, yeah.

So what we are doing today is we are networking AI agents now. So AI agent, I tell my students, is really the new front of networking. It’s not like I get into AI per se. I'm still doing networking after 40 years. I really enjoyed the previous talk, talking about all these exciting new challenges but also new results, how we move the AI forward. But like anything else, they need the networking. So networking people are come here to help moving the AI forward.

Now the question is that since both host and agents need networking, now what’s new and what’s the what’s the same? That’s what I want to talk first, then we identify new challenges. Oops, did I change it? You just—oh, sorry. I was proactively changing but—should I adjust—no, no, no, give him the clicker, you don't—you can't control it. Oh, okay. I figured it out. Yeah.

So, what we are doing today is we are networking AI agents now. So I want to talk about what's the new challenges in networking and then, you know, just identify some things, hopefully we can all work together. This is IRTF, so I'm not talking about what we must do today. What we must do today is keep the work going. You know, AI is already there, agents are already there. Oh, they are talking now, even though far from imperfect, there is many challenges. But meanwhile, at the research community, we move forward to look further out, to see how we can improve the current situation.

Next slides. So talking about the networking, what do we do? You know, put plainly we deliver packets. That’s the bottom line of networking. But for delivery, there are different ways of delivery. People normally say one-to-many, many-to-many. I think AI actually put some things like I put there as I-to-J. You know, I can be some subset of the entities on the network and J is some other subset. And I and J could be the full set, otherwise could be part of that. So in the day one, because the TCP/IP was point-to-point, that’s what we do and that’s what we still do today, point-to-point delivery. And for the agent, I think we also started by the point-to-point communication. But soon in the internet, we needed to scale the data dissemination. Therefore, I remembered clearly in the late 80s, IP multicast got developed, intend to scale, say, the audio/video dissemination. Although in reality, IP multicast turned out to be a real challenge. That’s another story. I'd be happy to share for whoever interested exactly what happened. But what we now to support dissemination is actually through CDNs, content distribution network. Now the AI also want multicast support. How do we do that? Yesterday I had some discussions with really the first-line engineers and clearly people have different views on regarding that. But nevertheless, talk about communication patterns, the pattern itself is not new. Now I don't have to talk about, you know, the N-by-N, N-to-1, but generally speaking is that what we do in practice is that we do multiple one-to-one communications to fulfill the need for I-to-J communication. That’s just a matter of fact.

If we move to next slides. Now what’s another thing that’s not new? Is the security. I must say that in day one, back in 1986, security everyone knew was important, just that it wasn't the problem we were actively working on at that moment. These days, you hear the talks, people will start by saying that TCP/IP didn't have security design in, as if that’s ignorance. And I really want to put that straight. It was not ignorance. My former advisor, David Clark, keeps saying that we were full mind of security, it just that we didn't know what’s the threats and what’s the problem that we must solve today. That’s how TCP/IP didn't start with the security. But very quickly, in the early 90s, e-commerce came online and then people needed the secure transaction yesterday. So therefore, some solution developed quickly, SSL with the Web PKI, you know, certificate authorities, this and that. And then later, because you know money tend to bring evils, the spams, DDoS attacks and all sort of other things came along and what we do is that we just put kind of a some kind of filter at all the different levels, firewalls, this and that trying to block the attacks. So what we'll end up with is really different solutions at the different levels. And what we get today is a site of solutions. Do they interact? You know the answer, right? And sometimes they actually counter each other. I heard a story of some company wanted to deploy CDNs—not CDNs, I'm so sorry—VPNs for the enterprises. And, you know, one big hurdle they run into is firewall because VPNs need to punch a hole on the firewalls. That cost real engineer money, so they say pay us for punching the hole on the firewall so that you can deploy your security solution called VPN.

What I want to say is that the security is a late add-on and then we don't really have a unified namespace to talk about what’s the threat, who’s the bad parties, to develop some kind of a collaborative solutions. Therefore, we end up with this fragmented solution space. There’s one thing I probably forgot to put into the slides, let me add on to that. So TCP/IP started with just addressing. Back in 1981, when RFC 791, 793 got rolled out, we just had IP addresses to deliver packets. At that time, there’s no DNS yet. There’s kind of informal things going on, but if you look at the RFC series, the first DNS specification I remember actually published in either 1983 or otherwise 1984. So DNS was added at least as a standard late. Now, at the time, again at the very beginning, all the hosts would get IP address. So therefore, we have this statement: anyone can can send packets to anyone else. But as DNS came along, we somehow lost that. Who get DNS names? Servers, websites. Do you have a DNS name? I think maybe 0.1% of you guys do, but the majority of us do not. Now, when you get into security, the very first thing is authentication, who you are. And that would lead to the second question, what you can do. Without a unified identifier space, that’s how our security solutions get fragmented. Now, we all have security in the cyberspace, you log on to the web, you have single sign-on. What’s your identity? Not a DNS name. Server authenticated by DNS name, now you? By and large, my personal understanding is that users identified by email addresses. Why email address? Because email address is globally unique. Identify you. It’s really DNS name, just plus a little part, that’s your user ID assigned by the email provider. So therefore, there is this implication: who owns your identity? It’s the application provider. So therefore, you do not have a application-independent identifier. I just state that as a fact.

If we go to next slides—I probably should go forward faster. So what’s really new? It’s not communication patterns, it’s not really no security. We do have security, but as I go on to talk about what’s new about agent communications, you can see that previous solutions for security really got stretched out. So I these days I keep inventing the triangles. Yesterday I talked about for the internet naming, we have three requirements: the requirement triangle. What I put down here is actually the agent networking patterns or behaviors. Number one, they need the scale. Number two, they are dynamic, and number three, all of them together, it’s the multi-party interactions. If we follow our vision, make it true, it’s really agents working together with each other. Now let’s go through all of them together. Talk about the agent scale, people say we're going to have billions of agents, probably soon, then we're going to go on to trillions. This sound very big number. But like I put in some something there, today's internet already have billions of human users and in terms of devices, there's like 40, 50 billion already. So billions is not a big number in terms of networking, we do that now. Now what’s new here? What’s new here is the next two. The dynamics. A user doesn't just show up in a minute and disappear again, nor does any devices. But the agent can. There is the agent who’s going to stay there for years maybe, but there will be also agent be so transient, maybe last at the sub-second level. And we do not really have a way to how we deal with this kind of agent, how we identify them. Especially, how we make that agent, even last for 500 millisecond, how we make that accountable for whatever the actions it performed during that short lifetime. The next thing is about interactions. Right before I I went to the airport for this IETF, I had dinner with a group of people who actually develop the AI agent for their application-specific usage. They need group agent communications, agent group communications. So they asked me, why is this hard? Look at us, we have groups in WhatsApp, why can't you do it for agent? Why can't we do it for agent? Because users do not have direct interactions as of today. We interact through the cloud, that’s how WhatsApp groups work. But the agent cannot do this interaction indirectly through the cloud because they have real-time requirement.

So therefore, you put all the three things together, I think all the three actually point to one point: we need to have decentralized networking, decentralized identity management, decentralized trust. I'm not saying that I'm the co-chair for the DINRG together with Dirk, I'm just trying to be objective, pointing out that agentic AI will give you a big push toward the direction to decentralization, from where we are today is more or less cloud-centric networking.

Next slides. So I mentioned the three things, but then there are two more stuff that really further put stress on the existing security solutions. The number one is this delegation chain. You know, me as the user will delegate the agent, you do this, the agent of course going to delegate to other agent for complicated works. Now, do we have the security solutions that support this kind of multi-stage delegation? Another thing together with the delegation is autonomy. You want agent to work because you don't do it. Agent going to act on your behalf, but autonomously. And therefore, we still need them to be kind of accountable, then how we make that work? So they no longer have this prior action permission, and people say let’s do the tokens, okay? We have OAuth tokens, give each action a token. Right. See if we can do that within that time limit and within that scale. I may be ignorant with the current solution practice in security. My understanding is that all the token servers actually in the cloud. Now, so it’s the order of magnitude requirements in terms of dynamics, in terms of interaction, in terms of scale really, really says that it may not be feasible, we just move on whatever we have today for the agentic networking.

Next slides—if I remember what that is. Yeah. So I want to bring up another thing about the physical agent. This is because, you know, some some companies came to me talking about that. They say that the IETF has lots of discussions about agent, but seems it’s all about the models, right? The language model agents. What about or not? We need the physical agent. Physical agents in the manufacture floors, physical agents in hospitals. As a matter of fact, I heard that there's already deployment there. Now what’s different between the software agent and the physical agent? I listed a few there. Number one, you know I use agent, but still one-to-one. Physical agent, I think by nature, they will be multi-party. I talk to some manufacturers, they really want to turn all their current statically programmed robots becoming physical agents so that they will intelligently working together, equipped with sensors, and also interact with humans. Then there’s real-time requirement, talking about interacting with humans, because if you get the time wrong, that can cause physical damage. And that relates to too about the direct physical interaction. Think about you make a a model call, it can be problems, your call can be failed, but this a software system fail. You say it can have huge damage, right? Your agent somehow behind your back gave, I don't know, $300 to somebody else. It is a big damage, but hopefully those soft damage can be fixed by the software system. Physical damage, on the other hand, is much more difficult to fix. If your robot just roll over something else and destroy it, physically, I doubt it could be easily fixable.

So moving on to next slides. So the next point is that I think thing that I just put the highlight to say the agent, the scale dynamics and multi-party interaction really expose the structural failures of the existing security solutions. This not the performance issues, it’s not about your crypto how fast you can run, issue certificate to agents or maybe even revoke it, but rather it’s really structural things that make seems to be not promising direction to carry today's security solutions onto the agent era. So I give some specific examples here. Talk about the namespace. We dealt so far with humans, okay we get our email addresses. Do your agents get email address? Probably not. You think oh we can give it one, sure. Who? My email address I get it Gmail address. Would you ask the Google give your agent a Gmail address? You you have to figure out something else. So it’s number one question, how you going to name the agent? Today's stuff, right? You you get all this solutions, those OAuth, those SSH keys, those API tokens, all of that. Do they apply with this all this kind of a mixed identifier approaches apply for agent? We talked about agent going to interact. Today I believe agent interaction is still within some scoped domains. But people envision tomorrow we're going to have inter-domain agentic communication. Now how we deal with this identity issue, if your domain use something as identity that’s different from my domain? Next thing is about certificate, I already mentioned. You can run really fast, give everyone a certificate. There’s the scale. We have today how many certificate authorities? Ask the security experts, they give me some numbers, someone say 100, someone say 300, doesn't matter. Some number of hundreds. Now, would these security CAs issue certificate to agents? You have to do something very different. I'm not going to go down to the trap of certificate revocation, I think you can spend three days just talking about how well those solutions work. Now the next thing is about this authorization. Now we have single sign-on on the web, we get a token, the token give authority to, you know, someone can access my I installed the new software. Gmail is my the OAuth server give me a token, so this new software can access some part of my data. Right. But I question to say that none of the existing ones, combination of them would be suitable to secure the agentic system.

We have to move a little bit faster. Oh sure, I'm almost done. Yeah. I don't know how many slides now. So starting point, yesterday we have joint talk with Dirk about why naming matters in this future kind of integrated AGI era, where we talked about that you want you want to fix the security problem, you want to fix the scalability problem, let’s start with fixing the naming problem. So that we advocate this this notion that DNS namespace should be serving as a unifying namespace for everything. It’s not just for organizations, it’s not just for applications—today we don't, we have websites—but organizations, users, applications, agents as well. This unified namespace will provide you the foundation for you to talk about security: who is who and what this guy can do. So name alone is not enough, DNS name is very meaningful, ucla.edu, ddu, you know what that it is. But you cannot say I proof this my name, you have to combine that with crypto. So identity is really name plus crypto certificate. Now, who give you that certificate? Let’s get into the trust anchor question. Today we get certificate by and large from the CAs. And tomorrow, we really promote this direction, like earlier I said this triangle all point to decentralization, we need to have decentralized trust management. Go quick. Okay.

So the if we want to get there, start with naming and start with decentralized trust. We need to build tools, new tools to do that. Today we don't have the tools to support the notion that you get a name, you have the automated tool toolset to help you manage the names. I can easily get a name from our department, lixia.cs.ucla.edu. If I gave this names to 10 servers I have, I don't know how to manage it, I don't have tools for that. Especially for agent, I fire off 20 agents, you know some someone lives for three months, other one lives for three seconds. I don't have tools to manage that. And associated with that is really this decentralized trust management. We talked about certificate, they definitely need a certificate, issued by me if I am their owner. But how I manage that certificate, not myself, but automatically and yet securely. And this really point to the next one: so 3As, this notion of authentication, authorization, and audit for accountability, it has been lived for over 20 years as far as I have learned about this word 3As. But in reality, we don't really have principled way to realize it. And at this time, if we want our agentic AI dream to become true, we need to get put that into reality. Now next slides, I think is my last one.

Ah, call for action. We are the IRTF. Um, I said that IETF can standardize protocols quick because agent there, agent communication is there, we need something to at least guide the common practice. But for IRTF, we need to look into the next N years, where we want to be. If I convinced you applying what we have today wouldn't last us into the future as we envision, now is the time if not late yet to start looking for the direction for the new solutions. I will skip the detail and then—oh this is my last slides. No, these are the last one. Yeah, I think all of this I mentioned earlier, DNS localized trust, and I think only one thing new is about we need to engage the regulatory body early on about how agentic AI should be regulated before the current market force actually influence the regulations too much. Now, takeaway, this is the summary. Agentic networking, we need to start with the global naming, semantically useful DNS names, and localized trust. So what’s new here is not one-to-one, one-to-many communication, what’s new here is special requirements to support agentic networking. And then today's this collected security solutions wouldn't up wouldn't carry us far enough into our dreams. So therefore, we need to start the new solution development. Um, I already mentioned about two suggested research investment, we really should get started quick. Then I think I just ending slides to say that agentic AI, agentic communications is really new front of networking. And this really a good opportunity for us to think start from the ground up and try to build the things. I mentioned before, today's system is where we are, because we started with the entire different world. It’s just when I started, few thousand machines probably worldwide. Um, today is totally different world. So so far for 40-some years, we just build on top of the foundation, address-based communication, add things as we needed. Agentic AI give us a strong push to say, can you still live on this incremental changes? And that’s the big question. So we can discuss the answers.

Dirk Kutscher: Great, thanks Lixia. [Applause] So on that inspiring note, do we have questions for Lixia? Let’s take the next question. A quick reminder for people in the room, if you want to ask questions please use the Meetecho queue system. But next we take Wes online.

Dave: Okay, it’s not Wes, it’s Dave. Wes is just is running the remote room login. I see, okay. So um I may have vastly misunderstood some of the things you were saying about naming. So let me sort of propose what my understanding is. Um, is that you're proposing that agents have their own naming structure. Um, and what I wonder is, isn't it a whole lot simpler if agents operating on my behalf operate with the names associated with my own identity rather than some independent identity? And doesn't that sort of factor the whole problem out in a in a much cleaner way? So explain why I'm misunderstanding well what I'm misunderstanding here.

Lixia Zhang: Um, we say that agent actions need to be accountable. Now, who is that party behind the agent that take that accountability? That is the question. Whether it’s your identity or some other party's identity you can use for that accountability, that be your answer. But fundamentally, you need this authentic identity for accountability. Exactly that’s yours, others, your organizations—those are the implementation details.

Dave: Ah, I don't see them as implementation details. Okay, thank you.

Lixia Zhang: Policy details. [Applause]

Dirk Kutscher: Thanks very much, Lixia again. [Applause] So we took a bit more time for the discussion after the talks and we didn't cut this short intentionally because well the discussions were very good and also led to many insights on potential topics for internet research. So I took some notes and just wanted to check whether I got everything.

So we talked about, you know, testbeds and and data sets, right, in in Hong Xu’s talk. And so that is a topic that was brought up previously as a a potential good activity that an IRTF like group or activity could could well do. We talked about, you know, all these new system designs at the intersection of distributed computing and networking, so KV cache-centric systems, unified protocol or unified abstractions, and um then just now we talked about different agent communication patterns, so group communication, low-latency interactive communication, and then a very important topic was brought up with the whole security, so naming, identity, trust, delegation for agents. And also you know preparing for a world where, you know, this scale that we are now used to would be challenged in terms of, you know, dynamic instantiations, dynamic coalition forming between agents.

And so these are what I think these topics have all quite good potential for, you know, really good research that could would go beyond like the current say protocol engineering aspects that are discussed in the in the IETF. Um so what I will do, so we have another IRTF open session tomorrow at 2:00 p.m., the first afternoon session. Um I reserved some time for continuing this discussion. I'll try to get a summary of today's meeting and and uh present that briefly, and I hope people who are interested could come to that meeting and then continue the discussion there. So like to thank our speakers and everybody who participated and everybody who came today. Thanks very much. [Applause] Meeting adjourned.


Session Date/Time: 18 Mar 2026 06:00

This is a transcript of the recorded session.

Dirk Kutscher: Okay, we're going to get started. Please take your seats. Hello, welcome to the second IRTF open session this week. Glad to see you. My name is Dirk Kutscher, I'm the IRTF chair. So, you've signed up to our Code of Conduct and our Note Well, so I'll leave this on for a second. This is about IPR, that if you're involved in anything, or you know anything, please inform us shortly. This meeting and all the other meetings are recorded and live-streamed. And so, in general, the IETF has a set of rules that apply to these meetings. In particular, the Code of Conduct, RFC 7154, and the anti-harassment procedures, RFC 7776. They also apply to us. And in addition, we have a Code of Conduct that is more specific on research ethics, and that's 9775.

Dirk Kutscher: And, yeah, the Internet Research Task Force is not doing standards. We don't have working groups. We don't produce standard track RFCs. So, the purpose here is to conduct research that will hopefully be useful for the internet. And, yeah, most of the work we do is done with the intent to generate insights for the community and, for example, enabling experiments. We currently have 16 research groups in the IRTF. 11 of them are meeting this week. Please look at the IRTF agenda to find the meeting times. And, actually, this is not a current slide. Let me skip this.

Dirk Kutscher: I wanted to report on a few things that have happened. So, we held a workshop at CoNEXT in December last year on "Internetworking Challenges for AI," so a topic that we also discussed this week. Antoine Fressancourt, my co-chair there, and I, we published a report, and you see the link on the slide. If you're interested, check it out. We also held the HKUST Internet Research Workshop on Friday before this week. So, that's a workshop that just invites ideas, lightning talks, and discussions on new topics, or topics that people find particularly interesting. And, yeah, it went fairly well, I think. So, you see the photo there, and topics that were discussed: IoT compute continuum and the end-to-end principle, sustainability, source buffer management for low latency, and then some discussions on agentic AI communication. So, we typically organize these workshops here in the area before the March IETF, which is often in Asia. And so, we'll likely do this again next year, maybe even for two days, depending on who can host us. So, we'll announce this on the IRTF announcement mailing list and other channels. If you're interested and in the area, feel free to attend that workshop. So, it's open for everybody, you can just register and come.

Dirk Kutscher: Okay. So, as usual, we had the possibility to offer some travel grants for people to come to this meeting. So, we offer travel grants to early career academics and PhD students from underrepresented groups to attend our meetings here and also the whole IETF. So, the idea is, yeah, to make it easier to get people to attend the IETF, maybe for the first time or maybe first two times. And so, these travel grants, yeah, pay for the travel accommodation and the IETF registration fee. And so, the expectation is that people come and then stay here for the whole week, make interactions with, you know, IETF/IRTF folks, attend our meetings, and then, yeah, find a good way to contribute. And so, this has worked relatively well in the past, so many of the travel grant winners there were able to find a good topic that they could then later work on and then continue to attend the IETF. So, we'll also offer travel grants for our next meeting, that is going to be in Vienna, not in March, sorry, it's in July. And the deadline is on March 27th, so that's relatively soon. So, it's the end of the week after the meeting. So, if you're interested, please make sure you apply and share this information with people you might think could be interested.

Dirk Kutscher: And so, we also have the Applied Networking Research Workshop (ANRW) every year at our summer IETF. So, that will be our next meeting in Vienna. So, the co-chairs for this workshop are Thomas Schmidt and Suresh Krishnan. And there's a call for papers that is currently open, deadline April 17th. And, yeah, please submit your best work in applied networking research to ANRW. It's a really nice event, so it's like one day during the IETF week. Right.

Dirk Kutscher: And, yeah, so we have the Applied Networking Research Prize (ANRP), where we want to recognize the best recent results in applied networking. So, the idea is that, while there is, you know, good published work in, let's say, top-tier conferences, and we'd like to bring that to this community and, you know, encourage more discussion about this. Maybe this could start up some new work in the IRTF, or it fits into things we are already doing. And, yeah, it also comes with support for attending our meetings. So, we had 70 nominations for this year for '25, and we made six awards. And then at each meeting, we will present two of these papers and invite the paper authors to present here. Yeah, we have a very supportive award committee. Thanks, everybody, who helped with selecting the papers. And so, the ANRP is kindly supported by ISOC, Comcast, and NBC Universal. Thanks very much for the support.

Dirk Kutscher: And, yeah, today we have two exciting talks on the agenda. So, first one will be by Tianchi Gao for his work on "Designing Transport Level Encryption for Data Center Networks". And second talk will be by Xiangjie Huang for his work on "Sending Burstiness Control for High-Quality Real-Time Communication". So, we will have these two talks now, and then after the talks, I'd like to continue the discussion that we started yesterday on Internetworking Challenges for AI. So, we had a dedicated meeting with also, yeah, really good presentations and ideas and, yeah, some good discussion that kind of crystallized on a, I think, few good topics, and we'd just like to use the rest of our time today to discuss this more with you. And so, now I'd like to welcome Tianchi Gao. So, his paper title was "Designing Transport Level Encryption for Data Center Networks." And Tianchi is a PhD student at the School of Informatics at the University of Edinburgh. He's supervised by Michio Honda. And he received his MS, his Master's degree, in computer science at Edinburgh as well, with Class Prize. And previously he interned under Professor Yang-Gyu Sun at the Advanced Institute of Information Technologies in Peking University. So, his work focuses on designing high-performance and secure data center networking systems. And he's currently working on SMT, a data center transport protocol with transport-level encryption, that is designed as a generic secure transport and provide strong confidentiality while maintaining low latency and high throughput for a wide range of data center applications. And he also contributed to XO, a framework for remote TCP connection offload that improves scalability and reduces host overhead. Welcome, Tianchi. Give me a second, bring up your slides. Okay, welcome again. Looking forward to your talk.

Tianchi Gao: Hello. Yeah. Thank you, Dirk, for introducing. So, everybody, I'm Tianchi Gao. I'm a fourth-year PhD student at the University of Edinburgh. I'm here to present my work on designing transport-level encryption for data center networks. This work is being done at the University of Edinburgh with my colleagues. And this paper is going to appear at the IEEE Security and Privacy later this year at May in San Francisco. So, let's get started.

Tianchi Gao: So, first, let me introduce some background of the encryption in the internet. So, before 2010, the encryption in the internet, people already starting to aware, but encryption was exceptional. It's quite rare. But the major change point was like after 2010, Google start roll out the default SSL encryption. Then there's more wide adoption of the encrypted transport like SSL and TLS in the internet. Then later on, in most recent, we start to, people start try to encrypt everything, like also there are encrypted native transport designs like QUIC and ECH, and people put DNS over HTTPS. And most recent, people start work on like post-quantum resistant encryption methodologies.

Tianchi Gao: But how about the data centers, right? That's internet. Everyone knows internet is not safe. But is data center really safe? Of course not. In 2013, Snowden revealed that NSA actually captures all the intra-data center and inter-data center traffic. And then people start realized that we need actually encryption even inside data center. And that's what happened today, like TLS also being used in the data center's microservices. As we can show, like this Meta have their post about their practice of using TLS in their data center. And also, it's quite trivial that data center is not really trustworthy. There may be malicious insiders and compromised tenants, which also makes us to understand that we need encryption in the data center.

Tianchi Gao: However, the encryption protocol currently being used in the data center are mostly TLS, and that's based on the TCP, which is unfit to data centers. So, in the data centers, the major workload is RPC, and we need high throughput, low latency RPCs for small and big messages. And for that, software overhead and head-of-line blocking really matters. Let me give you an example. So, now we have several servers. For example, one client send a request to one server, and this server dispatch this, partition the request to several servers. And these servers, they respond the, they respond the reply to this server. But if, like, for example, one of three are slow, so the, the server in middle have to wait until the one of this straggler to come, then to reply the, then do the reply. That's what makes, like, tail latency is very important in the data center.

Tianchi Gao: And but why in data center particularly TCP/TLS can have head-of-line blocking problem? So, for example, one example would be packet loss. For example, if we multiplex messages on just one TCP connection, one if one message is lost, the application have to wait until this message before handover a later message which already arrived. And another example would be a large message in the flow. So, the smaller message later on have to wait the previous big message fully received. And also, like, even if we just we don't multiplex messages over one TCP connections, we make TCP connection just serve single message, the connections themselves will will need share the same CPU cores, and another another connection may be delay one connection because they are sharing the CPU cores.

Tianchi Gao: That's why I want to solve this problem. That's why I designed a SMT, Secure Message Transport. And the SMT have several goals for its design. First, first of all, because it's a secure transport, secure message protocol, we of course want it to be secure. So, we aim to meet a TLS same security guarantees as TLS 1.3 and based on same trust model, but with a message-based abstraction. So, what a message-based abstraction looks like? So, this is how our SMT packet looks like. We have a source port, destination port, message ID, and message length in the transport layer header. And in the application layer, we still we also have a message framing header which indicates the message, the offset of the message. And by using the message-based transport, we can avoid head-of-line blocking by sending the messages out of order.

Tianchi Gao: And also, SMT can leverage existing hardware offloading opportunistically. So, we already have a TCP segmentation offloading and TLS offloading for TCP, but we also want to utilize it. And also, SMT, we didn't encrypt the message, the transport header, like QUIC. We want to keep it as plaintext because it's in the data center, we have we have less attackers in general. So, we but we actually really need in-network computing so we can load balance the messages in a fine-grained way.

Tianchi Gao: Oh, what's happening? Okay. And we design SMT as show in the figure. We lay it as same as, like, a new transport protocol. We didn't design it with UDP encapsulation to make use of better hardware. And also, it's a we want it really designed as a new transport. Some some people may ask, why why why don't we just use QUIC then? But actually, QUIC doesn't fit. QUIC is designed for the internet, where the latency is millisecond level, and the head-of-line is mainly on the path. And there are a lot middlebox in the internet and old hardware which interferes a lot. And for the hardware offloading, because many QUIC users are actually client device, so the hardware offloading isn't that essential, the load is not so high.

Tianchi Gao: But for our protocol SMT, we are expected to run in the data center where the latency requirement is microsecond level, and head-of-line blocking can happen on both paths and also the cores. However, in the data center, the it's a more it's a more controlled environment, so we have less middlebox interference. And also, to support the high throughput and low latency RPCs, we really we really need hardware offloading.

Tianchi Gao: And what about other options since QUIC doesn't work? So, there were like proposals for encrypted transport like tcpcrypt and TCPS. However, their common problem is they both inherit the TCP problems, which a fundamental head-of-line blocking. But there are also proposals for message-based transport, but they are they are not encrypted. For example, SRD from AWS and Falcon from Google. However, they are both hardware transport. They really they depend on the hardware a lot, and they are not generic. There were generic message-based transport proposals from the academia like NDP or Homa. Among them, Homa provides a generic message-based transport interface with a reasonable Linux implementation. So, our idea is we we use Homa to to as a base as a middle ground to to design a encryption message-based transport.

Tianchi Gao: And since we want a we already have Homa, we already have TLS, why not we just stack TLS over the Homa, right? Actually, that doesn't work because TLS assuming in-order byte stream as its so, for example, here, if we did a one TLS handshake, then we just encrypt the message, for example, we have there are three messages, we just encrypt with record sequence number 0, then 1, then 2. But actually, because message-based transport, we are we are sending the messages out of order and receive out of order. So, actually a later encrypted message can arrive earlier than the but the receiver don't know, receiver still try to decrypt with a earlier record sequence number, so the decryption will fail. And another option would be we just do a handshake for each message, but that's really, really too slow.

Tianchi Gao: So, let me introduce the architecture of SMT. So, for SMT, we first do the handshake in the application. So, we assume application already did a already agreed on a key key pair beforehand. Then application will set the key, I mean via a non-encrypted message-based transport. Then application set the key to to the our transport. Then application can just send the plaintext messages in different size to to to the SMT socket, and SMT will deliver them actually on the wire concurrently and encrypt them inherently. For example, like one box here is just like a packet, which have a message ID, message length, and offset, payload, the offset and payload actually encrypted.

Tianchi Gao: Okay, there are two main key design components of SMT. The first one is message format, we based on the TLS record layer protocol, because we want to preserve existing TLS offloading with TSO. And the second one is per-message record sequence number space, where transport integration enables replay protection to achieve similar security guarantees.

Tianchi Gao: So, let's dive into the SMT message and packet format, how we reach the goal of we we enable the TSO and with TLS offloading over a message-based transport. So, let me first explain what is TSO and TLS offloading. Just we see on the left hand left-hand side, there's a like big packet with a header. And when we then we send this big packet with header into the NIC, then the NIC will chunk chunk the this big packet into MTU-sized packets, meanwhile copy the headers into these packets and it will do the encryption at the same time.

Tianchi Gao: So, our key finding is that we found out that TLS offloading with TSO, which was originally invented for TCP TLS, it actually works for non-TCP packets. I mean, the IP the protocol number in the IP header is not TCP. It's it actually works and but the header the transport header format need to be kind of like TCP. That's why we have this this packet format for our SMT protocol. So, we have a protocol number not equal to TCP in the IP layer. And in the transport layer, we have message ID, message length, and also there's one thing called TSO offset I will explain later. So, for example, and we have a TLS records wrapped wrapped up with the message offset and actual message payload. And one TS and so one packet payload can look like this. One packet can contain one TLS header, message offset, and message payload. And the packets inside a TLS record can just have a message offset and message payload. And so this is a TLS record, and one TSO segment can have multiple TLS records up to 65K bytes.

Tianchi Gao: So, now I'm going to explain how the actual segmentation and reassembly works for SMT protocol in action. So, now we have one message over two TSO segments, which each with one TLS TLS record. So, you can see on the right-hand side, this we are on the sending side. So, after the packet goes through the NIC, it come become like this. It chunks it to the MTU size. And each each TSO segment is chunked to three packets: one packet with TLS header, one packet with without any TLS stuff, and one packet with TLS trailer. And then the packet these two sets of three packets reconstructed back to TSO segments, but actually encrypted based on the IPID and the TSO offset we already we actually set in the sending side. Then we can do the decryption. It will decrypt the encrypted part based on the known TLS record sequence number. Then we have a like similar to the right-hand side, right? Then we reconstruct the message content, message payload itself based on the message offset header.

Tianchi Gao: Another main component of SMT is per-message record sequence number space. Our goal, because our goal is we want to do out-of-order message delivery, but the problem is multiple sequence number space. If we just share over one TLS handshake, it can create like duplicate sequence number, which actually breaks the replay attack prevention provided by TLS. So, our solution is we we can avoid the duplicate by containing with a unique message ID in the session. We composite the we put so there is a 64-bits record sequence number field designed in the TLS TCP. So, we we put the message ID inside the higher bits of this field, and we put the index of the records within the message in the lower bits. So, right-hand side is a example. For example, we have three messages, and the message 0 it have so the first record of message 0 have index 0, and second one is message 0 index 1. So, it will use the record sequence number field 0, 1, and same applies to other messages.

Tianchi Gao: However, this leads to another problem, which is we need to trade-off between the message size and the how many message we can serve in one TLS handshake session. As we are splitting the record sequence number field to two, one for message ID and one for record index. So, the bit allocation can be flexible to support larger message size. For example, if if we look on the figure on right hand, if we have more we have less bits for the message size field, we can only see we can handle more messages per handshake, but we can only but the maximum message size get smaller. Like, for example, if we only have 8 bits for the message size field, we can only have the message size at at biggest only 400K, which is not good. So, currently we opt for the red red circle point. We allocate 16 bits for the record index and 48 bits for the message ID. In that case, we can support up to around 100 megabyte message size at maximum and around 300 trillion messages for one handshake session.

Tianchi Gao: So, SMT, let's overview. So, we can see that SMT compared to TLS 1.3 actually provides same authentication and confidentiality, and it's also actually adds up Homa for its integrity check as Homa didn't provide any checksum thing. And because the order in the SMT is per-message, every we only have order within every within each message, and inside each message, the record index will monotonically increment. And for the replay attack prevention, we have this composite ID thing, we guarantee that no every message ID can appear appear at most once, so we have a unique space. And for the length concealment, we can still preserve the existing TLS padding kind of mechanisms.

Tianchi Gao: For to support TLS offloading. So, TLS offloading in the Mellanox NICs was designed for in-order delivery, but to support TLS offloading for out-of-order delivery, we we meet such another main challenge. So, so the per-message record sequence number is essential for TLS offloading, as the NIC expect in-order record arrival. Like, NIC have its own counter for TLS record sequence number. So, every time there's a packet coming in, the counter inside the NIC will increment by one. But we actually can do can update the counter inside the NIC, but it's not automatically. For example, here, we have the we can attach a dummy packet before the actual record, like S1. We can update its counter to 1. And for same apply like if we we can attach this R3 to to update the NIC counter to 3. And the message-based transport but but the message-based transport can actually send the messages in any order. So, example, for example, if we have two cores sending sending as S1 R1 and S3 R3 together, then the actual it will be serialized maybe S1 R3 or something, and it breaks the counter synchronization. And that's why we need a per-message record sequence number to allow such sender to use separate flow context. For example, here, we use C1 and C2 as separate flow context or crypto engine to encrypt the NIC, so NIC have separate counters for each message and we recycle it.

Tianchi Gao: The implementation of SMT is currently 200 around 3,000 lines of patch upon the Homa Linux kernel module and 300 lines of code on the Mellanox NIC driver to support the encryption offloading. And it's currently open-sourced on this link. We are we are working on the next version for better for more features and better performance. So, this now let's go through some evaluation briefly. So, we can see that so we compare the SMT with existing TCP kTLS with both offloading TLS offloading enabled and disabled. And we can see that SMT on for unloaded latency, we can see that SMT outperforms TLS by 20 to 30 percent with hardware offloading and 16 to 35 percent without it. And we also did experiment on the loaded throughput for different concurrent number of different number of concurrent messages and different size of messages. We can see that for smaller message, 64 bytes, SMT have 16 to 40 improvements, and same apply to 1 KB. However, for larger messages, because Homa is not really optimized for large message, it's slowly it's slightly slower than existing TCP kTLS, but we are working on it.

Tianchi Gao: We also ported the SMT to actual applications. For example, we ported to the Redis, and we we show that SMT actually outperforms kTLS by 2 to 5 to 13 percent with TLS offloading and 8 to 17 percent without it in different workloads. Also, we we also show the SMT performance on in-kernel application, not only user-space applications. We ported the NVMe-over-Fabric for SMT and compared with NVMe-over-TCP. We show that for P50 latency, there is up to 77 percent improvement with offloading and 15 without it. And we also we show even better performance we show even more improvement on P99 because SMT is works better on tail latency, so we have up to 60 and 21 with or without TLS.

Tianchi Gao: Here is some discussion and future directions. So, so the SMT ensured the message integrity, which Homa lacks. And as I mentioned before, we currently still using IPID to reconstruct the chunked chunked packets of the message, but IP IPID doesn't exist in IPv6, so we can only have a TSO segment which at maximum size of 2 MTU. So, we want to solve it somehow, maybe we can have some indication in the TCP sequence number field for non-TCP packets. And another thing is receive-side offloading. So, currently SMT can only do send-side TLS offloading because we are unable to let the NIC just some we we are unable to indicate the NIC decrypt a packet that's not TCP unless we can modify the firmware. So, we we are working on this. And another thing is PSP. I don't know whether you know PSP. PSP is like kind of like IPsec kind of thing. So, we are also thinking of supporting PSP. And also, because we preserved the plaintext message ID and message length, we are planning on more like in-network computer applications like and work with the MTP, which is a protocol present in NSDI 2025. And also, regarding post-quantum questions, SMT is automatically post-quantum resistant, because it just for the handshake part, it just follows TLS handshake, and the cipher we used is just AES-GCM 256.

Tianchi Gao: And the conclusion is, Secure Message Transport, SMT, is a message-based encrypt transport for data centers. And we have already Linux kernel implementation as Homa Linux extension. We are also working on BSD implementation. I have colleague will present in the BSDCan later this year. And the ongoing work is I'm improving the performance and implementation, and we are trying to submit a internet draft for SMT for next IETF. Please see more details in our papers and also I'm very, very happy if you can come me in person. We can discuss. And also super happy for any criticism. Okay, thank you.

Dirk Kutscher: Thanks. Thanks. Great work, Tianchi. Okay, yeah, we open the questions. And Rod was first.

Speaker 1: Hi, Tianchi. Thanks, that was a really good talk. And, just from a quick look at the paper, it looks really good. Um, I actually want to go back to the underlying assumptions about this. Have you guys done any work on actually analyzing whether or not there are actual existing threats in in the data center environment? You know, how how strong is the underlying proof of the motivation for doing for doing this work?

Tianchi Gao: So, basically existing data center network already they already mostly adopted the TLS actually. After the NSA thing, the Google and Meta, they all have their own way to do encryption in the data center. And there are also literature that shows the inside the data center, particularly multi-tenancy cloud, that the tenant can be compromised, right? And there are also zero-day attack threats that there are way to actually escape the code from the hypervisor, from the virtual machine to the hypervisor. Then the it can snoop the packets from other tenants.

Speaker 1: And so that's actually an attack at at the packet level on the wire as as opposed to to cross-VM, um, attack?

Tianchi Gao: Oh, for cross-VM there are different vendors and you don't know whether there's a bug on the vendor side, right? Like for...

Speaker 1: Between virtual machines inside the same node. Seems like actually the bigger vulnerability to me, but I've been thinking about this. I haven't looked at it, so I don't know.

Tianchi Gao: Yeah, there are, there are. Okay.

Stewart Cheshire: I'm Stewart Cheshire from Apple. Thank you for this presentation. I have a couple of suggestions of things for you to look at and explore. One comment I have is that ever since I was a student, people have been talking about head-of-line blocking being a problem. And it's the kind of thing where you can draw a picture and see that it seems like a huge problem and it's really unfair that these packets are waiting because of head-of-line blocking. But um, uh, the the question is how often that happens. And I did some work years ago with a professor from Franklin and Marshall College in Pennsylvania, a guy called Jana Iyengar. And if you look for expired internet drafts with "Minion" in the name, he...

Tianchi Gao: Yes, we know Minion. We know Minion. But Minion is also I know Minion, Minion have like you have like a separator between the actual messages. But we we still found that Minion need still based on TCP, and Minion need some modification on the TCP, and which is kind of hard. So we found it maybe it's easier we just have a new protocol, right, instead of a hacked TCP.

Stewart Cheshire: So you, I mean you've answered one of my questions, which is you're aware of Minion, which had some similarities that it wasn't really TCP but it kept the TCP header format so that things like TSO could be used.

Tianchi Gao: Yes, we actually our work is inspired a lot by Minion. Thank you.

Stewart Cheshire: That's nice to hear. Um, we implemented that, Jana actually did a sabbatical at Apple and we implemented it, and one of the unexpected, the reason the drafts are now expired is because we found that there were lots of teams at Apple who thought they wanted out-of-order delivery, and when we gave it to them and they tried to use it, they found it's much harder than they realized to actually write their code to handle data when it's out of order. And that code is hard to write, and it happens rarely, so that code doesn't get very well tested. And and just my last comment on the subject of it being rare, head-of-line blocking is caused by packet loss. And you started off by talking about that. The question I would ask is how common is packet loss in a data center because if you're running your data center right with DCTCP, there should be no packet loss. And...

Tianchi Gao: Yeah, packet loss is not that common, but there's still like large message delay other later messages. And the the one connection. So one another thing is like if you consider a core handling the TCP connections, so we can only do load balancing for one for TCP load balancing on TCP, we can only do a per-connection granularity, right? So we cannot like just dispatch the messages to different cores. So that's actually fundamental reason that on computing-wise.

Stewart Cheshire: All right. Thank you.

Antoine Fressancourt: Hello. Antoine. I read your article several times, and it's very nice.

Tianchi Gao: Oh, thank you.

Antoine Fressancourt: I have two questions, actually, from reading and looking at the presentation regarding the fact that in data center and in your design goals, you target establishing exchanging message in data center with very low latency. The first is, you rely on TLS for key exchange.

Tianchi Gao: Yes.

Antoine Fressancourt: And isn't this introducing a long delay to establish the key at the beginning of the flow establishment?

Tianchi Gao: So, yes. I mean, we need some way to have a agreed keys.

Antoine Fressancourt: Okay, so then you do a 0-RTT and...

Tianchi Gao: We can have resumption for later flows, of course. But, I mean, we have need a way to exchange keys. TLS handshake can be one way, or we can have pre-shared key, whatever, as long as we have find a secure way. And I think the TLS is most common standard for handshake nowadays.

Antoine Fressancourt: The other one is, in the design of your protocol, you make lots of design decisions around the fact that you're going to use the TCP offloading capability for encryption. And why didn't you use a lower level offloading like the cryptographic function offloading directly from a custom design protocol that you're going to do anyway?

Tianchi Gao: So, the key of doing that is because we want to do the TSO segmentation, a TCP segmentation and encryption together. So, we just give it to the NIC, and the NIC just chunk and encrypt. So, if we do a of course we can do a we let some encryption engine like Intel QAT kind of thing to do that. But that's actually still like we need to call them and get it back then give to NIC. That's actually adds up latency. And since we have we already make it working, why not?

Antoine Fressancourt: Okay. My assumption was that the second way would be lower latency, but if it's not, perfect.

Tianchi Gao: Yeah, yeah, no. I think the no, we didn't add encryption, I think the we didn't so our hack didn't introduce more latency compared to TCP versus TLS, as it show in the figure. So, we actually because our stack is lighter than kernel TLS, we actually have lower latency add up on the plaintext transport. The way we made a Mellanox NIC work with a message-based TLS offloading didn't actually introduce more latency compared to TCP TLS offloading.

Antoine Fressancourt: Okay, perfect. Thank you.

Dirk Kutscher: Okay, let's thank Tianchi once again for the great paper and the very nice talk. So, SMT... (handing out award) Okay, great talk. Thank you very much. Can you stay bit after the session for another photo? Of course. Okay. Great.

Dirk Kutscher: Now I'd like to welcome Xiangjie Huang. So, Xiangjie Huang is a first-year PhD student at HKUST, advised by Professor Zili Meng. And prior to starting his doctoral studies, he obtained his Master's degree with a focus on video coding. He's deeply passionate about advancing next-generation real-time communication systems and improving them in every possible aspect. And today he will talk about ACE, sending burstiness control for high-quality real-time communication. Welcome, Xiangjie.

Xiangjie Huang: Thank you, Professor Dirk. Okay, good afternoon, everyone. My name is Xiangjie from Hong Kong UST, and it's a great pleasure to me to like present, have this I can deliver this talk to you guys. I'm a first-year PhD student in Hong Kong UST, and this work, ACE, is the paper we published in SIGCOMM last year. And in this work, we made some improvements to the real-time communication system. And real-time communication, compared to like data center network, is a kind of a niche field, so I'm glad that I can share with you some of the experience of how we improve it to you guys.

Xiangjie Huang: So, this paper, ACE, we made some improvements, and to do this, we proposed a new control dimension. We call it sending burstiness. So, in this talk, it's not a conference talk, so I want to discuss with you guys about some of the like real-world motivation behind it and some of the experience we had when we want to apply it to the like large-scale deployments. So, let's get start.

Xiangjie Huang: First, real-time communication, some of you guys may be familiar with this term, some of you may not. So, I will explain briefly. So, real-time communication, RTC, is the video transmission protocols or techniques under a lot of daily used applications. For example, the daily video phone call when you want to call your friends in a in a video mode, and the video conferencing that we are using every day and we are using now. And there are still a lot of other applications such as the emergent ones, cloud video gaming. Some of them are even future applications like the remote surgery, the teleoperations of robots and of the vehicles. And there are a lot of others not listed on the slide, for example, when you want to use the AR/VR for the streaming, when you want to stream a video from the AR/VR service to the like VR/AR headset.

Xiangjie Huang: So, real-time communication becomes indispensable in lives. And why it is hard? Or what is different from this real-time communication between the other like video on demand or traditional video streaming? The main difference here is the communication in RTC is more interactive. In other in other words, when you want to, like, play a video games on the cloud, the game service is run on the server. So, every command you make from your keyboard, you want to get instant feedback from the server. In other word, you want to see the next render frame from the server. So, that means the latency requirement of RTC is typically high, and there will be no buffer allowed when you are transmit these video streams in RTC. So, that is the main difference.

Xiangjie Huang: And in this slide, what I want to show you is the frustrating latency issues in RTC. So, a lot of researchers, including us, we identify tail latency issues as the main blockage or the obstacle of the path of future applications. So, here I list some of the papers. And as you can see, these papers are from the recent years, NSDI and SIGCOMM. And they all have one thing in common is they all have a large-scale measurement on the production-level RTC applications. And if you see the values they report for the tail latency metric, the values are still very high. For example, the paper, the third paper here, I don't know why the position is shifted, but the for example, for the paper two, you can see the stall frequency it report is still one stall event per minutes. So, it can translate to every minutes, you will have a user-sensible video freeze. So, imagine that if you are using this RTC system to do a remote surgery, it can be considered not only a frustrating experience but also very dangerous.

Xiangjie Huang: So, this actually directly motivate we are going to see where is the latency come from. And it directly motivate this paper. In this paper, we break down the latency into parts. So, let me show you our evaluation on WebRTC. So, this in this in this figure we show, if you see the left side of this figure, it reports the scenario where the end-to-end latency is less than 100 milliseconds, or for those most of the cases. For those low-latency cases that are not not in the tail. For most of cases, you can see the latency components don't show much difference in proportions. But if you see those tail cases, or the cases where we have latency greater than 200 milliseconds, in the latency components, the pacing delay takes the lead. And this is our evaluation on WebRTC. And we also validate this data from the online data we collected from the ByteDance Douyin cloud gaming, which directly show that the pacing delay contributes to 100 milliseconds stall.

Xiangjie Huang: So, what is the pacing delay here, and how it is formulated? And we can see it in a more detail. So, to introduce this in more detail, so let's firstly see the sending pipeline of RTC. So, in RTC stack, it's across three layers: the application to the network layer. And there are two important components here. One is rate control, the other is video encoder. So, ideally, the rate control constantly monitors the network, and it finally determines a rate. And the rate is set as the target rate of the video encoder. So, the video frames we are going to transmit will then be compressed followed this target bitrate and then segmented into data packet and finally sent to the network.

Xiangjie Huang: In that case, the slides don't have any animations, right? No, they get converted to PDF. Oh, that's a oh, that's fine. I'll just use my my finger to point it. So, thank you, thank you. Yeah, so the video will finally be compressed into packet and sent to the network. This seems fine. But in reality, the sending pattern is not that smooth as we think because the videos are in frames. So, it's this way. So, in the real RTC sending pipeline, the frames will be captured frame by frame. So, if there is a 30 FPS video, it will be 33 milliseconds each frame. So, we will have a burst every 33 milliseconds. And in that case, this burst can be large. So, think about it. If we have we want to stream a 30 Mbps video with 30 FPS, one single frame can be over 100 packets. So, it will be tiny bursts across all the sending time scale.

Xiangjie Huang: So, these bursts can even be exacerbated by one thing: it's the content complexity. So, if we think about the real video frames that are going to be transmit, all of the frames are show different in the content. For example, for the four frames, the third frame here is actually different than the previous two frames. So, it will results in a even larger frame size in the third frame. So, if we directly want to send this packet, send this frame into the network, it was considered risky. It was considered risky overshooting the bottleneck network buffers. So, that's why a straightforward solution to this problem is we add a queue here. We just add a queue to the transport side and we just set the target rate set set to the video encoder to the output rate of this queue. In that case, the videos can be transmitted into the network very smoothly.

Xiangjie Huang: This sounds great. But as long as there is a queue, there will be possible delays. So, and because there is no animation, I can just talk myself. So, if it's just average frame size, as you can see if the dotted line is the average frame size, the delay of this queue will be at exactly the frame interval. But if the frame is large, if the frame is twice the average size, it will be twice the frame interval to be the delay of this queue. So, in that case, the larger the frame size we have will have a larger queue size. This is how the pacing delay that I showed in the previous slides coming from.

Xiangjie Huang: Okay. And in this slide, I want to show you guys why this matters more today. Actually, I would forgive me. Actually, in this slide, there are two three reasons, but they are just move in the animation way. So, so there is only one reason presented. So, the first reason is the network. So, in short, the network RTT is shrinking. We can say thanks to the 5G, thanks to the development of network infrastructures, the edge servers, the RTT is now shrinking, especially for the service like cloud gaming. So, the RTT in the past is large, so the propagation delay takes the lead. But now, it's the queuing delay. And if the and queuing delay in this scenario if we if the queue is formulated in the sender side, it will be pacing delay. So, this is the main reason why the pacing delay now is taking the lead.

Xiangjie Huang: Another reason that we have, even though we don't show here, the another reason is the content variability. So, now if we are streaming a video, a gaming video, the content is moving very fast. So, the average so the fluctuation in the frame size will also be large. And the third reason we identify is the quality requirement that user have. Now, users always require for the a very high resolution videos. In that case, the average frame size will be larger. Like, if we are moving from the 1080p to the like 4K video, the average frame size will be larger. So, if we see the second and the third reason here, the average frame size is getting larger and the fluctuation for frame size is also getting larger. So, for the tail cases, we will experience a lot of occasionally experience a lot of large frame size that makes the pacing latency goes to the tail. So, that's the trend.

Xiangjie Huang: And how can we deal with it? So, in this study, we want to we want to firstly explore is that a possibility that we can just let go these bursty frames. Letting go, I mean, just send it to the network, and we don't really consider it. So, if we just let go these packets into the network, it will become a comparison between two queues. Because if the queue doesn't formulate in the sender side, it will formulate in the network side. So, you can consider the buffer here, I mean, the it will be the bottleneck buffer inside the network. So, it will become the comparison of these two queues.

Xiangjie Huang: Firstly, let's think about the situation one. In situation one, the pacing rate is less than the network bandwidth. In that case, it is better to let these packets go to the network, because it will be faster to deliver those packets. So, there will be no meaning to let these packets wait in the pacing queue. But there will be another circumstance. This should be finally to show up, sorry. So, another circumstance is the the queue size of the buffer here is unpredictable. So, if we send a pretty large frame that fill in this network buffer and it get packet the packet gets lost or the frame length here is larger than the buffer length, then we'll have subsequent retransmissions. In that case, the latency will go to tail. So, it would be better to let these packets stay in the pacing buffer. So, as you can see, either way, if we let just let the packet go or we directly keep them in the pacer, there is only one conclusion: the lack of management to the burstiness. So, this is the main claim we have in this paper. So, we want to manage the burstiness that we can send into the network. So, that's the key insight.

Xiangjie Huang: So, to do this, in this work, we proposed ACE. And in this in the ACE, we want to manage the burstiness from both enqueue and dequeue perspective of the queue. And as you can see, the ACE is consist of two control mechanism. One is called ACE-N, N means network side. So, in the network side, we want to adapt the sending burstiness. And in the encoder side, we want to adapt the encoding complexity to get a smoother frame size. As you can see, the problem of the pacing queue is induced by the frame size. So, we want the frame size produced to be as smooth as possible. And in the network side, we want to sometimes send burstily in the network, but sometimes keep it in the pacing way. So, that's the main idea of the of the of our ACE, and they work together and finally get all of this burstiness being managed.

Xiangjie Huang: So, let's talk about some of the details in the network side. Firstly, the ACE-N or the network side. In the network side, we want to control the sending pattern. And the main insight is we want to adaptively control the sending burstiness. So, we ask a simple questions, a two simple questions. One simple question is, is the network buffer is empty? So, if it is, it will be best to let this packet go. So, we can burst more aggressively. And the next questions is, is that risk overshooting the network buffer? If it it is, it will be dangerous. So, in that case, we want to shrink a burst packet and pace more strictly.

Xiangjie Huang: And this guide us to have our design in ACE-N. So, in ACE-N, we estimate the queue size to adapt a token bucket size. So, we finally use a token bucket. This token bucket is to control how much burst you can let go to send it to the network, and we adapt the bucket size and we adapt the bucket size. And what's the input of our algorithm? We take the queuing size as the input. So, ideally, we want to know the available token size available buffer queue size inside the network, but that was not that is not possible. We never know how many the queue size inside the network is. So, the only thing we can estimate is the exact the real queuing size. So, we estimate the queuing size and we increase to probe the bursty size. And if we detect some loss, we detect the queue, it means that it is more possible that the network queue is going to the top. In that case, we will decrease to avoid loss. We did have a lot of other design to make it this design more complete and we can cover more of all of the corner cases. But in the design philosophy, or in the most simple way, it is our design on the ACE-N.

Xiangjie Huang: So, let's get next next to the ACE-C, the encoding part. So, for encoder, our goal, our purpose is we want to reduce the frame size of the large frames. We want to make them as smooth as possible. So, we ask the encoder, as you can see there is a large frame here. So, we ask the encoder, is there any ways you can make this frame size a little bit smaller? And the encoder says, yes, we do have some approach. That is what I said the current approaches. The encoders always wants to if the encoder wants to make this frame size smaller, it will use the strict rate control mechanism. But this comes at some cost. If you're using the strict rate control mechanism, you are actually get more lossy compression to the frames, and this will end up in a worse quality. This is not something what we want. In other word, if we if we treat those frames with a more lossy compression, we'll finally have a blur frame, and this blur will even affect the subsequent frames. This is not something that we don't want. And our evaluation even shows that the strict rate control mechanism will reduce the VMAF score of this frame of the video for 10 to 15 points.

Xiangjie Huang: And what we do what we chose to do is a different way. We adapt the encoding complexity to enable a smoother stream. So, let me explain to you what the encoding complexity means. So, actually, the encoding complexity is controlled by a set of video coding techniques. So, more advanced video techniques we have, the coding complexity will increase, but the frame size will be reduced. And more encoding complexity we have, the encoding delay will increase. But because we have a less frame size, the pacing delay will decrease. So, as you can see, there will be a trade-off. A trade-off here is possible. So, if we can increase a little bit encoding time but end up the queuing time be decreased, there will be a time saving. So, that's the philosophy that we design for the ACE-C. So, we adaptively choose to control the encoding complexity to make the total time, including the encoding time and the queuing time, to become the minnest, to become the yeah.

Xiangjie Huang: And one last thing I want to mention is that in the ACE-C control, actually 95 percent of the frames are fine. We don't need to adapt extra work of the encoding to control it. What only matters is the tail frames, or for the like less than 5 percent of the frames. These are oversized frames that will cause eventually cause stalls, eventually cause the latency goes to tail. For those frame, ACE-C will target this. We will spend more encoding effort to it to make the size smaller. So, that's the key philosophy for the ACE-C.

Xiangjie Huang: And let me show you the evaluation results. The main evaluation was conducted by Mahi Mahi emulations where the real-world Wi-Fi and cellular traces. And we compared ACE with some of the baseline, some from the encoding rate control mechanism, some from the WebRTC pacing or bursty sending mechanism. And as you can see, ACE can be balanced in between. So, for the best quality baseline is the native WebRTC pacing one. It shows a great quality. But as you can see, because of the large frames, the P95 latency is the highest. So, compared to that, ACE can reduce latency for 33 percent, but it remains it maintains the similar qualities. And for the strict bitrate follow baselines, ACE can have the similar latency. But the quality of it, because it allows the frames to preserve the quality, the VMAF score here is 12 scores higher than the baselines. So, this slide just show that in our evaluation, if we are trying to control the burstiness, the result will be superior to the pacing to the both pacing and bursty and CBR, the rate control baselines.

Xiangjie Huang: And in this slide, I want to show you what we found in different environments of the in the experiment. And the key experience here is the sending burstiness control can benefit more in the Wi-Fi scenario and gaming videos. The first, let's see the scenario. So, we as you can see, in the Wi-Fi scenarios, the P95 latency reduced the most, 43 percent. But for the cellular ones, the benefit is still there, but largely reduced. So, we infer that in the cellular traces, the performance gains are less pronounced is because in the cellular traces the bandwidth fluctuations are higher. In that case, the congestion control is more to be blamed so that the ACE algorithm cannot take the real important role. And in the in the figure on the right shows the evaluation on different video types. And you can see, for the game video, the P95 latency reduction can even be larger than 70 percent. But for the lecture video, the benefit is about 30 percent. That is because for the game video, there are a lot of movement, and lot of movements come with a lot of fluctuations in the frame size. So, in those cases, the ACE will have a better performance. So, the experiences here is try to pay attention to the burstiness management when the CC performance is already good and the content is virus.

Xiangjie Huang: And in this slide, what I want to show is the experience we had from the paper to the real-world deployment. So, we had our test, we firstly have the evaluation on the campus Wi-Fi, and we implemented our algorithm on the WebRTC and a software encoder called X264. It's a widely used one. And the the performance is good, the implementation is fine, and it's really easy to be deployed on the WebRTC. And then we want to move the real-world experiment to the one of our cooperator, the ByteDance Douyin cloud gaming. We want to see if the deployment on their online system is also going well. And for the ACE-N, for the network part, it's deployed very simply, and we have 15 percent of stall rate reduction, and that is very good.

Xiangjie Huang: But for the ACE-C, for the encoder part, we face some limitation, and that is something I want to share with you. So, firstly, in the in the production services, they usually uses hardware encoder. And for the hardware encoder, it lacks of the control space in the hardware in the encoding parameter settings. And they always try to set strict rate control. So, in other word, they want they don't want to preserve the qualities, they want to trade the qualities away. So, that's that is something that limits ACE-C from a experimental data from the experimental work to the real implementation online.

Xiangjie Huang: So, finally, I'll show you some lessons we got when we want to deploy this ACE-N online. So, first lesson we got is we need to get a clear boundaries with the CCA. This is required. So, ACE-N, the the task of ACE-N is to try to manage the frame-level outliers. But it to need to build on top of a CCA foundation. So, the CCA is a sophisticated design algorithm, and you should trust it. And CCA determines the target rate set to the video encoder. And ACE-N should works at another top layer of the CCA. So, it should target those frame those large frame size outliers that is already produced by the video encoder. So, that is something we should get it clear. And for the because for the first time when we want to test it, the engineers they want to like replace the CCA with a sophisticated design algorithm. But that is not possible. So, never try to replace CCA, but try to deal with the outliers. And another experience we have is the carrier throttling. So, in the deployment, we face some carrier throttling scenario, especially in the end of month when the carrier wants to limit your traffic, want to control the traffic. In those cases, the throttling of this the carrier will always come with the shallow buffers. So, the buffer in the network will be set really shallow. So, in those cases, there will be zero burst allowed. And the core idea of this paper is want to to some extent allow some burst to be sent into the network. But if the carrier wants to throttle you wants to control your traffic, don't never try to send any burst into the network, but strictly pacing them. So, that is two of the experience we have when we want to deploy this system.

Xiangjie Huang: And finally, I'll show the takeaways of what we have learned by doing this project, by doing this RTC improvements. Firstly, in the modern RTC, we should pay attention to this thing: the sender-side pacing latency can be a first-order problem. So, sometimes it's not even the transmission latency itself, it the latency itself can come from the sender side. And the second thing is the average bitrate is not enough. We should pay attention to the burstiness. That's why we claim that we should manage the burstiness. And third one, the strict pacing and blind bursting are both suboptimal. We should adapt in between. And final one is for the encoder. The cross-layer control is most useful when it preserves quality, rather than trading quality away. If you are using a system that the quality is not important at all, the burstiness will be useless too. So, that's all for the presentation. I'm very open to many questions. So, if you have any questions to discuss, I'll be really happy. Thank you.

Dirk Kutscher: Thank you very much, Xiangjie. Great talk. Yeah, we do have time for questions. Yes, Christian.

Christian Huitema: First, let me tell you, I really enjoyed your talk. It's very interesting. I have been working on problems like that on and off for about 40 years, and it's refreshing to see your approach. The one thing that I've seen in the pacing algorithms that are implemented in transport protocols is that the leaky buckets is typically trained on a pacing delay that happens to be very short. If you look at the default setup for something like BBR, for example, the the pacing the the quantum, which is basically equivalent to leaky bucket in BBR, is set to 1 or 2 milliseconds. And so I wondered whether you are using this kind of CC algorithm or you're doing something else.

Xiangjie Huang: Let me try to ask one question. So, you what you're trying to you what you mean is that like in BBR, the sending time will be 1 or 2 milliseconds? Is that right?

Christian Huitema: No, that is not that. BBR can send millions of packets if you let it be. But basically, the purpose of those leaky buckets algorithm is that they they remember that you did not use the network completely in the past, so they give you the right to send a little bit more now. And the little bit more is, I mean, there's a tension there because if the little bit more is too much, you're going to create a queue in the network, etc. So, most of those algorithms are pretty conservative and the little bit more that they allow is typically the equivalent of 1 or 2 milliseconds of bandwidth. And so, what you want here is more like the equivalent of say 15 or 20 milliseconds. And that's a big tension there, because you also want to do this kind of this kind of trade-off in proportion to the round-trip time of the network. Because since the network capacity is changing all the time, you are going to learn the new capacity basically every RTT. So, as you learn every RTT, the CC algorithm will not want to give you too many tokens because they're afraid that if it changes in 1 millisecond in 20 milliseconds, they would have given too much. So, there is a big tension. And as you said, this kind of stuff should be discussed at the same time that you are doing the I mean, basically, you should speak to the guy doing the design of the CCA algorithm because there are lots of interactions there that are interesting. But I mean, I really liked the idea of what you are what you are doing is essentially trading bandwidth for CPU. And yeah, that's a very nice idea. That's very good.

Xiangjie Huang: Yeah, thank you. So, about the CCA question that you raised, I actually think that if you are if you're considering a CCA like BBR, it will be very complicated. Because the BBR also want to like specify the pacing rate. So, for example, the BBR wants to set a congestion window, but it also wants to set a rate of the pacing. Right? So if you consider this way, the BBR would be very complicated.

Christian Huitema: It's not that complicated, actually, but what you have to do is that they set they set a pacing rate. And you know, queues are always either empty or full. They are empty if the pacing rate is much lower than the sending rate, is higher than the sending rate, and they are full otherwise. So, basically, you just have to read the pacing rate from BBR and make sure that the video is encoded at a slightly lower rate.

Xiangjie Huang: Oh, yeah, yeah, yeah. That's true. That's true.

Christian Huitema: Yeah.

Xiangjie Huang: Yeah. So, yeah. What I was going to explain is that in like WebRTC-like structure, because it's a GCC, right? In the GCC is a relatively simple congestion control mechanism. It only determines a rate. So, it only determines a rate and it set this rate as the target rate to the video encoder. So, it never specify how much burst you want to allow in the network. So, that that is something that I would say.

Christian Huitema: Yeah. And maybe the solution is to set the the target encoding rate lower than the network capacity.

Xiangjie Huang: Yeah, that's typically, yeah. That's definitely a solution. But that comes at the burst of the quality loss, and that is not something what we want, right? We want to set it a little bit lower, and the bursty the pacing latency issue will never happen, but the quality is sacrificed.

Christian Huitema: Okay. Look, there are many people in the queue, I don't want to monopolize the queue. I just want to say thank you, and and yes, there is plenty of work to do there.

Xiangjie Huang: Thank you. Thank you.

Stewart Cheshire: Stewart Cheshire from Apple. Great presentation. Thank you. I found the paper online, I'm going to print it out and read it. I think there is much interesting work to explore in this space. I've talked with other people running into similar problems with L4S getting deployed by Comcast and other operators right now. It's great for keeping latency low, and part of the way it keeps latency low is by encouraging flows to deliver smooth traffic. And if you mark your traffic ECT1 to opt into the low latency flow but then send big bursts, you're kind of cheating and not being a good citizen, so you get penalized by the queue protection function. So, it becomes important if applications want the benefit of low latency, they have to pace their traffic appropriately. And so that's why I find this really interesting is to know what is the right amount of pacing to get that balance right. So, thank you.

Xiangjie Huang: Thank you. Thank you. Yeah, this is, yeah, this is also something I I didn't think about before.

Dirk Kutscher: Okay, next one in line, Mike.

Speaker 2: Hello. Yes, thank you for the presentation. This was excellent. I also plan to get a copy of the paper and maybe that will answer my questions. But while I have the opportunity to ask directly, the 15 percent stall rate reduction that you saw in the cloud gaming setup, was that just from the network side of the application of this work, ACE-N?

Xiangjie Huang: Yes. Yes, that's true. Because we didn't really get the ACE-C deployed, unfortunately.

Speaker 2: That's very impressive. And one further question: what techniques are you using to estimate the bottleneck buffer size in the network?

Xiangjie Huang: It's actually a very simple strategy. So, you can calculate by a delay multiplies by a rate, right? Because queue size can be calculated by a delay multiplied by a rate. So, the delay we select is the min RTT, and the rate we select is the network capacity. So, we use the simple but efficient algorithm to calculate a network capacity. It's the algorithm similar to the packet train or packet packet pairs. So, we just using the packet pair to calculate the rate, so I can get a more accurate network capacity. And I times it with the min RTT and I finally give the rate. So, I think this is a widely used techniques to estimate the queue size in the network, and we just borrowed the idea.

Speaker 2: Very cool. Thank you.

Dirk Kutscher: Thank you. And next one is James. James, do you want to ask a question?

James: Hello. (noise)

Dirk Kutscher: So there are some problems with your microphone. We cannot hear you. Sorry, James. There's something wrong with your microphone, we can't hear you. (James trying again) No, it still doesn't work. James, please reach out to Xiangjie offline and ask your question. Sorry about that. Okay, don't run away. Let's thank Xiangjie once again. (handing out award) Thank you very much. Great talk. Can you stay bit after the session for another photo? Of course. Okay. Great.

Dirk Kutscher: Great. Yeah, so thanks everyone, thanks both of you to the great presentations. Let's move on and resume our discussion from yesterday. So, yesterday, I'm not sure whether everyone was in the IRTF open session yesterday, but so we had a session that was discussing Internetworking Challenges for AI. And so, I just want to present the results and enable a bit more discussion about this. So, we had three talks yesterday on different topics that we felt both kind of could lead to quite interesting research challenges that go beyond the current engineering that we see in other places. So, we had one presentation by Ming-Fing Chang on the Disaggregated Architecture for LLM Inference (Mooncake). So, something that came out of the Kimi K2 design that's using a KV-cache-centric disaggregation architecture. And so, the talk talked about the design and the, you know, using this in a single data center environment, but was also alluding to, you know, scaling this up to more distributed settings for workloads beyond, like, one data center. And the second talk was by Hong Xu on Reliability Engineering – Challenges in Networking for AI. And so, discussed many things, so, you know, failure detection and measurements and so on. And also brought up the idea of, you know, having better testbed and benchmark systems in this space for, you know, evaluating systems better, for performing reproducible experiments. And then finally, we had a talk by Lisha Jiang on On AI Agent Communication. So, that took a more principled position on important naming, identity, and trust delegation topics. So, I'm sure you could not avoid all the AI agentic discussions in side meetings and this week. There was also the BoF this morning. So, it turned out that, so in this last talk, it turns out that, so there are many mechanisms that are being discussed these days, so using DNS for discovery and looking up additional properties and so on. But fundamentally, what's really important is to, you know, establish trust leveraging like a human-readable established namespace such as the DNS namespace. And then being able to delegate trust on a fine-grained basis, because there will be many agents and you want to authorize them really carefully and so on. So that was the idea behind that talk.

Dirk Kutscher: And so, what we kind of distilled from from these discussions at the meeting was mainly these three topics that people felt could be quite fruitful for further research to study, or to to work and develop. So, like the testbed benchmark and also dataset topic, that seems to be a very good activity that is difficult for like other groups like in the IETF or even like individual academic paper papers to do. So, if there was a chance to, you know, develop such frameworks enabling experiments benchmarking new systems, so we believe that that could be really quite useful for the community. Then this principled naming, identity, trust delegation concept that I just talked about. So, this would require some more, you know, principled research to develop solid foundations, of course, also do experiments with these systems that would go well beyond the current engineering efforts as well, as far as we understand them at the moment. And thirdly, so this is a space where, you know, distributed computing and networking expertise has to come together to think about a co-design of these architectures. So, the KV-cache-centric architecture is just one example where you really need to understand networking performance, distributed systems concepts, storage to some extent. And so, one other topic that came up in Ming-Fing's talk is like a unified transport that is able to deal with like heterogeneous hardware systems and also networks in the end. And, yeah, so these were the main three things in like our observation that came out of this meeting yesterday. And now I'd like to open the floor for further comments, ideas on this. Maybe people had other or additional observations that they would like to share.

Dirk Kutscher: And I see Rod in the queue, yeah.

Rodney Grubbs: Sure, why not? I always have comments and questions. Um, thinking big picture here, I think it's actually a really exciting time again in systems architecture and implementations for a bunch of reasons, even though, you know, my own research direction's out in a different direction these days and so I'm more of a dilettante in this than I was a couple decades ago. But um, I'm not sure to what extent we have to drag the the buzzword "AI" into everything. This seems to be sort of a little broader than than just AI as an issue, particularly when you think about, you know, a lot of the work on re-architecting things around microservices and allowing them to relocate within the network and the the orchestration of in-network computation and all of these things. And the only thing that I can really think from a systems point of view that's really dramatically different in terms of that workload is the fact that the LLMs represent a very large investment in both data and computation that's maybe a little harder to migrate around than some of these things. It may just be that an LLM represents such a large and harder to divide chunk of state and data and work that maybe we can't migrate it, in which case maybe we don't have quite as many of those issues as we actually have with with the smaller microservices. I don't know, just thinking off the top of my head.

Dirk Kutscher: Yeah, surely. I mean, of course you could be more specific also, right? So, I mean, in what when we say AI, what we often mean is, you know, collective communication abstractions and but this may not be the only kind of class of interactions that we could investigate, yeah.

Dave Plonka: Dave Plonka with Akamai. I like the thanks for doing this summary, Dirk. Um, the the second one is the one that's most interesting to me and you coded it as trust delegation. I don't see any opportunity for trust delegation, it's more trust in delegation or where does the responsibility lie when the agents are delegated things? Um, what I mean by that in Hong Xu's presentation and in Lisha's presentation, both of them seemed to touch on things about what the researcher's role or what would the person's role be in these kind of systems. That's a meta-level thing that I think a lot of people in the IETF groups I haven't heard from them so I shouldn't speak for them, but I think they might focus on some of the meat of how the AI systems work in networking, and I wonder if we should make sure we think about the part about what the role is of the researcher. Are they just in the evaluation phase? Are they in the design phase? How do they handle that they are really responsible for what those agents do afterwards? And maybe the simple idea is what's the parallel in this of like security considerations and privacy considerations. Where are the responsibility considerations so that you don't delegate trust to something that can't be trusted? It's definitely a meta thing so I don't know where it goes in the IRTF, but I I want us to kind of think about and work through that because I'm scared of any talk where they don't talk about what the roles are, and they don't talk about the trust delegation, and they don't talk about how you can't delegate trust, you have to keep the responsibility with the person that deployed the thing. Okay.

Dirk Kutscher: Yeah, thanks. Good point. Brian remotely.

Brian Trammell: Uh, wait a minute, hold on, I'm on the wrong camera. Let me come over then. Um, so yeah, plus one to both of those points. Um, I think of the of the three things here, the most interesting thing and I think the hardest nut to crack is this principled naming, identity, trust delegation in a way that it will deploy, right? Like so I was really, really inspired by Lisha's talk yesterday and I had spent most of the day trying to do my day job thinking about sort of the outcomes of that. And there's some engineering in the IETF that's like nibbling at the corners of this, I think, so like the Wimse working group, um, I I missed it this time, I'm not sure if there's anyone who was here who was there like looking at the fit here, but like the the trust/responsibility delegation, there's another missing primitive I think in the IETF stack that's like the "on behalf of," right? It's like, it's not just me, it's not just you or the agent or whatever. And I think that this will be, um, there's a lot of attention on AI at the moment, I'm not sure how long that lasts. Um, maybe for a very long time, maybe for a very long time at a lower intensity. Uh, but I think there are fundamental issues that we should dig into, and it would be a shame if we tied that to a part of the problem. So I think AI shows us that there's a um that there's a challenge here. Uh, but I I look forward to to essentially working on the challenge that Lisha set to us yesterday.

Dirk Kutscher: Okay, thanks Brian. Okay, we have another remote question. Not sure if it's from Wes or already someone in the room. Ah, it's actually from... okay, Wes.

Wes Hardaker: Yeah, um, I think the second one is sort of the interesting case that we need a lot better definition before you dive into it. Specifically, you need sort of the gap analysis that I don't think is being done with the things that are in there right now. You know, we have we have really good naming systems. The only thing that we don't do yet is naming on the client side, which really becomes the AI agent side, except that there are some work in that, I mean, I chair DANCE, so that's a good example of where engineering has been done lately to do that kind of thing. You know, there's been a lot of discussion about identity. What people haven't aligned yet is how this is different from, you know, the token-based authentication mechanisms that we hand to a bunch of software today, like, uh, you know, when we when we create an account then we hand out tokens to things. Why why does that not work for agents? I'm not saying that it does, right? But nobody's really done the analysis yet for what is the new thing that really becomes research. Um, you know, trust delegation almost seems like a a legal framework, but as I mentioned to people the other day, we're already doing trust delegation all over the time, right? When you have your home automation system with whatever company it is that creates that, you're delegating it the OAuth 2-like credentials or whatever to turn on your lights. So so there's already a bunch of agents that do that. We're just empowering the agents to do a lot more than we ever have before, and it's no longer rule-based, it's, you know, some other black-box decision-making ability. But there needs to be a real good analysis for what's new there that can't use the existing engineering pieces. And again, I'm not saying there aren't any, but but until I see that, it's hard to really figure out this is an area that we need work in first, if that makes sense.

Dirk Kutscher: Okay, thanks. I think Lisha can answer part of his questions.

Lisha Jiang: Um, I was actually came here to answer Rod's question, but for Wes, I think for the token stuff, that works great as of now. Uh, for AI, uh, think about if you have delegations three levels deep. I was wondering where your auth server would be and how many of them you would need. Uh, for now, we only have with one level of delegation. If you go further down, like agent can delegate someone other agent to do things and further down, and I wasn't sure how the the system would work. Uh, but back to Rod, actually I agree with Rod. Uh, we are facing a system challenge. Yes, and I just want to explain why the word AI is relevant here. Uh, that is because it changes the system qualitatively, yeah, in terms of scale and the dynamics. As I mentioned I forgot whether it's Monday or Tuesday, that today we already have billions of internet users and tens of billions of devices. So, therefore, it's not that number as important people talk about, "Oh, we're going to have billions of agent," so it's a big problem. It's that how those numbers work. Uh, we call today's internet users "eyeballs." Eyeballs don't interact with each other. They all watch onto the sky, the cloud there. But the agents changes that nature because agents are supposed to be collaborative, working together. We never work together directly. Like I mentioned this example right before I got onto the plane, I was dragged into a dinner with a bunch of people who want to use agentic system to help them develop some application-specific use cases. So they askedme the question to say, why can't our agents talk together as a group? Because we already have that, like a WhatsApp group. I mean, I should give them credit that they are not network people. So therefore, they didn't know all today's group-based applications work in the cloud. They didn't mean their agent needs to talk to the cloud to get a group behavior. But that's how we do grouping today. And I don't think that going to work for agents, especially you talk about physical agents. They cannot afford that delay. And there's also the scalability question, how many agents are out there and how the cloud actually going to deal with that. I was joking with someone during the lunch that three robots in the manufacture floor, you have to consult the cloud to get tokens. Before you get tokens, you already smashed into each other. Then how you repair from that crash. So I agree, it's really system challenge, but so far we have dealt with by the great cloud. And I think agents will push us to a new solution to be cloud independent. Thank you.

Dirk Kutscher: Thank you. And we have Christian again.

Christian Huitema: Christian Huitema. Yeah, I mean, I think we have a real issue there. I am a bit concerned when I see the IRTF of these three options focusing on the naming issue. Because, you know, it feels like the old story of the philosopher that has lost his key and he's looking for the key under the lamppost and he says, "Hey, yeah, I know it didn't fall there, but here I have light and I might find it if it did fall there." And naming, identity, that are things we have been working on for many, many years. And they are certainly useful, and I like the work that went on in DINRG, and it's very nice. But let's face it: the big problem we have with these systems is that these are massively centralized systems today. And they are massively centralized because it takes billions to train a new model. And that's the big, big problem. And so, what I would like to see, what I would like to see is some kind of movement in which this kind of huge task is broken in little pieces and we can enable that. And the little pieces that enable that are things like .1 and .3 actually. So, I am a bit concerned about remaining under the lamppost when the big problem is somewhere else.

Dirk Kutscher: Okay, thanks for this perspective. I I get what you're saying. Um, I mean, there's of course there is, you know, massive infrastructure scale and centralization in the systems today. On the other hand, for the agent communication, okay, we don't know all of this yet, but so agents are the kind of distillation of the training like these are inference systems. Um, they may use centralized inference, but not necessarily. So there are also open-source models and so on. Um, so there might be, you know, also potential for other system architectures, although, of course, I I see the tendency towards centralization. Thanks, Christian. And we have Wes again.

Wes Hardaker: Yeah, I think the other thing the IRTF has to consider is the IRTF is not a quick-moving thing, right? Uh, the problem space of the IRTF is five, 10 years out, maybe. And which elements of this will not be handled by the commercial organizations on a much faster timeline because they need it tomorrow. Um, and so I think that the harder problem in this entire list is which elements of that are not going to be handled, right, without something like the IRTF doing a cross-collaboration, large-scale, long-term, you know, solution versus um the industry's going to actually have to solve this problem and solve it quickly on their own anyway and there's no way the IRTF could even participate in that discussion.

Dirk Kutscher: Right. I get it. Um, on the other hand, I mean, yes, so like yesterday's presentation, so like for example Hong Xu's work, I mean, is of course also with industry collaboration, but having people, you know, bring this state-of-the-art work here, enable more discussion, maybe and also enable, like, analysis of some problems. That could potentially still be useful, um, I think. Lisha again?

Lisha Jiang: Um, I just want to get back to Wes on his questions. I think there are two comments here. Number one, if the industry goes down the direction we believe is the right direction, so much the better. Yes, I agree they have more power, more money, whatever, they have more pressure to get the work done. But I think like the second bullet about this naming using the DNS name or derived name from DNS as the unifying namespace, I haven't seen the industry actually mentioning that. And also the concept of a local trust. I haven't seen that yet. So therefore, I think the IRTF really trying to explore the direction of research. If industry agrees with it, that's so great. I also want to make a clarification. Um, IRTF is not doing the research per se. I think we are a coordination platform where individual teams bring the research result here for exchange and for the development. We are not coming here three times the year to do the research per se, and for that, I believe it'd be very slow. Thank you.

Dirk Kutscher: Thanks, Lisha. And Lars.

Lars Eggert: Lars Eggert, Mozilla. So, I want to agree with the last thing that Lisha said. So, um, stepping back a bit or stepping up a bit, however you want to think about this. So, I think AI is a is a huge topic both in industry and in academia at the moment, right? And um, the IETF struggles to identify pieces that are ready for standardization. But I think it would be very useful for the overall IETF/IRTF family to have a venue in this space, to give people the opportunity to come and present relevant work either from from industry or academia, right? Um, if we tried this a couple times for other big fuzzy topics like we had an NFVRG a while ago, we had an SDNRG. Those um functioned okay for a while and then they they didn't. And um, this has potentially has the same problem. But that problem is, I think, fixable by having extremely diligent chairs that make sure that what gets presented actually is of interest to to many and and actually brings in new ideas or new thoughts or new things from from industry, right? So, I think it's research groups much more than working groups live and fall with with the chairing, and this will be if we did something in this space would would be even more so than many others. Um, but I think it would be useful to the overall IETF if we had something that for for people that are interested in AI and many people are at the moment to to come and and participate in. Um, I think I talked to to you, Wes, actually yesterday. So, Wes runs the Guides program where new participants can get paired up with a mentor. And we had apparently very many that asked for a mentor to guide them to AI topics in the IETF. And we have very few mentors or topics where we can guide those people to. So I think a venue like this would would be helpful for the org. Thank you.

Dirk Kutscher: Thanks Lars. Um, so we can take two short statements, one from Wes and one from Colin.

Wes Hardaker: My is very quick. Lisha, you and I agree. I agree, the IRTF is not where research is done, it's where research should be reported and I absolutely agree that presenting some long-term novel research things to present in a group just like as what happens in MAPRG is a wonderful use of the IRTF. The trick is doing things that that you can get enough interesting long-term sort of research results that are novel as opposed to this is what industry just did, I think is the hard part.

Dirk Kutscher: Hmm. Okay. Thanks, Wes. Colin.

Colin Perkins: Hi, um, Colin Perkins. Oh, there's the camera. All right, I'm looking at the wrong place. Um, so so yeah, I think I agree with a lot of what has been said. Um, where the IRTF offers the most benefit is when it connects the researchers and the standards community. Uh, one aspect of that, perhaps the passive aspect of that, is in allowing people to come and present work which has happened elsewhere and make the standards community aware of the research which is happening, the the interesting research that's happening, and make the academic research community perhaps aware of some of the issues which are happening in the industry. And I think a group um like as you said a MAPRG style is is potentially very useful there. Uh, where I am perhaps more struggling to see where we can offer value is in the more active coordination of research, and I think that's um as someone said it it's something which has to take place on a longer timescale to avoid just being completely overtaken by events. Um, and I think that's something where we need to think a little harder.

Dirk Kutscher: Okay, great. Um, yeah, thanks everybody for this discussion and these really valuable points. So, we'll take this in and continue this discussion. So, um, there is a mailing list that could be used for further discussion. Please try to get on this. I will also send another message on IRTF announce on this one. And that would conclude the meeting today. Thanks, everybody. Great meeting. See you soon.