Markdown Version

Session Date/Time: 18 Mar 2026 06:00

Jonathan Lennox: Are we using Meetecho's minute tool, or does somebody want to take notes? Okay. Most thumbs up to Meetecho, and cut your head off for taking notes, so unless somebody disagrees, I'll call that consensus. Marius, do you agree?

Marius Kleidl: Hello everyone.

Jonathan Lennox: Hello. All right, we have most of our presenters here. It's actually a fairly good turnout for this meeting, despite us having an enormous room. I'm not sure we needed an auditorium. I feel like maybe I should do a song and dance routine while I'm up here, but I won't subject you to that. All right, hold on. Did I unshare the slides? I think I must have when I reloaded. Yeah, sorry, let me put the chair slides back. All right, hello everybody, welcome to AVTCore. This session is being recorded. I am your co-chair Jonathan Lennox, and remotely here we have Marius Kleidl.

All right, so first thing, of course, is the Note Well. By participating in the IETF, you have agreed to follow IETF processes and policies. There's a QR code here to remind you of if you need more information about these processes and policies. But notably is we have an anti-harassment policy, so please treat everybody with dignity and respect and behave in a professional manner. And if you feel like these are being violated, please contact the chairs or the ombudsteam or the ADs as appropriate. If you know if there's IPR policies, please be aware of your obligations under that and declare any IPR you're required to declare, which is—I won't explain the details here because it's complicated—but please understand that and comply with it. And what else? Oh yeah, and then we have—this is being recorded, so you will appear on the audio and video of the session.

All right, meeting tips: if you're in the room, please sign in to Meetecho, either from the agenda or via one of the QR codes in the room, or there might be other ways, but those are the two I know about. You can use the light client if you just want to join the queue and see the slides, or you can use the full client if you also want to see video or if we happen to do any screen sharing. But please make sure to keep your audio and video off unless you're chairing or presenting or talking to the session. And if you're in the room and you join the full client, make sure to turn audio playback off using the volume control at the bottom. And if you're remote, use of a headset is strongly recommended because echo cancellation is good but not perfect. And everybody, please state your name each time you begin speaking, because the recording does not show the queue. So we won't be able to use that after the fact to see who was recording, so please always say who you are.

Here are some resources, you can click on them, you probably know all these already. And here's our agenda for the day. So yeah, we're doing first APV, then H.265, the ARF, RTP Frame Acknowledgement, SFrame, and then at the end we have two things that are not strictly speaking in our charter, but we're probably the closest working group that's still meeting. They're both sort of related to WebRTC space enhancements. So we'll be presenting those, one of those I think we said that one of them's going to be Justin and one of them's going to be Philipp, but I don't see Justin—oh no, Justin is here. Good, good, good. Justin's here now. So yeah, so Justin will be presenting one and—is this the order you want, Philipp, for those? I know you were saying one order if—

Philipp Hancke: Yes, that order works.

Jonathan Lennox: Wonderful. Great, thank you. Okay, and then here is our draft status. We have two documents with the RFC Editor. Two that have completed working group last call, one for which I say shepherd write-up in progress, which might be a little bit optimistic. But I think, Marius, you said you thought it'd be simple so you could do the shepherd—take the shepherd on that? Yeah, that should be a simple one. draft-ietf-avtcore-rtcp-green-metadata is currently doing a working group last call, so please, if you haven't sent in your opinion on that yet, please read it and review it and send in any comments on it. We have a number of adopted drafts, some of which we're talking about today, some of which we aren't, and two newly adopted drafts, which I think we're talking about both today, but now I've lost track. But anyway, so our first presentation will be APV. So okay, you are here in person. Great. Let me switch the slide deck. And do you want to use the clicker or do you want me to control the slides?

Suhyeok Jeong: I'll try.

Jonathan Lennox: All right, let's see. I think I have to transfer slide control to the clicker and this should now work. You probably want the microphone higher and make sure it's turned on.

Suhyeok Jeong: Hey. There we go, that sounds good. Better. Okay, thank you. So this is, you know, payload format for APV. So as mentioned, it's just newly adopted. I just want to go over again, you know, basic concept of the draft so that you know I can get more comments or complaints, whatever, you know, to me. Okay, so APV, just as a reminder if you haven't heard about it, it's a kind of, you know, video codec for professional use. Professional use mainly means that, you know, you are not—I mean, you are thinking about to record, encode video once and then, you know, edit it and then decode it, encode it again. So there's a kind of multiple decoding, encoding, decoding, encoding, kind of re-encoding step will happen with that, you know, bitstream. So that means, you know, you need to try to preserve the content throughout that, you know, the re-encoding step and so on. It's been proposed as an informational RFC and then now it has been published as RFC 9924. Thank you for everybody who has kind of, you know, provided a good feedback about the draft. So and then this draft is about payload format of that, so I need to introduce, you know, what's the kind of bitstream structure of the video encoded bitstream. So it's a very simple structure. AU, you know, a common term used by kind of most of video codec, same as, you know, here that, you know, it's a, you know, one instance of the same time. So access unit is composed of a series of, you know, PBU, preliminary data unit. So there could be more than one PBU inside a AU, you can put whatever you like. So that, you know, you know the kind of AU size so that, you know, you can keep counting how many bits has passed so that kind code how many PBUs are there. And then PBU is constructed as PBU size, header, and data. So the size tells you how many bits, and then, you know, type is in the PBU header to tell you what kind of data in the PBU. So you can see the kind of detail in the RFC 9924. So it's mainly saying that it's a kind of video data and then—I will explain a little more—or the kind of metadata. And then in—for the video data, you can have a main video. And since we are talking about more of, you know, the professional video codec, so there could be a case that you have a small size of thumbnail kind of thing, so you don't have to decode everything to see kind of what's seen there. And then you can have, you know, multiple frames with the different angles so that you can have multi-view content and so on. So those are kind of main different types of PBU. And then so the main—a lot of— we expect that, you know, a lot of cases you will see this kind of PBU, which is encoding the frame data—I mean actual video frame, not the metadata or and so on. So if you encode video then that's mainly subdivided into multiple tiles. And, you know, each tile data will be encoded as a PBU type something, and then you will have a frame header repeated for each tile, and then you will see the kind of tile size, tile data, tile size, tile data, you know, combinations. And then at the end you can have a kind of filler data and so on. So this is a kind of video coding data. Mean, mainly you will see the tile as a, you know, smallest thing you will see in there as a bitstream. And then, you know, you will see that, you know, tile—series of tiles in, you know, is packed as a kind of PBU, then multiple PBUs become a one AU.

So the—as you can easily imagine that, you know, the payload format mainly talks about how to generate the payload out of that, you know, video bitstream—you know, AU bitstream. And there are two ways we are providing in this draft. The first one is simple mode, which means that, you know, it should be simple for fragmentation—I mean simple for packetization and de-packetization. That means, you know, you don't really care about, you know, what's inside and you just kind of whatever your MTU size or kind of remaining bit budget for the kind of payload, you just kind of chop it and send it. So there's nothing other than AU boundary, so each AU will start at the, you know, first of the payload, other than that, you know, nothing will be kind of restricted. Second way is low delay mode, you know. If you want to start decoding immediately whenever you have, you know, payload or packet, which means that kind of some of the boundary must be aligned with the kind of payload start. So in here we say that, you know, the beginning of tile size field—you saw that, you know, mostly you will have a kind of PBU with the tiles inside, so in, you know, we're not talking about the PBU header, PBU start, but, you know, tile start should be aligned with the payload header so that, you know, each individual tile can be kind of received easily and then discovered easily, so that your decoding latency would be kind of simplified. I mean or delay minimized.

Payload header usage—you know, packet header usage is very simple and very simple and straightforward because, you know, it's not—it doesn't have a kind of the temporal coding like it doesn't have P frame or B frame, it's just kind of everything is I frame. So whatever you need to indicate in the header is just kind of capture time, and so usually we use a kind of 90 kilohertz for timestamp. And payload header size, you only provide, you know, what's the your payload packetization mode is—simple mode or low delay mode. And you will have a counter that, you know, how many packets you need to receive to get back the original data sent by the sender as one thing, right? So in the operation mode you will see that, you know, what's the kind of mode, and then you will have a payload type indication saying that, you know, where you are—the start of the payload fragment, or end of fragment, or in the middle of, you know, something. Yep.

And this one is more about that, you know, the frame header is repeated or not. So as I mentioned that, you know, in the simple mode—you are aligning the start of, you know, tile header, and then you need to have a frame header to decode that part of, you know, tile. So you can optionally repeat the frame header at each kind of tile header, I mean tile size, you know, peers. Okay. And then you can also say that, you know, the frame header is static, so you have multiple frames going on but, you know, each frame has the same frame header and so on. Yep.

Mo Zanaty: Just a quick suggestion: I would try to avoid any field names that collide with the RTP header. So calling something payload type may be horribly confusing.

Suhyeok Jeong: Okay, okay. Good suggestion. Maybe I will call this APV, you know, payload type or something like that. Or format or, you know.

Mo Zanaty: Yeah. And also for consistency, you know, whether you use access units or frames, you know, pick one and then go with it.

Suhyeok Jeong: Okay, yeah. That's a good suggestion, thank you. So media type registration, we are thinking about, you know, register one type of APV. Mainly we need to indicate profile, level, and the band. Band is a little bit of, you know, further specifying levels because, you know, there's a huge range of levels, so you might want to have a kind of a little bit more, you know, band inside. SDP: we don't think, you know, this codec will be used any kind of bidirectional communication purpose, so our SDP parameter section simply say that kind of you announce this one and then you are not expecting any, you know, feedback or negotiation and so on in that section.

I think that's it, and this has been adopted as working group item, so I expect more comments, complaints, or criticism, whatever, you know. We'll be happy to get, you know, receive this one and then get going. And I don't think I have a GitHub repository yet for this one, but I don't know whether we actually need one or not.

Jonathan Lennox: Yeah, I mean it's up to you if you'd rather not work in a GitHub model, that's fine. So I have one comment as an individual which I'm afraid just occurred to me now or I would have emailed you about it. So what kind of packet rate are you expecting for this?

Suhyeok Jeong: Oh, several hundred megabits.

Jonathan Lennox: Yeah, because so with some previous payload formats that are intended for extremely high packet rates, they've added an extended sequence number because there's a worry that you can actually loop the 2^16 packets within the time, you know, packets might be expected to linger on the network. So if that's the case, you might need to look at, I think for example, the RTP payload format for uncompressed video, which is obviously the worst case, which is RFC 4175. That had that.

Suhyeok Jeong: Okay. Yeah, that's also good suggestion. Let me simulate the numbers and if there is a kind of often, you know, wrap around then definitely we need to increase it.

Jonathan Lennox: Yeah. That sounds good. All right, anybody else had any comments on this? Please join the queue. All right, sounds good. Thank you.

Suhyeok Jeong: Thank you.

Jonathan Lennox: All right. So let's see, how do I do this? Do I have to first—first I take back slide control, I think. And then I stop the slide, and then I start new slides. It's all very complicated. Now H.265, I think. And Philipp, I think I can hand you slide control if you want, Philipp, or do you want me to drive?

Philipp Hancke: If you could drive, that would be good. It's just one slide.

Jonathan Lennox: All right. I'll drive.

Philipp Hancke: Good, thank you. All right. It's been a while since the last update. Bernard Aboba was the driving force behind this draft, and he passed away last February. I would like to welcome Jianlin from Intel as co-author. He's been driving the implementation of this in Chrome, so I think he deserves to be co-author.

This is mostly a status update. All previous open issues were resolved. If you want to look at what happened, this is in PR 33 on GitHub. HEVC in WebRTC has shipped both in Chrome 136 (hardware only) and in Safari 18. We do have RPSI specified, which is a change since last version, but it's not implemented in browsers. We do have it specified, so we should be good on that front if browsers ever add support for that, they know how to behave. I think the ask for the working group is if there are any open issues that we should be dealing with before moving this forward. I found one issue in the introduction, but that was mostly nits. Any questions or comments? Mo?

Mo Zanaty: I'm assuming that for the WebRTC use case, you're doing in-band parameter sets on all of the key frames, correct?

Philipp Hancke: Yes.

Mo Zanaty: The one issue that we opened because of Mock was we discovered that for WebCodecs, when you specify HEVC, you don't get in-band parameter sets, even though there is a specification for it. It says if you use this particular type—I forget whether it's hvc1 or hev1, one of those two is supposed to be in-band, but yet you don't get it. I assume that's not going to be a problem for this because you're just going to always hardcode in-band parameter sets all the time?

Philipp Hancke: Yes.

Mo Zanaty: Okay. So I guess it'd just be more of an issue to open in the WebCodecs, the W3C issue tracker instead.

Philipp Hancke: Yes.

Jonathan Lennox: All right. So I would suggest then if—just publish a new version with fixing the nits you had in the introduction and then sounds like we're ready for working group last call.

Philipp Hancke: We'll do, thank you.

Jonathan Lennox: Okay, great. Thank you. And we're moving on at a nice clip today. So that's—so we should have plenty of time for the stuff at the end, which is always nice. And what's next should be ARF, right? Okay. Do you want—okay. I'll give control to the clicker. You should have slide control now. Is it not working? Huh. Has battery died? I'll drive the slides then. I'll take control of this one. All right, go ahead.

Suhyeok Jeong: So this is outline of the today. Next slide, please. Yeah, so this slide briefly summarizes the history of the ARF draft. First of all, thanks to everyone who supported working group adoption call. So based on that, the document has been updated as a working group draft. And we also made a small change to the title by adding the word stream to better reflect the scope of this draft. Next slide, please.

And then we also had many discussions with people who are involved in MPEG group. So based on those discussions, we also update abstract and introduction so that the terminology is better aligned with MPEG specifications. Next slide, please. And in addition, we also add some reference to clarify the original sources. Next slide, please.

And for another part is related to the Avatar animation unit. So we expect that Avatar animation unit may be updated depending on the outcome of next MPEG meeting. So we are planning to update payload header sections based on that discussion. So we will update it. Next slide, please.

Yeah, so the—we always suggestion and feedback are welcome. And as I mentioned, we are planning to update draft and eventually request working group last call when MPEG starts FDIS process of the Avatar spec.

Jonathan Lennox: Do you know roughly a timeline of that?

Suhyeok Jeong: Pardon?

Jonathan Lennox: Do you know roughly when that's likely to be?

Suhyeok Jeong: I'm not sure.

Jonathan Lennox: Okay, that's fine. Just wondering if they had any timelines on their side of things.

Suhyeok Jeong: As I know, this is DIS stage. So and then, yeah.

Jonathan Lennox: Okay, that's fine. So yeah, sounds like you have a good plan. So anybody have any comments? Otherwise, sounds good and I guess let us know when they've reached—when they've got FDIS and we'll proceed. Sounds good. Thank you.

Suhyeok Jeong: Thank you.

Jonathan Lennox: All right. Next up is RTP Frame Acknowledgement version 2. Let me make sure I have the right version of your slides. This should be version 2. Let's see if we have the clicker working now. All right, see if that works now. It's still not working. Hmm, very strange. All right. I'll drive the slides.

Gurtej Kanwar: Okay, thank you. Good afternoon, everyone. I'm Gurtej from Apple, and me along with Eric and Sridhar, we are the authors of RTP Frame Acknowledgement. So the background here is this is a mechanism through which a sender, a video sender, can find out about frames not being decoded or being lost, so that it can encode the subsequent frames with a well-known reference which it knows that the receiver has decoded. So this helps us not—it prevents large keyframes required to be sent when there is loss, and you can recover quicker from that loss. So can you go to the next slide, please.

So this draft has not yet been adopted, but the authors have been making progress on GitHub. We have 9 open issues and 15 of them have been closed, so quite a lot of progress has been made since last IETF. I wanted to go over some notable topics, so maybe next slide, please.

The first thing that we added and we discussed this at the previous IETF is we added a mechanism to have receiver-triggered resynchronization requests. So say if the receiver notices that its decoder is being starved, frames are coming in but it's not able to decode it because it does not have the reference, it can wait for some timeout and then request a resync. We—to account for this, we updated the feedback format to contain one byte flag field. Right now only one bit is being used in that, which is this R flag, which represents a resync request. So this can be like an unsolicited RTCP message that can be triggered from the receiver to the sender, and it can contain the last decoded frame ID and the status vector up to the last received frame. And then the sender can encode the next frame referencing the frames available at the receiver using the status vector.

York: I'm curious. So we have RFC 4585 that defines Reference Picture Selection Indication (RPSI), which allows you to precisely point to which frame you should be using as a reference for subsequent encoding after a frame has been lost. This seems to be providing similar functionality, and I'm wondering whether there are specific things you have in mind that aren't covered by this past document, or whether—I just didn't have a chance to read this, I've come across this for the first time now—so I'm just curious, so please enlighten me on that.

Gurtej Kanwar: Yeah, so we also have Eric here who can chime in, but my understanding is that the RPSI draft is not being implemented for some of the newer codecs, like AV1, AV2.

York: But it's orthogonal to the codec, I think.

Gurtej Kanwar: It is specified in the payload format, I think. Yeah.

Jonathan Lennox: Yeah, RPSI specifies what you select based on internal codec details. This is intended to be codec-independent.

York: Codec-independent is clearly good, even though RPSI is partly codec-independent, it's not specific to a certain—I was just curious, and if there is a similar design in spirit then maybe it should be following a similar structure, if that helps to make implementations easier to carry over. Thank you.

Gurtej Kanwar: We can go to the next slide, please. So we've also added a section for SDP negotiation. So we defined an extension map attribute for the new RTP header extension. This is how the sender tells—marks frames and expresses to the receiver that it cares about feedback for those frames. And we also added a new RTCP feedback attribute for this—the new RTCP feedback message that you saw in the previous slide. The feedback message attribute also has a parameter called resync-timeout, through which the sender and receiver can know about use-case specific timeouts. So for certain low latency applications, you might want to be more aggressive on the receiver side while requesting these the resync requests, and the sender can plan its feedback cadence in accordance to that as well, requested feedback cadence. So next slide, please.

So next I wanted to kind of just go over some examples, and these are also—we've also added these to the draft to explain some the different modalities of the feedback request and the feedback mechanism. The first example is the sender is transmitting several frames and then it requests feedback to confirm which frames have been decoded at the media receiver. So here, if you can see, the sender sent three frames, timestamp 0, 100, 200, with frame IDs marked as 0, 1, and 2 in the extensions, and it did not—it was just marking the frame ID, it did not request any feedback at that point. When it sent the fourth frame, it marked it with frame ID 3, but also set the FFR field to 10, which implies that it is requesting feedback and it provides the range. So it can start from frame ID 0 with length 4, so it's basically saying requesting a range of 0 to 3. And the media receiver responds to that with a status vector—here in this case, vector is all ones because all four frames were received. This is the simplest case. Next slide, please.

So this is just another variant where the—just showing that you might have several frames where you don't even tag them with a frame ID, so you just send them without any RTP extension, without this specific RTP header extension. And then it sent a fourth frame with ID—with some ID 4 in this case—and it also set the FFR field to 01, which implies that it's implicitly requesting feedback for this frame. So and the media receiver responded to it with just feedback of length 1, vector 1 because it received the feedback. Next slide, please.

Okay, so now this is a kind of showing like how you can recover from frame loss. So we have two modalities here, one is a sender-side recovery mechanism. So here what we're showing is the sender just keeps—with every frame that it marks—it keeps requesting the feedback for the last three frame IDs. So in this case, the example starts from the frame ID 10, but there were two frames before this. So the media receiver responds with the feedback saying, yes, I've received all three, status vector 111, all of them have been decoded. Next slide, please.

Okay, so the next frame, let's assume it's fully lost in transit. So—and next slide, please. When the sender sends the next frame, again as always, it's requesting the feedback for that frame and the previous three frames. The receiver will respond with a status vector 100 saying that I did receive frame ID 10 or timestamp 1000, and the two frames after that, either I have not received completely or I have received them but I have not been able to decode them, both of those are currently represented as a zero in the vector. So it's responding with 100. So at this point, the sender, once it receives this feedback, it can encode its next frame with the known reference, which would in this case be the first frame with timestamp 1000, and the receiver will be able to decode that frame.

Next slide, please, thank you. This is showing the receiver-triggered resync requests that we talked about earlier. So here, again, first frame ID frame was sent and implicit feedback was requested for it, the receiver immediately responded with yes, I've decoded it. Next slide, please. Let's say it received the next frame but it was not complete, it could not be recovered through FEC or other RTX mechanisms, so it was only partially received, so it could not be decoded. Next slide, please. It received another frame, this was fully received, but it was referencing the previous frame, so it cannot be decoded. So then after a timeout, the receiver can decide, okay, my decoder has been starved for too long, I'm going to request a resync request, which is what it does. And in that indicates that I did receive frame with identifier 20, which was the first frame in this flow, and with a status vector of length 1 and value 1. And then next slide, please. And the sender is then able to decode the next—encode the next frame using the known well-known frame as reference, and then decoding can continue. Next slide, please.

So as next steps, we want to keep making progress on the remaining GitHub issues that we have. We're happy to get more eyes on this, and we'll be happy to address any feedback that you have. Also at this point, I think last—we think that the draft is in a state where the main structure is there, that we would like to call for adoption. We think this work is very relevant to this working group.

Mo Zanaty: Yeah, I think the goal is admirable to have a general mechanism. But I think the reason why RPSI is codec-specific is because when you get into the details, you realize that you can't do this in general for any codec. The codec has to have explicit support for long-term reference frames, and you have to say how do you mark the long-term reference frames. And when you say rewind to this other reference frame, you can't use an RTP timestamp, you have to refer to the codec's internal frame ID. So then you have to have these bindings between your generic message and the individual codec that actually support it, and then how do you map those frame IDs and the support for long-term reference frames and things like that. So it almost seems like you still need a binding of this to AV1, 264, 5, and everything else.

Gurtej Kanwar: Yes. I think the binding is required in the software on the sender and the receiver, but this provides a generic mechanism on the wire which, once you have the bindings, it can work across any codec.

Mo Zanaty: Are you planning to put like common bindings for common codecs into this spec, or do you plan another spec to show those bindings?

Gurtej Kanwar: We were not planning to put it in this spec, but we can discuss that, that's valid feedback. Perhaps like in the future as new codecs come in or with old codecs we can discuss enhancing those.

Altanai: Sorry, did I unmute? Yeah. Hi, Altanai Sisodia. What do you think is the potential for amplification attacks when somebody's requesting a large window of frames? Because that's basically just going to overrun the budget for data.

Gurtej Kanwar: Can you repeat that, so like what's the attack vector?

Altanai: Yeah, if you go to the diagrams, any—let's go to the first diagram that you showed me, the first sequence diagram. Okay. This first sequence diagram, please. Yes. This—any is good. Okay, let's wait on this. So this one shows FB-start 0, FB-length 4. Right? Let's put FB-length and length to maximum, then it's—the RTCP feedback is just going to request for so many frames back. It's kind of a scope to launch amplification attack as in—

Gurtej Kanwar: No, no, sorry. This is not actually requesting frames back. What this is just doing—this slide is just showing that I want feedback for this range of frames. It will—one tag—like one feedback request will generate a response with one RTCP feedback message, which can have a status vector which can contain like that range of frames. And just like from an attack vector perspective, this is like—it's between the two peers and everything RTCP, RTP is like all authenticated. So yeah.

Jonathan Lennox: I think, Altanai—Jonathan Lennox just interjecting in the individual—I think there does need to be guidance on how much a receiver needs to remember for the vector.

Gurtej Kanwar: Absolutely. We have an open issue about this, yes. Great, great, great. Thank you.

Stefan: So I will spare you the—I will spare you my video for tonight. My—I want to echo roughly what Mo said. I think—I think trying to implement an or trying to work towards a generic mechanism is too ambitious a goal. I think a lot of this looks like it would work well enough with AV1 and probably AV2. So maybe you want to go for a limited scope rather than envisioning or rather than repeating the mistake we have done in the past, frankly, and envisioning mechanisms in the future that relate to the future codecs that may not even have been set at this point in the hopes to create a generic solution. So I'd—you know—I mean, we'll look at this a little bit closer once you have a working group draft, right? Because at that point, these comments may have a little bit more weight. But yeah. I'm—I appreciate the work here and I have no objections in accepting this for as a working group item. Thank you.

Gurtej Kanwar: Thank you. So we have looked at this—like we have similar mechanisms, proprietary ones in place for HEVC and H.264, so we know that like this can definitely cover that case, and we've looked at AV1 and AV2 as well. Yes, like in the future if there might be a codec which we cannot fit under this umbrella, it would be good to get eyes on the draft and get feedback about that.

Eric: Yes. I just wanted to comment on this topic that we just had a comment on. I think that you can actually be codec-agnostic as long as when the sender sends a group of packets with a specific timestamp and the marker bit to say this is a frame that you can decode. That is sort of the smallest unit we're dealing with here. As long as you have a codec that says if you have decoded this piece of data and mark that with a frame ID, then the receiver can say I have decoded this piece of data, and then the sender is—it can use whatever codec it wants, it can be like part of a frame even if it can decode a slice independently. And then it knows that, okay, the receiver has decoded this piece of the bitstream that I've sent. And I know the bitstream that I'm sending whatever codec format it is, and then it can reason about which state the decoder is in when it comes to that part of data. Thank you, that's just my comment.

Jonathan Lennox: All right, yeah. Jonathan Lennox as an individual. I'm just—I think somewhat reflecting what Eric just asked—but said—but I'm just curious how this works in the SVC case, where you have multiple frames that have a single timestamp corresponding to different spatial layers. Does—are each of those marked as different frames in this, or are they the same frame, and how does that work?

Gurtej Kanwar: Eric, do you want to take that?

Jonathan Lennox: So for the case of a spatial scalability SVC, if you have multiple logical frames within a single access unit, so there's—so they all have the same timestamp but they're different logical units. Do they get—

Gurtej Kanwar: Those would be sent over different SSRCs?

Jonathan Lennox: No, same SSRC.

Gurtej Kanwar: Okay. Eric, do you want to take that?

Eric: Yes. So maybe I wasn't precise enough when I just said a timestamp and a marker bit, but the idea is essentially the same. Each frame even when you're doing S-modes will on an RTP level be identifiable as that like this is a decodable unit. And then to each of those you can put one of these frame IDs as a sender and know that when the receiver has received and decoded this part, and that could be then part of an SVC frame, that it will respond back that I have decoded this piece of data. And that's sort of essentially why also we want a vector of frames, not just like the last decoded frame, because if you have multiple independent spatial layers within a single SSRC, you need to know the state of each layer independently to know what you can use as a reference or not. Okay, thank you.

Mo Zanaty: Yeah, I think Jonathan's question was really do the timestamps—you have multiple frames now per timestamp, so your assumption that timestamp maps to frame, you know, becomes invalid, so I think—

Gurtej Kanwar: I don't think this is making that assumption.

Mo Zanaty: Does the draft assume that timestamp has one frame?

Gurtej Kanwar: No, it does not.

Mo Zanaty: So you can have 10 frames in the same timestamp.

Gurtej Kanwar: Yep. That's no problem.

Mo Zanaty: Okay. And then to clarify, Eric, what I meant earlier is that I agree that the mechanism for the signaling can be generic, but for the mechanism to actually work, you have to have a binding to specific codecs. So how would you implement this for, you know, 264, 5, AV1, AV2? To have a full solution, you need to specify how this is used by those codecs and what, you know, like 264 case, what MMCO message you send to be able to tag your frame with this frame ID and how that maps to the 264 frame ID. AV1 is even more complicated because there are bitstream frame IDs and there's also RTP payload format frame IDs that are different than the bitstream frame IDs. Now you have a third frame ID concept in this feedback message. So I think something that has to specify those bindings is needed to have a full solution. And it yes, it can have a general, you know, RTCP feedback message on the wire, but the whole solution needs, just like RPSI, bindings to the actual codec to make it work.

Gurtej Kanwar: Would be super helpful if you could also get some comments on the GitHub with your thoughts on that.

Jonathan Lennox: Yeah. Again speaking as an individual, I think an implementer needs to understand the binding between the codec-specific frames and the frame numbers on the wire, but it's not clear to me that needs to be specified. I think that can be, you know, entirely a matter for implementer innovation and development and, because the receiver is dumb, the sender doesn't need to—the receiver doesn't—you don't need to have a deep standardization of how that works as long as the sender knows what it's doing and does something sensible. And if it doesn't, it won't work well and then you shouldn't do that, you should do something better.

All right, any other comments? Are people generally, you know, in favor of us doing a working—a call for adoption here? Anybody objections to doing a call for adoption? All right, I see general nods for doing a call for adoption, so we'll put that on our to-do to call that—I mean, I guess make sure that you have—I guess you've mostly been working out of your GitHub repo, so make sure you have an actual draft published in the—

Gurtej Kanwar: We have.

Jonathan Lennox: Okay, good. Is that up to date with your latest version?

Gurtej Kanwar: Yes.

Jonathan Lennox: Okay, yeah. Great. All right, so sounds like then we can do yeah a call for adoption on that if—with the version that you have published.

Gurtej Kanwar: Sounds good. Thank you.

Jonathan Lennox: All right. All right. So what do we have next? Okay, next up we have SFrame, I think, is that right? Yeah. All right, so Yoann. Yoann, do you want me to hand you slide control?

Yoann: Uh yeah, that would be good. I don't have a clicker, so.

Jonathan Lennox: Okay, yeah. Yes. All right, you should be able to control the slide.

Yoann: Yep. Next slide, please. Oh, you want me to—okay. Oh yeah, that's right. Hmm, working. Cool. So it's an update, so we updated the draft based on feedback from last IETF meeting and I'm presenting the current status. So the main change is that as discussed previously, we added a new T bit (type bit) for either raw or pre-packetized content. So 0 means raw as defined in WebRTC encoded transform, and 1 means pre-packetized meaning that you used a media-specific packetizer before doing encryption. We also have the pleasure to get a new co-author, Aaron Rosenberg, who's not there. Aaron is working on Chips specifications, so the Matter effort where SFrame is being used for camera streaming, and they—they're doing some stuff there as well. So they're using this draft in their specification and implementation as well.

So with the new T bit, we have like an asymmetry between sender and receiver, which is quite good. Basically the sender has the freedom to do what it wants—use raw, use packetized, and how to split a frame, do per packet, per frame. And receiver just follows the decision from the sender based on what it receives. So it simplifies API at the W3C level and we hope that it will also help implementations. The spec now defines four algorithms: so there's a one generate—generation of SFrame RTP packets on sender side, and then we have the per-frame-at-frame sending which is what current user agents support. There is per-packet-at-frame sending that Webex is using as well. And on receiver side we just have one algorithm for receiver processing of an RTP packet. So that's about it.

And the next—the last thing that we added is per-SSRC key derivation. So it stems from the Chips specification where they want to have a different encryption key for each SSRC. And so what they're doing is have a base key and then you can derive per-SSRC specific keys by using the SSRC as a salt to the base key. Here is the thing. And there's also a section on SFrame ratcheting integration. So SFrame ratcheting is defined in the SFrame spec RFC, and the idea here would be to use the SSRC derived key as the base key for ratcheting. And so it's also described in the specification. It's of course optional, it's really up to the application, but it seems good guidelines and I believe that the Chips Matter implementation will support these things.

So that's everything we have new in the draft. We made some progress on W3C API side, so it's under discussion but it's making progress. On sender side, we'll have the support for either per frame or per packet at frame sending. It's really the application that decides what to do. Plan is to have like a native construct meaning that the user agent will implement the packetization, the encryption and so on itself, and there will be support for per frame and per packet. And for JS support, meaning that you are doing the encryption using WebRTC encoded transform via JavaScript, support will be limited to per frame sending for now and we'll see whether there's interest in extending the support. On receiver side, native support will be supported for per frame and per packet, and JS support will also be limited to work on per frame for now. That's where we are.

Talking about implementation efforts, so there's libdatachannel prototyping that's conducted as part of Matter working group. I don't have the latest status but I think it's just working. It's limited to per frame sending and per frame receiving for now, it's not supporting per packet, but in terms of payload and it's following the latest draft. We envision a libwebrtc prototyping. I haven't checked lately where it is, but design decisions are being made to see how we can do this effort and I'm guessing that this effort will be used by user agents like Chromium and Safari in the future, we'll see. The idea is to support both per packet and per frame, that's the scope there.

And finally the last slide. So yeah, we want to finalize the document before going to last call. We think that although we have algorithms, we could simplify them a little bit, that's feedback we received. I'm guessing that we might want to add some guidelines on RTP header extensions in case of per frame encryption or in case an SFU is re-fragmenting an SFrame packet, so whether you copy all RTP header extensions or not. So my guess is that we will probably do something like if header extensions are useful for packet switching then they should be in all packets, but otherwise you do not need to copy them basically. So something like that I guess, we cannot give more precise guidelines because it might be specific to each RTP header extensions. And we plan to implement—gather implementation feedback and once we had that we might want to request last call. I don't have a specific schedule but I'm hoping we can get last call before the end of this year. And that's it for the status update.

Jonathan Lennox: All right, thank you. Does anybody have any comments, questions, anything to say? All right, sounds good. Yeah and that sounds like a reasonable timeline. Definitely is something where I'd would be happy—you know—would like to have some implementation experience before we go with it because it's not simple. I mean it's simpler than it could be but it's still—there's a lot there and I'd like to know—

Yoann: Yeah, the good thing is there are two implementation efforts: one on Matter where you do not need to have APIs in browsers, so I'm guessing it will be easier and faster. The one still needs to get W3C API to get the full thing, so it will take a little bit more time.

Jonathan Lennox: All right, sounds good. Thank you.

Yoann: Okay, thank you.

Jonathan Lennox: All right. So what do we have next? Okay, next up we have—Justin, is that right? Do I have this right? Yes. All right. Let's see. Philipp's slides, let me swap that. Sorry about that. This one, STUN Protocol for Embedding DTLS. Is that right? And do you want slide control, Justin?

Justin Uberti: Uh if it's easy, yeah sure.

Jonathan Lennox: Yeah, all right. That should be—you should have slide control.

Justin Uberti: All right, you got the slides control. All right. Okay, so called STUN Protocol for Embedding DTLS, or short name SPED.

So refresher on WebRTC. WebRTC basically sets up a sandwich of protocols. First there's ICE that sort of establishes a valid pair, selected pair, and then DTLS runs over that selected pair. But these operations happen serially. First the ICE process needs to establish that selected pair, then DTLS is then sent over that—that—that ICE connection. The DTLS handshake also has its own different semantics on how packet loss is treated. ICE basically has its sort of monotonic pacing, DTLS has exponential backoff.

So the way that we try to solve this of not having ICE and DTLS run serially is we actually send the DTLS handshake packets inside of STUN by embedding them into STUN attributes. So this can save an RTT. It works with all versions of DTLS, and with DTLS 1.3, which has a 1.5 RTT handshake, it actually delivers performance that's comparable with SDES because we basically are getting all the DTLS transmission happening during the actual ICE process. And even more so, we get reliability under packet loss by using ICE for retransmissions rather than the DTLS exponential timer. Downside: now we have larger STUN packets and if you have a lot of actual pairs, you might end up sending bigger packets onto the network.

But let's go to this ASCII art here. This ladder diagram you can see the difference between the vanilla WebRTC on the left, where we have these serialized. You send offer answer, that's a round trip. You send STUN binding request and response for ICE, that's a round trip. And then DTLS 1.2 has two round trips, so it's four round trips total before you can send any media. With SPED, we are actually embedding that DTLS into the ICE process, so we only have three round trips. In DTLS 1.3, as I noted, it's even better that the handshake has one fewer round trip, so we can actually instead of three round trips with DTLS 1.3, we can get that down to two, where the entire DTLS handshake is essentially covered in the ICE connection—connectivity request exchange. So essentially, like I said, this gets us back to where we could have been with SDES, so we have our benefit of DTLS and have our cake and eat it too.

The overall mechanism is fairly straightforward. We define a new STUN attribute for carrying the DTLS packet. When you have DTLS packets that are pushed up by the DTLS layer to send them onto the wire, when the next ICE check goes out, it takes that packet and sends it out round robin across the ICE—the ICE checks. We add a new STUN attribute for an ack mechanism. Then the ICE checks that are sent out will also include acks of packets that have been received using this mechanism. This ensures that we can acknowledge packets right away rather than having this exponential backoff. There's no SDP changes here, no offer answer negotiation of this, you can tell just by the ICE checks and if there's no attribute for the DTLS packets, you know the other side does not support this mechanism and you can stop sending. We have a few open questions, but the number one thing is where should we progress this document? Anyway, that's the overall summary of the mechanism. I open it up for questions.

Harald Alvestrand: So I would just note that Google has implemented the mechanism, with a few divergences from the—from what's in the draft, but yes it works and we definitely want to ship it.

Mo Zanaty: Um yeah, I think this is a good direction, and one note on the larger STUNs. I think that's actually a benefit because in the video case, we often see the initial rate estimate is very bad because of tiny STUNs and now if you get something that more reflects the traffic you're about to pump, you get a better rate estimate from your first round trip.

Justin Uberti: Interesting.

Gorry Fairhurst: I see—I see a question about where you might want to do this. I think the first thing to clarify is that you do want to do this, because it seems like it could be really useful. And then I'll talk with the chairs of this group and I guess TSVWG to see who should take on the work if we decide to do it.

Jonathan Lennox: Yeah. I think those are probably the two reasonable cases of groups that are still open. You know, it's—it's might for us it might be a charter change, I'd have to think about it, but TSVWG might work, but then again they're also there's a lot going on there, so. All right. Yeah. But I think it sounds like people do feel like this would be useful. Is there anybody who feels like this should not be done somewhere in the IETF? And if not, sounds like we should—we should talk with Gorry and TSVWG chairs to figure out where it should be done, but sounds like people like the idea. All right. Thank you. And Gorry—

Gorry Fairhurst: So can I check—can I check who might want to implement this? Because I think I heard people at the mic, but I wasn't quite sure.

Jonathan Lennox: I think Harald said that Google in Chrome is doing something that is similar to this though not exactly the same and they think they'd rather do something that was standardized. But I will let Harald speak for himself.

Harald Alvestrand: To be precise, Jonas from Google has been developing this together with Fippo and Justin, and currently we have some disagreements on details on what the mechanism should be, but our goal is to converge on a single specification and ship that.

Gorry Fairhurst: That is always a good input to have for an IETF work item. Thank you.

Justin Uberti: And we at OpenAI are implementing this into the Pion framework for Golang version of WebRTC.

Jonathan Lennox: Also good. So what I suggest I do is I follow up with the chairs here first and we discuss the charter and where the best place is to put this work and then the a set of chairs will be able to do an adoption call.

Justin Uberti: Sounds good. Great.

Jonathan Lennox: All right. Thank you. And Gorry—great. In that case, I think we can move on to Philipp. Yep. To the other thing which we don't know where it should go, but possibly the same. Oh wait, no. That's the one we just did, sorry. Hold on, I think I just shared the same one again. Hold on. This is SNAP. Yes, sorry. There we are. And Philipp, do you want slide control? No, he wants me to have slide control. All right. I'm not sure we hear you right now, Philipp.

Philipp Hancke: Now it should be working.

Jonathan Lennox: Now we hear you, yes.

Philipp Hancke: Okay, I'll let you drive. It's just three slides, so should be easy. Next slide. So like the SPED stuff, this is something that sits between working groups. I think it's more of an SCTP topic, so I tagged it as TSVWG working group, but it does use SDP for the solution, so the question is again what is the right working group. I'll let the chairs figure that out. As a reminder, WebRTC uses SCTP for data channels, Data Channel Establishment Protocol RFC 8832. This runs over DTLS after ICE is done. And we think the time to open a data channel can be reduced in WebRTC. One way to reduce that is ICE and DTLS, that's the other draft. And this draft is about reducing the time to open the data channel itself. And DCEP already allows sending data without waiting for an acknowledgment, and we have negotiated channels so you basically skip the whole negotiation when possible.

Next slide, please. So again ASCII art. On the left, we have the vanilla WebRTC case: SDP offer answer one round trip, ICE one round trip, DTLS 1.2 another two round trips, and then SCTP does this init, init-ack, cookie, and cookie-ack. Which brings us up to six round trip times, and we can reduce that on the right side to four by skipping the SCTP init.

Next slide, please. And what we do is we basically take the SCTP init and init-ack, which are either one or two RTTs—the spec says it can be one, what I've seen in implementations in WebRTC was two. We do this with the upfront SDP exchange. We basically take the init chunk base-64 encoded and put it into a single attribute in the SDP, which is reasonable because we only have a single SCTP m-line. So it doesn't blow up the SDP size too much. It still does feature negotiation, which kind of fits the SDP purpose, and from there you can jump straight to data channel establishment protocol or negotiated channels. We do have two implementations: one in libwebrtc and Chromium—this is still behind a flag because we want to avoid shipping it without getting IETF eyes on it—same for Pion, we have a SCTP and WebRTC implementation pending waiting before merging and shipping that for some feedback. Does anyone have questions?

Jonathan Lennox: I'll ask question as an individual, Jonathan Lennox. So I don't know how much the implementation actually supported it, but in theory—I mean, I think in theory WebRTC kind of supports this, I as I recall—but there's the issue of what happens if—if the offer is forked? Um, is the SCTP init—so you're sending the same SCTP init to more than one destination. Is that going to be a problem?

Philipp Hancke: Um, I think it will always end up with a single offer answer pair.

Jonathan Lennox: Yeah, so you'll—to the answer there'll be a—the—so a single offer might have multiple answers and then you have a single, you know, um—

Philipp Hancke: That's a good question. I need to think about it.

Justin Uberti: I was thinking that the DTLS connection would only be established with like a single remote peer, so I think it's probably not likely to be an actual issue in practice. And even if it is, like a decided—if it is parallel forked, I think the—the init packet would still be valid nonetheless.

Jonathan Lennox: I don't know enough about SCTP to know if that has randomness or anything like that. Harald?

Harald Alvestrand: And the problematic case is of course PR-answer. Sigh. But if you get a PR-answer with one init response chunk and then an—and then an answer with a different one, I don't know what state you end up in. And the—with—when you get an—a PR-answer and an answer with different signatures, that just causes an ICE—ICE restart. It's actually used in—unfortunately actually used in practice to—handle multi-answer situations. So—we could just say no to SCTP init in PR-answer. That would be tempting. But we probably need to address it.

Magnus: Yes. I think the problem here is as you say, if you get multiple answers you really need to know that you're binding the SCTP init-ack to the actual endpoint you're establishing the ICE connectivity with so that you actually can continue when you come to the after-DTLS part with the cookie echo, cookie ack. Otherwise it will not work because, I mean, it is—this does carry SCTP configuration data, there could be potentially differences in parameters etc. and also your set of—re—your building of the V-tags in the packets to match this second round, you need to have the first round be matching. So—so I think it's—you have a problem here that you actually need to understand that you're talking to the same endpoint that you exchanged the offer answer with when you come to the after-the-post-DTLS part. But that—yeah. Otherwise, I think this is fine. I mean, init-init-ack is intended to be stateless on the server side or—I mean, the responding side for the SCTP, so it should be able to take the cookie—cookie that's sent in this and verify it works. Otherwise you would just get a SCTP renegoti—no, this doesn't work abort, go back, so it doesn't mean it doesn't mean you don't set up. Assuming that you have an implementation that will—that when it gets an abort from the server side that this cookie doesn't match what it—it's not for me, I don't understand it, it's not for me and do an abort that you actually do a new init. Cool.

Gorry Fairhurst: Yeah, same sort of answer to the previous question of which working group. Let's discuss this, but it looks like this needs discussion here in AVTCore to work out what exactly is being proposed, although I really suspect that any work to change SCTP should be done in TSVWG.

Jonathan Lennox: Yeah, please discuss here. Yeah. Go ahead. The main issue is that I think a lot of the complexity comes out of the SDP parts, which is, you know, not—if MMusic still exists I'd—I would definitely be in MMusic but it doesn't. So—I—I'm not sure who has inherited MMusic, um, and arguably it's TSVWG or arguably it's us, but—um—I think certainly like anything that needs a lot of SCTP knowledge is going to be TSVWG, but there's not a lot of SDP knowledge there. So it's tricky.

Gorry Fairhurst: Well, okay. This is why the IETF exists and we will—we will figure out how to solve the problem. If it's—if it's what people really want to do, we'll find a solution.

Jonathan Lennox: Yeah. Yeah. And it—it sounds like certainly people are there's interest in this, um, I would say. But—Magnus, did you have more to say?

Magnus Westerlund: Yes, Magnus Westerlund. I don't—I do think it's primarily mapping and it's primarily question of how SDP offer answer interacts combined with the kind of ICE procedures etc. It's the SCTP part's probably fairly straightforward, it's not really changed, it's just that you move the message from one place to another, which has some implication of what could happen, but I think you can actually explain since it—oh, this would be corresponding to the following thing happened on an IP network. If you would try to talk to another client it basically say, oh, I'm sent the wrong thing to the wrong server and that—and that what's going to happen. So I think it's—it's not necessarily so much SCTP. You definitely need SCTP knowledge sanity-ize on it, but—that's—I think is by inviting getting like Michael Tuxen also to look at this etc. it's—it's probably better to do it focused this in MMusic/AVTCore-land kind of part. I mean.

Jonathan Lennox: Yeah. Agreed. All right. Um, any other comments on this draft? It sounds like, you know, I would say there is interest in doing it, we still need to figure out where, but it might be here. Um, it might be that if we do it here it would require some charter updates, but that's also something we could take care of if necessary. All right, any other comments on this topic or any other business? I think this was our last presentation unless I have forgotten something. All right. In that case, I guess let's go— let's go over what the chairs' work items are. And let's see. I think we said—Marius, you have notes there or should we—

Marius Kleidl: Yeah. So for the WebRTC profile, we're gonna issue a last call once the new version is published. For frame acknowledgement, I think we can start a call for adoption as soon as the current one runs out, so I don't think we have to wait for that. And for the last two presentations, we have to figure out where to put this. I think that's all that we have to do on our end.

Jonathan Lennox: Okay. Did we say we were going to do a working—last call on—no, APV still has has some work to do, right? So there—yeah. So—I'm trying to think was there something else here said we're ready for working group last call or am I misremembering?

Marius Kleidl: I don't think I have recorded anything on my list.

Jonathan Lennox: What did we say about ARF? I'm trying to remember.

Gurtej Kanwar: Um—okay. Do you want to come to the mic? Yeah.

Suhyeok Jeong: We'll request it later, yeah.

Jonathan Lennox: Say again?

Suhyeok Jeong: We will request it later, I mean, maybe after we—

Jonathan Lennox: Oh right, right. We're waiting on the—we're waiting on the MPEG—we're waiting on the MPEG update. Okay. Perfect, perfect. Thank you. Great. All right, so I guess we will start—um—we have our work items of the chairs, should expect those in the next few bits. And as usual, probably expect an interim sometime between now and, you know, roughly halfway between now and Vienna, so I guess that'd be probably May, I would suspect, though we'll need to figure out what—what works for people.

All right. Thank you all very much. Any other business? Otherwise, we can call it a day and you can wander down to—those who are here in person can wander down towards the snacks in the plenary. And everybody else can go do whatever is appropriate for their time of day, whether that's finish their day or get some sleep or whatever. All right. Thank you all very much.

Marius Kleidl: Thank you all and enjoy your snacks.