Markdown Version | Session Recording
Session Date/Time: 28 Sep 2022 14:00
COINRG
Summary
The COINRG interim meeting focused primarily on presenting and discussing ongoing research efforts and new ideas within the field of computing in the network. Key topics included novel programmable hardware for in-network machine learning, distributed coordination for in-network functions, updates on use cases, new distributed learning architectures, and the application of extensible in-network processing for machine learning features. The chairs also provided updates on the status of research group documents and milestones, emphasizing COINRG's role as a research group fostering innovation rather than a standards-setting body.
Key Discussion Points
- COINRG Scope: The chairs reiterated that COINRG is a research group focused on fostering new ideas and research in computing in the network, not for developing standards. The agenda reflected this focus on research projects and emerging concepts.
- RG Document Status and Milestones:
- Two existing RG documents require updates, with potential discussion on advancing them.
- A new draft on "Distributed Learning Architecture with Edge-Cloud Collaboration" was presented.
- Several other drafts have expired, and their future needs to be addressed (either keep expired or move forward).
- Two milestones are currently late. The chairs plan to review and propose a new set of milestones that better reflect the dynamic nature of the computing in the network field.
- Trio ML: In-Network Straggler Mitigation for Distributed ML:
- Myriam presented research on leveraging Juniper Network's Trio chipset's thread-based architecture to mitigate "stragglers" (slow servers) in distributed machine learning training.
- The approach achieved 1.8x faster training time and 1.6x faster time to accuracy compared to Tofino-based solutions in environments with stragglers.
- Discussion confirmed the realism of training iteration times and acknowledged that stragglers can be caused by various factors beyond network delay (e.g., server load, garbage collection). The benefits of the solution were identified as specific to Trio's architecture and its efficient straggler mitigation capabilities.
- Daiso: Distributed Coordination for In-Network Computations:
- Atharv presented Daiso, a solution for orchestrating in-network computations in Name Function Networking (NFN) built on ICN.
- Daiso aims to overcome limitations of NFN's local decision-making by enabling nodes to form "synchronization groups" and exchange state information.
- The system operates in four phases: neighbor node discovery, synchronization group formation, synchronization (using SVS), and coordination (altering forwarding, load balancing, function replication).
- Simulations showed improved function placement and reduced completion time, with optimization strategies to manage network overhead.
- Use Cases for In-network Computing (Draft Update):
- Dirk provided an update on the expired "Use Cases for In-network Computing" draft.
- The draft currently collects and structures use cases into four main groups, refining terminology and starting an analysis of research questions.
- Key questions posed to the RG included:
- Should terminology from this and the
coin-coachesdraft be collected in a separate document? - Should the analysis section be part of this document or a separate effort?
- Does the RG want this work to be continued, requiring resubmission and potential last call for publication, and are there more contributors?
- Should terminology from this and the
- The chair expressed a sense that a separate terminology draft would be a good idea, that the analysis might be better in a separate document, and that the work is important to continue given the field's dynamic and historical nature.
- Distributed Learning Architecture with Edge-Cloud Collaboration (New Draft):
- Chaoli presented a new draft proposing a distributed learning architecture for AI models, leveraging edge-cloud collaboration.
- The approach involves segmenting AI models for independent training across edge and cloud layers, with the cloud handling model determination and standardization.
- The goal is to balance compute load across network tiers, improve model accuracy, and reduce pressure on edge devices.
- The presenter expressed an intent to provide a compute balance model for training in networks.
- Data Operations in Network (DOIN):
- Yiting discussed "Data Operations in Network," focusing on scenarios where network devices perform simple, line-speed computing tasks.
- Three scenarios were highlighted: AI aggregation (basic operations like sum/arbitration), log tests (compare-and-swap for lock management), and sequence management (Fetch-and-Op for packet ordering).
- The presenter emphasized the need for general mechanisms to route computational packets to the correct device, specify operations, and describe structured data.
- Discussion ensued regarding the term "data operation," with a sense that the operations described were more akin to "atomic computations" than pure "data operations." The focus on low-level operations was clarified as essential for line-rate performance.
- EIP in ML Context (Extensible Inbound Processing):
- Stefano provided an update on using EIP for machine learning in networking, extending previous work on per-packet ML inference (Taurus).
- The proposal involves a distributed architecture where feature extraction occurs at one node, and these encoded features are transmitted via EIP (IPv6 hop-by-hop option) to another node for ML inference.
- A "lightweight standardization" approach was suggested, standardizing the framework for exchanging "Encoded Feature Representation" (EFR) records but leaving the specific feature content open to innovation.
- Discussion touched on the challenge of separating feature extraction from ML inference, especially in deep learning, with the presenter arguing for the practicality of pre-chosen, flow-level features for scalable in-network inference.
Decisions and Action Items
- RG Milestones: The chairs will work on reviewing and proposing a new set of milestones for the research group.
- Use Cases for In-network Computing Draft: The authors will resubmit the draft. Discussion is requested on the COINRG mailing list regarding:
- The continuation of the draft's work.
- Whether terminology from this and other related drafts should be collected in a separate document.
- Whether the analysis section should remain within this document or be moved to a separate document.
- Distributed Learning Architecture Draft: The presenter is encouraged to initiate a discussion on the COINRG mailing list about how this new draft could evolve within the group.
- Data Operations in Network: Discussion regarding the concepts and terminology is encouraged on the COINRG mailing list. The presenter intends to pursue a draft in the IETF.
- EIP in ML Context: The presenter will continue work on a position paper and extend the EIP draft to include this use case. Updates will be submitted to the COINRG mailing list for comments.
Next Steps
- Continued engagement on the COINRG mailing list for the "Use Cases for In-network Computing," "Distributed Learning Architecture," and "Data Operations in Network" topics.
- The chairs will prepare for the next COINRG meeting, which is planned for London.
- The chairs will coordinate to finalize the meeting minutes.