Markdown Version | Session Recording
Session Date/Time: 14 May 2024 18:00
TOOLS
Summary
This TOOLS Working Group session focused on updates regarding the Mailman 3 transition issues, the strategic direction for IETF infrastructure, and planned application migrations. Key decisions include a shift away from AWS for most services, the separation of the identity provider to enable multi-factor authentication, and an upcoming significant infrastructure cutover. The team also addressed lessons learned in communication during service disruptions and ongoing efforts to improve system observability.
Key Discussion Points
Mailman 3 Transition and Delivery Issues
- The Mailman 3 transition itself was smooth, but significant issues arose from the interaction between Postfix, Postconfirm, and Mailman 3.
- Problems included messages sent to list owners looping and messages appearing to be sent from the sender's perspective but not being delivered to the lists or archives.
- A "three-hour period" on Thursday, a week prior to the meeting, saw no messages delivered to lists. These issues are believed to be resolved since Thursday 19:00 UTC.
- Some messages were saved but are difficult to recover due to a large volume of unrelated files in the same directory. The recommendation is for individuals to resend any important messages that may have been lost.
- The difficulty in diagnosis was exacerbated by Postconfirm's tendency to swallow error messages, leading to a lack of observability. Future plans include rewriting Postconfirm to improve logging and transparency.
- Mark, from the Serius team, acknowledged the challenges and apologized for the disruption, noting that while extensive testing was done, the specific line-ending issue between Postconfirm and Mailman 3 was unanticipated and difficult to diagnose. The pressure to deploy an interim system was also a factor.
- Warren expressed appreciation for the team's efforts, given the complexity of the Mailman 2 to Mailman 3 migration and the existing "weird hackery" in Mailman 2.
- A key lesson learned regarding communication was the need for more aggressive over-communication to the broader community during disruptive periods, not just list owners and moderators.
- A significant usability issue with Mailman 3 was raised: the inability to jump directly from a moderation notification to the held message queue. A feature request has been submitted to Mailman 3 maintainers to restore this functionality.
Tools Team Retreat and Infrastructure Strategy
- A recent retreat led to a reassessment of the IETF's cloud strategy. It was decided to move away from Amazon Web Services (AWS) EKS/Kubernetes for most core services due to excessive complexity and time spent.
- The team is moving services to Digital Ocean and Azure, seeing more rapid and promising results.
- The tools roadmap has been updated and is available online. Robert acknowledged that the cards on the roadmap need more detailed descriptions.
Identity Provider (IDP) Separation
- A major decision from the retreat is to separate the identity provider from the data tracker.
- The current OIDC super-provider package in Django lacks features like two-factor authentication (2FA), which has been frequently requested.
- The new IDP will be based on Keycloak, which will enable 2FA and support different authentication modes while maintaining existing authorization logic.
"Tools Team" Terminology Discussion
- There has been a drift in the use of "tools team," sometimes referring to the community members on the call and other times to the LLC-hired staff. The team is looking for clearer terminology. A suggestion was made to prefer "team" for community-inclusive groups, similar to "gen-art art team."
Upcoming Infrastructure Transition
- The IETF is in the "final weeks" of the infrastructure transition, with a significant cutover planned for major applications: the Data Tracker, mail archive, main www.ietf.org site, and I-D-Tracker services.
- This cutover will require close coordination and an outage measured in hours (initially estimated at 3-4 hours, with final tests to refine this).
- Critical details being finalized include mail transport to the Data Tracker when it's on a different host and interactions with external APIs (e.g., at ARIN).
- The core mail processing pipeline will be moved to the cloud by Serius after the Mailman 3 dust settles, with wide communication once timing is known.
- The RFC Production Center (rscpc) server and origin for rfc-editor.ietf.org is planned to move to the cloud tomorrow morning (relative to the meeting).
- Significant work has been done on migrating static websites to cloud buckets (e.g., Cloudflare) and simplifying complex Apache redirect configurations, moving redirect logic to the edge.
- Monitoring and alerting infrastructure has been significantly improved, providing greater visibility.
- An infrastructure diagram is available for review, with an intent to refine it based on feedback.
Discussion on Mail System Hosting
- John Levine raised concerns about Digital Ocean's and Azure's reputation for sending mail, recommending against using them for outbound mail. AWS SES (Simple Email Service) was noted as reliable for outbound mail.
- The team confirmed that AWS SES would continue to be used for outbound mail, while other mail processing components (receivers, submission points, Mailman 3 origin) are being evaluated for Digital Ocean or Azure.
Personnel and Other Updates
- Matthew Hollow, a front-end specialist, will join the LLC programming staff in just over a week. This is expected to be the last direct programming staff addition for some time.
- A new Meetecho feature for handling disruptive participants will be implemented before IETF 120. Efforts are underway to train session chairs and update documentation to ensure smooth adoption.
Decisions and Action Items
- Decision: The IETF infrastructure strategy will shift away from AWS EKS/Kubernetes for most core services, migrating to Digital Ocean and Azure.
- Decision: The Identity Provider (IDP) will be separated from the Data Tracker and based on Keycloak to enable multi-factor authentication (2FA) and other features.
- Action Item: Individuals who sent important mail during the Mailman 3 disruption (particularly before Thursday, a week prior to the meeting) should resend those messages.
- Action Item: Robert to add more detailed descriptions to the cards on the tools roadmap.
- Action Item: The team will communicate widely regarding the planned significant infrastructure cutover outage, including refined time estimates.
- Action Item: The infrastructure diagram will be refined based on community feedback.
- Action Item: The feature request for Mailman 3 to restore direct links to held message queues from moderation notifications is being worked on.
Next Steps
- Continue efforts to backfill and recover unsent messages from the Mailman 3 transition, though this is a complex and time-consuming task.
- Finalize mail transport details for the Data Tracker and external API integrations.
- Schedule and widely communicate the date for the main infrastructure cutover of Data Tracker, mail archive, www.ietf.org, and I-D-Tracker services.
- Serius will begin work on migrating the core mail processing pipeline to the cloud following the Mailman 3 stabilization.
- The rscpc (RFC Production Center server) is planned for migration to the cloud tomorrow morning (relative to the meeting date).
- Matthew Hollow will start as a front-end developer in the coming week.
- The Meetecho disruptive participant feature will be implemented, and chairs will be trained before IETF 120.