Name: Enabling Secure Container Checkpointing for Distributed Model Training - Radostin Stoyanov, University of Oxford
Start: 2025-08-28T15:25:00+0200
End: 2025-08-28T15:50:00+0200

Thursday August 28, 2025 15:25 - 15:50 CEST

TBA

In the field of AI and machine learning, model training has become an increasingly complex and resource-intensive task. Training jobs often run for days or weeks, distributed across multiple nodes with expensive GPU accelerators. Container checkpointing is a crucial technique for implementing fault tolerance, mitigating the impact of hardware and software failures by periodically saving the state of computations and resuming from the last checkpoint in the event of failures. While support for checkpointing has been recently integrated into Kubernetes, enabling checkpoint/restore coordination across multiple containers and nodes remains a challenge. In this talk, we are going to discuss how we have extended container runtimes and CRIU to synchronize checkpointing operations among multiple container instances in Kubernetes clusters. The talk will cover how we enable efficient end-to-end encryption for sensitive data in checkpoints and the integration with existing container platforms.

Speakers

Radostin Stoyanov

PhD Student, University of Oxford

Radostin Stoyanov is a PhD student at the Scientific Computing research group at the University of Oxford, and a Software Engineer at the Core Kernel Team at Red Hat. His research focuses on improving the resilience and performance of HPC and cloud computing systems.

Thursday August 28, 2025 15:25 - 15:50 CEST
TBA

Session Presentation

Need help? View Support Guides
Event questions? Contact Event Planner

Container Plumbing Days 2025

Radostin Stoyanov

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!