Loading…
Thursday August 28, 2025 15:25 - 15:50 CEST
In the field of AI and machine learning, model training has become an increasingly complex and resource-intensive task. Training jobs often run for days or weeks, distributed across multiple nodes with expensive GPU accelerators. Container checkpointing is a crucial technique for implementing fault tolerance, mitigating the impact of hardware and software failures by periodically saving the state of computations and resuming from the last checkpoint in the event of failures. While support for checkpointing has been recently integrated into Kubernetes, enabling checkpoint/restore coordination across multiple containers and nodes remains a challenge. In this talk, we are going to discuss how we have extended container runtimes and CRIU to synchronize checkpointing operations among multiple container instances in Kubernetes clusters. The talk will cover how we enable efficient end-to-end encryption for sensitive data in checkpoints and the integration with existing container platforms.
Speakers
avatar for Radostin Stoyanov

Radostin Stoyanov

PhD Student, University of Oxford
Radostin Stoyanov is a PhD student at the Scientific Computing research group at the University of Oxford, and a Software Engineer at the Core Kernel Team at Red Hat. His research focuses on improving the resilience and performance of HPC and cloud computing systems.
Thursday August 28, 2025 15:25 - 15:50 CEST
TBA

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link