Loading…
Type: Session Presentation clear filter
Thursday, August 28
 

15:25 CEST

Enabling Secure Container Checkpointing for Distributed Model Training - Radostin Stoyanov, University of Oxford
Thursday August 28, 2025 15:25 - 15:50 CEST
In the field of AI and machine learning, model training has become an increasingly complex and resource-intensive task. Training jobs often run for days or weeks, distributed across multiple nodes with expensive GPU accelerators. Container checkpointing is a crucial technique for implementing fault tolerance, mitigating the impact of hardware and software failures by periodically saving the state of computations and resuming from the last checkpoint in the event of failures. While support for checkpointing has been recently integrated into Kubernetes, enabling checkpoint/restore coordination across multiple containers and nodes remains a challenge. In this talk, we are going to discuss how we have extended container runtimes and CRIU to synchronize checkpointing operations among multiple container instances in Kubernetes clusters. The talk will cover how we enable efficient end-to-end encryption for sensitive data in checkpoints and the integration with existing container platforms.
Speakers
avatar for Radostin Stoyanov

Radostin Stoyanov

PhD Student, University of Oxford
Radostin Stoyanov is a PhD student at the Scientific Computing research group at the University of Oxford, and a Software Engineer at the Core Kernel Team at Red Hat. His research focuses on improving the resilience and performance of HPC and cloud computing systems.
Thursday August 28, 2025 15:25 - 15:50 CEST
TBA

15:50 CEST

KubeVirt on the Loose: Kubernetes-Powered VM Migrations That Defy Gravity - Ronny Issac, Kubermatic GmbH
Thursday August 28, 2025 15:50 - 16:15 CEST
This session is about freeing your workloads from the shackles of traditional maintenance windows. By dynamically relocating running Virtual Machines (VMs) between Kubernetes nodes, KubeVirt keeps apps online during upgrades, scaling, or node failures.
Beneath the surface, KubeVirt orchestrates memory transfers and status checks, making VMs appear gravity-defying. “Pre-copy” sends most pages while VMs stay active, followed by a quick “stop-and-copy.” A “domain notify pipe” coordinates source and destination. For high dirty rates, “auto-converge” throttles CPU or “post-copy” starts the VM on the target, fetching pages on demand. Dedicated “migration0” and configurable parameters (e.g. completionTimeoutPerGiB) prevent stalls and manage bandwidth.
Learn best practices for migration timeouts, TLS toggling, and traffic isolation. We’ll also explore real-world scenarios like draining nodes for maintenance or using auto-converge on heavy VMs.
Speakers
avatar for Ronny Issac

Ronny Issac

Engineering Team Lead, Kubermatic, Kubermatic GmbH
I've spent 10+ years in infrastructure at Hewlett Packard Enterprise, Nutanix, and AWS. Now, as Engineering Team Lead at Kubermatic, I collaborate with bright minds on Kubernetes and beyond, applying cutting-edge trends and hands-on strategies to build robust, scalable cloud-native... Read More →
Thursday August 28, 2025 15:50 - 16:15 CEST
TBA

16:25 CEST

Arm-ing the Future: Multi-Arch Container Builds with BuildKit, Kaniko & Skopeo - Anishka Tiwari, AWS; Aditya Soni, Forrester Research
Thursday August 28, 2025 16:25 - 16:50 CEST
As ARM adoption grows for cloud-native workloads, thanks to its efficiency and cost advantages, multi-architecture container support is becoming a must-have, not a nice-to-have. But cross-arch builds come with their quirks: emulation headaches, manifest missteps, and pipeline slowdowns.

In this talk, Aditya and Anshika walk you through a CNCF-aligned, open source workflow to automate multi-arch builds using BuildKit, Kaniko, and Skopeo, producing reproducible, OCI-compliant images for both ARM and x86_64. They'll cover everything from base image strategy to manifest creation, registry publishing, and even image signing with cosign.

What you’ll learn:

1. How to configure and run multi-arch builds using open source tools like Kaniko and BuildKit.
2. Use Skopeo to inspect, sync, and validate multi-arch manifests.
3. Sign and verify images with cosign for trusted distribution.

By the end, you’ll have a solid, cloud-native blueprint to build once, run anywhere—from cloud to edge, x86 to ARM.
Speakers
avatar for Aditya Soni

Aditya Soni

DevOps/SRE, CNCF Ambassador, Forrester Research
Aditya Soni is a DevOps/SRE tech professional He worked with Product and Service based companies including Red Hat, Searce, and is currently positioned at Forrester Research as a DevOps Engineer II. He holds AWS, GCP, Azure, RedHat, and Kubernetes Certifications.He is a CNCF Ambassador... Read More →
avatar for Anshika Tiwari

Anshika Tiwari

CSA - Cloud Engineer, AWS
Anshika is a passionate DevOps/SRE Engineer who is always eager to learn & implement cloud-native solutions, she has contributed to streamlining deployment processes and enhancing system reliability. She is eager to share her experiences and insights at conferences, contributing to... Read More →
Thursday August 28, 2025 16:25 - 16:50 CEST
TBA
 
Share Modal

Share this link via

Or copy link

Filter sessions
Apply filters to sessions.