SAIL Logo

  Hi, we are the Stanford MAST team

Home · Publications · Code

Our Publications

2024

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis

ACM SIGOPS Symposium on Operating Systems Principles (SOSP), 2024

· PDF · Slides

Search

Publication Type

Paper ()

Tags

Distributed Pre-training () Inference () Fault-tolerance ()