Resiliency at Scale: Managing Google's TPUv4 Machine Learning Supercomputer[pdf]usenix.org4 pointslnyan2 years ago