Computer Engineering Seminar

Efficient Processor Fault Tolerance

Sudhanva GurumurthiAssistant ProfessorUniversity of Virginia
SHARE:

Silicon reliability is one of the most important challenges facing the microprocessor industry today. Processors need to be designed such that they provide protection from soft errors, as well as lifetime reliability phenomena, such as Negative Bias Temperature Instability (NBTI). However, many fault protection techniques entail significant performance, power, and area overheads. In this talk, I will present three techniques that assist in reducing these overheads while still allowing reliability goals to be met. I will first show how the performance overheads of Redundant Multi-Threading (RMT), a popular soft error protection technique, can be reduced. RMT protects microarchitectural structures within a certain “Sphere of Replication” by redundantly executing instructions within the Sphere and checking the outputs of instructions that leave it. I will present a technique called “SlicK” that implements redundancy at the granularity of backward-slices of these output instructions and allows slices to be selectively dropped from the redundant thread to improve performance while still providing high soft error coverage. I will then discuss how one could ascertain the Architecture Vulnerability Factor (AVF) of hardware structures at runtime, which can facilitate redundancy mechanisms to be tuned based on application behavior. I will present regression-based runtime AVF prediction techniques for an RMT-based processor and show how it can be used to craft AVF-aware RMT policies. Finally, I will present a circuit-level technique called “Recovery Boosting” that can significantly enhance NBTI recovery for PMOS devices in memory cells of high-speed SRAM arrays while imposing little performance, area, or power overheads.

Sudhanva Gurumurthi is an Assistant Professor in the Computer Science Department at the University of Virginia. He received his BE degree from the College of Engineering Guindy, Chennai, India in 2000 and his PhD from Penn State in 2005, both in the field of Computer Science and Engineering. Sudhanva's research interests include processor fault tolerance and storage systems. He has served on the program committees of several top computer architecture and systems conferences including ISCA, ASPLOS, HPCA, FAST, and SIGMETRICS, and he is the Associate Editor-in-Chief of Computer Architecture Letters. Sudhanva has held research positions at the IBM Austin Research Lab and Intel Corporation and is currently a faculty consultant for Intel. Sudhanva is a recipient of the NSF CAREER Award and has received several research awards from NSF, Google, Intel, and HP. He is a member of the ACM and the IEEE. More details about his research are available on his homepage: http://www.cs.virginia.edu/~gurumurthi/

Sponsored by

Tom Wenisch