Online Fault Tolerance for FPGA Logic Blocks

Most adaptive computing systems use reconfigurable hardware in the form of field programmable gate arrays (FPGAs). For these systems to be fielded in harsh environments where high reliability and availability are a must, the applications running on the FPGAs must tolerate hardware faults that may oc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on very large scale integration (VLSI) systems 2007-02, Vol.15 (2), p.216-226
Hauptverfasser: Emmert, J.M., Stroud, C.E., Abramovici, M.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Most adaptive computing systems use reconfigurable hardware in the form of field programmable gate arrays (FPGAs). For these systems to be fielded in harsh environments where high reliability and availability are a must, the applications running on the FPGAs must tolerate hardware faults that may occur during the lifetime of the system. In this paper, we present new fault-tolerant techniques for FPGA logic blocks, developed as part of the roving self-test areas (STARs) approach to online testing, diagnosis, and reconfiguration . Our techniques can handle large numbers of faults (we show tolerance of over 100 logic faults via actual implementation on an FPGA consisting of a 20 times 20 array of logic blocks). A key novel feature is the reuse of defective logic blocks to increase the number of effective spares and extend the mission life. To increase fault tolerance, we not only use nonfaulty parts of defective or partially faulty logic blocks, but we also use faulty parts of defective logic blocks in nonfaulty modes. By using and reusing faulty resources, our multilevel approach extends the number of tolerable faults beyond the number of currently available spare logic resources. Unlike many column, row, or tile-based methods, our multilevel approach can tolerate not only faults that are evenly distributed over the logic area, but also clusters of faults in the same local area. Furthermore, system operation is not interrupted for fault diagnosis or for computing fault-bypassing configurations. Our fault tolerance techniques have been implemented using ORCA 2C series FPGAs which feature incremental dynamic runtime reconfiguration
ISSN:1063-8210
1557-9999
DOI:10.1109/TVLSI.2007.891102