DIKUL - logo
E-viri
Celotno besedilo
Recenzirano
  • Radiation-Tolerant Deep Lea...
    Maillard, Pierre; Chen, Yanran P.; Vidmar, Jason; Fraser, Nicholas; Gambardella, Giulio; Sawant, Minal; Voogel, Martin L.

    IEEE transactions on nuclear science, 04/2023, Letnik: 70, Številka: 4
    Journal Article

    This article presents a platform and design appr- oach for enabling radiation-tolerant deep learning acceleration on static random access memory (SRAM)-based 20-nm Kintex UltraScale field-programmable gate arrays (FPGAs), for terrestrial and high-radiation environments. The presented platform is suitable for deep neural network (DNN) implementations with an emphasis on image classification and includes the solutions to mitigate both radiation-induced single-event functional interrupts (SEFIs) and network datapath corruptions. The radiation-tolerant deep learning platform combines Xilinx's deep learning processing unit (DPU) IP, triple modular redundancy (TMR) MicroBlaze soft processor IP, and soft error mitigation (SEM)-IP to mitigate SEFIs. Furthermore, a technique known as fault aware training (FAT) was applied to effectively mitigate single-event effects in the datapath. Test results from a high-energy proton beam (<inline-formula> <tex-math notation="LaTeX">> </tex-math></inline-formula>60 MeV) experiment using the ResNet-18 convolutional neural network (CNN) for image classification are presented. The single-event upset (SEU) rate, system-level SEFI rate, and neural network classification/datapath performance are compared between the radiation-tolerant platform and a standard, nonmitigated approach. Results show that datapath classification errors dominate the system response (90%) versus SEFIs (10%). When compared to standard nonmitigated training techniques, the radiation-tolerant platform using FAT methods shows dramatic improvements in overall system response: the overall single-event cross Section was reduced by half and 40% reduction in misclassification errors was observed. Also, datapath events with classification accuracy degradation larger than 5% were completely mitigated. The SEFI rate was reduced by <inline-formula> <tex-math notation="LaTeX">100\times </tex-math></inline-formula> with implemented solutions and can be further reduced by optimizing the physical separation between TMR modules.