X-GDR BarCamp sur les Défis d'Implémentation de l'IA – Sécurité, Fiabilité, Soutenabilité et Nouvelles Technologies

### Reliability of Spiking Neural Network VLSI Implementations

Haralampos-G. STRATIGOPOULOS Sorbonne Université, CNRS, LIP6 Paris, France



### Outline

- Introduction to brain-inspired computing and SNNs
- Testability and reliability framework for SNNs:
  - Fault modeling
  - Fault injection frameworks
  - Reliability analysis
  - Testing strategies
  - Fault tolerance strategies
- Conclusions

#### Brain-inspired neuromorphic computing

- Brain is the most brilliant computing machine
- Very "green"
- Computational power efficiency orders of magnitude higher than computers
- Brain has augmented capabilities (learning, produces ideas, error resilience...)





#### Spiking neural networks (SNNs)



# Von Neumann vs. neuromorphic architecture



- SNN is a dynamic system  $\rightarrow$  maps well to speech and image recognition
- More energy efficient as it is asynchronous in operation
- Speed of computation improved due to event processing
- Challenges in learning remain
- Challenges in developing hardware

D. C. Schumman, Nature Computational Science, 2022









#### Landscape of neuromorphic hardware

| 0             | Neuroscience                    | Application-d research              | riven             |   |          |
|---------------|---------------------------------|-------------------------------------|-------------------|---|----------|
| scale systems | Neurogrid<br>BlainScaleS<br>DYN | Loihi<br>APs TrueN                  | lorth             |   |          |
| Large-s       | SpiNNaker                       |                                     |                   |   | N<br>C   |
| ASICs         | CAVIA<br>ROLL                   | R<br>C. Frenklel, Tl<br>S<br>uBrain | BioCAS'18<br>ODIN |   | Sv<br>/c |
|               | W. Guo, TNNLS'22                | N. Abderrahm<br>Neural Netw.'2      | ane,<br>20        |   | C<br>cl  |
| FPGA          |                                 | S2N2<br>Minitaur                    | ConvNet           |   | C<br>b   |
|               |                                 | SpinalFlow                          | Spiker            |   | N        |
|               |                                 | E <sup>3</sup> NE                   | SyncNN            |   | S        |
|               |                                 | FireFly                             | ConvNet           | • |          |



|                   | SpiNNaker | Loihi | TrueNorth |
|-------------------|-----------|-------|-----------|
| Neurons/<br>core  | 36K       | 130K  | 1M        |
| Synapses<br>/core | 2.8M      | 130M  | 256M      |
| Cores/<br>chip    | 144       | 128   | 4096      |
| Chips/<br>board   | 56        | 768   | 4096      |
| Neurons           | 2.5B      | 100M  | 4B        |
| Synapses          | 200B      | 100B  | 1T        |



Neggaz, D&T'20

#### Hardware-level faults



metal

#### Neural networks reliability



Source: Paolo Rech, University of Trento



**Critical fault** 

**Fault-free** 

#### Testability and Reliability Framework



#### I&F spiking neuron circuit



# Transistor-level fault simulation

- 1000 MC runs using PDK
- Defect simulation using DefectSim by Siemens S. Sunter, TCAS-I'16
- Two types of faulty behaviors:
  - Catastrophic: neuron non-functional (observed for 31 defects)
  - Parametric: output spike train with timing variations (observed for MC and 15 defects)



#### Software Fault Injection Framework

 SNNs modeled in Python using primitives from the Spike LAYer Error Reassignment (SLAYER) and PyTorch frameworks

S. B. Shrestha, NeurIPS'18

- Fault injection framework built on top of the SLAYER and PyTorch frameworks
- Fault injection and simulation are performed by customizing the flow of computations according to the faulty behavior
- Single and multiple faults
- Extendible fault model library
- Large-scale fault simulation acceleration: early stopping, late start, GPU
- Metric: classification accuracy drop for test set













#### Faults occurring before training



 Networks can compensate for a high fault rate if faults occur before the training

# SNN hardware accelerator design framework



- End-to-end model-to-VHDL automated synthesis of arbitrary SNN
- FPGA implementation
- Fully synthesizable for an ASIC implementation
- Will be released as open-source

#### SNN hardware accelerator architecture



## SNN hardware experimentation platform



#### Reliability analysis of SNN hardware accelerator

sign

- Each node is configurable through a set of 8-bit parameters
- Parameters are stored in memory blocks inside the node:

| Memory                 | Purpose                                     |        |  |
|------------------------|---------------------------------------------|--------|--|
| Splitter<br>Parameters | input split information to first layer      |        |  |
| Router<br>Parameters   | routing information in the nodes'<br>mesh   |        |  |
| Neuron<br>Parameters   | key features of the neurons within the node | West P |  |
| Kernel<br>Parameters   | kernels structural characteristics          |        |  |
| Synapse Weights        | values of the synaptic weights              |        |  |

- bit integer 0 **1**10101... **Bit-flips** North Port **Bit-flips** ROUTER Convolutional East Port **Bit-flips** Unit Routing Table **Configuration Block Bit-flips** South Port
- Fault model: bit-flips in memories
  - Single bit-flips across different bit positions
  - Multiple bit-flips with a BER probability

#### Reliability analysis results

T. Spyrou, DATE 22



#### Reliability analysis results (cont'd) T. Spyrou, DATE 22





- Use existing samples in training/testing sets or craft new samples that can detect faults
- Fault is detected if responses of nominal/faulty chips differ

#### ATPG based on ranking fault detection capability of samples Assess the fault



S. Elsayed, TCAD'23

- coverage of an input sample with no fault simulation
- Fault coverage ∝ prediction confidence
- **Proposed criterion:** difference in output spikes between top-1 and top-2 classes
- Rank samples based on confidence in ascending order
- Add samples in the test-set according to ranking until fault coverage maximizes

#### S. Elsayed, TCAD'23 Results on SNN hardware accelerator

**Single Bit Flips** 

**Multiple bit flips** 



- The global cumulative fault coverage curves quickly reach 100%
- 6 samples suffice to detect all critical faults and a high percentage of benign faults

#### T. Spyrou, ETS'23



- Symptom detection
- Test parameter is the cumulative spike count at feature map output
- Use a system of two one-class classifiers for mapping test parameters to a decision
- One-shot decision (fault or no fault) with high confidence
- If low-confidence execute a reply operation to resolve ambiguity

#### Training with faults: dropout N. Srivastava, JMLR'14 T. Spyrou, DATE'21



- Training with dropout: temporarily removing neurons during training along with their connections
- Nullifies the effect of dead neuron faults in all hidden layers:
  - Distribution of computational load among the neurons of the network
  - More uniform and sparse spiking activity across the network

## On-line testing using in-situ monitors

- Count the number of spikes a neuron produces between two successive inputs
- A saturated neuron will produce spikes with higher frequency than usual: counter overflows before an incoming spike resets it again
- Exploits temporal dependency between the input and output of a spiking neuron



T. Spyrou, DATE 21

#### Error recovery using fault masking

- Saturated neurons are more critical than dead neurons & dead neurons can be nullified using dropout
- "Fault Hopping" concept: saturated neuron fault is translated to a dead neuron fault
- One single transistor is added to the neuron to switch-it off when a saturation "Flag" signal is raised
- Dead neurons do not consume energy



### Redundancy-based fault tolerance

- Triple Modular Redundancy:
  - 3 identical neurons vote for the decision of each class
  - majority decides
- Output layer is usually smaller in size than whole network (0.57% for the N-MNIST SNN and 0.04% for the IBM's Gesture SNN)
- Area overhead is negligible







T. Spyrou, DATE 21

#### Multiple fault scenario



#### Astrocyte neural networks



Neuron #2 under 80% fault rate with temporary faults. (80% - severely damaged)

J. Harkin & M. Trefzer, Tutorial DATE 23

#### Conclusions

- SNNs for neuromorphic edge computing
- SNN hardware accelerators are emerging
- Frameworks for accelerator design and fault injection
- Testability and fault tolerance concepts still at an early stage
- Acknowledgments:
  - Collaboration with the University of Sevillia
  - PhD Students: Sarah Elsayed, Theofilos Spyrou, Spyridon Raptis, Paul Kling
  - Sorbonne Center for Artificial Intelligence (SCAI)
  - ANR RE-TRUSTING
  - Horizon Europe dAIEDGE

#### Further reading

- H.-G. Stratigopoulos, T. Spyrou, and S. Raptis, "Testing and reliability of spiking neural networks: A review of the state-of-the-art," Proc. *IEEE Int. Symp. Defect Fault Toler. {VLSI} Nanotechnol. Syst. (DFT)*, Jaun Les Pins, France, Oct. 2023.
- F. Su, C. Liu, and H.-G. Stratigopoulos, "Testability and dependability of AI hardware: Survey, trends, challenges, and perspectives," *IEEE Des. Test*, vol. 40, no. 2, pp. 8–58, Apr. 2023.