#### "Dark Silicon and Dependability"

by Jörg Henkel

CES – Chair for Embedded Systems, KIT Karlsruhe

... with Muhammad Shafique and Hussam Amrouch

J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

#### **Overview**

- What is Dark Silicon
- Interplay of Power Density, Temperature and Dependability
- Mitigating Dark Silicon

## What is Dark Silicon?



J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

ces.itec.kit.edu

#### Lets go back to 1865 ...

In economics, the Jevons paradox (/ 'dʒɛvənz/; sometimes Jevons effect) occurs when technological progress increases the efficiency with which a resource is used (reducing the amount necessary for any one use), but the rate of consumption of that resource rises because of increasing demand.



[Src: Wikipedia]

#### So, how is this related to Dark Silicon?



## **Dennard Scaling ... (or failure thereof)**



(Src: "Dennard Scaling")

## Dark Silicon: Depends on Point of View ...

Dark silicon as a function of power constraint ...



#### **Temperature, Power and TDP**

#### Example:

- 16 cores with area 5.3 mm<sup>2</sup>
- Threshold temperature: 80°C
- Power budget: 90 W

| <mark>11.27 W</mark>                        | <mark>11.27 W</mark>                        | <mark>11.27 W</mark> | 59.5 °C               |
|---------------------------------------------|---------------------------------------------|----------------------|-----------------------|
| 78.9 <sup>°</sup> C                         | 79.5 <sup>°</sup> C                         | 77.8 <sup>°</sup> C  |                       |
| <mark>11.27 W</mark>                        | <b>11.27 W</b>                              | <mark>11.27 W</mark> | <mark>59</mark> .4 °C |
| 79.5 °C                                     | 80.0 °C                                     | 77.6 <sup>°</sup> C  |                       |
| <mark>11.27 W</mark><br>77.8 <sup>°</sup> C | <mark>11.27 W</mark><br>77.6 <sup>°</sup> C | 60.9 °C              | 58.1 °C               |
| 59.5 °C                                     | 59.4 °C                                     | 58.1 °C              | 57.0 °C               |

| <mark>7.52 W</mark> | <mark>7.52 W</mark> | <mark>7.52 W</mark> | <mark>59.5</mark> °C |
|---------------------|---------------------|---------------------|----------------------|
| 71.6 <sup>°</sup> C | 72.2 <sup>°</sup> C | 71.4 <sup>°</sup> C |                      |
| <mark>7.52 W</mark> | <mark>7.52 W</mark> | <mark>7.52 W</mark> | <mark>7.52 W</mark>  |
| 72.5 <sup>°</sup> C | 73.2 <sup>°</sup> C | 72.6 <sup>°</sup> C | 70.3 <sup>°</sup> C  |
|                     |                     |                     |                      |
| <mark>7.52 W</mark> | <mark>7.52 W</mark> | <mark>7.52 W</mark> | 59.6 °C              |
| 72.3 <sup>o</sup> C | 72.9 <sup>°</sup> C | 71.4 <sup>°</sup> C |                      |

|                                            |                                            |                                            |                                            |                  | 100     |
|--------------------------------------------|--------------------------------------------|--------------------------------------------|--------------------------------------------|------------------|---------|
| <mark>5.64 W</mark><br>67.8 <sup>°</sup> C | <mark>5.64 W</mark><br>68.5 <sup>°</sup> C | <mark>5.64 W</mark><br>68.5 <sup>°</sup> C | <mark>5.64 W</mark><br>67.8 <sup>°</sup> C |                  | 90      |
| <mark>5.64 W</mark><br>68.5 <sup>°</sup> C | <mark>5.64 W</mark><br>69.5 <sup>°</sup> C | <mark>5.64 W</mark><br>69.5 <sup>°</sup> C | <mark>5.64 W</mark><br>68.5 <sup>°</sup> C |                  | 80      |
| <mark>5.64 W</mark><br>68.5 <sup>°</sup> C | <mark>5.64 W</mark><br>69.5 <sup>°</sup> C | <mark>5.64 W</mark><br>69.5 <sup>°</sup> C | <mark>5.64 W</mark><br>68.5 <sup>°</sup> C |                  | 70      |
| <mark>5.64 W</mark><br>67.8 <sup>°</sup> C | <mark>5.64 W</mark><br>68.5 <sup>°</sup> C | <mark>5.64 W</mark><br>68.5 <sup>°</sup> C | <mark>5.64 W</mark><br>67.8 <sup>°</sup> C |                  | 60      |
|                                            |                                            |                                            |                                            | [ <sup>o</sup> C | 50<br>] |

Highest Temperature: 80.0°C

8 active cores

18 active cores

Highest Temperature: 7862°C

HHighesstTemperature:69858°CC

#### 16 active cores





S. Pagani, H. Khdr, W. Munawar, J.-J. Chen, M. Shafique, M. Li, J. Henkel, "TSP: Thermal Safe Power - Efficient power budgeting for Many-Core Systems in Dark Silicon", (CODES+ISSS), 2014.



# **Temperature Effects**



J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

High Power

Density

Temperature Effects

**Dark Silicon** 

mitigate

### **Thermal Gradients**

- Due to: a) Low-frequency power change, b) Workload change, c) Power management
- Affects MTTF



#### **Spatial Thermal Gradient Analysis**



J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

### **Example: Thermal Cycling in FPGAs**



Activity migration between two cores at the rate of 154 MCycle



#### Time [sec]

## Summary:

Dark Silicon is a thermal problem due to high power density:

- Average temperature
- Peak temperature
- Spatial thermal gradients
- Temporal thermal gradients
- => Accelerate Aging and Jeopardize Dependability!



# **Aging Effects**



J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

### Temperature and Aging Decrease Dependability



Technology scaling has made aging-induced reliability degradation a major concern





J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### Temperature and Aging decrease Dependability

#### Thermal:

- Accelerate aging mechanisms
- Degrade performance
- Increase leakage power
- Necessitate expensive cooling



## Aging Effects: Electro-Migration, Variability

Process variations and electromigration can result in hillocks and holes

$$MTTF = Aj^{-n}e^{\left(\frac{\varphi}{kT}\right)}$$

- Lead to open failures or short circuit failures respectively
- Failures may be temperature dependent due to material expansion
  - Holes may function normally at high temperatures but fail at low temperatures
  - Hillocks may function normally at low temperatures but short circuit at high temperatures



Hole/crack



Hillock



## Aging Effects: NBTI

- Negative Bias Temperature Instability
  - Breakdown of Si-H bonds at the silicon-oxide interface due to voltage/thermal stress
    - $\rightarrow$  causes interface traps
- Affects mostly P-MOSFETs because of negative gate bias
  - Effect in N-MOSFETS is negligible
- Despite research focus: NBTI is not yet fully understood!



## **Aging Effects: NBTI**

NBTI manifests itself as a shift in V<sub>th</sub>

- Causes increase in transistor delay
- Delay faults are responsible for NBTI induced bit-flips and resulting circuit failure
- Recovery effect in periods of no stress
  - When voltage and temperature are low, V<sub>th</sub> can shift back towards its original value
  - Full recovery from a stress period only possible in infinite time
    → In practice overall V<sub>th</sub> shift increases monotonously over longer periods, e.g. months/years



#### **Aging Effects: NBTI and Temperature**

- Temperature plays important aspect in NBTI modeling
- Higher temperatures increase shift in threshold voltage
- ▲*Vth* approximately 50% higher at 75°C than 55°C
- NBTI effect at 75°C is approximately equal to alternating between 85°C and 25°C



24

So, how to *accurately* model temperature (aging etc.) effects ... ?

#### **Temperature through abstraction levels**



## Temperature/Aging Effects Ex: 6T SRAM cell





J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

### **Example: Register File**

It has: relative small area footprint + frequently accessed

→ High power density → high temperature → higher temperature/aging effects

The hotspot is located at the register file in for most of the applications





Reliability relevant parameters

- Static Noise Margin (SNM) Susceptibility to Noise
- Read Access Time (RAT): Providing correct data in time
- Critical Charge (Qcrit): Susceptibility to radiation



- Key reliability aspects
  - Static Noise Margin (SNM) Susceptibility to Noise during read operations











#### **Aging Imapct in 6-T SRAM Cells**



- Key reliability aspects
  - Read Access Time (RAT): Providing correct data in time
  - Critical Charge (Qcrit): Susceptibility to radiation
  - Static Noise Margin (SNM) Susceptibility to Noise



#### **Aging Imapct in 6-T SRAM Cells**



## **NBTI Impact on aging: 6-T SRAM Cell**

- Static Noise Margin (SNM) is one of the critical reliability metrics in an SRAM cell.
- It represents the immunity of SRAM against noise during the read or write operation
- NBTI highly affects the SNM making the SRAM more susceptible to failure [Src:



during read operation in the case of  $\alpha$  = 0.3 over 11 years



SRAM transfer characteristics during read operation in the case of  $\alpha$  = 0.5 over 11 years



SRAM transfer characteristics during read operation in the case of  $\alpha$  = 0.5 during the first year

[Src: IBM, KIT]

#### Summary: Temperature/Aging in 6-T SRAM Cell

- On-chip temperatures directly stimulate underlying mechanisms behind aging phenomena and/or directly influence dependability (instantaneously or long term)
- In the following: how multiple simultaneous temperature/aging mechanisms may interact.

## Interaction of Temperature/Aging Effects





J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

#### **Interaction of Aging Effects**



#### **Register File Failure Maps**

Hussam Amrouch, Victor van Santen, Thomas Ebi, Volker Wenzel, Jörg Henkel, "Towards Interdependencies of Aging Mechanisms", IEEE/ACM Int'l Conference on CAD (**ICCAD**), 2014.



J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

# **Analyzing Temperature**





J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

### IR-Camera for thermal Evaluation

- DIAS Pyroview IR Camera
  - Spatial resolution macro lens: around 50µm
    - Limited by camera IR spectral range of 8µm- 14µm
  - Temperature range configurable -20 °C to 120 °C or 0°C to 500°C
  - Sampling rate of 50Hz
    - Camera transmits 50 frames per second over ethernet in real time
  - 384x288 pixels
  - Comprehensive SDK for accessing camera functionality



### **Analyzing Temperature of CPUs**

#### Challenges:

Infrared thermography of ASIC chips requires:

- Removing the chip cooling unit to expose the measured die.
- Building an alternative IR-transparent cooling unit to avoid burning up that:
  - allows the infrared radiation emitted from the chip to reach the thermal camera.
  - concurrently prevents the chip from burning up.

#### **Analyzing Temperature: Basic Setup**



- It continuously chills the measured chip from its bottom side, i.e. through the PCB to which the chip is attached.
- Thermoelectric technology has been employed as it is can provide a stable and controlled source of cooling.







Water heat sink cooling the hot side of the Peltier device

#### **Bottom-side cooling**



No layer on top of the measured chip

→ the camera can directly and clearly capture the infrared emissions

Example of the captured infrared thermal image of the Atom Intel Dual-Core (45nm) running at 1.8Ghz



#### Thermal (real-time) Video of an 8-Core Processor

sor



# Mitigating Dark Silicon: Overview





J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

#### ces.itec.kit.edu

## **Mitigating Dark Silicon**



J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

## **Handling Dark Silicon**

#### Tradeoff between NTC, Sprinting/Boosting, STC



#### Power Management Paradigms



### Mitigating the Power Density and Dark Silicon: Dark Silicon Patterning

minimizing peak temp => more effective use of the power budget
=> allows for further parallelization and multi-threading



### Mitigating the Power Density and Dark Silicon: Dark Silicon Patterning

Spatial and Temporal shutdown -> minimizing peak temp



## **Mitigating Dark Silicon**

#### **Thermal Safe Power (TSP)**

(Abstract from temperature using efficient power budgets)

#### STC / NTC vs. Boosting

(Constant frequency vs. control-loop based boosting)



J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015

**Dark Silicon Management** 

(Patterning and Resource

Management)

82

76

70

64

[°C]

## **Instruction Vulnerability**

- Spatial vulnerability: probability of an error depending upon the area of specific processor resources used by the instructions
- Temporal vulnerability: probability of a fault depending upon the vulnerable periods of an instruction in a certain pipeline stage







#### Motivation: Varying Instruction Vulnerability & Error Masking Properties



# Turning Dark Silicon Problem into a Solution!

- Leverage available dark silicon chip with reliability-wise specialized cores offering distinct degree of reliability, i.e., protection against soft errors
  - Multiple "iso-ISA reliability-heterogeneous cores"
  - Higher protection against soft error => more power and area



#### Example: Reliability Heterogenous Cores



# Turning Dark Silicon Problem into a Solution!

- Within the chip's TDP constraint, only a subset of cores can be powered-on at run-time and remaining cores stay dark
- A run-time system to manage reliability under thermal constraints.



## **ASER: Adaptive Soft Error Resilience**

#### Design-Time: reliability-heterogeneous core customization

- Formulated as the Bounded Knapsack Problem
- Run-Time: adaptive soft error resiliency manager allocates set of cores to concurrent applications under the given TDP constraint



### Reliability Heterogenous Cores: Synthesis Results

|    | frequency = 250 MHz        |           |         | frequency = 1 GHz |                            |         |               |        |
|----|----------------------------|-----------|---------|-------------------|----------------------------|---------|---------------|--------|
|    | area[# of                  | Power[mW] |         | area[# of         | Power[mW]                  |         | ]             |        |
|    | gate Eq.*10 <sup>5</sup> ] | leakage   | dynamic | total             | gate Eq.*10 <sup>5</sup> ] | leakage | dynamic       | total  |
| C1 | 4,83                       | 4,36      | 78,19   | 82,55             | 5,00                       | 4,76    | <b>269,58</b> | 274,34 |
| C2 | 4,99                       | 4,50      | 79,75   | 84,25             | 5,20                       | 4,98    | 274,91        | 279,89 |
| C3 | 13,62                      | 12,26     | 223,39  | 235,65            | 14,26                      | 13,83   | 767,73        | 781,56 |
| C4 | 5,41                       | 4,88      | 86,46   | 91,35             | 5,58                       | 5,23    | 298,21        | 303,44 |
| C5 | 13,77                      | 12,40     | 224,96  | 237,36            | 14,46                      | 14,08   | 773,52        | 787,60 |
| C6 | 5,56                       | 5,03      | 88,02   | 93,05             | 5,77                       | 5,52    | 303,45        | 308,98 |
| C7 | 14,19                      | 12,79     | 231,73  | 244,51            | 14,94                      | 14,35   | 796,26        | 810,61 |
| C8 | 14,35                      | 12,93     | 233,30  | 246,23            | 15,02                      | 14,56   | 801,79        | 816,34 |

- TSMC 45nm technology library
- Different process corners & frequencies

Reliability Savings are 20%-60% compared to state-of-the-art

| C1         | Baseline core                                |
|------------|----------------------------------------------|
| C2         | Pipeline TMR                                 |
| C3         | Cache TMR                                    |
| C4         | Register File TMR                            |
| C5         | Pipeline TMR + Cache TMR                     |
| C6         | Pipeline TMR + Register File TMR             |
| <b>C7</b>  | Cache TMR + Register File TMR                |
| <b>C</b> 8 | Pipeline TMR + Cache TMR + Register File TMR |

## **Mitigating Dark Silicon**

#### **Thermal Safe Power (TSP)**

(Abstract from temperature using efficient power budgets)

#### STC / NTC vs. Boosting

(Constant frequency vs. control-loop based boosting)

#### Dark Silicon Management (Patterning and Resource Management)



#### **Reliability: trade-off aging <-> SER**



# Hayat: Harnessing Dark Silicon and Variability for Aging Optimization



#### Conclusions

 "Dark Silicon" is a problem triggered through high power density => hardware is operated at thermal limits

- Temperature decreases reliability
- "Dark Silicon" can be minimized/exploited through:
  - Efficient dark silicon management under peak power and thermal constraints
  - New thermal safe power budgets
  - Scalable power and thermal management
  - Increasing different forms of heterogeneities: functional, power, reliability, etc.
  - Increasing reliability
  - **...**

Power/energy efficiency and reliability should jointly be optimized at multiple HW and SW layers of the system stack

#### If all this is considered, good chance there is no Dark Silicon problem at all!

#### Some of our recent publication on Dark Silicon

- Dennis Gnad, Muhammad Shafique, Florian Kriebel, Semeen Rehman, Duo Sun, Jörg Henkel:"Hayat: harnessing dark silicon and variability for aging deceleration and balancing", DAC 2015.
- Heba Khdr, Santiago Pagani, Muhammad Shafique, Jörg Henkel: "Thermal constrained resource management for mixed ILP-TLP workloads in dark silicon chips", DAC 2015:179
- S. Pagani, H. Khdr, W. Munawar, J.-J. Chen, M. Shafique, M. Li, J. Henkel, "TSP: Thermal Safe Power - Efficient power budgeting for Many-Core Systems in Dark Silicon", IEEE International Conference on Hardware-Software Codesign and System Synthesis (CODES+ISSS), 2014, Best Paper Award.
- Hussam Amrouch, Victor van Santen, Thomas Ebi, Volker Wenzel, Jörg Henkel, "Towards Interdependencies of Aging Mechanisms", IEEE/ACM Int'l Conference on CAD (ICCAD), 2014.
- Muhammad Shafique, Siddharth Garg, Tulika Mitra, Sri Parameswaran, Jörg Henkel, "Dark Silicon as a Challenge for Hardware/Software Co-Design", IEEE International Conference on Hardware-Software Codesign and System Synthesis (CODES+ISSS), 2014.
- M. Shafique, S. Garg, D. Marculescu, J. Henkel, "The EDA Challenges in the Dark Silicon Era", ACM/ IEEE/EDA 51st Design Automation Conference (DAC), 2014.
- F. Kriebel, S. Rehman, D. Sun, M. Shafique, J. Henkel, "ASER: Adaptive Soft Error Resilience for Reliability-Heterogeneous Processors in the Dark Silicon Era", ACM/IEEE/EDA 51st Design Automation Conference (DAC), 2014.
- H. Bokhari, H. Javaid, M. Shafique, J. Henkel, S. Parameswaran, "darkNoC: Designing Energy Efficient Network-on-Chip with Multi-Vt Cells for Dark Silicon", ACM/IEEE/EDA 51st Design Automation Conference (DAC), 2014.
- Semeen Rehman, Muhammad Shafique, Florian Kriebel, Jörg Henkel, "Reliable software for unreliable hardware: embedded code generation aiming at reliability". IEEE International Conference on Hardware-Software Codesign and System Synthesis (CODES+ISSS), 2011, Best Paper Award.

#### **Acknowledgements**





#### Partly Funded by InvasIC: http://invasic.de/ Partly Funded by Dependable Embedded Systems: http://spp1500.itec.kit.edu/

# Thank you for Attention!

#### Tools Download: http://ces.itec.kit.edu/download/

J. Henkel, Keynote @ CADS 2015, Tehran, Oct 8th. 2015