Fault-Tolerant Design презентация

Содержание

1. Fault-Tolerant Design
2. What is this chapter about? Gives Overview
3. Fault-Tolerant Design Introduction Fundamentals of Fault Tolerance
4. Introduction Fault Tolerance Ability of system to
5. Faults Permanent Faults Due to manufacturing defects,
6. Temporary Faults Transient Errors (Non-recurring errors) Cause
7. Redundancy Fault Tolerance requires some form of redundancy Time Redundancy Hardware Redundancy Information Redundancy
8. Time Redundancy Perform Same Operation Twice See
9. Hardware Redundancy Replicate hardware and compare outputs
10. Information Redundancy Encode outputs with error detecting
11. Failure Rate λ(t) = Component failure rate Measured in FITS (failures per 109 hours)
12. System Failure Rate System constructed from components
13. Reliability If component working at time 0
14. Reliability for Series System Series System All
15. System Reliability with Redundancy System reliability with
16. Mean-Time-to-Failure (MTTF) Average time before system fails
17. Maintainability If system failed at time 0
18. Repair Rate and MTTR μ = rate
19. Availability System Availability Fraction of time system is operational
20. Availability Telephone Systems Required to have system
21. Coding Theory Coding Using more bits than
22. Block Code Message = Data Being Encoded
23. Block Code To detect errors, some redundancy
24. Separable Block Code Separable n-bit blocks partitioned
25. Example of Separable Block Code (4,3) Parity
26. Example of Non-Separable Block Code One-Hot Code
27. Linear Block Codes Special class Modulo-2 sum
28. Linear Block Codes Generator Matrix, G kxn
29. Systematic Block Code First k-bits correspond to
30. Distance of Code Distance between two codewords
31. Error Correcting Codes Code with distance 3
32. Hamming Code For any value of n
33. Error Correction in Hamming Code Syndrome, s
34. Example of Error Correction For (7,3) Hamming
35. SEC-DED Code Make SEC Hamming Code SEC-DED
36. Example of Error Correction For (7,4) SEC-DED
37. Hsiao Code Weight of column Number of
38. Example of Hsiao Code (7,3) Hsiao Code Uses weight-1 and weight-3 columns
39. Unidirectional Errors Errors in block of data
40. Unidirectional Error Detecting Codes All unidirectional error
41. Two-Rail Code Two-Rail Code One check bit
42. Berger Codes Lowest redundancy of separable AUED
43. Berger Codes Codewords for (5,3) Berger Code
44. Berger Codes If 8 information bits Berger
45. Constant Weight Codes Constant Weight Codes Non-separable,
46. Constant Weight Codes Number codewords in m-out-of-n
47. Example 6-out-of-12 constant weight code
48. Constant Weight Codes Advantage Less redundancy than
49. Burst Error Burst Error Common, multi-bit errors
50. Cyclic Codes Special class of linear code
51. Cyclic Redundancy Check (CRC) Code Most widely
53. CRC Code Linear block code Has G-matrix
54. CRC Code Example (6,4) CRC code generated by g(x)=x2+1
55. Systematic CRC Codes To obtain systematic CRC
56. Galois Division Example Encode m(x)=x2+x with g(x)=x2+1
58. Generating Check Bits for CRC Code Use
59. Checking CRC Codeword Checking Received Codeword for
60. Selecting Generator Polynomial Key issue for CRC
61. Commonly Used CRC Generators
62. Fault Tolerance Schemes Adding Fault Tolerance to
63. Hardware Redundancy Involves replicating hardware units At
64. Static Redundancy Masks faults so no erroneous
65. Triple Module Redundancy (TMR) Well-known static redundancy
66. TMR Reliability and MTTF TMR works if
67. Comparison with Simplex Neglecting fault rate of
68. Comparison with Simplex Crossover point
69. N-Modular Redundancy (NMR) NMR N modules along
70. Interwoven Logic Replace each gate with 4
71. Interwoven Logic with 4 NOR Gates
72. Example of Error on Third Y Input
73. Dynamic Redundancy Involves Detecting fault Locating faulty
74. Unpowered (Cold) Spares Advantage Extends lifetime of
75. Unpowered (Cold) Spares One cold spare doubles
76. Powered (Hot) Spares Can use spares for
77. Pair-and-a-Spare Avoids halting system to run diagnostic procedure when fault occurs
78. TMR/Simplex When one module in TMR fails
79. Comparison of Reliability vs Time
80. Hybrid Redundancy Combines both static and dynamic
81. TMR with Spares If TMR module fails
82. Self-Purging Redundancy Uses threshold voter instead of
83. Self-Purging Redundancy
84. Self-Purging Redundancy Compared with 5MR Self-purging with
85. Time Redundancy Advantage Less hardware Drawback Cannot
86. Repeated Execution Repeat operation twice Simplest time
87. Repeated Execution Requires mechanism for storing and
88. Multi-threaded Redundant Execution Can use in processor-based
89. Multiple Sampling of Outputs Done at circuit-level
90. Multiple Sampling of Outputs Simple approach using two latches
91. Multiple Sampling of Outputs Approach using stability checker at output
92. Diverse Recomputation Use same hardware, but perform
93. Information Redundancy Based on Error Detecting and
94. Error Detection Error detecting codes used to
95. Rollback Requires adding storage to save previous
96. Checkpoint Execution divided into set of operations
97. Error Detection Encode outputs of circuit with error detecting code Non-codeword output indicates error
98. Self-Checking Checker Has two outputs Normal error-free
99. Totally Self-Checking Checker Requires three properties Code
100. Duplicate-and-Compare Equality checker indicates error Undetected error
101. Single-Bit Parity Code Totally self-checking checker formed by removing final gate from XOR tree
102. Single-Bit Parity Code Cannot detect even bit
103. Parity-Check Codes Each check bit is parity
104. Parity-Check Codes For c check bits and
105. Checker for Parity-Check Codes Constructed from single-bit parity checkers and two-rail checkers
106. Two-Rail Checkers Totally self-checking two-rail checker
107. Berger Codes Inverter-free circuit Inverters only at
108. Constant Weight Codes Non-separable with lower redundancy
109. Error Correction Information redundancy can also be
110. Error Correction Memories very dense and prone
111. Memory ECC Architecture
112. Hamming Code for ECC RAM
113. Memory ECC SEC-DED generally very effective Memory
114. Memory Scrubbing Every location in memory read
115. Multiple-Bit Upsets (MBU) Can occur due to
117. Concluding Remarks Many different fault-tolerant schemes Choosing
118. Concluding Remarks As technology scales Circuits increasingly

Слайд 1Chapter 3
Fault-Tolerant Design

Слайд 2What is this chapter about?
Gives Overview of Fault-Tolerant Design

Focus on
Basic Concepts

in Fault-Tolerant Design
Metrics Used to Specify and Evaluate Dependability
Review of Coding Theory
Fault-Tolerant Design Schemes
Hardware Redundancy
Information Redundancy
Time Redundancy
Examples of Fault-Tolerant Applications in Industry

Слайд 3Fault-Tolerant Design
Introduction
Fundamentals of Fault Tolerance
Fundamentals of Coding Theory
Fault Tolerant Schemes
Industry Practices
Concluding

Remarks

Слайд 4Introduction
Fault Tolerance
Ability of system to continue error-free operation in presence of

unexpected fault
Important in mission-critical applications
E.g., medical, aviation, banking, etc.
Errors very costly
Becoming important in mainstream applications
Technology scaling causing circuit behavior to become less predictable and more prone to failures
Needing fault tolerance to keep failure rate within acceptable levels

Слайд 5Faults
Permanent Faults
Due to manufacturing defects, early life failures, wearout failures
Wearout failures

due to various mechanisms
e.g., electromigration, hot carrier degradation, dielectric breakdown, etc.
Temporary Faults
Only present for short period of time
Caused by external disturbance or marginal design parameters

Слайд 6Temporary Faults
Transient Errors (Non-recurring errors)
Cause by external disturbance
e.g., radiation, noise, power

disturbance, etc.
Intermittent Errors (Recurring errors)
Cause by marginal design parameters
Timing problems
e.g., races, hazards, skew
Signal integrity problems
e.g., crosstalk, ground bounce, etc.

Слайд 7Redundancy
Fault Tolerance requires some form of redundancy
Time Redundancy
Hardware Redundancy
Information Redundancy

Слайд 8Time Redundancy
Perform Same Operation Twice
See if get same result both times
If

not, then fault occurred
Can detect temporary faults
Cannot detect permanent faults
Would affect both computations
Advantage
Little to no hardware overhead
Disadvantage
Impacts system or circuit performance

Слайд 9Hardware Redundancy
Replicate hardware and compare outputs
From two or more modules
Detects both

permanent and temporary faults
Advantage
Little or no performance impact
Disadvantage
Area and power for redundant hardware

Слайд 10Information Redundancy
Encode outputs with error detecting or correcting code
Code selected to

minimize redundancy for class of faults
Advantage
Less hardware to generate redundant information than replicating module
Drawback
Added complexity in design

Слайд 11Failure Rate
λ(t) = Component failure rate
Measured in FITS (failures per 109

hours)

Слайд 12System Failure Rate
System constructed from components
No Fault Tolerance
Any component fails, whole

system fails

Слайд 13Reliability
If component working at time 0
R(t) = Probability still working at

time t

Exponential Failure Law
If failure rate assumed constant
Good approximation if past infant mortality period

Слайд 14Reliability for Series System
Series System
All components need to work for system

to work

Слайд 15System Reliability with Redundancy
System reliability with component B in Parallel
Can tolerate

one component B failing

Слайд 16Mean-Time-to-Failure (MTTF)
Average time before system fails
Equal to area under reliability curve

For

Exponential Failure Law

Слайд 17Maintainability
If system failed at time 0
M(t) = Probability repaired and operational

at time t
System repair time divided into
Passive repair time
Time for service engineer to travel to site
Active repair time
Time to locate failing component, repair/replace, and verify system operational
Can be improved through designing system so easy to locate failed component and verify

Слайд 18Repair Rate and MTTR
μ = rate at which system repaired
Analogous to

failure rate λ
Maintainability often modeled as

Mean-Time-to-Repair (MTTR) = 1/μ

Слайд 19Availability
System Availability
Fraction of time system is operational

Слайд 20Availability
Telephone Systems
Required to have system availability of 0.9999 (“four nines”)
High-Reliability Systems
May

require 7 or more nines
Fault-Tolerant Design
Needed to achieve such high availability from less reliable components

Слайд 21Coding Theory
Coding
Using more bits than necessary to represent data
Provides way to

detect errors
Errors occur when bits get flipped
Error Detecting Codes
Many types
Detect different classes of errors
Use different amounts of redundancy
Ease of encoding and decoding data varies

Слайд 22Block Code
Message = Data Being Encoded
Block code
Encodes m messages with n-bit

codeword

If no redundancy
m messages encoded with log2(m) bits
minimum possible

Слайд 23Block Code
To detect errors, some redundancy needed
Space of distinct 2n blocks

partitioned into codewords and non-codewords
Can detect errors that cause codeword to become non-codeword
Cannot detect errors that cause codeword to become another codeword

Слайд 24Separable Block Code
Separable
n-bit blocks partitioned into
k information bits directly representing message
(n-k)

check bits
Denoted (n,k) Block Code
Advantage
k-bit message directly extracted without decoding
Rate of Separable Block Code = k/n

Слайд 25Example of Separable Block Code
(4,3) Parity Code
Check bit is XOR of

3 message bits
message 101 → codeword 1010
Single Bit Parity

Слайд 26Example of Non-Separable Block Code
One-Hot Code
Each Codeword has single 1
Example of

8-bit one-hot
10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001
Redundancy = 1 - log2(8)/8 = 5/8

Слайд 27Linear Block Codes
Special class
Modulo-2 sum of any 2 codewords also codeword
Null

space of (n-k)xn Boolean matrix
Called Parity Check Matrix, H
For any n-bit codeword c
cHT = 0
All 0 codeword exists in any linear code

Слайд 28Linear Block Codes
Generator Matrix, G
kxn Matrix
Codeword c for message m
c =

mG

GHT = 0

Слайд 29Systematic Block Code
First k-bits correspond to message
Last n-k bits correspond to

check bits
For Systematic Code
G = [Ikxk : Pkx(n-k)]
H = [I(n-k)x(n-k) : PT(n-k)xk]
Example

Слайд 30Distance of Code
Distance between two codewords
Number of bits in which they

differ
Distance of Code
Minimum distance between any two codewords in code
If n=k (no redundancy), distance = 1
Single-bit parity, distance = 2
Code with distance d
Detect d-1 errors
Correct up to ⎣(d-1)/2⎦ errors

Слайд 31Error Correcting Codes
Code with distance 3
Called single error correcting (SEC) code
Code

with distance 4
Called single error correcting and double error detecting (SEC-DED) code
Procedure for constructing SEC code
Described in [Hamming 1950]
Any H-matrix with all columns distinct and no all-0 column is SEC

Слайд 32Hamming Code
For any value of n
SEC code constructed by
setting each column

in H equal to binary representation of column number (starting from 1)
Number of rows in H equal to ⎡log2(n+1)⎤
Example of SEC Hamming Code for n=7

Слайд 33Error Correction in Hamming Code
Syndrome, s
s = HvT for received vector

v
If v is codeword
Syndrome = 0
If v non-codeword and single-bit error
Syndrome will match one of columns of H
Will contain binary value of bit position in error

Слайд 34Example of Error Correction
For (7,3) Hamming Code
Suppose codeword 0110011 has one-bit

error changing it to 1110011

Слайд 35SEC-DED Code
Make SEC Hamming Code SEC-DED
By adding parity check over all

bits
Extra parity bit
1 for single-bit error
0 for double-bit error
Makes possible to detect double bit error
Avoid assuming single-bit error and miscorrecting it

Слайд 36Example of Error Correction
For (7,4) SEC-DED Hamming Code
Suppose codeword 0110011 has

two-bit error changing it to 1010011
Doesn’t match any column in H

Слайд 37Hsiao Code
Weight of column
Number of 1’s in column
Constructing n-bit SEC-DED Hsiao

Code
First use all possible weight-1 columns
Then all possible weight-3 columns
Then weight-5 columns, etc.
Until n columns formed
Number check bits is ⎡log2(n+1)⎤
Minimizes number of 1’s in H-matrix
Less hardware and delay for computing syndrome
Disadvantage: Correction logic more complex

Слайд 38Example of Hsiao Code
(7,3) Hsiao Code
Uses weight-1 and weight-3 columns

Слайд 39Unidirectional Errors
Errors in block of data which only cause 0→1 or

1→0, but not both
Any number of bits in error in one direction
Example
Correct codeword 111000
Unidirectional errors could cause
001000, 000000, 101000 (only 1→0 errors)
Non-unidirectional errors
101001, 011001, 011011 (both1→0 and 0→1)

Слайд 40Unidirectional Error Detecting Codes
All unidirectional error detecting (AUED) Codes
Detect all unidirectional

errors in codeword
Single-bit parity is not AUED
Cannot detect even number of errors
No linear code is AUED
All linear codes must contain all-0 vector, so cannot detect all 1→0 errors

Слайд 41Two-Rail Code
Two-Rail Code
One check bit for each information bit
Equal to complement

of information bit
Two-Rail Code is AEUD
50% Redundancy
Example of (6,3) Two-Rail Code
Message 101 has Codeword 101010
Set of all codewords
000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000

Слайд 42Berger Codes
Lowest redundancy of separable AUED codes
For k information bits, log2(k+1)

check bits
Check bits equal to binary representation of number of 0’s in information bits
Example
Information bits 1000101
log2(7+1)=3 check bits
Check bits equal to 100 (4 zero’s)

Слайд 43Berger Codes
Codewords for (5,3) Berger Code
00011, 00110, 01010, 01101, 10010, 10101,

11001, 11100
If unidirectional errors
Contain 1→0 errors
increase 0’s in information bits
can only decrease binary number in check bits
Contain 0→1 errors
decrease 0’s in information bits
can only increase binary number in check bits

Слайд 44Berger Codes
If 8 information bits
Berger code requires log2⎡8+1⎤=4 check bits

(16,8) Two-Rail

Code
Requires 50% redundancy
Redundancy advantage of Berger Code
Increases as k increased

Слайд 45Constant Weight Codes
Constant Weight Codes
Non-separable, but lower redundancy than Berger
Each codeword

has same number of 1’s
Example 2-out-of-3 constant weight code
110, 011, 101
AEUD code
Unidirectional errors always change number of 1’s

Слайд 46Constant Weight Codes
Number codewords in m-out-of-n code

Codewords maximized when m close

to n/2 as possible
n/2-out-of-n when n even
(n/2-0.5 or n/2+0.5)-out-of-n when n odd
Minimizes redundancy of code

Слайд 47Example
6-out-of-12 constant weight code

12-bit Berger Code
Only 28 = 256 codewords

Слайд 48Constant Weight Codes
Advantage
Less redundancy than Berger codes

Disadvantage
Non-separable
Need decoding logic
to convert codeword

back to binary message

Слайд 49Burst Error
Burst Error
Common, multi-bit errors tend to be clustered
Noise source affects

contiguous set of bus lines
Length of burst error
number of bits between first and last error
Wrap around from last to first bit of codeword
Example: Original codeword 00000000
00111100 is burst error length 4
00110100 is burst error length 4
Any number of errors between first and last error

Слайд 50Cyclic Codes
Special class of linear code
Any codeword shifted cyclically is another

codeword
Used to detect burst errors
Less redundancy required to detect burst error than general multi-bit errors
Some distance 2 codes can detect all burst errors of length 4
detecting all possible 4-bit errors requires distance 5 code

Слайд 51Cyclic Redundancy Check (CRC) Code
Most widely used cyclic code
Uses binary alphabet

based on GF(2)
CRC code is (n,k) block code
Formed using generator polynomial, g(x)
called code generator
degree n-k polynomial (same degree as number of check bits)

Слайд 52

Слайд 53CRC Code
Linear block code
Has G-matrix and H-matrix
G-matrix shifted version of generator

polynomial

Слайд 54CRC Code Example
(6,4) CRC code generated by g(x)=x2+1

Слайд 55Systematic CRC Codes
To obtain systematic CRC code
codewords formed using Galois division
nice

because LFSR can be used for performing division

Слайд 56Galois Division Example
Encode m(x)=x2+x with g(x)=x2+1
Requires dividing m(x)xn-k =x4+x3 by g(x)

Remainder

r(x)=x+1
c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1

Слайд 57

Слайд 58Generating Check Bits for CRC Code
Use LFSR
With characteristic polynomial equal to

g(x)
Append n-k 0’s to end of message
Example: m(x)=x2+x+1 and g(x)=x3+x+1

Слайд 59Checking CRC Codeword
Checking Received Codeword for Errors
Shift codeword into LFSR
with same

characteristic polynomial as used to generate it
If final state of LFSR non-zero, then error

Слайд 60Selecting Generator Polynomial
Key issue for CRC Codes
If first and last bit

of polynomial are 1
Will detect burst errors of length n-k or less
If generator polynomial is multiple of (x+1)
Will detect any odd number of errors
If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and n < 2n-k-1
Will detect single, double, triple, and odd errors

Слайд 61Commonly Used CRC Generators

Слайд 62Fault Tolerance Schemes
Adding Fault Tolerance to Design
Improves dependability of system
Requires redundancy
Hardware
Time
Information

Слайд 63Hardware Redundancy
Involves replicating hardware units
At any level of design
gate-level, module-level, chip-level,

board-level
Three Basic Forms
Static (also called Passive)
Masks faults rather than detects them
Dynamic (also called Active)
Detects faults and reconfigures to spare hardware
Hybrid
Combines active and passive approaches

Слайд 64Static Redundancy
Masks faults so no erroneous outputs
Provides uninterrupted operation
Important for real-time

systems
No time to reconfigure or retry operation
Simple self-contained
No need to update or rollback system state

Слайд 65Triple Module Redundancy (TMR)
Well-known static redundancy scheme
Three copies of module
Use majority

voter to determine final output
Error in one module out-voted by other two

Слайд 66TMR Reliability and MTTF
TMR works if any 2 modules work
Rm =

reliability of each module
Rv = reliability of voter

MTTF for TMR

Слайд 67Comparison with Simplex
Neglecting fault rate of voter

TMR has lower MTTF, but
Can

tolerate temporary faults
Higher reliability for short mission times

Слайд 68Comparison with Simplex
Crossover point

RTMR > Rsimplex when
Mission time shorter than 70%

of MTTF

Слайд 69N-Modular Redundancy (NMR)
NMR
N modules along with majority voter
TMR special case
Number of

failed modules masked = ⎣(N-1)/2⎦
As N increases, MTTF decreases
But, reliability for short missions increases
If goal only to tolerate temporary faults
TMR sufficient

Слайд 70Interwoven Logic
Replace each gate
with 4 gates using inconnection pattern that automatically

corrects errors
Traditionally not as attractive as TMR
Requires lots of area overhead
Renewed interest by researchers investigating emerging nanoelectronic technologies

Слайд 71Interwoven Logic with 4 NOR Gates

Слайд 72Example of Error on Third Y Input

Слайд 73Dynamic Redundancy
Involves
Detecting fault
Locating faulty hardware unit
Reconfiguring system to use spare fault-free

hardware unit

Слайд 74Unpowered (Cold) Spares
Advantage
Extends lifetime of spares
Equations
Assume spare not failing until powered
Perfect

reconfiguration capability

Слайд 75Unpowered (Cold) Spares
One cold spare doubles MTTF
Assuming faults always detected and

reconfiguration circuitry never fails
Drawback of cold spare
Extra time to power and initialize
Cannot be used to help in detecting faults
Fault detection requires either
periodic offline testing
online testing using time or information redundancy

Слайд 76Powered (Hot) Spares
Can use spares for online fault detection
One approach is

duplicate-and-compare
If outputs mismatch then fault occurred
Run diagnostic procedure to determine which module is faulty and replace with spare
Any number of spares can be used

Слайд 77Pair-and-a-Spare
Avoids halting system to run diagnostic procedure when fault occurs

Слайд 78TMR/Simplex
When one module in TMR fails
Disconnect one of remaining modules
Improves MTTF

while retaining advantages of TMR when 3 good modules
TMR/Simplex
Reliability always better than either TMR or Simplex alone

Слайд 79Comparison of Reliability vs Time

Слайд 80Hybrid Redundancy
Combines both static and dynamic redundancy
Masks faults like static
Detects and

reconfigures like dynamic

Слайд 81TMR with Spares
If TMR module fails
Replace with spare
can be either hot

or cold spare
While system has three working modules
TMR will provide fault masking for uninterrupted operation

Слайд 82Self-Purging Redundancy
Uses threshold voter instead of majority voter
Threshold voter outputs 1

if number of input that are 1 greater than threshold
Otherwise outputs 0
Requires hot spares

Слайд 83Self-Purging Redundancy

Слайд 84Self-Purging Redundancy
Compared with 5MR
Self-purging with 5 modules
Tolerate up to 3 failing

modules (5MR cannot)
Cannot tolerate two modules simultaneously failing (5MR can)
Compared with TMR with 2 spares
Self-purging with 5 modules
simpler reconfiguration circuitry
requires hot spares (3MR w/spares can use either hot or cold spares)

Слайд 85Time Redundancy
Advantage
Less hardware
Drawback
Cannot detect permanent faults
If error detected
System needs to rollback

to known good state before resuming operation

Слайд 86Repeated Execution
Repeat operation twice
Simplest time redundancy approach
Detects temporary faults occurring during

one execution (but not both)
Causes mismatch in results
Can reuse same hardware for both executions
Only one copy of functional hardware needed

Слайд 87Repeated Execution
Requires mechanism for storing and comparing results of both executions
In

processor, can store in memory or on disk and use software to compare
Main cost
Additional time for redundant execution and comparison

Слайд 88Multi-threaded Redundant Execution
Can use in processor-based system that can run multiple

threads
Two copies of thread executed concurrently
Results compared when both complete
Take advantage of processor’s built-in capability to exploit processing resources
Reduce execution time
Can significantly reduce performance penalty

Слайд 89Multiple Sampling of Outputs
Done at circuit-level
Sample once at end of normal

clock cycle
Same again after delay of Δt
Two samples compared to detect mismatch
Indicates error occurred
Detect fault whose duration is less than Δt
Performance overhead depends on
Size of Δt relative to normal clock period

Слайд 90Multiple Sampling of Outputs
Simple approach using two latches

Слайд 91Multiple Sampling of Outputs
Approach using stability checker at output

Слайд 92Diverse Recomputation
Use same hardware, but perform computation differently second time
Can detect

permanent faults that affects only one computation
For arithmetic or logical operations
Shift operands when performing second computation [Patel 1982]
Detects permanent fault affecting only one bit-slice

Слайд 93Information Redundancy
Based on Error Detecting and Correcting Codes
Advantage
Detects both permanent and

temporary faults
Implemented with less hardware overhead than using multiple copies of module
Disadvantage
More complex design

Слайд 94Error Detection
Error detecting codes used to detect errors
If error detected
Rollback to

previous known error-free state
Retry operation

Слайд 95Rollback
Requires adding storage to save previous state
Amount of rollback depends on

latency of error detection mechanism
Zero-latency error detection
rollback implemented by preventing system state from updating
If errors detected after n cycles
need rollback restoring system to state at least n clock cycles earlier

Слайд 96Checkpoint
Execution divided into set of operations
Before each operation executed
checkpoint created where

system state saved
If any error detected during operation
rollback to last checkpoint and retry operation
If multiple retries fail
operation halts and system flags that permanent fault has occurred

Слайд 97Error Detection
Encode outputs of circuit with error detecting code
Non-codeword output indicates

error

Слайд 98Self-Checking Checker
Has two outputs
Normal error-free case (1,0) or (0,1)
If equal to

each other, then error (0,0) or (1,1)
Cannot have single error indicator output
Stuck-at 0 fault on output could never be detected

Слайд 99Totally Self-Checking Checker
Requires three properties
Code Disjoint
all codeword inputs mapped to codeword

outputs
Fault Secure
for all codeword inputs, checker in presence of fault will either procedure correct codeword output or non-codeword output (not incorrect codeword)
Self-Testing
For each fault, at least one codeword input gives error indication

Слайд 100Duplicate-and-Compare
Equality checker indicates error
Undetected error can occur only if common-mode fault

affecting both copies
Only faults after stems detected
Over 100% overhead (including checker)

Слайд 101Single-Bit Parity Code
Totally self-checking checker formed by removing final gate from

XOR tree

Слайд 102Single-Bit Parity Code
Cannot detect even bit errors
Can ensure no even bit

errors by generating each output with independent cone of logic
Only single bit errors can occur due to single point fault
Typically requires a lot of overhead

Слайд 103Parity-Check Codes
Each check bit is parity for some set of output

bits
Example: 6 outputs and 3 check bits

Слайд 104Parity-Check Codes
For c check bits and k functional outputs
2ck possible parity

check codes
Can choose code based on structure of circuit to minimize undetected error combinations
Fanouts in circuit determine possible error combinations due to single-point fault

Слайд 105Checker for Parity-Check Codes
Constructed from single-bit parity checkers and two-rail checkers

Слайд 106Two-Rail Checkers
Totally self-checking two-rail checker

Слайд 107Berger Codes
Inverter-free circuit
Inverters only at primary inputs
Can be synthesized using only

algebraic factoring [Jha 1993]
Only unidirectional errors possible for single point faults
Can use unidirectional code
Berger code gives 100% coverage

Слайд 108Constant Weight Codes
Non-separable with lower redundancy
Drawback: need decoding logic to convert

codeword back to its original binary value
Can use for encoding states of FSM
No need for decoding logic

Слайд 109Error Correction
Information redundancy can also be used to mask errors
Not as

attractive as TMR because logic for predicting check bits very complex
However, very good for memories
Check bits stored with data
Error do not propagate in memories as in logic circuits, so SEC-DED usually sufficient

Слайд 110Error Correction
Memories very dense and prone to errors
Especially due to single-event

upsets (SEUs) from radiation
SEC-DED check bits stored in memory
32-bit word, SEC-DED requires 7 check bits
Increases size of memory by 7/32=21.9%
64-bit word, SEC-DED requires 8 check bits
Increases size of memory by 8/64=12.5%

Слайд 111Memory ECC Architecture

Слайд 112Hamming Code for ECC RAM

Z
1

Z
2

Z
3

Z
4

Z
5

Z
6

Parity Group 1

Parity Group 2

Parity Group 3

Parity Group 4

Слайд 113Memory ECC
SEC-DED generally very effective
Memory bit-flips tend to be independent and

uniformly distributed
If bit-flip occurs, gets corrected next time memory location accessed
Main risk is if memory word not access for long time
Multiple bit-flips could accumulate

Слайд 114Memory Scrubbing
Every location in memory read on periodic basis
Reduces chance of

multiple errors accumulating in a memory word
Can be implemented by having memory controller cycle through memory during idle periods

Слайд 115Multiple-Bit Upsets (MBU)
Can occur due to single SEU
Typically occur in adjacent

memory cells
Memory interleaving used
To prevent MBUs from resulting in multiple bit errors in same word

Слайд 116

Слайд 117Concluding Remarks
Many different fault-tolerant schemes
Choosing scheme depends on
Types of faults to

be tolerated
Temporary or permanent
Single or multiple point failures
etc.
Design constraints
Area, performance, power, etc.

Слайд 118Concluding Remarks
As technology scales
Circuits increasingly prone to failure
Achieving sufficient fault tolerance

will be major design issue

Скачать презентацию

Fault-Tolerant Design презентация

Содержание

Слайд 1Chapter 3Fault-Tolerant Design

Слайд 2What is this chapter about?Gives Overview of Fault-Tolerant DesignFocus onBasic Concepts

Слайд 3Fault-Tolerant DesignIntroductionFundamentals of Fault ToleranceFundamentals of Coding TheoryFault Tolerant SchemesIndustry PracticesConcluding

Слайд 4IntroductionFault ToleranceAbility of system to continue error-free operation in presence of

Слайд 5FaultsPermanent FaultsDue to manufacturing defects, early life failures, wearout failuresWearout failures

Слайд 6Temporary FaultsTransient Errors (Non-recurring errors)Cause by external disturbancee.g., radiation, noise, power

Слайд 7RedundancyFault Tolerance requires some form of redundancyTime RedundancyHardware RedundancyInformation Redundancy

Слайд 8Time RedundancyPerform Same Operation TwiceSee if get same result both timesIf

Слайд 9Hardware RedundancyReplicate hardware and compare outputsFrom two or more modulesDetects both

Слайд 10Information RedundancyEncode outputs with error detecting or correcting codeCode selected to

Слайд 11Failure Rateλ(t) = Component failure rateMeasured in FITS (failures per 109

Слайд 12System Failure RateSystem constructed from componentsNo Fault ToleranceAny component fails, whole

Слайд 13ReliabilityIf component working at time 0R(t) = Probability still working at

Слайд 14Reliability for Series SystemSeries SystemAll components need to work for system

Слайд 15System Reliability with RedundancySystem reliability with component B in ParallelCan tolerate

Слайд 16Mean-Time-to-Failure (MTTF)Average time before system failsEqual to area under reliability curveFor

Слайд 17MaintainabilityIf system failed at time 0M(t) = Probability repaired and operational

Слайд 18Repair Rate and MTTRμ = rate at which system repairedAnalogous to

Слайд 19AvailabilitySystem AvailabilityFraction of time system is operational

Слайд 20AvailabilityTelephone SystemsRequired to have system availability of 0.9999 (“four nines”)High-Reliability SystemsMay

Слайд 21Coding TheoryCodingUsing more bits than necessary to represent dataProvides way to

Слайд 22Block CodeMessage = Data Being EncodedBlock codeEncodes m messages with n-bit

Слайд 23Block CodeTo detect errors, some redundancy neededSpace of distinct 2n blocks

Слайд 24Separable Block CodeSeparablen-bit blocks partitioned intok information bits directly representing message(n-k)

Слайд 25Example of Separable Block Code(4,3) Parity CodeCheck bit is XOR of

Слайд 26Example of Non-Separable Block CodeOne-Hot CodeEach Codeword has single 1Example of

Слайд 27Linear Block CodesSpecial classModulo-2 sum of any 2 codewords also codewordNull

Слайд 28Linear Block CodesGenerator Matrix, Gkxn MatrixCodeword c for message mc =

Слайд 29Systematic Block CodeFirst k-bits correspond to messageLast n-k bits correspond to

Слайд 30Distance of CodeDistance between two codewordsNumber of bits in which they

Слайд 31Error Correcting CodesCode with distance 3Called single error correcting (SEC) codeCode

Слайд 32Hamming CodeFor any value of nSEC code constructed bysetting each column

Слайд 33Error Correction in Hamming CodeSyndrome, ss = HvT for received vector

Слайд 34Example of Error CorrectionFor (7,3) Hamming CodeSuppose codeword 0110011 has one-bit

Слайд 35SEC-DED CodeMake SEC Hamming Code SEC-DEDBy adding parity check over all

Слайд 36Example of Error CorrectionFor (7,4) SEC-DED Hamming CodeSuppose codeword 0110011 has

Слайд 37Hsiao CodeWeight of columnNumber of 1’s in columnConstructing n-bit SEC-DED Hsiao

Слайд 38Example of Hsiao Code(7,3) Hsiao CodeUses weight-1 and weight-3 columns

Слайд 39Unidirectional ErrorsErrors in block of data which only cause 0→1 or

Слайд 40Unidirectional Error Detecting CodesAll unidirectional error detecting (AUED) CodesDetect all unidirectional

Слайд 41Two-Rail CodeTwo-Rail CodeOne check bit for each information bitEqual to complement

Слайд 42Berger CodesLowest redundancy of separable AUED codesFor k information bits, log2(k+1)

Слайд 43Berger CodesCodewords for (5,3) Berger Code00011, 00110, 01010, 01101, 10010, 10101,

Слайд 44Berger CodesIf 8 information bitsBerger code requires log2⎡8+1⎤=4 check bits(16,8) Two-Rail

Слайд 45Constant Weight CodesConstant Weight CodesNon-separable, but lower redundancy than BergerEach codeword

Слайд 46Constant Weight CodesNumber codewords in m-out-of-n codeCodewords maximized when m close

Слайд 47Example6-out-of-12 constant weight code12-bit Berger CodeOnly 28 = 256 codewords

Слайд 48Constant Weight CodesAdvantageLess redundancy than Berger codesDisadvantageNon-separableNeed decoding logicto convert codeword

Слайд 49Burst ErrorBurst ErrorCommon, multi-bit errors tend to be clusteredNoise source affects

Слайд 50Cyclic CodesSpecial class of linear codeAny codeword shifted cyclically is another

Слайд 51Cyclic Redundancy Check (CRC) CodeMost widely used cyclic codeUses binary alphabet