Fault-Tolerant Design презентация

Содержание

What is this chapter about? Gives Overview of Fault-Tolerant Design Focus on Basic Concepts in Fault-Tolerant Design Metrics Used to Specify and Evaluate Dependability Review of Coding Theory Fault-Tolerant Design

Слайд 1Chapter 3
Fault-Tolerant Design


Слайд 2What is this chapter about?
Gives Overview of Fault-Tolerant Design

Focus on
Basic Concepts

in Fault-Tolerant Design
Metrics Used to Specify and Evaluate Dependability
Review of Coding Theory
Fault-Tolerant Design Schemes
Hardware Redundancy
Information Redundancy
Time Redundancy
Examples of Fault-Tolerant Applications in Industry


Слайд 3Fault-Tolerant Design
Introduction
Fundamentals of Fault Tolerance
Fundamentals of Coding Theory
Fault Tolerant Schemes
Industry Practices
Concluding

Remarks

Слайд 4Introduction
Fault Tolerance
Ability of system to continue error-free operation in presence of

unexpected fault
Important in mission-critical applications
E.g., medical, aviation, banking, etc.
Errors very costly
Becoming important in mainstream applications
Technology scaling causing circuit behavior to become less predictable and more prone to failures
Needing fault tolerance to keep failure rate within acceptable levels

Слайд 5Faults
Permanent Faults
Due to manufacturing defects, early life failures, wearout failures
Wearout failures

due to various mechanisms
e.g., electromigration, hot carrier degradation, dielectric breakdown, etc.
Temporary Faults
Only present for short period of time
Caused by external disturbance or marginal design parameters

Слайд 6Temporary Faults
Transient Errors (Non-recurring errors)
Cause by external disturbance
e.g., radiation, noise, power

disturbance, etc.
Intermittent Errors (Recurring errors)
Cause by marginal design parameters
Timing problems
e.g., races, hazards, skew
Signal integrity problems
e.g., crosstalk, ground bounce, etc.

Слайд 7Redundancy
Fault Tolerance requires some form of redundancy
Time Redundancy
Hardware Redundancy
Information Redundancy


Слайд 8Time Redundancy
Perform Same Operation Twice
See if get same result both times
If

not, then fault occurred
Can detect temporary faults
Cannot detect permanent faults
Would affect both computations
Advantage
Little to no hardware overhead
Disadvantage
Impacts system or circuit performance

Слайд 9Hardware Redundancy
Replicate hardware and compare outputs
From two or more modules
Detects both

permanent and temporary faults
Advantage
Little or no performance impact
Disadvantage
Area and power for redundant hardware

Слайд 10Information Redundancy
Encode outputs with error detecting or correcting code
Code selected to

minimize redundancy for class of faults
Advantage
Less hardware to generate redundant information than replicating module
Drawback
Added complexity in design

Слайд 11Failure Rate
λ(t) = Component failure rate
Measured in FITS (failures per 109

hours)



Слайд 12System Failure Rate
System constructed from components
No Fault Tolerance
Any component fails, whole

system fails



Слайд 13Reliability
If component working at time 0
R(t) = Probability still working at

time t

Exponential Failure Law
If failure rate assumed constant
Good approximation if past infant mortality period



Слайд 14Reliability for Series System
Series System
All components need to work for system

to work




Слайд 15System Reliability with Redundancy
System reliability with component B in Parallel
Can tolerate

one component B failing





Слайд 16Mean-Time-to-Failure (MTTF)
Average time before system fails
Equal to area under reliability curve



For

Exponential Failure Law




Слайд 17Maintainability
If system failed at time 0
M(t) = Probability repaired and operational

at time t
System repair time divided into
Passive repair time
Time for service engineer to travel to site
Active repair time
Time to locate failing component, repair/replace, and verify system operational
Can be improved through designing system so easy to locate failed component and verify

Слайд 18Repair Rate and MTTR
μ = rate at which system repaired
Analogous to

failure rate λ
Maintainability often modeled as


Mean-Time-to-Repair (MTTR) = 1/μ



Слайд 19Availability
System Availability
Fraction of time system is operational


Слайд 20Availability
Telephone Systems
Required to have system availability of 0.9999 (“four nines”)
High-Reliability Systems
May

require 7 or more nines
Fault-Tolerant Design
Needed to achieve such high availability from less reliable components

Слайд 21Coding Theory
Coding
Using more bits than necessary to represent data
Provides way to

detect errors
Errors occur when bits get flipped
Error Detecting Codes
Many types
Detect different classes of errors
Use different amounts of redundancy
Ease of encoding and decoding data varies

Слайд 22Block Code
Message = Data Being Encoded
Block code
Encodes m messages with n-bit

codeword



If no redundancy
m messages encoded with log2(m) bits
minimum possible



Слайд 23Block Code
To detect errors, some redundancy needed
Space of distinct 2n blocks

partitioned into codewords and non-codewords
Can detect errors that cause codeword to become non-codeword
Cannot detect errors that cause codeword to become another codeword

Слайд 24Separable Block Code
Separable
n-bit blocks partitioned into
k information bits directly representing message
(n-k)

check bits
Denoted (n,k) Block Code
Advantage
k-bit message directly extracted without decoding
Rate of Separable Block Code = k/n

Слайд 25Example of Separable Block Code
(4,3) Parity Code
Check bit is XOR of

3 message bits
message 101 → codeword 1010
Single Bit Parity





Слайд 26Example of Non-Separable Block Code
One-Hot Code
Each Codeword has single 1
Example of

8-bit one-hot
10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001
Redundancy = 1 - log2(8)/8 = 5/8




Слайд 27Linear Block Codes
Special class
Modulo-2 sum of any 2 codewords also codeword
Null

space of (n-k)xn Boolean matrix
Called Parity Check Matrix, H
For any n-bit codeword c
cHT = 0
All 0 codeword exists in any linear code

Слайд 28Linear Block Codes
Generator Matrix, G
kxn Matrix
Codeword c for message m
c =

mG

GHT = 0

Слайд 29Systematic Block Code
First k-bits correspond to message
Last n-k bits correspond to

check bits
For Systematic Code
G = [Ikxk : Pkx(n-k)]
H = [I(n-k)x(n-k) : PT(n-k)xk]
Example




Слайд 30Distance of Code
Distance between two codewords
Number of bits in which they

differ
Distance of Code
Minimum distance between any two codewords in code
If n=k (no redundancy), distance = 1
Single-bit parity, distance = 2
Code with distance d
Detect d-1 errors
Correct up to ⎣(d-1)/2⎦ errors

Слайд 31Error Correcting Codes
Code with distance 3
Called single error correcting (SEC) code
Code

with distance 4
Called single error correcting and double error detecting (SEC-DED) code
Procedure for constructing SEC code
Described in [Hamming 1950]
Any H-matrix with all columns distinct and no all-0 column is SEC

Слайд 32Hamming Code
For any value of n
SEC code constructed by
setting each column

in H equal to binary representation of column number (starting from 1)
Number of rows in H equal to ⎡log2(n+1)⎤
Example of SEC Hamming Code for n=7



Слайд 33Error Correction in Hamming Code
Syndrome, s
s = HvT for received vector

v
If v is codeword
Syndrome = 0
If v non-codeword and single-bit error
Syndrome will match one of columns of H
Will contain binary value of bit position in error

Слайд 34Example of Error Correction
For (7,3) Hamming Code
Suppose codeword 0110011 has one-bit

error changing it to 1110011



Слайд 35SEC-DED Code
Make SEC Hamming Code SEC-DED
By adding parity check over all

bits
Extra parity bit
1 for single-bit error
0 for double-bit error
Makes possible to detect double bit error
Avoid assuming single-bit error and miscorrecting it

Слайд 36Example of Error Correction
For (7,4) SEC-DED Hamming Code
Suppose codeword 0110011 has

two-bit error changing it to 1010011
Doesn’t match any column in H




Слайд 37Hsiao Code
Weight of column
Number of 1’s in column
Constructing n-bit SEC-DED Hsiao

Code
First use all possible weight-1 columns
Then all possible weight-3 columns
Then weight-5 columns, etc.
Until n columns formed
Number check bits is ⎡log2(n+1)⎤
Minimizes number of 1’s in H-matrix
Less hardware and delay for computing syndrome
Disadvantage: Correction logic more complex

Слайд 38Example of Hsiao Code
(7,3) Hsiao Code
Uses weight-1 and weight-3 columns


Слайд 39Unidirectional Errors
Errors in block of data which only cause 0→1 or

1→0, but not both
Any number of bits in error in one direction
Example
Correct codeword 111000
Unidirectional errors could cause
001000, 000000, 101000 (only 1→0 errors)
Non-unidirectional errors
101001, 011001, 011011 (both1→0 and 0→1)

Слайд 40Unidirectional Error Detecting Codes
All unidirectional error detecting (AUED) Codes
Detect all unidirectional

errors in codeword
Single-bit parity is not AUED
Cannot detect even number of errors
No linear code is AUED
All linear codes must contain all-0 vector, so cannot detect all 1→0 errors

Слайд 41Two-Rail Code
Two-Rail Code
One check bit for each information bit
Equal to complement

of information bit
Two-Rail Code is AEUD
50% Redundancy
Example of (6,3) Two-Rail Code
Message 101 has Codeword 101010
Set of all codewords
000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000

Слайд 42Berger Codes
Lowest redundancy of separable AUED codes
For k information bits, log2(k+1)

check bits
Check bits equal to binary representation of number of 0’s in information bits
Example
Information bits 1000101
log2(7+1)=3 check bits
Check bits equal to 100 (4 zero’s)

Слайд 43Berger Codes
Codewords for (5,3) Berger Code
00011, 00110, 01010, 01101, 10010, 10101,

11001, 11100
If unidirectional errors
Contain 1→0 errors
increase 0’s in information bits
can only decrease binary number in check bits
Contain 0→1 errors
decrease 0’s in information bits
can only increase binary number in check bits

Слайд 44Berger Codes
If 8 information bits
Berger code requires log2⎡8+1⎤=4 check bits



(16,8) Two-Rail

Code
Requires 50% redundancy
Redundancy advantage of Berger Code
Increases as k increased



Слайд 45Constant Weight Codes
Constant Weight Codes
Non-separable, but lower redundancy than Berger
Each codeword

has same number of 1’s
Example 2-out-of-3 constant weight code
110, 011, 101
AEUD code
Unidirectional errors always change number of 1’s

Слайд 46Constant Weight Codes
Number codewords in m-out-of-n code


Codewords maximized when m close

to n/2 as possible
n/2-out-of-n when n even
(n/2-0.5 or n/2+0.5)-out-of-n when n odd
Minimizes redundancy of code



Слайд 47Example
6-out-of-12 constant weight code



12-bit Berger Code
Only 28 = 256 codewords




Слайд 48Constant Weight Codes
Advantage
Less redundancy than Berger codes

Disadvantage
Non-separable
Need decoding logic
to convert codeword

back to binary message

Слайд 49Burst Error
Burst Error
Common, multi-bit errors tend to be clustered
Noise source affects

contiguous set of bus lines
Length of burst error
number of bits between first and last error
Wrap around from last to first bit of codeword
Example: Original codeword 00000000
00111100 is burst error length 4
00110100 is burst error length 4
Any number of errors between first and last error

Слайд 50Cyclic Codes
Special class of linear code
Any codeword shifted cyclically is another

codeword
Used to detect burst errors
Less redundancy required to detect burst error than general multi-bit errors
Some distance 2 codes can detect all burst errors of length 4
detecting all possible 4-bit errors requires distance 5 code

Слайд 51Cyclic Redundancy Check (CRC) Code
Most widely used cyclic code
Uses binary alphabet

based on GF(2)
CRC code is (n,k) block code
Formed using generator polynomial, g(x)
called code generator
degree n-k polynomial (same degree as number of check bits)




Слайд 53CRC Code
Linear block code
Has G-matrix and H-matrix
G-matrix shifted version of generator

polynomial



Слайд 54CRC Code Example
(6,4) CRC code generated by g(x)=x2+1



Слайд 55Systematic CRC Codes
To obtain systematic CRC code
codewords formed using Galois division
nice

because LFSR can be used for performing division



Слайд 56Galois Division Example
Encode m(x)=x2+x with g(x)=x2+1
Requires dividing m(x)xn-k =x4+x3 by g(x)






Remainder

r(x)=x+1
c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1

Слайд 58Generating Check Bits for CRC Code
Use LFSR
With characteristic polynomial equal to

g(x)
Append n-k 0’s to end of message
Example: m(x)=x2+x+1 and g(x)=x3+x+1

Слайд 59Checking CRC Codeword
Checking Received Codeword for Errors
Shift codeword into LFSR
with same

characteristic polynomial as used to generate it
If final state of LFSR non-zero, then error

Слайд 60Selecting Generator Polynomial
Key issue for CRC Codes
If first and last bit

of polynomial are 1
Will detect burst errors of length n-k or less
If generator polynomial is multiple of (x+1)
Will detect any odd number of errors
If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and n < 2n-k-1
Will detect single, double, triple, and odd errors

Слайд 61Commonly Used CRC Generators


Слайд 62Fault Tolerance Schemes
Adding Fault Tolerance to Design
Improves dependability of system
Requires redundancy
Hardware
Time
Information


Слайд 63Hardware Redundancy
Involves replicating hardware units
At any level of design
gate-level, module-level, chip-level,

board-level
Three Basic Forms
Static (also called Passive)
Masks faults rather than detects them
Dynamic (also called Active)
Detects faults and reconfigures to spare hardware
Hybrid
Combines active and passive approaches

Слайд 64Static Redundancy
Masks faults so no erroneous outputs
Provides uninterrupted operation
Important for real-time

systems
No time to reconfigure or retry operation
Simple self-contained
No need to update or rollback system state

Слайд 65Triple Module Redundancy (TMR)
Well-known static redundancy scheme
Three copies of module
Use majority

voter to determine final output
Error in one module out-voted by other two

Слайд 66TMR Reliability and MTTF
TMR works if any 2 modules work
Rm =

reliability of each module
Rv = reliability of voter


MTTF for TMR




Слайд 67Comparison with Simplex
Neglecting fault rate of voter



TMR has lower MTTF, but
Can

tolerate temporary faults
Higher reliability for short mission times



Слайд 68Comparison with Simplex
Crossover point




RTMR > Rsimplex when
Mission time shorter than 70%

of MTTF



Слайд 69N-Modular Redundancy (NMR)
NMR
N modules along with majority voter
TMR special case
Number of

failed modules masked = ⎣(N-1)/2⎦
As N increases, MTTF decreases
But, reliability for short missions increases
If goal only to tolerate temporary faults
TMR sufficient

Слайд 70Interwoven Logic
Replace each gate
with 4 gates using inconnection pattern that automatically

corrects errors
Traditionally not as attractive as TMR
Requires lots of area overhead
Renewed interest by researchers investigating emerging nanoelectronic technologies

Слайд 71Interwoven Logic with 4 NOR Gates


Слайд 72Example of Error on Third Y Input


Слайд 73Dynamic Redundancy
Involves
Detecting fault
Locating faulty hardware unit
Reconfiguring system to use spare fault-free

hardware unit

Слайд 74Unpowered (Cold) Spares
Advantage
Extends lifetime of spares
Equations
Assume spare not failing until powered
Perfect

reconfiguration capability



Слайд 75Unpowered (Cold) Spares
One cold spare doubles MTTF
Assuming faults always detected and

reconfiguration circuitry never fails
Drawback of cold spare
Extra time to power and initialize
Cannot be used to help in detecting faults
Fault detection requires either
periodic offline testing
online testing using time or information redundancy

Слайд 76Powered (Hot) Spares
Can use spares for online fault detection
One approach is

duplicate-and-compare
If outputs mismatch then fault occurred
Run diagnostic procedure to determine which module is faulty and replace with spare
Any number of spares can be used

Слайд 77Pair-and-a-Spare
Avoids halting system to run diagnostic procedure when fault occurs


Слайд 78TMR/Simplex
When one module in TMR fails
Disconnect one of remaining modules
Improves MTTF

while retaining advantages of TMR when 3 good modules
TMR/Simplex
Reliability always better than either TMR or Simplex alone

Слайд 79Comparison of Reliability vs Time


Слайд 80Hybrid Redundancy
Combines both static and dynamic redundancy
Masks faults like static
Detects and

reconfigures like dynamic

Слайд 81TMR with Spares
If TMR module fails
Replace with spare
can be either hot

or cold spare
While system has three working modules
TMR will provide fault masking for uninterrupted operation

Слайд 82Self-Purging Redundancy
Uses threshold voter instead of majority voter
Threshold voter outputs 1

if number of input that are 1 greater than threshold
Otherwise outputs 0
Requires hot spares

Слайд 83Self-Purging Redundancy


Слайд 84Self-Purging Redundancy
Compared with 5MR
Self-purging with 5 modules
Tolerate up to 3 failing

modules (5MR cannot)
Cannot tolerate two modules simultaneously failing (5MR can)
Compared with TMR with 2 spares
Self-purging with 5 modules
simpler reconfiguration circuitry
requires hot spares (3MR w/spares can use either hot or cold spares)

Слайд 85Time Redundancy
Advantage
Less hardware
Drawback
Cannot detect permanent faults
If error detected
System needs to rollback

to known good state before resuming operation

Слайд 86Repeated Execution
Repeat operation twice
Simplest time redundancy approach
Detects temporary faults occurring during

one execution (but not both)
Causes mismatch in results
Can reuse same hardware for both executions
Only one copy of functional hardware needed


Слайд 87Repeated Execution
Requires mechanism for storing and comparing results of both executions
In

processor, can store in memory or on disk and use software to compare
Main cost
Additional time for redundant execution and comparison

Слайд 88Multi-threaded Redundant Execution
Can use in processor-based system that can run multiple

threads
Two copies of thread executed concurrently
Results compared when both complete
Take advantage of processor’s built-in capability to exploit processing resources
Reduce execution time
Can significantly reduce performance penalty

Слайд 89Multiple Sampling of Outputs
Done at circuit-level
Sample once at end of normal

clock cycle
Same again after delay of Δt
Two samples compared to detect mismatch
Indicates error occurred
Detect fault whose duration is less than Δt
Performance overhead depends on
Size of Δt relative to normal clock period

Слайд 90Multiple Sampling of Outputs
Simple approach using two latches


Слайд 91Multiple Sampling of Outputs
Approach using stability checker at output


Слайд 92Diverse Recomputation
Use same hardware, but perform computation differently second time
Can detect

permanent faults that affects only one computation
For arithmetic or logical operations
Shift operands when performing second computation [Patel 1982]
Detects permanent fault affecting only one bit-slice

Слайд 93Information Redundancy
Based on Error Detecting and Correcting Codes
Advantage
Detects both permanent and

temporary faults
Implemented with less hardware overhead than using multiple copies of module
Disadvantage
More complex design

Слайд 94Error Detection
Error detecting codes used to detect errors
If error detected
Rollback to

previous known error-free state
Retry operation


Слайд 95Rollback
Requires adding storage to save previous state
Amount of rollback depends on

latency of error detection mechanism
Zero-latency error detection
rollback implemented by preventing system state from updating
If errors detected after n cycles
need rollback restoring system to state at least n clock cycles earlier

Слайд 96Checkpoint
Execution divided into set of operations
Before each operation executed
checkpoint created where

system state saved
If any error detected during operation
rollback to last checkpoint and retry operation
If multiple retries fail
operation halts and system flags that permanent fault has occurred

Слайд 97Error Detection
Encode outputs of circuit with error detecting code
Non-codeword output indicates

error

Слайд 98Self-Checking Checker
Has two outputs
Normal error-free case (1,0) or (0,1)
If equal to

each other, then error (0,0) or (1,1)
Cannot have single error indicator output
Stuck-at 0 fault on output could never be detected


Слайд 99Totally Self-Checking Checker
Requires three properties
Code Disjoint
all codeword inputs mapped to codeword

outputs
Fault Secure
for all codeword inputs, checker in presence of fault will either procedure correct codeword output or non-codeword output (not incorrect codeword)
Self-Testing
For each fault, at least one codeword input gives error indication

Слайд 100Duplicate-and-Compare
Equality checker indicates error
Undetected error can occur only if common-mode fault

affecting both copies
Only faults after stems detected
Over 100% overhead (including checker)

Слайд 101Single-Bit Parity Code
Totally self-checking checker formed by removing final gate from

XOR tree

Слайд 102Single-Bit Parity Code
Cannot detect even bit errors
Can ensure no even bit

errors by generating each output with independent cone of logic
Only single bit errors can occur due to single point fault
Typically requires a lot of overhead

Слайд 103Parity-Check Codes
Each check bit is parity for some set of output

bits
Example: 6 outputs and 3 check bits

Слайд 104Parity-Check Codes
For c check bits and k functional outputs
2ck possible parity

check codes
Can choose code based on structure of circuit to minimize undetected error combinations
Fanouts in circuit determine possible error combinations due to single-point fault

Слайд 105Checker for Parity-Check Codes
Constructed from single-bit parity checkers and two-rail checkers


Слайд 106Two-Rail Checkers
Totally self-checking two-rail checker


Слайд 107Berger Codes
Inverter-free circuit
Inverters only at primary inputs
Can be synthesized using only

algebraic factoring [Jha 1993]
Only unidirectional errors possible for single point faults
Can use unidirectional code
Berger code gives 100% coverage

Слайд 108Constant Weight Codes
Non-separable with lower redundancy
Drawback: need decoding logic to convert

codeword back to its original binary value
Can use for encoding states of FSM
No need for decoding logic

Слайд 109Error Correction
Information redundancy can also be used to mask errors
Not as

attractive as TMR because logic for predicting check bits very complex
However, very good for memories
Check bits stored with data
Error do not propagate in memories as in logic circuits, so SEC-DED usually sufficient


Слайд 110Error Correction
Memories very dense and prone to errors
Especially due to single-event

upsets (SEUs) from radiation
SEC-DED check bits stored in memory
32-bit word, SEC-DED requires 7 check bits
Increases size of memory by 7/32=21.9%
64-bit word, SEC-DED requires 8 check bits
Increases size of memory by 8/64=12.5%

Слайд 111Memory ECC Architecture


Слайд 112Hamming Code for ECC RAM

Z
1

Z
2

Z
3

Z
4

Z
5

Z
6


Z

7


Z

8


c

1


c

2


c

3


c

4









































Parity Group 1


1


1


0


1


1


0


1


0


1


0


0


0











































Parity Group 2


1


0


1


1


0


1


1


0


0


1


0


0











































Parity Group 3


0


1


1


1


0


0


0


1


0


0


1


0











































Parity Group 4


0


0


0


0


1


1


1


1


0


0


0


1










































































Слайд 113Memory ECC
SEC-DED generally very effective
Memory bit-flips tend to be independent and

uniformly distributed
If bit-flip occurs, gets corrected next time memory location accessed
Main risk is if memory word not access for long time
Multiple bit-flips could accumulate

Слайд 114Memory Scrubbing
Every location in memory read on periodic basis
Reduces chance of

multiple errors accumulating in a memory word
Can be implemented by having memory controller cycle through memory during idle periods

Слайд 115Multiple-Bit Upsets (MBU)
Can occur due to single SEU
Typically occur in adjacent

memory cells
Memory interleaving used
To prevent MBUs from resulting in multiple bit errors in same word


Слайд 117Concluding Remarks
Many different fault-tolerant schemes
Choosing scheme depends on
Types of faults to

be tolerated
Temporary or permanent
Single or multiple point failures
etc.
Design constraints
Area, performance, power, etc.



Слайд 118Concluding Remarks
As technology scales
Circuits increasingly prone to failure
Achieving sufficient fault tolerance

will be major design issue

Обратная связь

Если не удалось найти и скачать презентацию, Вы можете заказать его на нашем сайте. Мы постараемся найти нужный Вам материал и отправим по электронной почте. Не стесняйтесь обращаться к нам, если у вас возникли вопросы или пожелания:

Email: Нажмите что бы посмотреть 

Что такое ThePresentation.ru?

Это сайт презентаций, докладов, проектов, шаблонов в формате PowerPoint. Мы помогаем школьникам, студентам, учителям, преподавателям хранить и обмениваться учебными материалами с другими пользователями.


Для правообладателей

Яндекс.Метрика