

July 14-16, 2009

## AA129 - Reliability in Embedded Systems

Safety Standards and Self Tests

#### **Christopher Temple** Automotive Systems Technology Manager





### **Overview**

- Introduction
- IEC61508 Safety Standard
- ISO26262 Safety Standard (draft)
- MCU Safety Continuum
- Basic Core Self-Test
- Summary







## Introduction





### Freescale Introduces Product Longevity Program

- The embedded market needs long-term product support, which allows OEMs to provide assurance to their customers
- Freescale has a longstanding track record of providing long-term production support for our products
- Freescale is pleased to introduce a formal product longevity program for the market segments we serve
  - For the automotive and medical segments, Freescale will manufacture select devices for a minimum period of 15 years
  - For all other market segments in which Freescale participates, Freescale will manufacture select devices for a minimum period of 10 years
- A list of applicable Freescale products is available at www.freescale.com.



### **Automotive Safety and Functional Safety**



#### "Safety is freedom from unacceptable risk" (IEC 61508)

Freescale ™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2009.

5



### **Evolution of Functional Safety Approaches**



### **From Components to Integrated Systems**



#### **Systems and Standards**

Semiconductor manufacturers are moving towards safety systems suppliers

Industry-wide cooperation and standardization emerging to harmonize system related aspects across industry





### **Industry-wide Initiatives for Standards**



## **Role of Safety Standards**

- Standards are emerging as a framework to establish metrics and value network
- ►IEC61508
  - V1 since late 1990s, V2 announced
  - Safety lifecycle defined
  - Recommended and mandatory practices
- ►ISO26262
  - Current draft, release expected ~2011
  - Refinement of IEC61508 to comply with needs specific to the application sector of E/E systems within road vehicles







## **IEC61508 Safety Standard**







## The Seven Parts of IEC 61508

- ► 1: General Requirements
- 2: Requirements for electrical / electronic / programmable electronic safety-related systems (means HW)
- 3: Software Requirements
- 4: Definitions and abbreviations
- 5: Examples of methods for the determination of safety integrity levels
- 6: Guidelines on the application of IEC 61508-2 and IEC 61508-3
- 7: Overview of techniques and measures



lormative

## How does IEC61508 define Functional Safety?

### Safety

• "freedom from unacceptable risk"

► Risk

 "combination of the probability of occurrence of harm and the severity of that harm"

## ►Harm

• "physical injury or damage to the health of people either directly or indirectly as a result of damage to property or to the environment"

## Functional safety

 "part of the overall safety relating to the equipment under control (EUC) and the EUC control system which depends on the correct functioning of the electrical/electronic/programmable electronic (E/E/PE) safety-related systems, other technology related safety-related systems and external risk reduction facilities"

## **Quantitative Requirements of IEC61508**

## ►IEC 61508

- Four Safety Integrity Levels (SIL)
- Two key metrics
  - Probability of dangerous failure per hour (PFH)
  - Safe Failure Fraction (SFF)
- Hardware redundancy in formulas (HFT)

|                | SIL 1             | SIL 2             | SIL 3             |
|----------------|-------------------|-------------------|-------------------|
| PFH [1/h]      | <10 <sup>-5</sup> | <10 <sup>-6</sup> | <10 <sup>-7</sup> |
| SFF<br>(HFT=0) | >=60%             | >=90%             | >=99%             |
| SFF<br>(HFT=1) | -                 | >=60%             | >=90%             |

Note: Table adopted for typical automotive application

## **Quantitative Requirements of IEC61508**

## ►IEC 61508

- Four Safety Integrity Levels (SIL)
- Two key metrics
  - Probability of dangerous failure per hour (PFH)
  - Safe Failure Fraction (SFF)
- Hardware redundancy in formulas (HFT)

|                | SIL 1             | SIL 2             | SIL 3             |
|----------------|-------------------|-------------------|-------------------|
| PFH [1/h]      | <10 <sup>-5</sup> | <10 <sup>-6</sup> | <10 <sup>-7</sup> |
| SFF<br>(HFT=0) | >=60%             | >=90%             | >=99%             |
| SFF<br>(HFT=1) | -                 | >=60%             | >=90%             |

Note: Table adopted for typical automotive application

#### Safety Integrity Levels

- SIL: "discrete level for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest"
- Approaches to determine the SIL
  - Quantitative methods: such as via probability of a dangerous failure per hour for continuous mode of operation
  - Qualitative methods: such as risk graph or hazardous event severity matrix



## **Quantitative Requirements of IEC61508**

## ►IEC 61508

- Four Safety Integrity Levels (SIL)
- Two key metrics
  - Probability of dangerous failure per hour (PFH)
  - Safe Failure Fraction (SFF)
- Hardware redundancy in formulas (HFT)

|                | SIL 1             | SIL 2             | SIL 3             |
|----------------|-------------------|-------------------|-------------------|
| PFH [1/h]      | <10 <sup>-5</sup> | <10 <sup>-6</sup> | <10 <sup>-7</sup> |
| SFF<br>(HFT=0) | >=60%             | >=90%             | >=99%             |
| SFF<br>(HFT=1) | -                 | >=60%             | >=90%             |

Note: Table adopted for typical automotive application

#### **Key Metrics**

- Probability of dangerous failure per hour (PFH)
  - Target values depend on mode of system (low demand versus *high demand/continuous*), complexity of system (Type A (simplex)) versus *Type B (complex)*) and additional *customer requirements*
- Safe Failure Fraction
  - the ratio of the average rate of safe failures plus dangerous detected failures of the system to the total average failure rate of the system



### **Safe Failure Fraction and Diagnostic Coverage**



- Note: SFF is computed from the <u>RATES</u> (approx. probabilities) of the different failure classes
  - SFF =  $(\sum \lambda_{S} + \sum \lambda_{DD})/(\sum \lambda_{S} + \sum \lambda_{DD} + \sum \lambda_{DU})$
  - Where:
    - $\sum \lambda_s$ : total rate of safe failures
    - $\sum \lambda_{DD}$ : total rate of dangerous detected failures
    - $\sum \lambda_{DU}$ : total rate of dangerous undetected failures

### **IEC61508 Safety Lifecycle**





## **Outline for Designing a Safe System**

#### Safety Integrity Level 1..4 **Risk Analysis** Dangerous failure rate How likely is a hazard? Which unintended situations How dangerous is a hazard? Diagnostic Coverage DC (hazards) can occur? Safe Failure Fraction SFF How controllable is the system in case of a hazard? **Requirements** Requirements Are the safety functions How to *mitigate* the hazards? executed correctly?

## **Outline for Designing a Safe System**



Refine the system until the remaining risk is below the highest acceptable risk



## What the Standard Says for Hardware Components

| Hardware safety integrity                                                                                                                                                | Systematic safety integrity                                                                                                                                                                                                                    | Avoidance of systematic failures<br>during the different phases of the<br>lifecycle (relating to processes)                                                                                                                                                                                                            |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Faults or failures to be analyzed<br/>in the derivation of safe failure<br/>fraction</li> <li>Faults or failures to be detected<br/>during operation</li> </ul> | <ul> <li>Techniques and measures to control:</li> <li>systematic failures caused by hardware and software design</li> <li>systematic failures caused by environmental stress or influences</li> <li>systematic operational failures</li> </ul> | <ul> <li>Recommendations to avoid<br/>mistakes:</li> <li>during specification of E/E/PES<br/>requirements</li> <li>during E/E/PES design and<br/>development</li> <li>during E/E/PES integration</li> <li>during E/E/PES operation and<br/>maintenance procedures</li> <li>during E/E/PES safety validation</li> </ul> |
| <ul> <li>Recommended</li> <li>Highly recommended</li> <li>Mandatory measures</li> </ul>                                                                                  | <ul> <li>Recommended</li> <li>Highly recommended</li> <li>Mandatory measures</li> </ul>                                                                                                                                                        | <ul> <li>Recommended</li> <li>Highly recommended</li> <li>Mandatory techniques</li> </ul>                                                                                                                                                                                                                              |
| Guidelines for assessing the<br>maximum diagnostic coverage<br>considered achievable through<br>various techniques                                                       | Guidelines for assessing the<br>effectiveness of techniques and<br>measures to control systematic<br>failures                                                                                                                                  | Guidelines for assessing the<br>effectiveness of techniques and<br>measures to avoid systematic<br>failures                                                                                                                                                                                                            |



## Conclusion

- Applying all measures to achieve hardware safety integrity for a specific Safety Integrity Level would make a system far too expense
- The right choice of measures is required
- (Effective!) use of error detection and diagnostic capabilities to detect dangerous failures
  - Error detection measures
    - Stop errors from propagating beyond component boundary
    - Error correction (compensation)
    - Shut down (fail-silent)
  - Self test measures
    - Ensure that the device is free from dormant faults
    - Software self-test, various BIST mechanisms





## **ISO26262 Safety Standard (draft)**





## The Nine Parts of ISO26262

- ► ISO 26262 is the adaptation of IEC61508 in automotive industry
- ISO 26262 applies to safety related E/E systems installed in road vehicles of class M, N and O (see 70/156/EC)
- ► ISO 26262 consists of the following parts:
  - Part 1: Glossary
  - Part 2: Management of functional safety
  - Part 3: Concept phase
  - Part 4: Product development: system level
  - Part 5: Product development: hardware level
  - Part 6: Product development: software level
  - Part 7: Production and operation
  - Part 8: Supporting processes
  - Part 9: ASIL-oriented and safety-oriented analyses (analysis techniques)



## Objective

- ISO 26262 addresses hazards caused by safety related E/E systems due to malfunctions, excluding nominal performances of active and passive safety systems
  - Provides an automotive safety lifecycle (management, development, production, operation, service, decommissioning) and supports tailoring the necessary activities during these lifecycle phases
  - Provides an automotive specific risk-based approach for determining risk classes (Automotive Safety Integrity Levels, ASILs)
  - Uses ASILs for specifying the item's necessary safety requirements for achieving an acceptable residual risk
  - Provides requirements for validation and confirmation measures to ensure a sufficient and acceptable level of safety being achieved



## **Quantitative Requirements ISO26262**

## ►ISO 26262

- Four Automotive SILs (ASIL)
- Three key metrics
  - Probability of violation of safety goals (PVSG)
  - Single Point Fault Metric
  - Latent Fault Metric
- Hardware redundancy in structural modeling

| -             | ASIL B                        | ASIL C            | ASIL D            |
|---------------|-------------------------------|-------------------|-------------------|
| PVSG<br>[1/h] | <10 <sup>-7</sup><br>(recom.) | <10 <sup>-7</sup> | <10 <sup>-8</sup> |
| SPFM          | >90%                          | >97%              | >99%              |
| LFM           | >60%                          | >80%              | >90%              |

#### Automotive Safety Integrity Levels

- ASIL: "One of four classes to specify the item's necessary safety requirements for achieving an acceptable residual risk with D representing the highest and A the lowest class"
- Approaches to determine the ASIL
  - Focus on qualitative methods: such as risk graph or hazardous event severity matrix → see next slide

## **Determining Required ASIL**

| Classes of severity                             | Classes of                                                        | Classes of controllability (by driver) |                |                                      |
|-------------------------------------------------|-------------------------------------------------------------------|----------------------------------------|----------------|--------------------------------------|
|                                                 | probability of<br>exposure regarding<br>operational<br>situations | C1<br>(simple)                         | C2<br>(normal) | C3<br>(difficult,<br>uncontrollable) |
| S1                                              | E1 (very low)                                                     | QM                                     | QM             | QM                                   |
| Light and moderate<br>injuries                  | E2 (low)                                                          | QM                                     | QM             | QM                                   |
| injuneo                                         | E3 (medium)                                                       | QM                                     | QM             | А                                    |
|                                                 | E4 (high)                                                         | QM                                     | А              | В                                    |
| S2                                              | E1 (very low)                                                     | QM                                     | QM             | QM                                   |
| Severe and life threatening injuries            | E2 (low)                                                          | QM                                     | QM             | А                                    |
| (survival probable)                             | E3 (medium)                                                       | QM                                     | А              | В                                    |
|                                                 | E4 (high)                                                         | А                                      | В              | С                                    |
| S3                                              | E1 (very low)                                                     | QM                                     | QM             | А                                    |
| Life threatening<br>injuries, fatal<br>injuries | E2 (low)                                                          | QM                                     | А              | В                                    |
|                                                 | E3 (medium)                                                       | А                                      | В              | С                                    |
|                                                 | E4 (high)                                                         | В                                      | С              | D                                    |



## **Quantitative Requirements ISO26262**

## ►ISO 26262

- Four Automotive SILs (ASIL)
- Three key metrics
  - Probability of violation of safety goals (PVSG)
  - Single Point Fault Metric
  - Latent Fault Metric
- Hardware redundancy in structural modeling

|               | ASIL B                        | ASIL C            | ASIL D            |
|---------------|-------------------------------|-------------------|-------------------|
| PVSG<br>[1/h] | <10 <sup>-7</sup><br>(recom.) | <10 <sup>-7</sup> | <10 <sup>-8</sup> |
| SPFM          | >90%                          | >97%              | >99%              |
| LFM           | >60%                          | >80%              | >90%              |

#### **Key Metrics**

- Probability of violation of safety goals
  - Equivalent to PFH in IEC61508
- Single Point Fault Metric
  - Quantifies how many potentially immediately dangerous faults are either safe or detected
- Latent Fault Metric
  - Quantifies how many potentially dangerous faults that not yet influence the application are either safe or detected → under discussion, consult standard!



Freescale <sup>™</sup> and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective owners. © Freescale Semiconductor, Inc. 2009.



29



## **Quantitative Requirements of IEC61508 versus ISO26262**

## ►IEC 61508

- Four Safety Integrity Levels (SIL)
- Two key metrics
  - Probability of dangerous failure per hour (PFH)
  - Safe Failure Fraction (SFF)
- Hardware redundancy in formulas (HFT)

|                | SIL 1             | SIL 2             | SIL 3             |
|----------------|-------------------|-------------------|-------------------|
| PFH [1/h]      | <10 <sup>-5</sup> | <10 <sup>-6</sup> | <10 <sup>-7</sup> |
| SFF<br>(HFT=0) | >=60%             | >=90%             | >=99%             |
| SFF<br>(HFT=1) | -                 | >=60%             | >=90%             |

Note: Table adopted for typical automotive application

#### ►ISO 26262

- Four Automotive SILs (ASIL)
- Three key metrics
  - Probability of violation of safety goals (PVSG)
  - Single Point Fault Metric
  - Latent Fault Metric
- Hardware redundancy in structural modeling

| -             | ASIL B                        | ASIL C            | ASIL D            |
|---------------|-------------------------------|-------------------|-------------------|
| PVSG<br>[1/h] | <10 <sup>-7</sup><br>(recom.) | <10 <sup>-7</sup> | <10 <sup>-8</sup> |
| SPFM          | >90%                          | >97%              | >99%              |
| LFM           | >60%                          | >80%              | >90%              |





# **MCU Safety Continuum**





### **Integrated Safety Features**





### **Integrated Safety Features**





#### **Processor Core** — Performance



Example: Freescale e200 core family built on Power Architecture<sup>®</sup> technology

#### **Example: Increased pipeline depth**

- Typically 7-stages+ pipeline architectures allows more instructions per clock cycle
- Most instructions provide single cycle execution
- Integer and floating point multiply and multiply-accumulate in three clocks, fully pipelined

#### ► Example: Dual instruction issue

 Two execution units allow parallel processing of instructions

#### **Example: Instruction and data cache**

- I-cache to speed up executable instruction fetch
- D-cache to speed up data fetch and store
- TLB to improve the speed of virtual address translation

#### ► Example: SIMD unit and FPU

- Provides DSP capabilities
- Executes an operation on two separate sets of data

## **Processor Core** — Safety



Example: Freescale e200 core family built on Power Architecture<sup>®</sup> technology

#### Example: Memory management unit (MMU)

- Optimization of self test coverage by using different virtual adresses without relocating customer application data and code
- MMU can be used to protect accesses due to occurence of faults in the core (exception generation)
- Example: Multiple input shift register (MISR)
- Method for verifying all intermediate results of a set of architected registers at the end of an instruction stream
- Introduction of MISR improves observability of the core resulting in:
  - Increased self test coverage
  - Faster detection of dormant faults



## **Memories and Crossbar — Safety**



Example: Typical 32-bit MPC55/56xx processor

#### Example: Memory protection unit

- Monitors all system bus transactions and evaluates the appropriateness of each transfer
- Pre-programmed region descriptors define memory spaces and associated access rights
- Unmapped references are terminated with a protection error response

#### ► Example: Error-correcting code

- Used to detect failures of flash/SRAM stored data
- Typical solution for correcting bitflips caused by soft error rate (SER) impact
- ECC module (64 data bits + 8 ECC bits) can:
  - Correct all single bit errors
  - Detect all dual bit faults
  - Detect several faults affecting >2 bits







Example: Typical 32-bit MPC55/56xx processor

# **Communication — Safety**

#### ► Example: FlexRay<sup>TM</sup> networking

- FlexRay master controller directly linked to the crossbar
- Replicated transmission of safety relevant data by single/dual channel FlexRay support with 2.5, 5 and 10 MBit/s data rates
- Message buffer stored and protected in dedicated memory partition located in system memory

#### Example: Safety port

- Controller area network (CAN)-type interface supporting high bandwidth for fast MCU-MCU communication
- Bit rate up to 7.5 Mbit/s
- 32 message buffers of 0 to eight bytes data length





## **Power Supply and Clock — Safety**



Example: Typical 32-bit MPC55/56xx processor

#### Example: Power supply

- Monitoring of internal and external voltages internal and external power supply
- · Over- and undervoltage detection
- Testing capability of monitoring circuitry e.g., for detection of dormant faults

### ► Example: Clock and monitoring

- Clock monitoring for system and periphery clock:
  - Loss of crystal or PLL clock
  - PLL frequency higher/lower than reference
- Redundant clock generation with internal RC oscillator
- Glitch filtering with on-chip PLL





## Software – Safety



Example: Typical 32-bit MPC55/56xx processor

#### ► Example: Core self test — basic

- Coverage: instruction-set based, all addressing modes
- Integration: mostly interruptible, low integration effort
- Safety: not fault graded, determined behavior in fault-free case
- For PPC instruction set

#### ► Example: Core self test – advanced

- Coverage: stuck-at fault model, based on physics of failure
- Integration: partly interruptible, can be adjusted to application/OS specifics
- Safety: detailed test coverage provided, fault graded, determined behavior in faultfree and faulty case
- For selected PPC devices





# **Basic Core Self-Test**





## **CST** with Instruction Coverage Metric

|                                  |    | Instruction sets   |                  |                         |  |
|----------------------------------|----|--------------------|------------------|-------------------------|--|
|                                  |    | BookE instructions | VLE instructions | SPE instructions        |  |
| Instruction coverage*            |    | ~83% to ~98%       | ~86% to ~98%     | Estimated<br>85% to 99% |  |
| Code size<br>(bytes)             |    | < 10k              | < 5k             | In development          |  |
| Execution time<br>(clock cycles) |    | < 6000             | < 5000           | In development          |  |
| Supported<br>PPC<br>Cores        | Z6 | Supported          | Supported        | In development          |  |
|                                  | Z3 | Supported          | Supported        | In development          |  |
|                                  | Z1 | Supported          | Supported        | Not applicable          |  |
|                                  | Z0 | Not applicable     | Supported        | Not applicable          |  |

\* Variability caused by whether instructions or operations (performed by instructions) are considered, and whether MMU and cache configuration instructions/operations are taken into account or not



## **Basic Operating Principle**

### Application

- Triggers test execution
- Selects subset of tests to perform
- Checks actual versus expected result

#### Self test API

- Saves application context
- Prepares core and device for testing
- Calls atomic tests
- Checks results
- Restores application context
- Compresses atomic test results into one 32-bit signature
- Atomic test
  - · Short piece of assembly code
  - Optimizes to activate and propagate faults in different core modules





## **Potential Issues beyond the Self-Test Software**



## **Mitigation Measures**

|    |                          | Can be caught by  |                         |                                       |
|----|--------------------------|-------------------|-------------------------|---------------------------------------|
|    |                          | Basic<br>Watchdog | Intelligent<br>watchdog | Application<br>check and<br>signature |
| 1  | Test not triggered       | $\checkmark$      |                         |                                       |
| 2  | Wrong test triggered     |                   |                         | $\checkmark$                          |
| 3  | Runaway                  |                   |                         | $\checkmark$                          |
| 4  | Wrong atomic test called |                   |                         | $\checkmark$                          |
| 5  | Atomic Test Runaway      | $\checkmark$      |                         |                                       |
| 6  | Test result falsified    |                   |                         | $\checkmark$                          |
| 7  | Check fails              |                   |                         | $\checkmark$                          |
| 8  | Compression fails        |                   |                         | $\checkmark$                          |
| 9  | Error handling fails     |                   | $\checkmark$            |                                       |
| 10 | Result falsified         |                   |                         | $\checkmark$                          |
| 11 | Application check fails  |                   | $\checkmark$            |                                       |

- Watchdog and redundant result check
  - External to core
  - May be device internal, however (coprocessor, ETPU, etc.)
- Application check
  - Unique result for each atomic test



## **Overall Operating Principle**







# Summary





## Summary

- Safety standards are becoming key for the design of new controller solutions and influence the architecture of virtually all building blocks
- Freescale sees safety, and in particular, functional safety as a key paradigm of next generation electronic vehicle systems
- Freescale is continuously expanding the product controller, analog and sensor portfolio to address the needs of these systems in line with IEC61508 and ISO26262





Q&A

Thank you for attending this presentation. We'll now take a few moments for the audience's questions and then we'll begin the question and answer session.

