Error Reporting and Handling: follow up on error classification


Paoloni, Gabriele <gabriele.paoloni@...>
 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-          Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-          [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Christopher Temple
 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·       the interface at which the fatal error is observed,

·       the transaction that is affected, and

·       the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·       Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·       Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-        Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-        [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Paoloni, Gabriele <gabriele.paoloni@...>
 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·         the interface at which the fatal error is observed,

·         the transaction that is affected, and

·         the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·         Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·         Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-          Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-          [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Christopher Temple
 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);
  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.
  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·       the interface at which the fatal error is observed,

·       the transaction that is affected, and

·       the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·       Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·       Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-        Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-        [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Paoloni, Gabriele <gabriele.paoloni@...>
 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·         the interface at which the fatal error is observed,

·         the transaction that is affected, and

·         the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·         Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·         Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-          Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-          [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Christopher Temple
 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·       the interface at which the fatal error is observed,

·       the transaction that is affected, and

·       the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·       Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·       Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-        Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-        [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Paoloni, Gabriele <gabriele.paoloni@...>
 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·         the interface at which the fatal error is observed,

·         the transaction that is affected, and

·         the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·         Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·         Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-          Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-          [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Christopher Temple
 

I have never used the term “async errors”. What are “async errors”?

 

Best regards

Chris

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 16:45
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·       the interface at which the fatal error is observed,

·       the transaction that is affected, and

·       the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·       Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·       Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-        Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-        [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Paoloni, Gabriele <gabriele.paoloni@...>
 

I mean errors associated to what you call ‘async path’….how do you classify these on ARM?

 

On Intel I have provided a clear extensive classification of errors….

 

Thanks

Gab

 

From: Christopher Temple <Christopher.Temple@...>
Sent: Tuesday, July 28, 2020 6:16 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I have never used the term “async errors”. What are “async errors”?

 

Best regards

Chris

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 16:45
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·         the interface at which the fatal error is observed,

·         the transaction that is affected, and

·         the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·         Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·         Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-          Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-          [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Christopher Temple
 

Hi Gab,

 

There are two contexts, in which the terms synchronous and asynchronous are used, that need to be differentiated.

 

The first context - that I have been referring to so far - is about the synchronous flow and the asynchronous flow in respect to the memory controller error reporting.

  • The synchronous flow ensures that the transaction gets completed.
  • In the case of a fatal memory read the core gets informed inherently through the synchronous flow that the memory read from DRAM failed.
  • The exception is raised immediately since the CPU needs to know that no valid instruction (or data) could be returned.
  • This is where imho the critical error handling should take place – due to the immediate nature the execution is started within the FTTI – it happens consecutively to the failed read.

 

 

To my understanding the basic principle is the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

The second context is about whether the CPU classifies the abort exception that it receives through the synchronous flow as synchronous-external-abort or asynchronous-external-abort.

  • This is a quite detailed uArchitecture aspect of how the abort is handled inside the CPU.
  • “Synchronous precise” means that all instructions preceding the exception were executed and no instruction after the exception has been executed.
    • In this case one could try to re-execute the failed memory interaction upon rectification of the error.
  • “Asynchronous precise” and “asynchronous imprecise” pertain to out-of-order execution, where the CPU may have rearranged the instruction execution.
    • Since in this case the CPU may have already executed subsequent instructions the correlation between the instruction pointer and a precise statement, which instructions have been executed when the exception occurs, is hard or not possible (compared to what you would be seeing in the assembly code compiled from C).
  • Hence, the terms “precise” and “imprecise” to be precise in the notation of the different cases

 

Within the Arm architecture it is up to the CPU to choose and different uArchitectures will do different things since some devices support out-of-order execution, while others don’t.

 

As far as I can tell this is the context that you have copied into the email below.

 

I would say

  • Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1) relates to "Synchronous precise"
    • “System software may be able to restart execution from the interrupted context if it is able to rectify the error condition.”
  • Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x) Relates to the “Asynchronous” cases
    • “In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this machine check exception and restarting execution with the interrupted context is not possible.”

 

Again the basic principle seems pretty much the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

I’m not sure, if the synchronous-external-abort and asynchronous-external-abort is really relevant for our discussion (unless we want to attempt some complex recovery stuff in pipelined out-of-order CPUs, but – for the record - this can be insanely tricky – best to proceed as stated in the x86 text below “System software must terminate the interrupted stream of execution").

 

Where do you see the relevance of this aspect?

 

Earlier on you wrote “On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 18:25
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I mean errors associated to what you call ‘async path’….how do you classify these on ARM?

 

On Intel I have provided a clear extensive classification of errors….

 

Thanks

Gab

 

From: Christopher Temple <Christopher.Temple@...>
Sent: Tuesday, July 28, 2020 6:16 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I have never used the term “async errors”. What are “async errors”?

 

Best regards

Chris

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 16:45
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·       the interface at which the fatal error is observed,

·       the transaction that is affected, and

·       the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·       Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·       Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-        Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-        [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Paoloni, Gabriele <gabriele.paoloni@...>
 

Hi Chris

 

I think the point here is that the HW relies on SW actions to be taken (properly) in order to fix the errors and continue the program execution or kill the execution of the program (in Intel case recoverable-continuable or recoverable not continuable respectively); in either cases the SW has got sensitive tasks to be done that, if not done properly may lead to a hazard. Do you agree with my point here?


Now I am not talking about EDAC anymore (EDAC could be relevant if error are reported through NMI instead of MCA), instead I am trying to find out the Linux code paths involved in the handling of such errors; the reason being that a safety critical app may rely on these error handling paths to claim its integrity…

 

Thanks

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 7:41 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

There are two contexts, in which the terms synchronous and asynchronous are used, that need to be differentiated.

 

The first context - that I have been referring to so far - is about the synchronous flow and the asynchronous flow in respect to the memory controller error reporting.

  • The synchronous flow ensures that the transaction gets completed.
  • In the case of a fatal memory read the core gets informed inherently through the synchronous flow that the memory read from DRAM failed.
  • The exception is raised immediately since the CPU needs to know that no valid instruction (or data) could be returned.
  • This is where imho the critical error handling should take place – due to the immediate nature the execution is started within the FTTI – it happens consecutively to the failed read.

 

 

To my understanding the basic principle is the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

The second context is about whether the CPU classifies the abort exception that it receives through the synchronous flow as synchronous-external-abort or asynchronous-external-abort.

  • This is a quite detailed uArchitecture aspect of how the abort is handled inside the CPU.
  • “Synchronous precise” means that all instructions preceding the exception were executed and no instruction after the exception has been executed.
    • In this case one could try to re-execute the failed memory interaction upon rectification of the error.
  • “Asynchronous precise” and “asynchronous imprecise” pertain to out-of-order execution, where the CPU may have rearranged the instruction execution.
    • Since in this case the CPU may have already executed subsequent instructions the correlation between the instruction pointer and a precise statement, which instructions have been executed when the exception occurs, is hard or not possible (compared to what you would be seeing in the assembly code compiled from C).
  • Hence, the terms “precise” and “imprecise” to be precise in the notation of the different cases

 

Within the Arm architecture it is up to the CPU to choose and different uArchitectures will do different things since some devices support out-of-order execution, while others don’t.

 

As far as I can tell this is the context that you have copied into the email below.

 

I would say

  • Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1) relates to "Synchronous precise"
    • “System software may be able to restart execution from the interrupted context if it is able to rectify the error condition.”
  • Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x) Relates to the “Asynchronous” cases
    • “In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this machine check exception and restarting execution with the interrupted context is not possible.”

 

Again the basic principle seems pretty much the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

I’m not sure, if the synchronous-external-abort and asynchronous-external-abort is really relevant for our discussion (unless we want to attempt some complex recovery stuff in pipelined out-of-order CPUs, but – for the record - this can be insanely tricky – best to proceed as stated in the x86 text below “System software must terminate the interrupted stream of execution").

 

Where do you see the relevance of this aspect?

 

Earlier on you wrote “On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 18:25
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I mean errors associated to what you call ‘async path’….how do you classify these on ARM?

 

On Intel I have provided a clear extensive classification of errors….

 

Thanks

Gab

 

From: Christopher Temple <Christopher.Temple@...>
Sent: Tuesday, July 28, 2020 6:16 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I have never used the term “async errors”. What are “async errors”?

 

Best regards

Chris

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 16:45
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·         the interface at which the fatal error is observed,

·         the transaction that is affected, and

·         the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·         Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·         Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-          Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-          [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Christopher Temple
 

Hi Gab,

 

Absolutely, and I don’t think there was ever any doubt that critical SW exists that has to perform sensitive tasks.

 

The key task is finding the critical SW.

 

At the beginning of the discussion there was a strong focus on the Linux EDAC subsystem and that the Linux EDAC subsystem would be the critical SW path in the event of fatal memory errors.

 

I think now with more collective understanding in the team of how Linux interacts with the hardware it is clear that the critical SW path for fatal memory errors is in the exception handling.

 

In the discussion we came across numerous interesting findings, for example:

  • The descriptions on kernel.org were never developed with the intent to be read with a safety context in mind and are therefore not as precise as one would like them to be.
  • The basic behaviour between different hardware architectures is no fundamentally different.
    • However, it is important to understand the details – in some cases subtle differences exist and these need to be understood.
  • The safety dependency on Linux depends on how Linux is used in specific applications.
    • It is possible to integrate Linux in a way that that the safety dependency remains small, it is also possible to integrate it in a way that the safety dependency is big.

 

We shouldn’t loose sight that if the Linux EDAC subsystem is used to boost availability in light of non-fatal memory faults it still plays an important role in the overall safety story via the asynchronous flow.

You wanted to clarify what safety activities from the ISO standard need to be covered in this case.

 

Best regards

Chris

 

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Mittwoch, 29. Juli 2020 01:29
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

I think the point here is that the HW relies on SW actions to be taken (properly) in order to fix the errors and continue the program execution or kill the execution of the program (in Intel case recoverable-continuable or recoverable not continuable respectively); in either cases the SW has got sensitive tasks to be done that, if not done properly may lead to a hazard. Do you agree with my point here?


Now I am not talking about EDAC anymore (EDAC could be relevant if error are reported through NMI instead of MCA), instead I am trying to find out the Linux code paths involved in the handling of such errors; the reason being that a safety critical app may rely on these error handling paths to claim its integrity…

 

Thanks

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 7:41 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

There are two contexts, in which the terms synchronous and asynchronous are used, that need to be differentiated.

 

The first context - that I have been referring to so far - is about the synchronous flow and the asynchronous flow in respect to the memory controller error reporting.

  • The synchronous flow ensures that the transaction gets completed.
  • In the case of a fatal memory read the core gets informed inherently through the synchronous flow that the memory read from DRAM failed.
  • The exception is raised immediately since the CPU needs to know that no valid instruction (or data) could be returned.
  • This is where imho the critical error handling should take place – due to the immediate nature the execution is started within the FTTI – it happens consecutively to the failed read.

 

 

To my understanding the basic principle is the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

The second context is about whether the CPU classifies the abort exception that it receives through the synchronous flow as synchronous-external-abort or asynchronous-external-abort.

  • This is a quite detailed uArchitecture aspect of how the abort is handled inside the CPU.
  • “Synchronous precise” means that all instructions preceding the exception were executed and no instruction after the exception has been executed.
    • In this case one could try to re-execute the failed memory interaction upon rectification of the error.
  • “Asynchronous precise” and “asynchronous imprecise” pertain to out-of-order execution, where the CPU may have rearranged the instruction execution.
    • Since in this case the CPU may have already executed subsequent instructions the correlation between the instruction pointer and a precise statement, which instructions have been executed when the exception occurs, is hard or not possible (compared to what you would be seeing in the assembly code compiled from C).
  • Hence, the terms “precise” and “imprecise” to be precise in the notation of the different cases

 

Within the Arm architecture it is up to the CPU to choose and different uArchitectures will do different things since some devices support out-of-order execution, while others don’t.

 

As far as I can tell this is the context that you have copied into the email below.

 

I would say

  • Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1) relates to "Synchronous precise"
    • “System software may be able to restart execution from the interrupted context if it is able to rectify the error condition.”
  • Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x) Relates to the “Asynchronous” cases
    • “In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this machine check exception and restarting execution with the interrupted context is not possible.”

 

Again the basic principle seems pretty much the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

I’m not sure, if the synchronous-external-abort and asynchronous-external-abort is really relevant for our discussion (unless we want to attempt some complex recovery stuff in pipelined out-of-order CPUs, but – for the record - this can be insanely tricky – best to proceed as stated in the x86 text below “System software must terminate the interrupted stream of execution").

 

Where do you see the relevance of this aspect?

 

Earlier on you wrote “On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 18:25
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I mean errors associated to what you call ‘async path’….how do you classify these on ARM?

 

On Intel I have provided a clear extensive classification of errors….

 

Thanks

Gab

 

From: Christopher Temple <Christopher.Temple@...>
Sent: Tuesday, July 28, 2020 6:16 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I have never used the term “async errors”. What are “async errors”?

 

Best regards

Chris

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 16:45
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·       the interface at which the fatal error is observed,

·       the transaction that is affected, and

·       the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·       Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·       Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-        Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-        [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Paoloni, Gabriele <gabriele.paoloni@...>
 

Hi Chris

 

Yes I think we are quite aligned on the overall scenario. Now with respect to the points below in order to move forward I think we should do what is marked in orange

 

  • The basic behaviour between different hardware architectures is no fundamentally different.
    • However, it is important to understand the details – in some cases subtle differences exist and these need to be understood.

Here we need to understand (to start with Intel and ARM) which HW errors may eventually represent a hazard. The reason is to identify respective error reporting and handling paths in Linux
and highlight HW independent paths. On this point on ARM if we have an async external abort error is SW called to take action? If it doesn’t could we have a hazard? (action includes possibly
killing the App or the OS)

 

  • The safety dependency on Linux depends on how Linux is used in specific applications.
    • It is possible to integrate Linux in a way that that the safety dependency remains small, it is also possible to integrate it in a way that the safety dependency is big.

Right we need to discuss on this. In my personal opinion If we commonly rely on Linux to handle errors we should continue to do so, especially if we cannot demonstrate that doing it in FW is safer.

In short here my point is “Ok, of course we can delegate safety critical tasks to dedicated FW or HW….but then what is the value of ELISA 😊 ?”

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Wednesday, July 29, 2020 10:50 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

Absolutely, and I don’t think there was ever any doubt that critical SW exists that has to perform sensitive tasks.

 

The key task is finding the critical SW.

 

At the beginning of the discussion there was a strong focus on the Linux EDAC subsystem and that the Linux EDAC subsystem would be the critical SW path in the event of fatal memory errors.

 

I think now with more collective understanding in the team of how Linux interacts with the hardware it is clear that the critical SW path for fatal memory errors is in the exception handling.

 

In the discussion we came across numerous interesting findings, for example:

  • The descriptions on kernel.org were never developed with the intent to be read with a safety context in mind and are therefore not as precise as one would like them to be.
  • The basic behaviour between different hardware architectures is no fundamentally different.
    • However, it is important to understand the details – in some cases subtle differences exist and these need to be understood.
  • The safety dependency on Linux depends on how Linux is used in specific applications.
    • It is possible to integrate Linux in a way that that the safety dependency remains small, it is also possible to integrate it in a way that the safety dependency is big.

 

We shouldn’t loose sight that if the Linux EDAC subsystem is used to boost availability in light of non-fatal memory faults it still plays an important role in the overall safety story via the asynchronous flow.

 

You wanted to clarify what safety activities from the ISO standard need to be covered in this case.

 

Best regards

Chris

 

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Mittwoch, 29. Juli 2020 01:29
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

I think the point here is that the HW relies on SW actions to be taken (properly) in order to fix the errors and continue the program execution or kill the execution of the program (in Intel case recoverable-continuable or recoverable not continuable respectively); in either cases the SW has got sensitive tasks to be done that, if not done properly may lead to a hazard. Do you agree with my point here?


Now I am not talking about EDAC anymore (EDAC could be relevant if error are reported through NMI instead of MCA), instead I am trying to find out the Linux code paths involved in the handling of such errors; the reason being that a safety critical app may rely on these error handling paths to claim its integrity…

 

Thanks

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 7:41 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

There are two contexts, in which the terms synchronous and asynchronous are used, that need to be differentiated.

 

The first context - that I have been referring to so far - is about the synchronous flow and the asynchronous flow in respect to the memory controller error reporting.

  • The synchronous flow ensures that the transaction gets completed.
  • In the case of a fatal memory read the core gets informed inherently through the synchronous flow that the memory read from DRAM failed.
  • The exception is raised immediately since the CPU needs to know that no valid instruction (or data) could be returned.
  • This is where imho the critical error handling should take place – due to the immediate nature the execution is started within the FTTI – it happens consecutively to the failed read.

 

 

To my understanding the basic principle is the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

The second context is about whether the CPU classifies the abort exception that it receives through the synchronous flow as synchronous-external-abort or asynchronous-external-abort.

  • This is a quite detailed uArchitecture aspect of how the abort is handled inside the CPU.
  • “Synchronous precise” means that all instructions preceding the exception were executed and no instruction after the exception has been executed.
    • In this case one could try to re-execute the failed memory interaction upon rectification of the error.
  • “Asynchronous precise” and “asynchronous imprecise” pertain to out-of-order execution, where the CPU may have rearranged the instruction execution.
    • Since in this case the CPU may have already executed subsequent instructions the correlation between the instruction pointer and a precise statement, which instructions have been executed when the exception occurs, is hard or not possible (compared to what you would be seeing in the assembly code compiled from C).
  • Hence, the terms “precise” and “imprecise” to be precise in the notation of the different cases

 

Within the Arm architecture it is up to the CPU to choose and different uArchitectures will do different things since some devices support out-of-order execution, while others don’t.

 

As far as I can tell this is the context that you have copied into the email below.

 

I would say

  • Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1) relates to "Synchronous precise"
    • “System software may be able to restart execution from the interrupted context if it is able to rectify the error condition.”
  • Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x) Relates to the “Asynchronous” cases
    • “In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this machine check exception and restarting execution with the interrupted context is not possible.”

 

Again the basic principle seems pretty much the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

I’m not sure, if the synchronous-external-abort and asynchronous-external-abort is really relevant for our discussion (unless we want to attempt some complex recovery stuff in pipelined out-of-order CPUs, but – for the record - this can be insanely tricky – best to proceed as stated in the x86 text below “System software must terminate the interrupted stream of execution").

 

Where do you see the relevance of this aspect?

 

Earlier on you wrote “On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 18:25
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I mean errors associated to what you call ‘async path’….how do you classify these on ARM?

 

On Intel I have provided a clear extensive classification of errors….

 

Thanks

Gab

 

From: Christopher Temple <Christopher.Temple@...>
Sent: Tuesday, July 28, 2020 6:16 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I have never used the term “async errors”. What are “async errors”?

 

Best regards

Chris

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 16:45
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·         the interface at which the fatal error is observed,

·         the transaction that is affected, and

·         the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·         Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·         Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-          Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-          [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Christopher Temple
 

Hi Gab,

 

Here we need to understand (to start with Intel and ARM) which HW errors may eventually represent a hazard. The reason is to identify respective error reporting and handling paths in Linux and highlight HW independent paths.

 

It is impossible to determine hazards w/o assuming some application properties. W/o any assumptions literally any HW error may eventually represent a hazard.

  • The term “hazard” describes the dangerous behaviour of a system as a whole as perceived from the context.  
  • This is also reflected by the definition of “hazard” in ISO 26262, namely “potential source of harm caused by malfunctioning behaviour of the item”.

 

On this point on ARM if we have an async external abort error is SW called to take action?

 

As explained this is equivalent to the Recoverable-not-continuable SRAR Error. (The abort is the exception rather than the error).

  • "In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this machine check exception and restarting execution with the interrupted context is not possible."
  • "System software must terminate the interrupted stream of execution"

 

Earlier on you wrote “On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

In my personal opinion If we commonly rely on Linux to handle errors we should continue to do so, especially if we cannot demonstrate that doing it in FW is safer.

 

We need to base conclusions on an accurate understanding of how the Linux services are initiated.

In both x86 and Arm the fatal memory error gets handled through an exception and not through the Linux EDAC subsystem. As explained no FW gets executed.

 

Where do you see that the fatal error is handled in FW?

 

In short here my point is "Ok, of course we can delegate safety critical tasks to dedicated FW or HW….but then what is the value of ELISA ?? ?"

 

Nobody is delegating safety critical tasks to FW or HW – we are simply looking at the reality of how both x86 and Arm behave (which in this case is identical).

 

The objective of ELISA is to enable the use of Linux in safety applications. W/o this discussion someone may have missed the criticality of the exception handling, while having assumed that the Linux EDAC subsystem is the critical SW in the assumed application.

 

Isn’t this value?

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Mittwoch, 29.
Juli 2020 11:21
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Yes I think we are quite aligned on the overall scenario. Now with respect to the points below in order to move forward I think we should do what is marked in orange

 

  • The basic behaviour between different hardware architectures is no fundamentally different.
    • However, it is important to understand the details – in some cases subtle differences exist and these need to be understood.

Here we need to understand (to start with Intel and ARM) which HW errors may eventually represent a hazard. The reason is to identify respective error reporting and handling paths in Linux
and highlight HW independent paths. On this point on ARM if we have an async external abort error is SW called to take action? If it doesn’t could we have a hazard? (action includes possibly
killing the App or the OS)

 

  • The safety dependency on Linux depends on how Linux is used in specific applications.
    • It is possible to integrate Linux in a way that that the safety dependency remains small, it is also possible to integrate it in a way that the safety dependency is big.

Right we need to discuss on this. In my personal opinion If we commonly rely on Linux to handle errors we should continue to do so, especially if we cannot demonstrate that doing it in FW is safer.

In short here my point is “Ok, of course we can delegate safety critical tasks to dedicated FW or HW….but then what is the value of ELISA 😊 ?”

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Wednesday, July 29, 2020 10:50 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

Absolutely, and I don’t think there was ever any doubt that critical SW exists that has to perform sensitive tasks.

 

The key task is finding the critical SW.

 

At the beginning of the discussion there was a strong focus on the Linux EDAC subsystem and that the Linux EDAC subsystem would be the critical SW path in the event of fatal memory errors.

 

I think now with more collective understanding in the team of how Linux interacts with the hardware it is clear that the critical SW path for fatal memory errors is in the exception handling.

 

In the discussion we came across numerous interesting findings, for example:

  • The descriptions on kernel.org were never developed with the intent to be read with a safety context in mind and are therefore not as precise as one would like them to be.
  • The basic behaviour between different hardware architectures is no fundamentally different.
    • However, it is important to understand the details – in some cases subtle differences exist and these need to be understood.
  • The safety dependency on Linux depends on how Linux is used in specific applications.
    • It is possible to integrate Linux in a way that that the safety dependency remains small, it is also possible to integrate it in a way that the safety dependency is big.

 

We shouldn’t loose sight that if the Linux EDAC subsystem is used to boost availability in light of non-fatal memory faults it still plays an important role in the overall safety story via the asynchronous flow.

 

You wanted to clarify what safety activities from the ISO standard need to be covered in this case.

 

Best regards

Chris

 

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Mittwoch, 29. Juli 2020 01:29
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

I think the point here is that the HW relies on SW actions to be taken (properly) in order to fix the errors and continue the program execution or kill the execution of the program (in Intel case recoverable-continuable or recoverable not continuable respectively); in either cases the SW has got sensitive tasks to be done that, if not done properly may lead to a hazard. Do you agree with my point here?


Now I am not talking about EDAC anymore (EDAC could be relevant if error are reported through NMI instead of MCA), instead I am trying to find out the Linux code paths involved in the handling of such errors; the reason being that a safety critical app may rely on these error handling paths to claim its integrity…

 

Thanks

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 7:41 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

There are two contexts, in which the terms synchronous and asynchronous are used, that need to be differentiated.

 

The first context - that I have been referring to so far - is about the synchronous flow and the asynchronous flow in respect to the memory controller error reporting.

  • The synchronous flow ensures that the transaction gets completed.
  • In the case of a fatal memory read the core gets informed inherently through the synchronous flow that the memory read from DRAM failed.
  • The exception is raised immediately since the CPU needs to know that no valid instruction (or data) could be returned.
  • This is where imho the critical error handling should take place – due to the immediate nature the execution is started within the FTTI – it happens consecutively to the failed read.

 

 

To my understanding the basic principle is the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

The second context is about whether the CPU classifies the abort exception that it receives through the synchronous flow as synchronous-external-abort or asynchronous-external-abort.

  • This is a quite detailed uArchitecture aspect of how the abort is handled inside the CPU.
  • “Synchronous precise” means that all instructions preceding the exception were executed and no instruction after the exception has been executed.
    • In this case one could try to re-execute the failed memory interaction upon rectification of the error.
  • “Asynchronous precise” and “asynchronous imprecise” pertain to out-of-order execution, where the CPU may have rearranged the instruction execution.
    • Since in this case the CPU may have already executed subsequent instructions the correlation between the instruction pointer and a precise statement, which instructions have been executed when the exception occurs, is hard or not possible (compared to what you would be seeing in the assembly code compiled from C).
  • Hence, the terms “precise” and “imprecise” to be precise in the notation of the different cases

 

Within the Arm architecture it is up to the CPU to choose and different uArchitectures will do different things since some devices support out-of-order execution, while others don’t.

 

As far as I can tell this is the context that you have copied into the email below.

 

I would say

  • Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1) relates to "Synchronous precise"
    • “System software may be able to restart execution from the interrupted context if it is able to rectify the error condition.”
  • Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x) Relates to the “Asynchronous” cases
    • “In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this machine check exception and restarting execution with the interrupted context is not possible.”

 

Again the basic principle seems pretty much the same on x86 and Arm. In both architectures the situation is resolved w/o the Linux EDAC subsystem.

 

I’m not sure, if the synchronous-external-abort and asynchronous-external-abort is really relevant for our discussion (unless we want to attempt some complex recovery stuff in pipelined out-of-order CPUs, but – for the record - this can be insanely tricky – best to proceed as stated in the x86 text below “System software must terminate the interrupted stream of execution").

 

Where do you see the relevance of this aspect?

 

Earlier on you wrote “On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 18:25
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I mean errors associated to what you call ‘async path’….how do you classify these on ARM?

 

On Intel I have provided a clear extensive classification of errors….

 

Thanks

Gab

 

From: Christopher Temple <Christopher.Temple@...>
Sent: Tuesday, July 28, 2020 6:16 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

I have never used the term “async errors”. What are “async errors”?

 

Best regards

Chris

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Dienstag, 28. Juli 2020 16:45
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Are your async errors the same as UE SW Recoverable on Intel?

If yes on IA I think these could be dangerous if not recovered within the FTTI or PST; for instance in the Intel SDM we have the UE SRAR explanation at the
bottom and in my view SW missing to take action (or worst) taking the wrong recovery action could result in corruption->hazard here.

Maybe for some reasons on ARM it is safe to not handle UE within the PST/FTTI…what I am asking is to build the picture for both ARM and IA so that later on
we can define which SW parts of Linux are assigned with safety relevant tasks (i.e. with safety reqs)

 

Thanks

Gab

 

 

SRAR Error And Affected Logical Processors
The affected logical processor is the one that has detected and raised an SRAR error at the point of the consumption in the execution flow. The affected logical processor should find the Data Load or the Instruction Fetch error
information in the IA32_MCi_STATUS register that is reporting the SRAR error.
Table 15-20 list the actionable scenarios that system software can respond to an SRAR error on an affected logical
processor according to RIPV and EIPV values:
Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1):
For Recoverable-Continuable SRAR errors, the affected logical processor should find that both the
IA32_MCG_STATUS.RIPV and the IA32_MCG_STATUS.EIPV flags are set, indicating that system software may
be able to restart execution from the interrupted context if it is able to rectify the error condition. If system
software cannot rectify the error condition then it must treat the error as a recoverable error where restarting
execution with the interrupted context is not possible. Restarting without rectifying the error condition will
result in most cases with another SRAR error on the same instruction.

Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x):
For Recoverable-not-continuable errors, the affected logical processor should find that either
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=1, or
— IA32_MCG_STATUS.RIPV= 0, IA32_MCG_STATUS.EIPV=0

In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this
machine check exception and restarting execution with the interrupted context is not possible. System
software may take the following recovery actions for the affected logical processor:
The current executing thread cannot be continued. System software must terminate the interrupted
stream of execution and provide a new stream of execution on return from the machine check handler
for the affected logical processor.

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Tuesday, July 28, 2020 9:29 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

You write

From an OS point of view exceptions are always asynchronous."

 

That is correct since the OS doesn’t know when a fatal instruction memory read would occur. The exception occurs synchronously with the fatal instruction memory read to inform the CPU that there is no valid instruction available to execute.

 

Regarding the error reaction to fatal instruction memory reads you write

“On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

 

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

As stated early in the discussion data read/write has its own challenges – lets agree on the instruction read aspect first.

 

Best regards

Chris

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Montag, 27. Juli 2020 13:19
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

Thanks, pls see GP inline

 

Gab

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Monday, July 27, 2020 11:51 AM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I agree – we need useful safety requirements for Linux and it would be great if we would receive some closer to real world safety requirements from the domains.

GP: good we’re on the same page here 😊

 

The question at hand was whether it makes sense to include the Linux EDAC subsystem in the mitigation of SPFs. The envisioned use case consists of the kernel, a safety application and an EDAC monitor app.

GP: From our last discussions it seems that we are transitioning from EDAC to the memory fault handling in Linux, but in order to assign safety reqs we need to

  • Understand error types and define which ones we ‘assume’ to be dangerous for a generic safety app running on top
  • Identify which parts of Linux are involved in reporting and handling the dangerous errors depending on the different HWs (e.g. ARM vs Intel vs Others) and configurations (e.g. ACPI APEI on vs ACPI APEI off)
  • Allocate SReqs to Linux accoridingly

 

The conclusion from the discussions seems to be that at least in the use case drafted by the architecture WG some weeks ago this doesn’t really add value.

 

I’ve summarized the conclusion in the table:

 

 

Imho I think we have gained some very valuable insights from the discussion, which range from inaccuracies in the descriptions on kernel.org to the need to understand how Linux interacts with hardware in failure mode situations.

GP: Right now if we take away the EDAC framework and if we rely on Linux to kill itself or the safety app we need to allocate Linux with SReqs accordingly. So yes, you are right, we are transitioning from EDAC to a different model

 

The learning for this use case seems to be that

  • the safety integrity for SPF really resides on the Linux execution in the synchronous path and not on the asynchronous path (including firmware);

GP: Here I think we need to investigate better. From an OS point of view exceptions are always asynchronous. Now looking at your slide what you call async flow is related to the handling of UE SW Recoverable. On IA I think that
a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard. Now as you can see at the bottom of this thread I am
asking to do a similar investigation on ARM to make sure that what you call ‘async flow’ is always related to error that do not represent a hazard

  • the interaction with an external safeing device is done better via some watchdog mechanism through the safety application.

GP: I agree on the WD, yes

  • the asynchronous path is important for latent faults (as you write we still need to clarify if the full symmetric integrity of the safety goal is needed in this case)

 

I hope I got it right.

 

Let’s discuss tomorrow.

GP: OK, I’ll save 15min for the second part of the meeting as we have already the Linux Kernel map on the agenda.

 

Best regards

Chris

 

 

From: Paoloni, Gabriele <gabriele.paoloni@...>
Sent: Samstag, 25. Juli 2020 10:20
To: Christopher Temple <Christopher.Temple@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Chris

 

To be honest from my point of view it is important to just assign some safety requirements to Linux to get started with the safety analyses. I would expect detailed and more correct TSRs to come from domain specific WGs.
Hence I proposed EDAC (because if you rely on NMI error handling instead of MCA then EDAC is responsible for error handling) because from my personal point of view I would just be happy to start with assigning some safety
reqs, maybe not complete, not fully correct but that at least would allow us to focus on the safety analyses (FFI, FMEA or FTA) where I believe we’ll find common challenges regardless of which specific subsystems of Linux;
challenges that we can solve and where we can apply the solution to other scenarios once we have them from the domain specific WGs.

 

However EDAC was challenged and now we moved to analyze error reporting and handling specifically for MCA and the ARM counterpart of MCA (see Tony and your presentation on last Tue). My end goal is still to come up
with some TSRs that can be accepted and that would allow us to move on. Hence I think we need to:

  1. Classify the errors and decide which we can ‘assume’ to be a hazard
  2. See how these ‘hazardous’ errors are handled and if it makes sense to rely on Linux (see email from Lukas “[…] How does the HW react to a memory fault, that was caused by a required memory access for a specific operation of the safety function?”)
  3. Following on b) define TSRs to be allocated to Linux

 

In my previous analysis of a), since I don’t have a specific context for a specific application running on a specific system I just assume that a fatal error if not reported is a hazard as well as UE SRAR because for UE SRAR the error has been consumed and an action shall be taken to avoid incorrect data.

Having said that the reason why I asked ARM to do a similar analysis is to move on to the next steps b) and c) where we go and check the Linux paths involved in handling the respective errors.

 

With respect to your points below

  • the termination path via the hardware is faster;

Do we have for all system a termination path via HW? Wouldn’t be good to rely on Linux or analyse how much we can rely on Linux?
And BTW is it really faster or safer….see previous discussion with Corey?

  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

Even if you rely on FW to handle errors you would have the same problem (see previous discussion with Corey)

 

BTW in my view it is crucial to assign Linux with safety reqs to start analyses; these may not be fully correct in some specific contexts or not complete but they’ll allow us to go on with safety analyses and to start looking at the challenges then we’ll also encounter later on as TSRs will come from domain specific WGs.
If we continue like this we go back in circles and we are not able to work on the core of this WG.

 

Thanks

Gab

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Christopher Temple
Sent: Friday, July 24, 2020 6:49 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: Re: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

I don’t understand why we need all those details. The question about what constitutes a hazard is not resolved in the details.

 

We should first double check the summary - here some comments:

 

  1. “Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM”

 

Yes, there are very low level differences in the HW (for example Arm does not have an NMI), but ELISA really shouldn’t be arguing at that level.

 

The assumption from our side was that the general error handling flow (at the level shown on Tuesday) essentially captures what happens in either architecture. As we pointed out the BIOS  is in fact the same.

 

 

  1. “Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)”

 

One cannot designate a fatal error as hazard w/o considering

·       the interface at which the fatal error is observed,

·       the transaction that is affected, and

·       the application use case.

 

Assume, for example, a system consisting of the kernel, one or more safety critical applications and one or more non-safety critical applications.

 

A fatal memory read error (ie. memory controller cannot return valid data to the core) occurring during the execution of the non-safety critical application, for example, will not lead to a violation of a safety goal and is hence no direct hazard.

 

A fatal memory read error occurring during the execution of the safety critical application will lead to loss of service – this will only lead to a violation of the safety goal if the system has been designed such that loss of this service leads to a hazard. However, as there are many (more likely) error cases that can lead to loss of service provisions for this case need to be in place in any case.

 

Similarly fatal memory read errors can also occur during other operations, such as DMA transfers, which require different reasoning.

 

  1. “For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)”

This needs a more nuanced consideration.

 

Neither Arm nor Intel based systems use the Linux EDAC subsystem for the direct error reaction to fatal memory read errors that occur during instruction execution.

 

Why am I so particular?

·       Firstly, because the general term “EDAC” is often used to refer to the complete sphere of HW and SW execution around error detection and control.

o   So with this understanding of the term “EDAC” the correct statement would be that ”in both Arm and Intel based systems the HW EDAC subsystem initiates the error reaction to fatal memory read errors that occur during instruction execution”.

·       Secondly, because the term “rely” does not relate to a specific instance.

o   Both Arm and Intel based systems can use the Linux EDAC subsystem for a posteriori diagnosis of what happened.

o   So in the a posteriori phase both Arm and Intel based systems can make use of it, i.e. “rely” on it.

 

  1. For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.”

We explained the basic execution flow in the case of fatal memory read errors and in the case of non-fatal memory read errors  – I would not call either of this a “SW recovery path”, as nothing is recovered.

 

 

  1. “Correctable Errors: we are aligned on the role of EDAC in counting CEs.”

As explained the EDAC counts much more:

I didn’t see that this was ever disputed.

 

What we did challenge was the idea to drive a SW driven error reaction from HW_EVENT_ERR_FATAL by placing the Linux EDAC subsystem into a critical path in the event of fatal memory read errors that occur during instruction execution.

 

We raised two arguments:

  • the termination path via the hardware is faster;
  • if the fatal memory read error affects the kernel or the process executing the Linux EDAC subsystem then the SW driven error reaction will never be executed.

 

Is this disputed any longer?

 

We should try to keep the level of detail needed as simple as possible.

 

I’m happy to explain our thoughts on Tuesday.

 

Best regards

Chris

 

 

From: safety-architecture@... <safety-architecture@...> On Behalf Of Paoloni, Gabriele via lists.elisa.tech
Sent: Freitag, 24. Juli 2020 13:18
To: myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; Christopher Temple <Christopher.Temple@...>; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi All

 

I am following up on the first TODO of the reference summary at the bottom.

[TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding

 

Please review and give your feedbacks.
@Chris please see the ‘next steps’ on this specific point; it would be good to provide similar detail as I did for Intel (mainly to understand what represent a hazard and what does not)

 

Thanks

Gab

------------------

Follow-up

------------------

 

We can use definition as in ACPI specs 2.8 AppendixN – Table54 Error Record Header:

Error Severity

12

4

Indicates the severity of the error condition. The severity of the
error record corresponds to the most severe error section.
0 - Recoverable (also called non-fatal uncorrected)
1 - Fatal
2 - Corrected
3 - Informational
All other values are reserved.
Note that severity of "Informational" indicates that the record
could be safely ignored by error handling software.

From the UEFI Specs we have:

Machine Check Exception (MCE): {0xE8F56FFE, 0x919C, 0x4cc5, {0xBA, 0x88, 0x65, 0xA
0x49, 0x13, 0xBB}}
A Machine Check Exception is a processor-generated exception class interrup
to system software of the presence of a fatal or recoverable error condition

 

Non-Maskable Interrupt (NMI): {0x5BAD89FF, 0xB7E6, 0x42c9, {0x81, 0x4A, 0xCF, 0x24, 0x85,
0xD6, 0xE9, 0x8A}}
Non-Maskable Interrupts are used on X64 platforms to report fatal or recoverable
platform error conditions. NMIs are reported via interrupt vector 2 on IA32 and X64
processor architecture platforms.

 

Synchronous External Abort (SEA): {0x9A78788A, 0xBBE8, 0x11E4, {0x80, 0x9E, 0x67, 0x61,
0x1E, 0x5D, 0x46, 0xB0}}
Synchronous External Aborts represent precise processor error conditions on ARM
systems (uncorrectable and/or recoverable) as described in D3.5 of the ARMv8 ARM
reference manual. This notification may be triggered by one of the following
scenarios: cache parity error, cache ECC error, external bus error, micro-architectural
error, data poisoning, and other platform errors.

 

SError Interrupt (SEI): {0x5C284C81, 0xB0AE, 0x4E87, {0xA3, 0x22, 0xB0, 0x4C, 0x85, 0x62,
0x43, 0x23}}
SError Interrupts represent asynchronous imprecise (or possibly precise) processor
error conditions on ARM systems (corrected, uncorrectable, and recoverable) as
described in D3.5 of the ARM ARM reference manual. This notification may be
triggered by one of the following scenarios: cache parity error, cache ECC error,
external bus error, micro-architectural error, data poisoning, and other platform
errors.

 

Platform Error Interrupt (PEI): {0x09A9D5AC, 0x5204, 0x4214, {0x96, 0xE5, 0x94, 0x99, 0x2E,
0x75, 0x2B, 0xCD}
Platform Error Interrupt represent asynchronous imprecise platform error conditions
on ARM systems that may be triggered by the following scenarios: system memory

ECC error, ECC errors in system cache (e.g. shared high-level caches), vendor specific
chip errors, external platform errors.

 

 

From a HW specific point of view on Intel this would map to: https://www.intel.it/content/www/it/it/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

FATAL Errors: RIPV bit in IA32_MCG_STATUS MSR not set.

 

Uncorrected Errors (SW Recoverable):

When IA32_MCG_CAP[24] is set, a UCR error is indicated by the following bit settings in the IA32_MCi_STATUS register:

Valid (bit 63) = 1
UC (bit 61) = 1
PCC (bit 57) = 0

            There are 3 further classifications of uncorrected errors:

  1. Uncorrected no action required (UCNA) - UCNA errors require no action from system software to continue
    execution. A UNCA error is indicated with UC=1, PCC=0, S=0 and AR=0 in the IA32_MCi_STATUS register
  2. Software recoverable action optional (SRAO): An SRAO error when signaled as a machine check is indicated with
    UC=1, PCC=0, S=1, EN=1 and AR=0 in the IA32_MCi_STATUS register. System software
    needs to inspect the MCA error code fields in the IA32_MCi_STATUS register to identify the specific recovery
    action for a given SRAO error. If MISCV and ADDRV are not set, it is recommended that no system software
    error recovery be performed however, system software can resume execution.
  3. Software recoverable action required (SRAR) - SRAR errors indicate that the error was detected and raised
    at the point of the consumption in the execution flow. An SRAR error is indicated with UC=1, PCC=0, S=1,
    EN=1 and AR=1 in the IA32_MCi_STATUS register. System software needs to inspect the MCA error code
    fields in the IA32_MCi_STATUS register to identify the specific recovery action for a given SRAR error.

 

Corrected Errors: UC bit clear in IA32_MCi_STATUS; Errors reported through CMCI interrupt

-------------------

Summary: In my view

-        Recoverable errors as in UEFI specs are the same as UCR errors in Intel SDM

-        [Fatal errors] and [UCR – SRAR] both represent a hazard to be handled within the PST/FTTI.

 

Next Steps

  • On ARM try to do a detailed analysis of the error classifications as done here for Intel
  • Clarify what are PEI (Platform Error Interrupt) used for and if they are used to reports hazardous errors

 

 

****************************************************************************************

***********          REFERENCE   SUMMARY BELOW  *********************************************

****************************************************************************************

 

 

Summary:

  • Chris presented the error handling focusing on how ARM behaves but trying to find a common path for both Intel and ARM
  • Fatal errors represent a direct hazard whereas non Fatal errors (meant as Uncorrected SW recoverable) represent a possible hazard if not recovered (here it seems that we have a HW failure in memory with the memory not been consumed already)
    [TODO]: to investigate on where these definitions can be found to have a common reference and a common understanding
  • For Fatal error neither ARM or Intel rely on EDAC (for ARM64 these are handled in do_sea and do_serror whereas for Intel these are handled in the MCE)
    [TODO]: to better analyze these paths. It is not clear if for fatal errors these paths are different in the case GHES+APEI enabled vs GHES+APEI disabled
  • For non Fatal Error Chris presented a SW recovery path that is implemented in GHES.c.
    [TODO]: to investigate what happens if FW First (i.e. GHES and APEI) is not enabled on both IA and ARM (on Intel these seem to be reported by the EDAC driver plugging into the MCE handler; e.g.:
    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/edac/i10nm_base.c#n319
  • Correctable Errors: we are aligned on the role of EDAC in counting CEs.
    [TODO]: to understand if a systematic capability is required in the reporting of latent faults

 

 

Gab

 

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

---------------------------------------------------------------------
INTEL CORPORATION ITALIA S.p.A. con unico socio
Sede: Milanofiori Palazzo E 4
CAP 20094 Assago (MI)
Capitale Sociale Euro 104.000,00 interamente versato
Partita I.V.A. e Codice Fiscale  04236760155
Repertorio Economico Amministrativo n. 997124
Registro delle Imprese di Milano nr. 183983/5281/33
Soggetta ad attivita' di direzione e coordinamento di
INTEL CORPORATION, USA

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.


Paoloni, Gabriele <gabriele.paoloni@...>
 

Hi Chris

 

Many thanks, I am repsonding point by point below.

 

>> It is impossible to determine hazards w/o assuming some application properties. W/o any assumptions literally any HW error may eventually represent a hazard.

As you know right now we do not have a clear use case and safety app. Maybe we can agree on a subset of error that could likely be related to a hazard….so we can evaluate them and assign Safety Reqs to the related Linux handling routines.
BTW this is the critical point we need to agree on to make progress I think…

 

>> As explained this is equivalent to the Recoverable-not-continuable SRAR Error. (The abort is the exception rather than the error).

Right I am trying to summarize below

 

Error Type              

Intel

ARM

SW Action

Could it be a hazard if not SW Action taken?

Fatal

MCA Fatal

Fatal (Synchronous Flow)

Panic (to see if here it is possible to just kiil the app)

Yes

Uncorrected (SW Recoverable – Non Continuable)

UE Recoverable-not-continuable SRAR Error (RIPV=0, EIPV=x)

Asynchronous precise and imprecise

Kill the execution flow (i.e. kill the app)

Yes

Uncorrected (SW Recoverable - Continuable)

UE Recoverable-Continuable SRAR Error (RIPV=1, EIPV=1)

Synchronous precise

Rectify the error and recover execution at the save instruction pointer

Yes

Uncorrected (SW No Action Required or Optional)

UE UCNE or SRAO

Anything like this on ARM?

Maybe Logging?

No

Corrected Errors

CE

Not sure how these are reported no ARM

Logging (through EDAC or also other?)

No

 

>> What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?
Maybe I used the wrong wording; my point is that for those UE where a SW action is required, is the dame is not taken or if the wrong one is taken we could possibly have a hazard.

 

 

>> We need to base conclusions on an accurate understanding of how the Linux services are initiated.

>> In both x86 and Arm the fatal memory error gets handled through an exception and not through the Linux EDAC subsystem. As explained no FW gets executed.

I agree and if you look at the table above in my view as next steps we should analyse the SW paths involved in handling the errors associated to the first 3 rows

 

>> Where do you see that the fatal error is handled in FW?

As we discussed if you enable FW First theoretically you could handle everything in FW and never pass the exception up to the OS, however as Tony pointed out usually
the FW only does some logging before handing over up to the OS

 

>> The objective of ELISA is to enable the use of Linux in safety applications. W/o this discussion someone may have missed the criticality of the exception handling, while having assumed that the Linux EDAC subsystem is the critical SW in the assumed application.

>> Isn’t this value?

Oh yeah! And in fact as we have now got much more clarity from an error classification point of view we can move on with the other TODOs in the plan at the very bottom;
i.e. analysing the Linux paths associated with handling the first three rows of the table above…do you agree on this?

 

Many Thanks

Gab

 

From: Christopher Temple <Christopher.Temple@...>
Sent: Wednesday, July 29, 2020 2:26 PM
To: Paoloni, Gabriele <gabriele.paoloni@...>; myu@...; artem_mygaiev@...; hartkopp@...; doris_wild@...; Kate Stewart <kstewart@...>; jochen.kall@...; tglx@...; Gurvitz, Eli (Mobileye) <eli.gurvitz@...>; Copperman, Elana (Mobileye) <elana.copperman@...>; slotosch@...; Paccapeli, Roberto <roberto.paccapeli@...>; Iacaruso, Maurizio <maurizio.iacaruso@...>; mbeltran@...; Ghosh, Joyabrata <joyabrata.ghosh@...>; Dellosa, Stefano <stefano.dellosa@...>; afaerber@...; lukas.bulwahn@...; Antonio Priore <Antonio.Priore@...>; safety-architecture@...; yasushi.ando@...; dposner@...; aymeric.rateau@...; Jean-Francois CULAT <jean-francois.culat.e@...>; James Morse <James.Morse@...>
Subject: RE: [ELISA Safety Architecture WG] Error Reporting and Handling: follow up on error classification

 

Hi Gab,

 

Here we need to understand (to start with Intel and ARM) which HW errors may eventually represent a hazard. The reason is to identify respective error reporting and handling paths in Linux and highlight HW independent paths.

 

It is impossible to determine hazards w/o assuming some application properties. W/o any assumptions literally any HW error may eventually represent a hazard.

  • The term “hazard” describes the dangerous behaviour of a system as a whole as perceived from the context.  
  • This is also reflected by the definition of “hazard” in ISO 26262, namely “potential source of harm caused by malfunctioning behaviour of the item”.

 

On this point on ARM if we have an async external abort error is SW called to take action?

 

As explained this is equivalent to the Recoverable-not-continuable SRAR Error. (The abort is the exception rather than the error).

  • "In either case, this indicates that the error is detected at the instruction pointer saved on the stack for this machine check exception and restarting execution with the interrupted context is not possible."
  • "System software must terminate the interrupted stream of execution"

 

Earlier on you wrote “On IA I think that a part of these can be considered dangerous as the data has already been consumed, hence if no corrective action is taken, we can have data corruption and hence a hazard.”

What instruction do you see the x86 executing, if no instruction can be returned by the memory controller?

 

In my personal opinion If we commonly rely on Linux to handle errors we should continue to do so, especially if we cannot demonstrate that doing it in FW is safer.

 

We need to base conclusions on an accurate understanding of how the Linux services are initiated.

In both x86 and Arm the fatal memory error gets handled through an exception and not through the Linux EDAC subsystem. As explained no FW gets executed.

 

Where do you see that the fatal error is handled in FW?