What to do in response to a kernel warning


Shuah Khan
 

All,

This is an active thread about "What to do in response to a kernel
warning" on Linux kernel mailing lists.

Lukas and others from ELISA have been participating. Give it a read.
Alexander Popov called out ELISA for input and feedback on his take
on solving the big hammer approach of sysctl knob (kernel/panic_on_warn
knob with proposing adding kernel/pkill_on_warn knob to kill threads
and process that cause the warn as opposed taking the system down.

Give it a read - if you can't access it now, it will available without
subscription in a week.

https://lwn.net/Articles/876209/

thanks,
-- Shuah


Lukas Bulwahn
 

On Fri, Nov 19, 2021 at 5:58 PM Shuah Khan <skhan@...> wrote:

All,

This is an active thread about "What to do in response to a kernel
warning" on Linux kernel mailing lists.

Lukas and others from ELISA have been participating. Give it a read.
Alexander Popov called out ELISA for input and feedback on his take
on solving the big hammer approach of sysctl knob (kernel/panic_on_warn
knob with proposing adding kernel/pkill_on_warn knob to kill threads
and process that cause the warn as opposed taking the system down.

Give it a read - if you can't access it now, it will available without
subscription in a week.

https://lwn.net/Articles/876209/
Thanks, Shuah for pointing out the LWN article.

Alex Popov pulled us into a kernel discussion this week on a specific
kernel feature proposal with a remark that that is what
safety-critical systems need.

In short, Alexander Popov suggested that warnings in the kernel need a
refined run-time treatment. I disagreed with him and stated that I see
that panic_on_warn would be turned on in the kernel for
safety-critical systems and that a safety-critical system never would
try to continue to operate after a warn(): the risk of malfunction is
larger than the benefit of continued operation.

All of this is of course largely a hypothesis based on my
understanding of the requirements of safety-critical systems that may
ever rely on Linux.

I would of course be interested in:

- do we all agree that setting panic_on_warn is the reasonable choice
for this kernel configuration for the safety-critical systems we are
discussing? Are there arguments not to set panic_on_warn that I am not
aware of or I misjudged?

- Which warnings and kernel panics do you encounter in your current
test and (early) production systems when switching on panic_on_warn?
We can support each other here to debug and resolve them
appropriately.

Please share such information. I am confident that ELISA contributors
could support your development (clean-up) activities if that
information on known encountered but unresolved warnings is shared.


Lukas


elana.copperman@...
 

Thanks, Shuah, for sharing this important information.
From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive. For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers. What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior: For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant. Even killing the specific threads causing the warning should not be the only option. It would then be the integrator's responsibility to configure the appropriate behavior per use case.
I will join the thread when it will be publicly available.
Regards
Elana

-----Original Message-----
From: devel@... <devel@...> On Behalf Of Lukas Bulwahn
Sent: Friday, November 19, 2021 8:16 PM
To: Shuah Khan <skhan@...>
Cc: devel@...
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

On Fri, Nov 19, 2021 at 5:58 PM Shuah Khan <skhan@...> wrote:

All,

This is an active thread about "What to do in response to a kernel
warning" on Linux kernel mailing lists.

Lukas and others from ELISA have been participating. Give it a read.
Alexander Popov called out ELISA for input and feedback on his take on
solving the big hammer approach of sysctl knob (kernel/panic_on_warn
knob with proposing adding kernel/pkill_on_warn knob to kill threads
and process that cause the warn as opposed taking the system down.

Give it a read - if you can't access it now, it will available without
subscription in a week.

https://lwn.net/Articles/876209/
Thanks, Shuah for pointing out the LWN article.

Alex Popov pulled us into a kernel discussion this week on a specific kernel feature proposal with a remark that that is what safety-critical systems need.

In short, Alexander Popov suggested that warnings in the kernel need a refined run-time treatment. I disagreed with him and stated that I see that panic_on_warn would be turned on in the kernel for safety-critical systems and that a safety-critical system never would try to continue to operate after a warn(): the risk of malfunction is larger than the benefit of continued operation.

All of this is of course largely a hypothesis based on my understanding of the requirements of safety-critical systems that may ever rely on Linux.

I would of course be interested in:

- do we all agree that setting panic_on_warn is the reasonable choice for this kernel configuration for the safety-critical systems we are discussing? Are there arguments not to set panic_on_warn that I am not aware of or I misjudged?

- Which warnings and kernel panics do you encounter in your current test and (early) production systems when switching on panic_on_warn?
We can support each other here to debug and resolve them appropriately.

Please share such information. I am confident that ELISA contributors could support your development (clean-up) activities if that information on known encountered but unresolved warnings is shared.


Lukas


Lukas Bulwahn
 

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:

Thanks, Shuah, for sharing this important information.
From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive. For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers. What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior: For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant. Even killing the specific threads causing the warning should not be the only option. It would then be the integrator's responsibility to configure the appropriate behavior per use case.

Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, but
then kernel warnings are really not the problem, but the fact that the
system's safety design is so weak that the integrator's business is at
risk if this system is distributed in large numbers to others.

I will join the thread when it will be publicly available.
There is some misconception on the discussions on the linux-kernel
mailing list: the thread is already public. The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.

Only the LWN.net article, which summarizes the discussion, is
available to the wider public a week after publication. Of course,
anyone that has relevant stakes in the overall kernel development, has
a LWN.net subscription---which really does not cost much---to
understand and follow closely what is happening.


The argument above did not convince me: I still think with the current
policies on when a warning is emitted in the kernel, panic_on_warn is
the only reasonable option.

I would of course support anyone that goes through all warnings in the
kernel and tries to identify exactly which operations are still
functional after the warning, e.g., which system call would still
work, which processes might still be functional, which fault
operations may still work. But that is very complex code investigation
for quite little benefit compared to following a "panic_on_warn"
behaviour, but it certainly can be done and is worth presenting if
somebody does that in an informed, structured and systematic way.

I suggest somebody describe all activities required and estimate the
complexity of those activities to build a fail-operational system with
linux on modern hardware in a single-channel system. Then, one might
have a convincing argument to do some refined handling of warnings, or
just make all the kernel functions fail-operational by modifying its
failure behavior to not emit a warning at all.

Good luck.

Lukas


elana.copperman@...
 

Lukas, please read carefully and align with what I wrote.  
In any case, we should park this thread, and move discussion to the LWN thread.
Thanks
Elana


From: Lukas Bulwahn <lukas.bulwahn@...>
Sent: Monday, November 22, 2021 10:58 AM
To: Elana Copperman <Elana.Copperman@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning
 
EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:
>
> Thanks, Shuah, for sharing this important information.
> From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive.  For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers.  What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior:  For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant.  Even killing the specific threads causing the warning should not be the only option.  It would then be the integrator's responsibility to configure the appropriate behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.
>> That is not at all what was prescribed above.
>> panic_on_warn leads to kernel panic on warning.  And that is too restrictive, as explained above.  If the kernel panics, it will override any system engineering and system features which have been defined.
>> Fault handling for warnings needs to be more specific (i.e., not panic).
>> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.

Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, 
>> NO, as already stated - kernel panic on warning is the wrong behavior.
>> Correct system behavior is outlined above, in original email thread.

...
The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
>> Good.
>> OK, this discussion should continue in that context.  Thanks


Lukas Bulwahn
 

On Mon, Nov 22, 2021 at 10:19 AM Elana Copperman
<Elana.Copperman@...> wrote:

Lukas, please read carefully and align with what I wrote.
In any case, we should park this thread, and move discussion to the LWN thread.
Good luck.

Lukas


Jochen Kall
 

Hi Elana,

 

not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position.

If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.

 

Best regards

Jochen

 

Von: devel@... <devel@...> Im Auftrag von elana.copperman@...
Gesendet: Montag, 22. November 2021 10:20
An: Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Betreff: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

Lukas, please read carefully and align with what I wrote.  

In any case, we should park this thread, and move discussion to the LWN thread.

Thanks

Elana

 


From: Lukas Bulwahn <lukas.bulwahn@...>
Sent: Monday, November 22, 2021 10:58 AM
To: Elana Copperman <Elana.Copperman@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:
>
> Thanks, Shuah, for sharing this important information.
> From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive.  For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers.  What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior:  For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant.  Even killing the specific threads causing the warning should not be the only option.  It would then be the integrator's responsibility to configure the appropriate behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

>> That is not at all what was prescribed above.

>> panic_on_warn leads to kernel panic on warning.  And that is too restrictive, as explained above.  If the kernel panics, it will override any system engineering and system features which have been defined.

>> Fault handling for warnings needs to be more specific (i.e., not panic).

>> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.


Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, 

>> NO, as already stated - kernel panic on warning is the wrong behavior.

>> Correct system behavior is outlined above, in original email thread.

 

...

The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
>> Good.

>> OK, this discussion should continue in that context.  Thanks


--

Mit freundlichen Grüßen
Jochen Kall

 

--
Dr. rer. nat. Jochen Kall

Funktionale Sicherheit

 

ITK Engineering GmbH
Im Speyerer Tal 6
76761 Rülzheim

 

Tel.: +49 7272 7703-546
Fax: +49 7272 7703-100

Mobil:+491734957776

 

mailto:jochen.kall@...

 

______________________________________________________________

 

ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim

Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100

mailto:info@... | http://www.itk-engineering.de

 

Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:

Dr. Rudolf Maier

Geschäftsführung/Executive Board:

Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke

Sitz der Gesellschaft/Registered Office: 76761 Rülzheim

Registergericht/Registered Court: Amtsgericht Landau, HRB 32046

USt.-ID-Nr./VAT-ID-No. DE 813165046


elana.copperman@...
 

OK, if this is the focus - then we can /should continue here, and not simply annoy LWN.
Jochen, redundancy will not help here.  If all systems enable panic_on_warn, the redundant systems will all quickly fail.  
And availability is very much a concern, not only for Automotive AV, but also for medical devices and many other safety critical systems.
Panic_on_warn is simply too restrictive and is not "shutting off cleanly"; it is pulling the plug on a running device.

What is needed is a well-defined fault handling mechanism which handles faults appropriately, in a more specific way.
Shutting down a safety-critical device on which requires high availability for any warning, is not an option.
As a rule, warnings should be eliminated by sufficient testing and static analysis before the system hits the road.  And if some warning(s) remains in the deployed device, the most practical solution is logging and software (OTA) update.  


From: Jochen Kall <Jochen.Kall@...>
Sent: Monday, November 22, 2021 4:03 PM
To: Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: AW: [ELISA Technical Community] What to do in response to a kernel warning
 

Hi Elana,

 

not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position.

If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.

 

Best regards

Jochen

 

Von: devel@... <devel@...> Im Auftrag von elana.copperman@...
Gesendet: Montag, 22. November 2021 10:20
An: Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Betreff: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

Lukas, please read carefully and align with what I wrote.  

In any case, we should park this thread, and move discussion to the LWN thread.

Thanks

Elana

 


From: Lukas Bulwahn <lukas.bulwahn@...>
Sent: Monday, November 22, 2021 10:58 AM
To: Elana Copperman <Elana.Copperman@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:
>
> Thanks, Shuah, for sharing this important information.
> From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive.  For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers.  What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior:  For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant.  Even killing the specific threads causing the warning should not be the only option.  It would then be the integrator's responsibility to configure the appropriate behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

>> That is not at all what was prescribed above.

>> panic_on_warn leads to kernel panic on warning.  And that is too restrictive, as explained above.  If the kernel panics, it will override any system engineering and system features which have been defined.

>> Fault handling for warnings needs to be more specific (i.e., not panic).

>> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.


Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, 

>> NO, as already stated - kernel panic on warning is the wrong behavior.

>> Correct system behavior is outlined above, in original email thread.

 

...

The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
>> Good.

>> OK, this discussion should continue in that context.  Thanks


Jochen Kall
 

Hi Elana,

 

you misunderstood my point.

What I meant is the following:

If shutting off is not safe for a system as it apparently is the case in your example, you simply can not design it with a single controller, but rather you need several of them, monitoring each other such that if one fails, another can take over, unless you have a magical piece of hardware that can never fail to run it on.

(That consideration btw is totally independent of the OS used, it applies to all of them)

For such a system however, continuing to operate a compromised path makes the whole thing potentially less safe.

 

In a fail safe system where shutting off leads to the safe state, the same conclusion applies.

That’s why I agree with Lukas position that it is probably not very important for safety applications to have this capability and why I believe the example you gave does not support your position.

That of course does not mean it’s a bad idea to have this capability in general (Not qualified to judge that ^^), but in safety systems, we’d probably switch it off anyways for the reasons outlined above, that’s all.

 

Jochen

 

Von: Elana Copperman <Elana.Copperman@...>
Gesendet: Montag, 22. November 2021 17:20
An: Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Betreff: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

OK, if this is the focus - then we can /should continue here, and not simply annoy LWN.

Jochen, redundancy will not help here.  If all systems enable panic_on_warn, the redundant systems will all quickly fail.  

And availability is very much a concern, not only for Automotive AV, but also for medical devices and many other safety critical systems.

Panic_on_warn is simply too restrictive and is not "shutting off cleanly"; it is pulling the plug on a running device.

 

What is needed is a well-defined fault handling mechanism which handles faults appropriately, in a more specific way.

Shutting down a safety-critical device on which requires high availability for any warning, is not an option.

As a rule, warnings should be eliminated by sufficient testing and static analysis before the system hits the road.  And if some warning(s) remains in the deployed device, the most practical solution is logging and software (OTA) update.  

 


From: Jochen Kall <Jochen.Kall@...>
Sent: Monday, November 22, 2021 4:03 PM
To: Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: AW: [ELISA Technical Community] What to do in response to a kernel warning

 

Hi Elana,

 

not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position.

If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.

 

Best regards

Jochen

 

Von: devel@... <devel@...> Im Auftrag von elana.copperman@...
Gesendet: Montag, 22. November 2021 10:20
An: Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Betreff: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

Lukas, please read carefully and align with what I wrote.  

In any case, we should park this thread, and move discussion to the LWN thread.

Thanks

Elana

 


From: Lukas Bulwahn <lukas.bulwahn@...>
Sent: Monday, November 22, 2021 10:58 AM
To: Elana Copperman <Elana.Copperman@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:
>
> Thanks, Shuah, for sharing this important information.
> From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive.  For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers.  What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior:  For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant.  Even killing the specific threads causing the warning should not be the only option.  It would then be the integrator's responsibility to configure the appropriate behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

>> That is not at all what was prescribed above.

>> panic_on_warn leads to kernel panic on warning.  And that is too restrictive, as explained above.  If the kernel panics, it will override any system engineering and system features which have been defined.

>> Fault handling for warnings needs to be more specific (i.e., not panic).

>> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.


Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, 

>> NO, as already stated - kernel panic on warning is the wrong behavior.

>> Correct system behavior is outlined above, in original email thread.

 

...

The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
>> Good.

>> OK, this discussion should continue in that context.  Thanks


--

Mit freundlichen Grüßen
Jochen Kall

 

--
Dr. rer. nat. Jochen Kall

Funktionale Sicherheit

 

ITK Engineering GmbH
Im Speyerer Tal 6
76761 Rülzheim

 

Tel.: +49 7272 7703-546
Fax: +49 7272 7703-100

Mobil:+491734957776

 

mailto:jochen.kall@...

 

______________________________________________________________

 

ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim

Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100

mailto:info@... | http://www.itk-engineering.de

 

Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:

Dr. Rudolf Maier

Geschäftsführung/Executive Board:

Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke

Sitz der Gesellschaft/Registered Office: 76761 Rülzheim

Registergericht/Registered Court: Amtsgericht Landau, HRB 32046

USt.-ID-Nr./VAT-ID-No. DE 813165046


elana.copperman@...
 

Jochen, I understand your point exactly.
Even with multiple controllers, you do not solve the problem.  
Let me explain in more detail:
  1. Assume you have a safety critical system with requirement for high availability.  And as we agreed below, including redundant HW.  
  2. panic_on_warn in all the controllers is over reaction.  And that is my key point.
  3. If all the controllers are Linux based, with panic_on_warn, you risk a situation where multiple controllers will fail, because instead of handling the warnings - their only option is to panic and crash.
A warning is a "compromised path" but not necessarily a path which should lead to panic.  
panic_on_warn is the wrong way to deal with warnings in a high-availability system.   You cannot ignore warnings, of course; but there must be more fine-tuned fault handling - and redundancy won't help to avoid that problem.
It is not only "not very important for safety applications to have this capability", it is potentially dangerous if multiple controllers fail simultaneously on warnings.
This claim is true for any safety critical system with requirement for high availability and graceful degradation.  And this claim is not limited to a specific use case (automotive AV was only an example of such a system).



From: devel@... <devel@...> on behalf of Jochen Kall <jochen.kall@...>
Sent: Monday, November 22, 2021 6:46 PM
To: Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning
 

Hi Elana,

 

you misunderstood my point.

What I meant is the following:

If shutting off is not safe for a system as it apparently is the case in your example, you simply can not design it with a single controller, but rather you need several of them, monitoring each other such that if one fails, another can take over, unless you have a magical piece of hardware that can never fail to run it on.

(That consideration btw is totally independent of the OS used, it applies to all of them)

For such a system however, continuing to operate a compromised path makes the whole thing potentially less safe.

 

In a fail safe system where shutting off leads to the safe state, the same conclusion applies.

That’s why I agree with Lukas position that it is probably not very important for safety applications to have this capability and why I believe the example you gave does not support your position.

That of course does not mean it’s a bad idea to have this capability in general (Not qualified to judge that ^^), but in safety systems, we’d probably switch it off anyways for the reasons outlined above, that’s all.

 

Jochen

 

Von: Elana Copperman <Elana.Copperman@...>
Gesendet: Montag, 22. November 2021 17:20
An: Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Betreff: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

OK, if this is the focus - then we can /should continue here, and not simply annoy LWN.

Jochen, redundancy will not help here.  If all systems enable panic_on_warn, the redundant systems will all quickly fail.  

And availability is very much a concern, not only for Automotive AV, but also for medical devices and many other safety critical systems.

Panic_on_warn is simply too restrictive and is not "shutting off cleanly"; it is pulling the plug on a running device.

 

What is needed is a well-defined fault handling mechanism which handles faults appropriately, in a more specific way.

Shutting down a safety-critical device on which requires high availability for any warning, is not an option.

As a rule, warnings should be eliminated by sufficient testing and static analysis before the system hits the road.  And if some warning(s) remains in the deployed device, the most practical solution is logging and software (OTA) update.  

 


From: Jochen Kall <Jochen.Kall@...>
Sent: Monday, November 22, 2021 4:03 PM
To: Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: AW: [ELISA Technical Community] What to do in response to a kernel warning

 

Hi Elana,

 

not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position.

If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.

 

Best regards

Jochen

 

Von: devel@... <devel@...> Im Auftrag von elana.copperman@...
Gesendet: Montag, 22. November 2021 10:20
An: Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Betreff: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

Lukas, please read carefully and align with what I wrote.  

In any case, we should park this thread, and move discussion to the LWN thread.

Thanks

Elana

 


From: Lukas Bulwahn <lukas.bulwahn@...>
Sent: Monday, November 22, 2021 10:58 AM
To: Elana Copperman <Elana.Copperman@...>
Cc: Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

 

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:
>
> Thanks, Shuah, for sharing this important information.
> From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive.  For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers.  What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior:  For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant.  Even killing the specific threads causing the warning should not be the only option.  It would then be the integrator's responsibility to configure the appropriate behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

>> That is not at all what was prescribed above.

>> panic_on_warn leads to kernel panic on warning.  And that is too restrictive, as explained above.  If the kernel panics, it will override any system engineering and system features which have been defined.

>> Fault handling for warnings needs to be more specific (i.e., not panic).

>> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.


Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, 

>> NO, as already stated - kernel panic on warning is the wrong behavior.

>> Correct system behavior is outlined above, in original email thread.

 

...

The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
>> Good.

>> OK, this discussion should continue in that context.  Thanks


--

Mit freundlichen Grüßen
Jochen Kall

 

--
Dr. rer. nat. Jochen Kall

Funktionale Sicherheit

 

ITK Engineering GmbH
Im Speyerer Tal 6
76761 Rülzheim

 

Tel.: +49 7272 7703-546
Fax: +49 7272 7703-100

Mobil:+491734957776

 

mailto:jochen.kall@...

 

______________________________________________________________

 

ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim

Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100

mailto:info@... | http://www.itk-engineering.de

 

Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:

Dr. Rudolf Maier

Geschäftsführung/Executive Board:

Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke

Sitz der Gesellschaft/Registered Office: 76761 Rülzheim

Registergericht/Registered Court: Amtsgericht Landau, HRB 32046

USt.-ID-Nr./VAT-ID-No. DE 813165046


Sudip Mukherjee
 

iiuc, a WARN_ON() will be used by the kernel when it sees something
which it is not expecting and so the system is in an unknown state.

For example, the WARN_ON() at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106
which will give a warning when the system is about to hibernate but it
sees the number of CPU is not equal to 1 which means more than 1 CPU is
still active and thus is an undefined state.

Will you want to continue running the system when the CPU states are in
an undefined state or will you want the system to panic so that the
supervisory system can then bring the system back to a known good state?


--
Regards
Sudip

On 22/11/2021 7:36 pm, elana.copperman@... wrote:
Jochen, I understand your point exactly.
Even with multiple controllers, you do not solve the problem.  
Let me explain in more detail:

1. Assume you have a safety critical system with requirement for high
availability.  And as we agreed below, including redundant HW.  
2. panic_on_warn in all the controllers is over reaction.  And that is
my key point.
3. If all the controllers are Linux based, with panic_on_warn, you risk
a situation where multiple controllers will fail, because instead of
handling the warnings - their only option is to panic and crash.

A warning is a "compromised path" but not necessarily a path which
should lead to panic.  
panic_on_warn is the wrong way to deal with warnings in a
high-availability system.   You cannot ignore warnings, of course; but
there must be more fine-tuned fault handling - and redundancy won't help
to avoid that problem.
It is not only "not very important for safety applications to have this
capability", it is potentially dangerous if multiple controllers fail
simultaneously on warnings.
This claim is true for any safety critical system with requirement for
high availability and graceful degradation.  And this claim is not
limited to a specific use case (automotive AV was only an example of
such a system).


------------------------------------------------------------------------
*From:* devel@... <devel@...> on behalf of
Jochen Kall <jochen.kall@...>
*Sent:* Monday, November 22, 2021 6:46 PM
*To:* Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
<devel@...>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning
 

Hi Elana,

 

you misunderstood my point.

What I meant is the following:

If shutting off is not safe for a system as it apparently is the case in
your example, you simply can not design it with a single controller, but
rather you need several of them, monitoring each other such that if one
fails, another can take over, unless you have a magical piece of
hardware that can never fail to run it on.

(That consideration btw is totally independent of the OS used, it
applies to all of them)

For such a system however, continuing to operate a compromised path
makes the whole thing potentially less safe.

 

In a fail safe system where shutting off leads to the safe state, the
same conclusion applies.

That’s why I agree with Lukas position that it is probably not very
important for safety applications to have this capability and why I
believe the example you gave does not support your position.

That of course does not mean it’s a bad idea to have this capability in
general (Not qualified to judge that ^^), but in safety systems, we’d
probably switch it off anyways for the reasons outlined above, that’s all.

 

Jochen

* *

*Von:* Elana Copperman <Elana.Copperman@...>
*Gesendet:* Montag, 22. November 2021 17:20
*An:* Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

OK, if this is the focus - then we can /should continue here, and not
simply annoy LWN.

Jochen, redundancy will not help here.  If all systems enable
panic_on_warn, the redundant systems will all quickly fail.  

And availability is very much a concern, not only for Automotive AV, but
also for medical devices and many other safety critical systems.

Panic_on_warn is simply too restrictive and is not "shutting off
cleanly"; it is pulling the plug on a running device.

 

What is needed is a well-defined fault handling mechanism which handles
faults appropriately, in a more specific way.

Shutting down a safety-critical device on which requires high
availability for any warning, is not an option.

As a rule, warnings should be eliminated by sufficient testing and
static analysis before the system hits the road.  And if some warning(s)
remains in the deployed device, the most practical solution is logging
and software (OTA) update.  

 

------------------------------------------------------------------------

*From:*Jochen Kall <Jochen.Kall@...
<mailto:Jochen.Kall@...>>
*Sent:* Monday, November 22, 2021 4:03 PM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>; Lukas Bulwahn
<lukas.bulwahn@... <mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* AW: [ELISA Technical Community] What to do in response to a
kernel warning

 

Hi Elana,

 

not sure what you refer to, but in any case, I think Lukas is right, I
also don’t see how the example supports your position.

If availlability is safety critical, a system without redundancy (linux
based or not) is not a feasible design pattern anyways, and if we are
talking about a fail safe system, shutting off cleanly when in doubt is
the way to go from safety perspective.

 

Best regards

Jochen

 

*Von:* devel@... <mailto:devel@...>
<devel@... <mailto:devel@...>> *Im Auftrag von
*elana.copperman@... <mailto:elana.copperman@...>
*Gesendet:* Montag, 22. November 2021 10:20
*An:* Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...>
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

Lukas, please read carefully and align with what I wrote.  

In any case, we should park this thread, and move discussion to the LWN
thread.

Thanks

Elana

 

------------------------------------------------------------------------

*From:*Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Sent:* Monday, November 22, 2021 10:58 AM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

EXTERNAL EMAIL: Do not click any links or open any attachments unless
you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@... <mailto:Elana.Copperman@...>> wrote:

Thanks, Shuah, for sharing this important information.
From my experience (and we should hear from others as well!),
panic_on_warn as a hard rule is too restrictive.  For example, if an
autonomous car is driving at 200 KM per hour on a German highway, the
kernel panic from a warning could be life threatening to the car
passengers.  What is necessary is appropriate handling for
panic_on_warn, to enable the integrator to define follow up behavior: 
For example, switch to degraded functionality, or switch to a fault
handling application, or panic when relevant.  Even killing the specific
threads causing the warning should not be the only option.  It would
then be the integrator's responsibility to configure the appropriate
behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

That is not at all what was prescribed above.
panic_on_warn leads to kernel panic on warning.  And that is too
restrictive, as explained above.  If the kernel panics, it will override
any system engineering and system features which have been defined.

Fault handling for warnings needs to be more specific (i.e., not panic).
And as I wrote above, you need some fault handling application or
other system support, switching to degraded functionality or a fault
handling application.


Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, 

NO, as already stated - kernel panic on warning is the wrong behavior.
Correct system behavior is outlined above, in original email thread.
 

...

The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
Good.
OK, this discussion should continue in that context.  Thanks

--

Mit freundlichen Grüßen
Jochen Kall

 

--
Dr. rer. nat. Jochen Kall

Funktionale Sicherheit

 

ITK Engineering GmbH
Im Speyerer Tal 6
76761 Rülzheim

 

Tel.: +49 7272 7703-546
Fax: +49 7272 7703-100

Mobil:+491734957776

 

mailto:jochen.kall@...<mailto:jochen.kall@...>

 

______________________________________________________________

 

ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim

Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100

mailto:info@... <mailto:info@...>|
http://www.itk-engineering.de<http://secure-web.cisco.com/1mpWU0vrkiFUZcYrpoBbpb5Kl3vJezCrpIF8aj-3p7BtkTq5weIJMnGCcYkMud5xJ0tRASYpVZv-c4MbwXiGvTBQK9UWqjgIQbnJ_fAnKJv_2upGw5U1YPnXdcTsnNA-AUEyQJKoQVoh3YHkWIAG4To8UDY3Ya0Yy79WxaHCBrIFAuDUmISDxlJkea_et2azfHrTI64RCNx1HQMTY2WYzfRYCKLivwuHDrnoTGQVbNL81cUsp2khv3-JpNvhtLx3ZvH4SL6KoTZJPugcTSbVlN9RkA1snxck4R6j5CPNwctTpsdvlS2Ms0En0NBpVP8acDUtlLYxEWKnHvaat7HCYvA/http%3A%2F%2Fwww.itk-engineering.de%2F>

 

Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:

Dr. Rudolf Maier

Geschäftsführung/Executive Board:

Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke

Sitz der Gesellschaft/Registered Office: 76761 Rülzheim

Registergericht/Registered Court: Amtsgericht Landau, HRB 32046

USt.-ID-Nr./VAT-ID-No. DE 813165046


elana.copperman@...
 

Sudip, I agree that this is certainly the appropriate reaction in some specific instances.
The argument is that panic on every warning is over reaction. It is an academic exercise which will be rejected by any reasonable system architect.

-----Original Message-----
From: Sudip Mukherjee <sudip.mukherjee@...>
Sent: Monday, November 22, 2021 10:29 PM
To: Elana Copperman <Elana.Copperman@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

iiuc, a WARN_ON() will be used by the kernel when it sees something which it is not expecting and so the system is in an unknown state.

For example, the WARN_ON() at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106
which will give a warning when the system is about to hibernate but it sees the number of CPU is not equal to 1 which means more than 1 CPU is still active and thus is an undefined state.

Will you want to continue running the system when the CPU states are in an undefined state or will you want the system to panic so that the supervisory system can then bring the system back to a known good state?


--
Regards
Sudip


On 22/11/2021 7:36 pm, elana.copperman@... wrote:
Jochen, I understand your point exactly.
Even with multiple controllers, you do not solve the problem.  
Let me explain in more detail:

1. Assume you have a safety critical system with requirement for high
availability.  And as we agreed below, including redundant HW.  
2. panic_on_warn in all the controllers is over reaction.  And that is
my key point.
3. If all the controllers are Linux based, with panic_on_warn, you risk
a situation where multiple controllers will fail, because instead of
handling the warnings - their only option is to panic and crash.

A warning is a "compromised path" but not necessarily a path which
should lead to panic.  
panic_on_warn is the wrong way to deal with warnings in a
high-availability system.   You cannot ignore warnings, of course; but
there must be more fine-tuned fault handling - and redundancy won't help
to avoid that problem.
It is not only "not very important for safety applications to have this
capability", it is potentially dangerous if multiple controllers fail
simultaneously on warnings.
This claim is true for any safety critical system with requirement for
high availability and graceful degradation.  And this claim is not
limited to a specific use case (automotive AV was only an example of
such a system).


------------------------------------------------------------------------
*From:* devel@... <devel@...> on behalf of
Jochen Kall <jochen.kall@...>
*Sent:* Monday, November 22, 2021 6:46 PM
*To:* Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
<devel@...>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning
 

Hi Elana,

 

you misunderstood my point.

What I meant is the following:

If shutting off is not safe for a system as it apparently is the case in
your example, you simply can not design it with a single controller, but
rather you need several of them, monitoring each other such that if one
fails, another can take over, unless you have a magical piece of
hardware that can never fail to run it on.

(That consideration btw is totally independent of the OS used, it
applies to all of them)

For such a system however, continuing to operate a compromised path
makes the whole thing potentially less safe.

 

In a fail safe system where shutting off leads to the safe state, the
same conclusion applies.

That's why I agree with Lukas position that it is probably not very
important for safety applications to have this capability and why I
believe the example you gave does not support your position.

That of course does not mean it's a bad idea to have this capability in
general (Not qualified to judge that ^^), but in safety systems, we'd
probably switch it off anyways for the reasons outlined above, that's all.

 

Jochen

* *

*Von:* Elana Copperman <Elana.Copperman@...>
*Gesendet:* Montag, 22. November 2021 17:20
*An:* Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

OK, if this is the focus - then we can /should continue here, and not
simply annoy LWN.

Jochen, redundancy will not help here.  If all systems enable
panic_on_warn, the redundant systems will all quickly fail.  

And availability is very much a concern, not only for Automotive AV, but
also for medical devices and many other safety critical systems.

Panic_on_warn is simply too restrictive and is not "shutting off
cleanly"; it is pulling the plug on a running device.

 

What is needed is a well-defined fault handling mechanism which handles
faults appropriately, in a more specific way.

Shutting down a safety-critical device on which requires high
availability for any warning, is not an option.

As a rule, warnings should be eliminated by sufficient testing and
static analysis before the system hits the road.  And if some warning(s)
remains in the deployed device, the most practical solution is logging
and software (OTA) update.  

 

------------------------------------------------------------------------

*From:*Jochen Kall <Jochen.Kall@...
<mailto:Jochen.Kall@...>>
*Sent:* Monday, November 22, 2021 4:03 PM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>; Lukas Bulwahn
<lukas.bulwahn@... <mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* AW: [ELISA Technical Community] What to do in response to a
kernel warning

 

Hi Elana,

 

not sure what you refer to, but in any case, I think Lukas is right, I
also don't see how the example supports your position.

If availlability is safety critical, a system without redundancy (linux
based or not) is not a feasible design pattern anyways, and if we are
talking about a fail safe system, shutting off cleanly when in doubt is
the way to go from safety perspective.

 

Best regards

Jochen

 

*Von:* devel@... <mailto:devel@...>
<devel@... <mailto:devel@...>> *Im Auftrag von
*elana.copperman@... <mailto:elana.copperman@...>
*Gesendet:* Montag, 22. November 2021 10:20
*An:* Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...>
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

Lukas, please read carefully and align with what I wrote.  

In any case, we should park this thread, and move discussion to the LWN
thread.

Thanks

Elana

 

------------------------------------------------------------------------

*From:*Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Sent:* Monday, November 22, 2021 10:58 AM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

EXTERNAL EMAIL: Do not click any links or open any attachments unless
you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@... <mailto:Elana.Copperman@...>> wrote:

Thanks, Shuah, for sharing this important information.
From my experience (and we should hear from others as well!),
panic_on_warn as a hard rule is too restrictive.  For example, if an
autonomous car is driving at 200 KM per hour on a German highway, the
kernel panic from a warning could be life threatening to the car
passengers.  What is necessary is appropriate handling for
panic_on_warn, to enable the integrator to define follow up behavior: 
For example, switch to degraded functionality, or switch to a fault
handling application, or panic when relevant.  Even killing the specific
threads causing the warning should not be the only option.  It would
then be the integrator's responsibility to configure the appropriate
behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

That is not at all what was prescribed above.
panic_on_warn leads to kernel panic on warning.  And that is too
restrictive, as explained above.  If the kernel panics, it will override
any system engineering and system features which have been defined.

Fault handling for warnings needs to be more specific (i.e., not panic).
And as I wrote above, you need some fault handling application or
other system support, switching to degraded functionality or a fault
handling application.


Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, 

NO, as already stated - kernel panic on warning is the wrong behavior.
Correct system behavior is outlined above, in original email thread.
 

...

The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
Good.
OK, this discussion should continue in that context.  Thanks

--

Mit freundlichen Grüßen
Jochen Kall

 

--
Dr. rer. nat. Jochen Kall

Funktionale Sicherheit

 

ITK Engineering GmbH
Im Speyerer Tal 6
76761 Rülzheim

 

Tel.: +49 7272 7703-546
Fax: +49 7272 7703-100

Mobil:+491734957776

 

mailto:jochen.kall@...<mailto:jochen.kall@...>

 

______________________________________________________________

 

ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim

Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100

mailto:info@... <mailto:info@...>|
http://secure-web.cisco.com/1ihxzBLAUCKgAM4Mmt9kRBDB4hpC8fIOBTK-cQWO0TinCXfSYm2vK7GqhgRwHEiuulgHfqa5DzoF55TajxEyX-zPe_lqxg6GT1ctlA0kjBrYvkTuS4AelH1YgT8GXMFHuZg0ZJEG2snYVZ2L7kc1Z98Bts7X14gtfzuSkiLEn45AE6idQedd1wgOCi7tAb3Qz0Ri6ZKylZiGy480AbGLON4IAaYZ1n812kbA9AnZv0krGBYZk3xrhW5F3A6dbp0Bzvk-RiQUCZKMODN_B7YAerPqX519WeadSiVU1ySmF0kNYn5A4tMQOIHjtqm_ni7uf/http%3A%2F%2Fwww.itk-engineering.de<http://secure-web.cisco.com/1mpWU0vrkiFUZcYrpoBbpb5Kl3vJezCrpIF8aj-3p7BtkTq5weIJMnGCcYkMud5xJ0tRASYpVZv-c4MbwXiGvTBQK9UWqjgIQbnJ_fAnKJv_2upGw5U1YPnXdcTsnNA-AUEyQJKoQVoh3YHkWIAG4To8UDY3Ya0Yy79WxaHCBrIFAuDUmISDxlJkea_et2azfHrTI64RCNx1HQMTY2WYzfRYCKLivwuHDrnoTGQVbNL81cUsp2khv3-JpNvhtLx3ZvH4SL6KoTZJPugcTSbVlN9RkA1snxck4R6j5CPNwctTpsdvlS2Ms0En0NBpVP8acDUtlLYxEWKnHvaat7HCYvA/http%3A%2F%2Fwww.itk-engineering.de%2F>

 

Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:

Dr. Rudolf Maier

Geschäftsführung/Executive Board:

Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke

Sitz der Gesellschaft/Registered Office: 76761 Rülzheim

Registergericht/Registered Court: Amtsgericht Landau, HRB 32046

USt.-ID-Nr./VAT-ID-No. DE 813165046


elana.copperman@...
 

Sorry for quick hit on enter.
Now here is a challenge, how do we define safe and practical requirements for a "safe_panic_on_warn"? that is the type of resolution which may provide the balance which I am looking for.

-----Original Message-----
From: devel@... <devel@...> On Behalf Of Elana Copperman
Sent: Monday, November 22, 2021 10:47 PM
To: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

Sudip, I agree that this is certainly the appropriate reaction in some specific instances.
The argument is that panic on every warning is over reaction. It is an academic exercise which will be rejected by any reasonable system architect.

-----Original Message-----
From: Sudip Mukherjee <sudip.mukherjee@...>
Sent: Monday, November 22, 2021 10:29 PM
To: Elana Copperman <Elana.Copperman@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

iiuc, a WARN_ON() will be used by the kernel when it sees something which it is not expecting and so the system is in an unknown state.

For example, the WARN_ON() at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106
which will give a warning when the system is about to hibernate but it sees the number of CPU is not equal to 1 which means more than 1 CPU is still active and thus is an undefined state.

Will you want to continue running the system when the CPU states are in an undefined state or will you want the system to panic so that the supervisory system can then bring the system back to a known good state?


--
Regards
Sudip


On 22/11/2021 7:36 pm, elana.copperman@... wrote:
Jochen, I understand your point exactly.
Even with multiple controllers, you do not solve the problem. Let me
explain in more detail:

1. Assume you have a safety critical system with requirement for high
availability.  And as we agreed below, including redundant HW. 2.
panic_on_warn in all the controllers is over reaction.  And that is
my key point.
3. If all the controllers are Linux based, with panic_on_warn, you risk
a situation where multiple controllers will fail, because instead of
handling the warnings - their only option is to panic and crash.

A warning is a "compromised path" but not necessarily a path which
should lead to panic.
panic_on_warn is the wrong way to deal with warnings in a
high-availability system.   You cannot ignore warnings, of course; but
there must be more fine-tuned fault handling - and redundancy won't
help to avoid that problem.
It is not only "not very important for safety applications to have
this capability", it is potentially dangerous if multiple controllers
fail simultaneously on warnings.
This claim is true for any safety critical system with requirement for
high availability and graceful degradation.  And this claim is not
limited to a specific use case (automotive AV was only an example of
such a system).


----------------------------------------------------------------------
--
*From:* devel@... <devel@...> on behalf of
Jochen Kall <jochen.kall@...>
*Sent:* Monday, November 22, 2021 6:46 PM
*To:* Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
<devel@...>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning
 

Hi Elana,

 

you misunderstood my point.

What I meant is the following:

If shutting off is not safe for a system as it apparently is the case
in your example, you simply can not design it with a single
controller, but rather you need several of them, monitoring each other
such that if one fails, another can take over, unless you have a
magical piece of hardware that can never fail to run it on.

(That consideration btw is totally independent of the OS used, it
applies to all of them)

For such a system however, continuing to operate a compromised path
makes the whole thing potentially less safe.

 

In a fail safe system where shutting off leads to the safe state, the
same conclusion applies.

That's why I agree with Lukas position that it is probably not very
important for safety applications to have this capability and why I
believe the example you gave does not support your position.

That of course does not mean it's a bad idea to have this capability
in general (Not qualified to judge that ^^), but in safety systems,
we'd probably switch it off anyways for the reasons outlined above, that's all.

 

Jochen

* *

*Von:* Elana Copperman <Elana.Copperman@...>
*Gesendet:* Montag, 22. November 2021 17:20
*An:* Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

OK, if this is the focus - then we can /should continue here, and not
simply annoy LWN.

Jochen, redundancy will not help here.  If all systems enable
panic_on_warn, the redundant systems will all quickly fail.

And availability is very much a concern, not only for Automotive AV,
but also for medical devices and many other safety critical systems.

Panic_on_warn is simply too restrictive and is not "shutting off
cleanly"; it is pulling the plug on a running device.

 

What is needed is a well-defined fault handling mechanism which
handles faults appropriately, in a more specific way.

Shutting down a safety-critical device on which requires high
availability for any warning, is not an option.

As a rule, warnings should be eliminated by sufficient testing and
static analysis before the system hits the road.  And if some
warning(s) remains in the deployed device, the most practical solution
is logging and software (OTA) update.

 

----------------------------------------------------------------------
--

*From:*Jochen Kall <Jochen.Kall@...
<mailto:Jochen.Kall@...>>
*Sent:* Monday, November 22, 2021 4:03 PM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>; Lukas Bulwahn
<lukas.bulwahn@... <mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* AW: [ELISA Technical Community] What to do in response to a
kernel warning

 

Hi Elana,

 

not sure what you refer to, but in any case, I think Lukas is right, I
also don't see how the example supports your position.

If availlability is safety critical, a system without redundancy
(linux based or not) is not a feasible design pattern anyways, and if
we are talking about a fail safe system, shutting off cleanly when in
doubt is the way to go from safety perspective.

 

Best regards

Jochen

 

*Von:* devel@... <mailto:devel@...>
<devel@... <mailto:devel@...>> *Im Auftrag
von *elana.copperman@...
<mailto:elana.copperman@...>
*Gesendet:* Montag, 22. November 2021 10:20
*An:* Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...>
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

Lukas, please read carefully and align with what I wrote.

In any case, we should park this thread, and move discussion to the
LWN thread.

Thanks

Elana

 

----------------------------------------------------------------------
--

*From:*Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Sent:* Monday, November 22, 2021 10:58 AM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning

 

EXTERNAL EMAIL: Do not click any links or open any attachments unless
you trust the sender and know the content is safe.

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@... <mailto:Elana.Copperman@...>> wrote:

Thanks, Shuah, for sharing this important information.
From my experience (and we should hear from others as well!),
panic_on_warn as a hard rule is too restrictive.  For example, if an
autonomous car is driving at 200 KM per hour on a German highway, the
kernel panic from a warning could be life threatening to the car
passengers.  What is necessary is appropriate handling for
panic_on_warn, to enable the integrator to define follow up behavior:
For example, switch to degraded functionality, or switch to a fault
handling application, or panic when relevant.  Even killing the
specific threads causing the warning should not be the only option. 
It would then be the integrator's responsibility to configure the
appropriate behavior per use case.


Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

That is not at all what was prescribed above.
panic_on_warn leads to kernel panic on warning.  And that is too
restrictive, as explained above.  If the kernel panics, it will
override any system engineering and system features which have been defined.

Fault handling for warnings needs to be more specific (i.e., not panic).
And as I wrote above, you need some fault handling application or
other system support, switching to degraded functionality or a fault
handling application.


Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event,

NO, as already stated - kernel panic on warning is the wrong behavior.
Correct system behavior is outlined above, in original email thread.
 

...

The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
Good.
OK, this discussion should continue in that context.  Thanks

--

Mit freundlichen Grüßen
Jochen Kall

 

--
Dr. rer. nat. Jochen Kall

Funktionale Sicherheit

 

ITK Engineering GmbH
Im Speyerer Tal 6
76761 Rülzheim

 

Tel.: +49 7272 7703-546
Fax: +49 7272 7703-100

Mobil:+491734957776

 

mailto:jochen.kall@...<mailto:jochen.kall@itk-engineeri
ng.de>

 

______________________________________________________________

 

ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim

Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100

mailto:info@... <mailto:info@...>|
http://secure-web.cisco.com/1ihxzBLAUCKgAM4Mmt9kRBDB4hpC8fIOBTK-cQWO0T
inCXfSYm2vK7GqhgRwHEiuulgHfqa5DzoF55TajxEyX-zPe_lqxg6GT1ctlA0kjBrYvkTu
S4AelH1YgT8GXMFHuZg0ZJEG2snYVZ2L7kc1Z98Bts7X14gtfzuSkiLEn45AE6idQedd1w
gOCi7tAb3Qz0Ri6ZKylZiGy480AbGLON4IAaYZ1n812kbA9AnZv0krGBYZk3xrhW5F3A6d
bp0Bzvk-RiQUCZKMODN_B7YAerPqX519WeadSiVU1ySmF0kNYn5A4tMQOIHjtqm_ni7uf/
http%3A%2F%2Fwww.itk-engineering.de<http://secure-web.cisco.com/1mpWU0
vrkiFUZcYrpoBbpb5Kl3vJezCrpIF8aj-3p7BtkTq5weIJMnGCcYkMud5xJ0tRASYpVZv-
c4MbwXiGvTBQK9UWqjgIQbnJ_fAnKJv_2upGw5U1YPnXdcTsnNA-AUEyQJKoQVoh3YHkWI
AG4To8UDY3Ya0Yy79WxaHCBrIFAuDUmISDxlJkea_et2azfHrTI64RCNx1HQMTY2WYzfRY
CKLivwuHDrnoTGQVbNL81cUsp2khv3-JpNvhtLx3ZvH4SL6KoTZJPugcTSbVlN9RkA1snx
ck4R6j5CPNwctTpsdvlS2Ms0En0NBpVP8acDUtlLYxEWKnHvaat7HCYvA/http%3A%2F%2
Fwww.itk-engineering.de%2F>

 

Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:

Dr. Rudolf Maier

Geschäftsführung/Executive Board:

Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke

Sitz der Gesellschaft/Registered Office: 76761 Rülzheim

Registergericht/Registered Court: Amtsgericht Landau, HRB 32046

USt.-ID-Nr./VAT-ID-No. DE 813165046


Paul Sherwood
 

/* diving in with no idea how hot/deep the water is... */

Wouldn't best practice be just to insist on no warnings in production for safe operation? Any ignorable warnings that occur during testing should be identified and explicit action taken, as code. Any warning never seen in testing should not be ignorable.

On 2021-11-22 20:50, elana.copperman@... wrote:
Sorry for quick hit on enter.
Now here is a challenge, how do we define safe and practical
requirements for a "safe_panic_on_warn"? that is the type of
resolution which may provide the balance which I am looking for.
-----Original Message-----
From: devel@... <devel@...> On Behalf Of
Elana Copperman
Sent: Monday, November 22, 2021 10:47 PM
To: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall
<jochen.kall@...>; Lukas Bulwahn
<lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Subject: Re: [ELISA Technical Community] What to do in response to a
kernel warning
Sudip, I agree that this is certainly the appropriate reaction in some
specific instances.
The argument is that panic on every warning is over reaction. It is
an academic exercise which will be rejected by any reasonable system
architect.
-----Original Message-----
From: Sudip Mukherjee <sudip.mukherjee@...>
Sent: Monday, November 22, 2021 10:29 PM
To: Elana Copperman <Elana.Copperman@...>; Jochen Kall
<jochen.kall@...>; Lukas Bulwahn
<lukas.bulwahn@...>
Cc: Shuah Khan <skhan@...>; devel@...
Subject: Re: [ELISA Technical Community] What to do in response to a
kernel warning
EXTERNAL EMAIL: Do not click any links or open any attachments unless
you trust the sender and know the content is safe.
iiuc, a WARN_ON() will be used by the kernel when it sees something
which it is not expecting and so the system is in an unknown state.
For example, the WARN_ON() at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106
which will give a warning when the system is about to hibernate but it
sees the number of CPU is not equal to 1 which means more than 1 CPU
is still active and thus is an undefined state.
Will you want to continue running the system when the CPU states are
in an undefined state or will you want the system to panic so that the
supervisory system can then bring the system back to a known good
state?
--
Regards
Sudip
On 22/11/2021 7:36 pm, elana.copperman@... wrote:
Jochen, I understand your point exactly.
Even with multiple controllers, you do not solve the problem. Let me
explain in more detail:
1. Assume you have a safety critical system with requirement for high
availability.  And as we agreed below, including redundant HW. 2.
panic_on_warn in all the controllers is over reaction.  And that is
my key point.
3. If all the controllers are Linux based, with panic_on_warn, you risk
a situation where multiple controllers will fail, because instead of
handling the warnings - their only option is to panic and crash.
A warning is a "compromised path" but not necessarily a path which
should lead to panic.
panic_on_warn is the wrong way to deal with warnings in a
high-availability system.   You cannot ignore warnings, of course; but
there must be more fine-tuned fault handling - and redundancy won't
help to avoid that problem.
It is not only "not very important for safety applications to have
this capability", it is potentially dangerous if multiple controllers
fail simultaneously on warnings.
This claim is true for any safety critical system with requirement for
high availability and graceful degradation.  And this claim is not
limited to a specific use case (automotive AV was only an example of
such a system).
----------------------------------------------------------------------
--
*From:* devel@... <devel@...> on behalf of
Jochen Kall <jochen.kall@...>
*Sent:* Monday, November 22, 2021 6:46 PM
*To:* Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
<devel@...>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning
 
Hi Elana,
 
you misunderstood my point.
What I meant is the following:
If shutting off is not safe for a system as it apparently is the case
in your example, you simply can not design it with a single
controller, but rather you need several of them, monitoring each other
such that if one fails, another can take over, unless you have a
magical piece of hardware that can never fail to run it on.
(That consideration btw is totally independent of the OS used, it
applies to all of them)
For such a system however, continuing to operate a compromised path
makes the whole thing potentially less safe.
 
In a fail safe system where shutting off leads to the safe state, the
same conclusion applies.
That's why I agree with Lukas position that it is probably not very
important for safety applications to have this capability and why I
believe the example you gave does not support your position.
That of course does not mean it's a bad idea to have this capability
in general (Not qualified to judge that ^^), but in safety systems,
we'd probably switch it off anyways for the reasons outlined above, that's all.
 
Jochen
* *
*Von:* Elana Copperman <Elana.Copperman@...>
*Gesendet:* Montag, 22. November 2021 17:20
*An:* Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn
<lukas.bulwahn@...>
*Cc:* Shuah Khan <skhan@...>; devel@...
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning
 
OK, if this is the focus - then we can /should continue here, and not
simply annoy LWN.
Jochen, redundancy will not help here.  If all systems enable
panic_on_warn, the redundant systems will all quickly fail.
And availability is very much a concern, not only for Automotive AV,
but also for medical devices and many other safety critical systems.
Panic_on_warn is simply too restrictive and is not "shutting off
cleanly"; it is pulling the plug on a running device.
 
What is needed is a well-defined fault handling mechanism which
handles faults appropriately, in a more specific way.
Shutting down a safety-critical device on which requires high
availability for any warning, is not an option.
As a rule, warnings should be eliminated by sufficient testing and
static analysis before the system hits the road.  And if some
warning(s) remains in the deployed device, the most practical solution
is logging and software (OTA) update.
 
----------------------------------------------------------------------
--
*From:*Jochen Kall <Jochen.Kall@...
<mailto:Jochen.Kall@...>>
*Sent:* Monday, November 22, 2021 4:03 PM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>; Lukas Bulwahn
<lukas.bulwahn@... <mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* AW: [ELISA Technical Community] What to do in response to a
kernel warning
 
Hi Elana,
 
not sure what you refer to, but in any case, I think Lukas is right, I
also don't see how the example supports your position.
If availlability is safety critical, a system without redundancy
(linux based or not) is not a feasible design pattern anyways, and if
we are talking about a fail safe system, shutting off cleanly when in
doubt is the way to go from safety perspective.
 
Best regards
Jochen
 
*Von:* devel@... <mailto:devel@...>
<devel@... <mailto:devel@...>> *Im Auftrag
von *elana.copperman@...
<mailto:elana.copperman@...>
*Gesendet:* Montag, 22. November 2021 10:20
*An:* Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...>
*Betreff:* Re: [ELISA Technical Community] What to do in response to a
kernel warning
 
Lukas, please read carefully and align with what I wrote.
In any case, we should park this thread, and move discussion to the
LWN thread.
Thanks
Elana
 
----------------------------------------------------------------------
--
*From:*Lukas Bulwahn <lukas.bulwahn@...
<mailto:lukas.bulwahn@...>>
*Sent:* Monday, November 22, 2021 10:58 AM
*To:* Elana Copperman <Elana.Copperman@...
<mailto:Elana.Copperman@...>>
*Cc:* Shuah Khan <skhan@...
<mailto:skhan@...>>; devel@...
<mailto:devel@...> <devel@...
<mailto:devel@...>>
*Subject:* Re: [ELISA Technical Community] What to do in response to a
kernel warning
 
EXTERNAL EMAIL: Do not click any links or open any attachments unless
you trust the sender and know the content is safe.
On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@... <mailto:Elana.Copperman@...>> wrote:
Thanks, Shuah, for sharing this important information.
From my experience (and we should hear from others as well!),
panic_on_warn as a hard rule is too restrictive.  For example, if an
autonomous car is driving at 200 KM per hour on a German highway, the
kernel panic from a warning could be life threatening to the car
passengers.  What is necessary is appropriate handling for
panic_on_warn, to enable the integrator to define follow up behavior:
For example, switch to degraded functionality, or switch to a fault
handling application, or panic when relevant.  Even killing the
specific threads causing the warning should not be the only option. 
It would then be the integrator's responsibility to configure the
appropriate behavior per use case.
Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

That is not at all what was prescribed above.
panic_on_warn leads to kernel panic on warning.  And that is too
restrictive, as explained above.  If the kernel panics, it will
override any system engineering and system features which have been defined.

Fault handling for warnings needs to be more specific (i.e., not panic).
And as I wrote above, you need some fault handling application or
other system support, switching to degraded functionality or a fault
handling application.
Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event,

NO, as already stated - kernel panic on warning is the wrong behavior.
Correct system behavior is outlined above, in original email thread.
...
The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.
Good.
OK, this discussion should continue in that context.  Thanks
--
Mit freundlichen Grüßen
Jochen Kall
 
--
Dr. rer. nat. Jochen Kall
Funktionale Sicherheit
 
ITK Engineering GmbH
Im Speyerer Tal 6
76761 Rülzheim
 
Tel.: +49 7272 7703-546
Fax: +49 7272 7703-100
Mobil:+491734957776
 
mailto:jochen.kall@...<mailto:jochen.kall@itk-engineeri
ng.de>
 
______________________________________________________________
 
ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim
Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100
mailto:info@... <mailto:info@...>|
http://secure-web.cisco.com/1ihxzBLAUCKgAM4Mmt9kRBDB4hpC8fIOBTK-cQWO0T
inCXfSYm2vK7GqhgRwHEiuulgHfqa5DzoF55TajxEyX-zPe_lqxg6GT1ctlA0kjBrYvkTu
S4AelH1YgT8GXMFHuZg0ZJEG2snYVZ2L7kc1Z98Bts7X14gtfzuSkiLEn45AE6idQedd1w
gOCi7tAb3Qz0Ri6ZKylZiGy480AbGLON4IAaYZ1n812kbA9AnZv0krGBYZk3xrhW5F3A6d
bp0Bzvk-RiQUCZKMODN_B7YAerPqX519WeadSiVU1ySmF0kNYn5A4tMQOIHjtqm_ni7uf/
http%3A%2F%2Fwww.itk-engineering.de<http://secure-web.cisco.com/1mpWU0
vrkiFUZcYrpoBbpb5Kl3vJezCrpIF8aj-3p7BtkTq5weIJMnGCcYkMud5xJ0tRASYpVZv-
c4MbwXiGvTBQK9UWqjgIQbnJ_fAnKJv_2upGw5U1YPnXdcTsnNA-AUEyQJKoQVoh3YHkWI
AG4To8UDY3Ya0Yy79WxaHCBrIFAuDUmISDxlJkea_et2azfHrTI64RCNx1HQMTY2WYzfRY
CKLivwuHDrnoTGQVbNL81cUsp2khv3-JpNvhtLx3ZvH4SL6KoTZJPugcTSbVlN9RkA1snx
ck4R6j5CPNwctTpsdvlS2Ms0En0NBpVP8acDUtlLYxEWKnHvaat7HCYvA/http%3A%2F%2
Fwww.itk-engineering.de%2F>
 
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:
Dr. Rudolf Maier
Geschäftsführung/Executive Board:
Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke
Sitz der Gesellschaft/Registered Office: 76761 Rülzheim
Registergericht/Registered Court: Amtsgericht Landau, HRB 32046
USt.-ID-Nr./VAT-ID-No. DE 813165046


elana.copperman@...
 

Trying once more:
Warnings MUST be managed.  No argument about that.
Panic is NOT feasible for every warning.  That is the point.  



From: Paul Sherwood <paul.sherwood@...>
Sent: Tuesday, November 23, 2021 6:58 PM
To: Elana Copperman <Elana.Copperman@...>
Cc: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...>; Shuah Khan <skhan@...>; devel@... <devel@...>
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning
 
EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

/* diving in with no idea how hot/deep the water is... */

Wouldn't best practice be just to insist on no warnings in production
for safe operation? Any ignorable warnings that occur during testing
should be identified and explicit action taken, as code. Any warning
never seen in testing should not be ignorable.

On 2021-11-22 20:50, elana.copperman@... wrote:
> Sorry for quick hit on enter.
> Now here is a challenge, how do we define safe and practical
> requirements for a "safe_panic_on_warn"?  that is the type of
> resolution which may provide the balance which I am looking for.
>
>
> -----Original Message-----
> From: devel@... <devel@...> On Behalf Of
> Elana Copperman
> Sent: Monday, November 22, 2021 10:47 PM
> To: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall
> <jochen.kall@...>; Lukas Bulwahn
> <lukas.bulwahn@...>
> Cc: Shuah Khan <skhan@...>; devel@...
> Subject: Re: [ELISA Technical Community] What to do in response to a
> kernel warning
>
> Sudip, I agree that this is certainly the appropriate reaction in some
> specific instances.
> The argument is that panic on every warning is over reaction.  It is
> an academic exercise which will be rejected by any reasonable system
> architect.
>
> -----Original Message-----
> From: Sudip Mukherjee <sudip.mukherjee@...>
> Sent: Monday, November 22, 2021 10:29 PM
> To: Elana Copperman <Elana.Copperman@...>; Jochen Kall
> <jochen.kall@...>; Lukas Bulwahn
> <lukas.bulwahn@...>
> Cc: Shuah Khan <skhan@...>; devel@...
> Subject: Re: [ELISA Technical Community] What to do in response to a
> kernel warning
>
> EXTERNAL EMAIL: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
>
> iiuc,  a WARN_ON() will be used by the kernel when it sees something
> which it is not expecting and so the system is in an unknown state.
>
> For example, the WARN_ON() at
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106
> which will give a warning when the system is about to hibernate but it
> sees the number of CPU is not equal to 1 which means more than 1 CPU
> is still active and thus is an undefined state.
>
> Will you want to continue running the system when the CPU states are
> in an undefined state or will you want the system to panic so that the
> supervisory system can then bring the system back to a known good
> state?
>
>
> --
> Regards
> Sudip
>
>
> On 22/11/2021 7:36 pm, elana.copperman@... wrote:
>> Jochen, I understand your point exactly.
>> Even with multiple controllers, you do not solve the problem. Let me
>> explain in more detail:
>>
>>  1. Assume you have a safety critical system with requirement for high
>>     availability.  And as we agreed below, including redundant HW.  2.
>> panic_on_warn in all the controllers is over reaction.  And that is
>>     my key point.
>>  3. If all the controllers are Linux based, with panic_on_warn, you
>> risk
>>     a situation where multiple controllers will fail, because instead
>> of
>>     handling the warnings - their only option is to panic and crash.
>>
>> A warning is a "compromised path" but not necessarily a path which
>> should lead to panic.
>> panic_on_warn is the wrong way to deal with warnings in a
>> high-availability system.   You cannot ignore warnings, of course; but
>> there must be more fine-tuned fault handling - and redundancy won't
>> help to avoid that problem.
>> It is not only "not very important for safety applications to have
>> this capability", it is potentially dangerous if multiple controllers
>> fail simultaneously on warnings.
>> This claim is true for any safety critical system with requirement for
>> high availability and graceful degradation.  And this claim is not
>> limited to a specific use case (automotive AV was only an example of
>> such a system).
>>
>>
>> ----------------------------------------------------------------------
>> --
>> *From:* devel@... <devel@...> on behalf of
>> Jochen Kall <jochen.kall@...>
>> *Sent:* Monday, November 22, 2021 6:46 PM
>> *To:* Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn
>> <lukas.bulwahn@...>
>> *Cc:* Shuah Khan <skhan@...>; devel@...
>> <devel@...>
>> *Subject:* Re: [ELISA Technical Community] What to do in response to a
>> kernel warning
>>  
>>
>> Hi Elana,
>>
>>  
>>
>> you misunderstood my point.
>>
>> What I meant is the following:
>>
>> If shutting off is not safe for a system as it apparently is the case
>> in your example, you simply can not design it with a single
>> controller, but rather you need several of them, monitoring each other
>> such that if one fails, another can take over, unless you have a
>> magical piece of hardware that can never fail to run it on.
>>
>> (That consideration btw is totally independent of the OS used, it
>> applies to all of them)
>>
>> For such a system however, continuing to operate a compromised path
>> makes the whole thing potentially less safe.
>>
>>  
>>
>> In a fail safe system where shutting off leads to the safe state, the
>> same conclusion applies.
>>
>> That's why I agree with Lukas position that it is probably not very
>> important for safety applications to have this capability and why I
>> believe the example you gave does not support your position.
>>
>> That of course does not mean it's a bad idea to have this capability
>> in general (Not qualified to judge that ^^), but in safety systems,
>> we'd probably switch it off anyways for the reasons outlined above,
>> that's all.
>>
>>  
>>
>> Jochen
>>
>> * *
>>
>> *Von:* Elana Copperman <Elana.Copperman@...>
>> *Gesendet:* Montag, 22. November 2021 17:20
>> *An:* Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn
>> <lukas.bulwahn@...>
>> *Cc:* Shuah Khan <skhan@...>; devel@...
>> *Betreff:* Re: [ELISA Technical Community] What to do in response to a
>> kernel warning
>>
>>  
>>
>> OK, if this is the focus - then we can /should continue here, and not
>> simply annoy LWN.
>>
>> Jochen, redundancy will not help here.  If all systems enable
>> panic_on_warn, the redundant systems will all quickly fail.
>>
>> And availability is very much a concern, not only for Automotive AV,
>> but also for medical devices and many other safety critical systems.
>>
>> Panic_on_warn is simply too restrictive and is not "shutting off
>> cleanly"; it is pulling the plug on a running device.
>>
>>  
>>
>> What is needed is a well-defined fault handling mechanism which
>> handles faults appropriately, in a more specific way.
>>
>> Shutting down a safety-critical device on which requires high
>> availability for any warning, is not an option.
>>
>> As a rule, warnings should be eliminated by sufficient testing and
>> static analysis before the system hits the road.  And if some
>> warning(s) remains in the deployed device, the most practical solution
>> is logging and software (OTA) update.
>>
>>  
>>
>> ----------------------------------------------------------------------
>> --
>>
>> *From:*Jochen Kall <Jochen.Kall@...
>> <mailto:Jochen.Kall@...>>
>> *Sent:* Monday, November 22, 2021 4:03 PM
>> *To:* Elana Copperman <Elana.Copperman@...
>> <mailto:Elana.Copperman@...>>; Lukas Bulwahn
>> <lukas.bulwahn@... <mailto:lukas.bulwahn@...>>
>> *Cc:* Shuah Khan <skhan@...
>> <mailto:skhan@...>>; devel@...
>> <mailto:devel@...> <devel@...
>> <mailto:devel@...>>
>> *Subject:* AW: [ELISA Technical Community] What to do in response to a
>> kernel warning
>>
>>  
>>
>> Hi Elana,
>>
>>  
>>
>> not sure what you refer to, but in any case, I think Lukas is right, I
>> also don't see how the example supports your position.
>>
>> If availlability is safety critical, a system without redundancy
>> (linux based or not) is not a feasible design pattern anyways, and if
>> we are talking about a fail safe system, shutting off cleanly when in
>> doubt is the way to go from safety perspective.
>>
>>  
>>
>> Best regards
>>
>> Jochen
>>
>>  
>>
>> *Von:* devel@... <mailto:devel@...>
>> <devel@... <mailto:devel@...>> *Im Auftrag
>> von *elana.copperman@...
>> <mailto:elana.copperman@...>
>> *Gesendet:* Montag, 22. November 2021 10:20
>> *An:* Lukas Bulwahn <lukas.bulwahn@...
>> <mailto:lukas.bulwahn@...>>
>> *Cc:* Shuah Khan <skhan@...
>> <mailto:skhan@...>>; devel@...
>> <mailto:devel@...>
>> *Betreff:* Re: [ELISA Technical Community] What to do in response to a
>> kernel warning
>>
>>  
>>
>> Lukas, please read carefully and align with what I wrote.
>>
>> In any case, we should park this thread, and move discussion to the
>> LWN thread.
>>
>> Thanks
>>
>> Elana
>>
>>  
>>
>> ----------------------------------------------------------------------
>> --
>>
>> *From:*Lukas Bulwahn <lukas.bulwahn@...
>> <mailto:lukas.bulwahn@...>>
>> *Sent:* Monday, November 22, 2021 10:58 AM
>> *To:* Elana Copperman <Elana.Copperman@...
>> <mailto:Elana.Copperman@...>>
>> *Cc:* Shuah Khan <skhan@...
>> <mailto:skhan@...>>; devel@...
>> <mailto:devel@...> <devel@...
>> <mailto:devel@...>>
>> *Subject:* Re: [ELISA Technical Community] What to do in response to a
>> kernel warning
>>
>>  
>>
>> EXTERNAL EMAIL: Do not click any links or open any attachments unless
>> you trust the sender and know the content is safe.
>>
>> On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
>> <Elana.Copperman@... <mailto:Elana.Copperman@...>>
>> wrote:
>>>
>>> Thanks, Shuah, for sharing this important information.
>>> From my experience (and we should hear from others as well!),
>> panic_on_warn as a hard rule is too restrictive.  For example, if an
>> autonomous car is driving at 200 KM per hour on a German highway, the
>> kernel panic from a warning could be life threatening to the car
>> passengers.  What is necessary is appropriate handling for
>> panic_on_warn, to enable the integrator to define follow up behavior:
>> For example, switch to degraded functionality, or switch to a fault
>> handling application, or panic when relevant.  Even killing the
>> specific threads causing the warning should not be the only option. 
>> It would then be the integrator's responsibility to configure the
>> appropriate behavior per use case.
>>
>>
>> Sorry, Elana, this argument you are presenting above---with the
>> autonomous car---is very pictorial, but hardly meets reality. What you
>> are writing above suggests that there is no surrounding system and no
>> system engineering in place that ensures degradation and passenger
>> safety within the system that consists of multiple ECUs in a vehicle
>> network.
>>
>>>> That is not at all what was prescribed above.
>>
>>>> panic_on_warn leads to kernel panic on warning.  And that is too
>> restrictive, as explained above.  If the kernel panics, it will
>> override any system engineering and system features which have been
>> defined.
>>
>>>> Fault handling for warnings needs to be more specific (i.e., not
>>>> panic).
>>
>>>> And as I wrote above, you need some fault handling application or
>> other system support, switching to degraded functionality or a fault
>> handling application.
>>
>>
>> Of course, it is possible to design such a system you are sketching
>> above, where a single warning leads to a life-threatening event,
>>
>>>> NO, as already stated - kernel panic on warning is the wrong
>>>> behavior.
>>
>>>> Correct system behavior is outlined above, in original email thread.
>>
>>  
>>
>> ...
>>
>> The discussion actually
>> has already moved on; I think the arguments against the suggested
>> pkill_on_warn were overwhelming and now alternatives are discussed.
>>>> Good.
>>
>>>> OK, this discussion should continue in that context.  Thanks
>>
>>
>> --
>>
>> Mit freundlichen Grüßen
>> Jochen Kall
>>
>>  
>>
>> --
>> Dr. rer. nat. Jochen Kall
>>
>> Funktionale Sicherheit
>>
>>  
>>
>> ITK Engineering GmbH
>> Im Speyerer Tal 6
>> 76761 Rülzheim
>>
>>  
>>
>> Tel.: +49 7272 7703-546
>> Fax: +49 7272 7703-100
>>
>> Mobil:+491734957776
>>
>>  
>>
>> mailto:jochen.kall@...<mailto:jochen.kall@itk-engineeri
>> ng.de>
>>
>>  
>>
>> ______________________________________________________________
>>
>>  
>>
>> ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim
>>
>> Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100
>>
>> mailto:info@... <mailto:info@...>|
>> http://secure-web.cisco.com/1ihxzBLAUCKgAM4Mmt9kRBDB4hpC8fIOBTK-cQWO0T
>> inCXfSYm2vK7GqhgRwHEiuulgHfqa5DzoF55TajxEyX-zPe_lqxg6GT1ctlA0kjBrYvkTu
>> S4AelH1YgT8GXMFHuZg0ZJEG2snYVZ2L7kc1Z98Bts7X14gtfzuSkiLEn45AE6idQedd1w
>> gOCi7tAb3Qz0Ri6ZKylZiGy480AbGLON4IAaYZ1n812kbA9AnZv0krGBYZk3xrhW5F3A6d
>> bp0Bzvk-RiQUCZKMODN_B7YAerPqX519WeadSiVU1ySmF0kNYn5A4tMQOIHjtqm_ni7uf/
>> http%3A%2F%2Fwww.itk-engineering.de<http://secure-web.cisco.com/1mpWU0
>> vrkiFUZcYrpoBbpb5Kl3vJezCrpIF8aj-3p7BtkTq5weIJMnGCcYkMud5xJ0tRASYpVZv-
>> c4MbwXiGvTBQK9UWqjgIQbnJ_fAnKJv_2upGw5U1YPnXdcTsnNA-AUEyQJKoQVoh3YHkWI
>> AG4To8UDY3Ya0Yy79WxaHCBrIFAuDUmISDxlJkea_et2azfHrTI64RCNx1HQMTY2WYzfRY
>> CKLivwuHDrnoTGQVbNL81cUsp2khv3-JpNvhtLx3ZvH4SL6KoTZJPugcTSbVlN9RkA1snx
>> ck4R6j5CPNwctTpsdvlS2Ms0En0NBpVP8acDUtlLYxEWKnHvaat7HCYvA/http%3A%2F%2
>> Fwww.itk-engineering.de%2F>
>>
>>  
>>
>> Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board:
>>
>> Dr. Rudolf Maier
>>
>> Geschäftsführung/Executive Board:
>>
>> Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke
>>
>> Sitz der Gesellschaft/Registered Office: 76761 Rülzheim
>>
>> Registergericht/Registered Court: Amtsgericht Landau, HRB 32046
>>
>> USt.-ID-Nr./VAT-ID-No. DE 813165046
>>
>>
>
>
>
>
>
>
>
>
>
>


Paul Sherwood
 

On 2021-11-23 18:41, elana.copperman@... wrote:
Warnings MUST be managed. No argument about that.
Panic is NOT feasible for every warning. That is the point.
Sorry if I'm missing it, but I think maybe we're agreeing?

Folks taking responsibility for safety could/should

- review all warnings in the code, and take steps to manage them in their use case

and/or

- perform risk analysis for all warnings discovered during testing, and put appropriate mitigations in place
- panic for all warnings not otherwise handled

br
Paul


elana.copperman@...
 

1000% agree with every word you wrote.
Panic_on_warn is just not a single solution for all warnings.
The point is that we need some tweaking on panic_on_warn in order to have a solution which is derived from safety requirements, feasible, and a true contribution to Linux kernel safety.

-----Original Message-----
From: Paul Sherwood <paul.sherwood@...>
Sent: Tuesday, November 23, 2021 9:48 PM
To: Elana Copperman <Elana.Copperman@...>
Cc: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...>; Shuah Khan <skhan@...>; devel@...
Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning

EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.

On 2021-11-23 18:41, elana.copperman@... wrote:
Warnings MUST be managed. No argument about that.
Panic is NOT feasible for every warning. That is the point.
Sorry if I'm missing it, but I think maybe we're agreeing?

Folks taking responsibility for safety could/should

- review all warnings in the code, and take steps to manage them in their use case

and/or

- perform risk analysis for all warnings discovered during testing, and put appropriate mitigations in place
- panic for all warnings not otherwise handled

br
Paul


Luis Chamberlain
 

On Tue, Nov 23, 2021 at 07:50:36PM +0000, elana.copperman@... wrote:
1000% agree with every word you wrote.
Panic_on_warn is just not a single solution for all warnings.
The point is that we need some tweaking on panic_on_warn in order to have a solution which is derived from safety requirements, feasible, and a true contribution to Linux kernel safety.
The patch actually merged in replacement to the "panic on warn" then was
actually "panic: use error_report_end tracepoint on warnings":

https://lore.kernel.org/all/20211115085630.1756817-1-elver@google.com/T/#u

And so can monitor the error_report event tracepoint and look for
warnings. That should make it easier to monitor for kernel warnings
in userspace without having to poll the kernel logs.

Marco, can you extend documentation on how to use this so critical
safety folks can start using it?

Luis


Shuah Khan
 

On 11/23/21 12:47 PM, Paul Sherwood wrote:
On 2021-11-23 18:41, elana.copperman@... wrote:
 Warnings MUST be managed.  No argument about that.
 Panic is NOT feasible for every warning.  That is the point.
Sorry if I'm missing it, but I think maybe we're agreeing?
Folks taking responsibility for safety could/should
- review all warnings in the code, and take steps to manage them in their use case
If anybody is curious about the scope of this work:

git grep WARN_ON shows about 18486 usages (- definition). It isn't surprising
as WARN_ON is used a debug mechanism by some drivers. In general WARN_ON use
is discouraged and supposed to be used in only cases where there is no other
choice bu to panic. However in reality there are several usages that are just
for debug.

thanks,
-- Shuah


Lukas Bulwahn
 

On Tue, Nov 23, 2021 at 10:30 PM Shuah Khan <skhan@...> wrote:

On 11/23/21 12:47 PM, Paul Sherwood wrote:
On 2021-11-23 18:41, elana.copperman@... wrote:
Warnings MUST be managed. No argument about that.
Panic is NOT feasible for every warning. That is the point.
Sorry if I'm missing it, but I think maybe we're agreeing?

Folks taking responsibility for safety could/should

- review all warnings in the code, and take steps to manage them in their use case
If anybody is curious about the scope of this work:

git grep WARN_ON shows about 18486 usages (- definition). It isn't surprising
as WARN_ON is used a debug mechanism by some drivers. In general WARN_ON use
is discouraged and supposed to be used in only cases where there is no other
choice bu to panic. However in reality there are several usages that are just
for debug.
And a big thanks to Elana to start this; there is no need to look at
the 18486 usages at first.

It would be a good start to just look at the 1614 WARN_ON* in the
kernel/ directory.

Please either eliminate them in the code by handling them in place in
that context, or describe a reliable reaction that can be executed in
user-space and brings the kernel back into a proper operating mode
despite the state after the existing warning.

Also, https://syzkaller.appspot.com/upstream will point you to
warnings that we encountered during our kernel testing campaigns.

Good luck.

Lukas

thanks,
-- Shuah