Re: What to do in response to a kernel warning


Lukas Bulwahn
 

On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:

Thanks, Shuah, for sharing this important information.
From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive. For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers. What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior: For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant. Even killing the specific threads causing the warning should not be the only option. It would then be the integrator's responsibility to configure the appropriate behavior per use case.

Sorry, Elana, this argument you are presenting above---with the
autonomous car---is very pictorial, but hardly meets reality. What you
are writing above suggests that there is no surrounding system and no
system engineering in place that ensures degradation and passenger
safety within the system that consists of multiple ECUs in a vehicle
network.

Of course, it is possible to design such a system you are sketching
above, where a single warning leads to a life-threatening event, but
then kernel warnings are really not the problem, but the fact that the
system's safety design is so weak that the integrator's business is at
risk if this system is distributed in large numbers to others.

I will join the thread when it will be publicly available.
There is some misconception on the discussions on the linux-kernel
mailing list: the thread is already public. The discussion actually
has already moved on; I think the arguments against the suggested
pkill_on_warn were overwhelming and now alternatives are discussed.

Only the LWN.net article, which summarizes the discussion, is
available to the wider public a week after publication. Of course,
anyone that has relevant stakes in the overall kernel development, has
a LWN.net subscription---which really does not cost much---to
understand and follow closely what is happening.


The argument above did not convince me: I still think with the current
policies on when a warning is emitted in the kernel, panic_on_warn is
the only reasonable option.

I would of course support anyone that goes through all warnings in the
kernel and tries to identify exactly which operations are still
functional after the warning, e.g., which system call would still
work, which processes might still be functional, which fault
operations may still work. But that is very complex code investigation
for quite little benefit compared to following a "panic_on_warn"
behaviour, but it certainly can be done and is worth presenting if
somebody does that in an informed, structured and systematic way.

I suggest somebody describe all activities required and estimate the
complexity of those activities to build a fail-operational system with
linux on modern hardware in a single-channel system. Then, one might
have a convincing argument to do some refined handling of warnings, or
just make all the kernel functions fail-operational by modifying its
failure behavior to not emit a warning at all.

Good luck.

Lukas

Join {devel@lists.elisa.tech to automatically receive all group messages.