Re: [ELISA Safety Architecture WG] [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Lukas Bulwahn

On Tue, Nov 16, 2021 at 9:41 AM Petr Mladek <pmladek@...> wrote:

On Tue 2021-11-16 10:52:39, Alexander Popov wrote:
On 15.11.2021 18:51, Gabriele Paoloni wrote:
On 15/11/2021 14:59, Lukas Bulwahn wrote:
On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...> wrote:
On 13.11.2021 00:26, Linus Torvalds wrote:
On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...> wrote:
Killing the process that hit a kernel warning complies with the Fail-Fast
principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when
the **first signs** of wrong behavior are detected.
In summary, I am not supporting pkill_on_warn. I would support the
other points I mentioned above, i.e., a good enforced policy for use
of warn() and any investigation to understand the complexity of
panic() and reducing its complexity if triggered by such an
Hi Alex

I also agree with the summary that Lukas gave here. From my experience
the safety system are always guarded by an external flow monitor (e.g. a
watchdog) that triggers in case the safety relevant workloads slows down
or block (for any reason); given this condition of use, a system that
goes into the panic state is always safe, since the watchdog would
trigger and drive the system automatically into safe state.
So I also don't see a clear advantage of having pkill_on_warn();
actually on the flip side it seems to me that such feature could
introduce more risk, as it kills only the threads of the process that
caused the kernel warning whereas the other processes are trusted to
run on a weaker Kernel (does killing the threads of the process that
caused the kernel warning always fix the Kernel condition that lead to
the warning?)
Lukas, Gabriele, Robert,
Thanks for showing this from the safety point of view.

The part about believing in panic() functionality is amazing :)
Nothing is 100% reliable.

With printk() maintainer hat on, the current panic() implementation
is less reliable because it tries hard to provide some debugging
information, for example, error message, backtrace, registry,
flush pending messages on console, crashdump.

See panic() implementation, the reboot is done by emergency_restart().
The rest is about duping the information.

Well, the information is important. Otherwise, it is really hard to
fix the problem.

From my experience, especially the access to consoles is not fully
safe. The reliability might improve a lot when a lockless console
is used. I guess that using non-volatile memory for the log buffer
might be even more reliable.

I am not familiar with the code under emergency_restart(). I am not
sure how reliable it is.

Yes, safety critical systems depend on the robust ability to restart.
If I wanted to implement a super-reliable panic() I would
use some external device that would cause power-reset when
the watched device is not responding.
Petr, that is basically the common system design taken.

The whole challenge then remains to show that:

Once panic() was invoked, the watched device does not signal being
alive unintentionally, while the panic() is stuck in its shutdown
routines. That requires having a panic() or other shutdown routine
that still reliably can do something that the kernel routine that
makes the watched device signal does not signal anymore.


Best Regards,

PS: I do not believe much into the pkill approach as well.

It is similar to OOM killer. And I always had to restart the
system when it was triggered.

Also kernel is not prepared for the situation that an external
code kills a kthread. And kthreads are used by many subsystems
to handle work that has to be done asynchronously and/or in
process context. And I guess that kthreads are non-trivial
source of WARN().

Join { to automatically receive all group messages.