Date   

Event: ELISA Safety-Architecture Weekly Meeting - 11/30/2021 #cal-reminder

safety-architecture@lists.elisa.tech Calendar <noreply@...>
 

Reminder: ELISA Safety-Architecture Weekly Meeting

When:
11/30/2021
1:00pm to 2:00pm
(UTC+00:00) UTC

Where:
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Organizer: myu@... myu@...

View Event

Description:
──────────

ELISA Project is inviting you to a scheduled Zoom meeting.

Join Zoom Meeting
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Meeting ID: 957 7511 4472
Passcode: 289297
One tap mobile
+16465588656,,95775114472#,,,,,,0#,,289297# US (New York)
+13017158592,,95775114472#,,,,,,0#,,289297# US (Germantown)

Dial by your location
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 6833 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
855 880 1246 US Toll-free
877 369 0926 US Toll-free
+1 587 328 1099 Canada
+1 647 374 4685 Canada
+1 647 558 0588 Canada
+1 778 907 2071 Canada
+1 204 272 7920 Canada
+1 438 809 7799 Canada
855 703 8985 Canada Toll-free
Meeting ID: 957 7511 4472
Passcode: 289297
Find your local number: https://zoom.us/u/aQIrrAQlD


ww47 Agenda

Gabriele Paoloni
 

Hi All

Since we spent last week session discussing about the goals for the next quarter today we'll continue the discussion about the Kernel FFI (Eliphaz) and Integrity of the telltale safety app process (Gab)

Thanks
Gab


[ELISA Workshop] Dynamic Memory Allocation in Safety Related Context

Raffaele Giannessi
 

Hi WGs participants,
after our talk regarding Dynamic Memory allocation during the last ELISA workshop, we provided a feedback to the comments received in the chat panel.
Please find attached the answers to each question that may used also as roadmap for the next contributions from our side.
Thanks for your interest.

Raffaele Giannessi
Industrial PhD Evidence srl


Event: ELISA Safety-Architecture Weekly Meeting - 11/23/2021 #cal-reminder

safety-architecture@lists.elisa.tech Calendar <noreply@...>
 

Reminder: ELISA Safety-Architecture Weekly Meeting

When:
11/23/2021
1:00pm to 2:00pm
(UTC+00:00) UTC

Where:
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Organizer: myu@... myu@...

View Event

Description:
──────────

ELISA Project is inviting you to a scheduled Zoom meeting.

Join Zoom Meeting
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Meeting ID: 957 7511 4472
Passcode: 289297
One tap mobile
+16465588656,,95775114472#,,,,,,0#,,289297# US (New York)
+13017158592,,95775114472#,,,,,,0#,,289297# US (Germantown)

Dial by your location
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 6833 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
855 880 1246 US Toll-free
877 369 0926 US Toll-free
+1 587 328 1099 Canada
+1 647 374 4685 Canada
+1 647 558 0588 Canada
+1 778 907 2071 Canada
+1 204 272 7920 Canada
+1 438 809 7799 Canada
855 703 8985 Canada Toll-free
Meeting ID: 957 7511 4472
Passcode: 289297
Find your local number: https://zoom.us/u/aQIrrAQlD


ww46 Agenda

Gabriele Paoloni
 

Hi All

For today I'd like to discuss the following topics:
- where we are WRT the next quarter goals and activity proposed planning 
- continuation of Kernel FFI
- continuation of Process Address Space protection

Thanks
Gab


Re: [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Lukas Bulwahn
 

On Tue, Nov 16, 2021 at 9:41 AM Petr Mladek <pmladek@...> wrote:

On Tue 2021-11-16 10:52:39, Alexander Popov wrote:
On 15.11.2021 18:51, Gabriele Paoloni wrote:
On 15/11/2021 14:59, Lukas Bulwahn wrote:
On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...> wrote:
On 13.11.2021 00:26, Linus Torvalds wrote:
On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...> wrote:
Killing the process that hit a kernel warning complies with the Fail-Fast
principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when
the **first signs** of wrong behavior are detected.
In summary, I am not supporting pkill_on_warn. I would support the
other points I mentioned above, i.e., a good enforced policy for use
of warn() and any investigation to understand the complexity of
panic() and reducing its complexity if triggered by such an
investigation.
Hi Alex

I also agree with the summary that Lukas gave here. From my experience
the safety system are always guarded by an external flow monitor (e.g. a
watchdog) that triggers in case the safety relevant workloads slows down
or block (for any reason); given this condition of use, a system that
goes into the panic state is always safe, since the watchdog would
trigger and drive the system automatically into safe state.
So I also don't see a clear advantage of having pkill_on_warn();
actually on the flip side it seems to me that such feature could
introduce more risk, as it kills only the threads of the process that
caused the kernel warning whereas the other processes are trusted to
run on a weaker Kernel (does killing the threads of the process that
caused the kernel warning always fix the Kernel condition that lead to
the warning?)
Lukas, Gabriele, Robert,
Thanks for showing this from the safety point of view.

The part about believing in panic() functionality is amazing :)
Nothing is 100% reliable.

With printk() maintainer hat on, the current panic() implementation
is less reliable because it tries hard to provide some debugging
information, for example, error message, backtrace, registry,
flush pending messages on console, crashdump.

See panic() implementation, the reboot is done by emergency_restart().
The rest is about duping the information.

Well, the information is important. Otherwise, it is really hard to
fix the problem.

From my experience, especially the access to consoles is not fully
safe. The reliability might improve a lot when a lockless console
is used. I guess that using non-volatile memory for the log buffer
might be even more reliable.

I am not familiar with the code under emergency_restart(). I am not
sure how reliable it is.

Yes, safety critical systems depend on the robust ability to restart.
If I wanted to implement a super-reliable panic() I would
use some external device that would cause power-reset when
the watched device is not responding.
Petr, that is basically the common system design taken.

The whole challenge then remains to show that:

Once panic() was invoked, the watched device does not signal being
alive unintentionally, while the panic() is stuck in its shutdown
routines. That requires having a panic() or other shutdown routine
that still reliably can do something that the kernel routine that
makes the watched device signal does not signal anymore.


Lukas

Best Regards,
Petr


PS: I do not believe much into the pkill approach as well.

It is similar to OOM killer. And I always had to restart the
system when it was triggered.

Also kernel is not prepared for the situation that an external
code kills a kthread. And kthreads are used by many subsystems
to handle work that has to be done asynchronously and/or in
process context. And I guess that kthreads are non-trivial
source of WARN().


Re: [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Lukas Bulwahn
 

On Tue, Nov 16, 2021 at 7:37 AM Christophe Leroy
<christophe.leroy@...> wrote:



Le 15/11/2021 à 17:06, Steven Rostedt a écrit :
On Mon, 15 Nov 2021 14:59:57 +0100
Lukas Bulwahn <lukas.bulwahn@...> wrote:

1. Allow a reasonably configured kernel to boot and run with
panic_on_warn set. Warnings should only be raised when something is
not configured as the developers expect it or the kernel is put into a
state that generally is _unexpected_ and has been exposed little to
the critical thought of the developer, to testing efforts and use in
other systems in the wild. Warnings should not be used for something
informative, which still allows the kernel to continue running in a
proper way in a generally expected environment. Up to my knowledge,
there are some kernels in production that run with panic_on_warn; so,
IMHO, this requirement is generally accepted (we might of course
To me, WARN*() is the same as BUG*(). If it gets hit, it's a bug in the
kernel and needs to be fixed. I have several WARN*() calls in my code, and
it's all because the algorithms used is expected to prevent the condition
in the warning from happening. If the warning triggers, it means either that
the algorithm is wrong or my assumption about the algorithm is wrong. In
either case, the kernel needs to be updated. All my tests fail if a WARN*()
gets hit (anywhere in the kernel, not just my own).

After reading all the replies and thinking about this more, I find the
pkill_on_warning actually worse than not doing anything. If you are
concerned about exploits from warnings, the only real solution is a
panic_on_warning. Yes, it brings down the system, but really, it has to be
brought down anyway, because it is in need of a kernel update.
We also have LIVEPATCH to avoid bringing down the system for a kernel
update, don't we ? So I wouldn't expect bringing down a vital system
just for a WARN.

As far as I understand from
https://www.kernel.org/doc/html/latest/process/deprecated.html#bug-and-bug-on,
WARN() and WARN_ON() are meant to deal with those situations as
gracefull as possible, allowing the system to continue running the best
it can until a human controled action is taken.

So I'd expect the WARN/WARN_ON to be handled and I agree that that
pkill_on_warning seems dangerous and unrelevant, probably more dangerous
than doing nothing, especially as the WARN may trigger for a reason
which has nothing to do with the running thread.
Christophe,

I agree with a reasonable goal that WARN() should allow users "to deal
with those situations as gracefull as possible, allowing the system to
continue running the best it can until a human controled action is
taken."

However, that makes me wonder even more: what does the system after a
WARN() invocation still need to provide as properly working
functionality, so that the human can take action, and how can the
kernel indicate to the whole user applications that a certain
functionality is not working anymore and how adaptive can those user
application really be here? Making that explicit for every WARN()
invocation seems to be tricky and probably also quite error-prone. So,
in the end, after a WARN(), you end up running a system where you have
this uncomfortable feeling of a running system where some things work
and some things do not and it might be insecure (the whole system
security concept is invalidated, because security features do not
work, security holes are opened etc.) or other surprises happen.

The panic_on_warn implements a simple policy of that "run as graceful
as possible": We assume stopping the kernel is _graceful_, and we just
assume that the functionality "panic shuts down the system" still
works properly after any WARN() invocation. Once the system is shut
down, the human can take action and switch it into some (remote)
diagnostic mode for further analysis and repair.

I am wondering if that policy and that assumption holds for all WARN()
invocations in the kernel? I would hope that we can answer this
question, which is much simpler than getting the precise answer on
"what as graceful as possible actually means".

Lukas


Re: [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Lukas Bulwahn
 

On Tue, Nov 16, 2021 at 8:52 AM Alexander Popov <alex.popov@...> wrote:

On 15.11.2021 18:51, Gabriele Paoloni wrote:


On 15/11/2021 14:59, Lukas Bulwahn wrote:
On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...> wrote:

On 13.11.2021 00:26, Linus Torvalds wrote:
On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...> wrote:

Hello everyone!
Friendly ping for your feedback.
I still haven't heard a compelling _reason_ for this all, and why
anybody should ever use this or care?
Ok, to sum up:

Killing the process that hit a kernel warning complies with the Fail-Fast
principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when
the **first signs** of wrong behavior are detected.

By default, the Linux kernel ignores a warning and proceeds the execution from
the flawed state. That is opposite to the Fail-Fast principle.
A kernel warning may be followed by memory corruption or other negative effects,
like in CVE-2019-18683 exploit [2] or many other cases detected by the SyzScope
project [3]. pkill_on_warn would prevent the system from the errors going after
a warning in the process context.

At the same time, pkill_on_warn does not kill the entire system like
panic_on_warn. That is the middle way of handling kernel warnings.
Linus, it's similar to your BUG_ON() policy [4]. The process hitting BUG_ON() is
killed, and the system proceeds to work. pkill_on_warn just brings a similar
policy to WARN_ON() handling.

I believe that many Linux distros (which don't hit WARN_ON() here and there)
will enable pkill_on_warn because it's reasonable from the safety and security
points of view.

And I'm sure that the ELISA project by the Linux Foundation (Enabling Linux In
Safety Applications [5]) would support the pkill_on_warn sysctl.
[Adding people from this project to CC]

I hope that I managed to show the rationale.
Alex, officially and formally, I cannot talk for the ELISA project
(Enabling Linux In Safety Applications) by the Linux Foundation and I
do not think there is anyone that can confidently do so on such a
detailed technical aspect that you are raising here, and as the
various participants in the ELISA Project have not really agreed on
such a technical aspect being one way or the other and I would not see
that happening quickly. However, I have spent quite some years on the
topic on "what is the right and important topics for using Linux in
safety applications"; so here are my five cents:

One of the general assumptions about safety applications and safety
systems is that the malfunction of a function within a system is more
critical, i.e., more likely to cause harm to people, directly or
indirectly, than the unavailability of the system. So, before
"something potentially unexpected happens"---which can have arbitrary
effects and hence effects difficult to foresee and control---, it is
better to just shutdown/silence the system, i.e., design a fail-safe
or fail-silent system, as the effect of shutdown is pretty easily
foreseeable during the overall system design and you could think about
what the overall system does, when the kernel crashes the usual way.

So, that brings us to what a user would expect from the kernel in a
safety-critical system: Shutdown on any event that is unexpected.

Here, I currently see panic_on_warn as the closest existing feature to
indicate any event that is unexpected and to shutdown the system. That
requires two things for the kernel development:

1. Allow a reasonably configured kernel to boot and run with
panic_on_warn set. Warnings should only be raised when something is
not configured as the developers expect it or the kernel is put into a
state that generally is _unexpected_ and has been exposed little to
the critical thought of the developer, to testing efforts and use in
other systems in the wild. Warnings should not be used for something
informative, which still allows the kernel to continue running in a
proper way in a generally expected environment. Up to my knowledge,
there are some kernels in production that run with panic_on_warn; so,
IMHO, this requirement is generally accepted (we might of course
discuss the one or other use of warn) and is not too much to ask for.

2. Really ensure that the system shuts down when it hits warn and
panic. That requires that the execution path for warn() and panic() is
not overly complicated (stuffed with various bells and whistles).
Otherwise, warn() and panic() could fail in various complex ways and
potentially keep the system running, although it should be shut down.
Some people in the ELISA Project looked a bit into why they believe
panic() shuts down a system but I have not seen a good system analysis
and argument why any third person could be convinced that panic()
works under all circumstances where it is invoked or that at least,
the circumstances under which panic really works is properly
documented. That is a central aspect for using Linux in a
reasonably-designed safety-critical system. That is possibly also
relevant for security, as you might see an attacker obtain information
because it was possible to "block" the kernel shutting down after
invoking panic() and hence, the attacker could obtain certain
information that was only possible because 1. the system got into an
inconsistent state, 2. it was detected by some check leading to warn()
or panic(), and 3. the system's security engineers assumed that the
system must have been shutting down at that point, as panic() was
invoked, and hence, this would be disallowing a lot of further
operations or some specific operations that the attacker would need to
trigger in that inconsistent state to obtain information.

To your feature, Alex, I do not see the need to have any refined
handling of killing a specific process when the kernel warns; stopping
the whole system is the better and more predictable thing to do. I
would prefer if systems, which have those high-integrity requirements,
e.g., in a highly secure---where stopping any unintended information
flow matters more than availability---or in fail-silent environments
in safety systems, can use panic_on_warn. That should address your
concern above of handling certain CVEs as well.

In summary, I am not supporting pkill_on_warn. I would support the
other points I mentioned above, i.e., a good enforced policy for use
of warn() and any investigation to understand the complexity of
panic() and reducing its complexity if triggered by such an
investigation.
Hi Alex

I also agree with the summary that Lukas gave here. From my experience
the safety system are always guarded by an external flow monitor (e.g. a
watchdog) that triggers in case the safety relevant workloads slows down
or block (for any reason); given this condition of use, a system that
goes into the panic state is always safe, since the watchdog would
trigger and drive the system automatically into safe state.
So I also don't see a clear advantage of having pkill_on_warn();
actually on the flip side it seems to me that such feature could
introduce more risk, as it kills only the threads of the process that
caused the kernel warning whereas the other processes are trusted to
run on a weaker Kernel (does killing the threads of the process that
caused the kernel warning always fix the Kernel condition that lead to
the warning?)
Lukas, Gabriele, Robert,
Thanks for showing this from the safety point of view.

The part about believing in panic() functionality is amazing :)
Yes, safety critical systems depend on the robust ability to restart.
Well, there is really a lot of thought and willingness for engineering
effort to address the fact there must be high confidence that the
shutdown with panic() really works.

The proper start and restart of the kernel is actually another
issue... but there various sanity checks are possible before the
system switches into a mode that could potentially harm people (cause
physical damage, directly or indirectly).

Lukas

Best regards,
Alexander


Re: [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Robert Krutsch
 

We can always kill on warnings but if these warnings are very frequent nobody is going to buy the implementation. If you have a statistic showing how frequent these warnings arise (in some typical system) and can argue that the impact is minimal it would be easier to accept.

As Lukas was saying, currently we do not rate availability so high but in next years this could become a key ask. 

Even in consumer HW today this kind of reset mindset is avoided (we have RAS features in all architectures nowadays). 

//Robert 



On Mon, Nov 15, 2021 at 3:00 PM Lukas Bulwahn <lukas.bulwahn@...> wrote:
On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...> wrote:
>
> On 13.11.2021 00:26, Linus Torvalds wrote:
> > On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...> wrote:
> >>
> >> Hello everyone!
> >> Friendly ping for your feedback.
> >
> > I still haven't heard a compelling _reason_ for this all, and why
> > anybody should ever use this or care?
>
> Ok, to sum up:
>
> Killing the process that hit a kernel warning complies with the Fail-Fast
> principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when
> the **first signs** of wrong behavior are detected.
>
> By default, the Linux kernel ignores a warning and proceeds the execution from
> the flawed state. That is opposite to the Fail-Fast principle.
> A kernel warning may be followed by memory corruption or other negative effects,
> like in CVE-2019-18683 exploit [2] or many other cases detected by the SyzScope
> project [3]. pkill_on_warn would prevent the system from the errors going after
> a warning in the process context.
>
> At the same time, pkill_on_warn does not kill the entire system like
> panic_on_warn. That is the middle way of handling kernel warnings.
> Linus, it's similar to your BUG_ON() policy [4]. The process hitting BUG_ON() is
> killed, and the system proceeds to work. pkill_on_warn just brings a similar
> policy to WARN_ON() handling.
>
> I believe that many Linux distros (which don't hit WARN_ON() here and there)
> will enable pkill_on_warn because it's reasonable from the safety and security
> points of view.
>
> And I'm sure that the ELISA project by the Linux Foundation (Enabling Linux In
> Safety Applications [5]) would support the pkill_on_warn sysctl.
> [Adding people from this project to CC]
>
> I hope that I managed to show the rationale.
>

Alex, officially and formally, I cannot talk for the ELISA project
(Enabling Linux In Safety Applications) by the Linux Foundation and I
do not think there is anyone that can confidently do so on such a
detailed technical aspect that you are raising here, and as the
various participants in the ELISA Project have not really agreed on
such a technical aspect being one way or the other and I would not see
that happening quickly. However, I have spent quite some years on the
topic on "what is the right and important topics for using Linux in
safety applications"; so here are my five cents:

One of the general assumptions about safety applications and safety
systems is that the malfunction of a function within a system is more
critical, i.e., more likely to cause harm to people, directly or
indirectly, than the unavailability of the system. So, before
"something potentially unexpected happens"---which can have arbitrary
effects and hence effects difficult to foresee and control---, it is
better to just shutdown/silence the system, i.e., design a fail-safe
or fail-silent system, as the effect of shutdown is pretty easily
foreseeable during the overall system design and you could think about
what the overall system does, when the kernel crashes the usual way.

So, that brings us to what a user would expect from the kernel in a
safety-critical system: Shutdown on any event that is unexpected.

Here, I currently see panic_on_warn as the closest existing feature to
indicate any event that is unexpected and to shutdown the system. That
requires two things for the kernel development:

1. Allow a reasonably configured kernel to boot and run with
panic_on_warn set. Warnings should only be raised when something is
not configured as the developers expect it or the kernel is put into a
state that generally is _unexpected_ and has been exposed little to
the critical thought of the developer, to testing efforts and use in
other systems in the wild. Warnings should not be used for something
informative, which still allows the kernel to continue running in a
proper way in a generally expected environment. Up to my knowledge,
there are some kernels in production that run with panic_on_warn; so,
IMHO, this requirement is generally accepted (we might of course
discuss the one or other use of warn) and is not too much to ask for.

2. Really ensure that the system shuts down when it hits warn and
panic. That requires that the execution path for warn() and panic() is
not overly complicated (stuffed with various bells and whistles).
Otherwise, warn() and panic() could fail in various complex ways and
potentially keep the system running, although it should be shut down.
Some people in the ELISA Project looked a bit into why they believe
panic() shuts down a system but I have not seen a good system analysis
and argument why any third person could be convinced that panic()
works under all circumstances where it is invoked or that at least,
the circumstances under which panic really works is properly
documented. That is a central aspect for using Linux in a
reasonably-designed safety-critical system. That is possibly also
relevant for security, as you might see an attacker obtain information
because it was possible to "block" the kernel shutting down after
invoking panic() and hence, the attacker could obtain certain
information that was only possible because 1. the system got into an
inconsistent state, 2. it was detected by some check leading to warn()
or panic(), and 3. the system's security engineers assumed that the
system must have been shutting down at that point, as panic() was
invoked, and hence, this would be disallowing a lot of further
operations or some specific operations that the attacker would need to
trigger in that inconsistent state to obtain information.

To your feature, Alex, I do not see the need to have any refined
handling of killing a specific process when the kernel warns; stopping
the whole system is the better and more predictable thing to do. I
would prefer if systems, which have those high-integrity requirements,
e.g., in a highly secure---where stopping any unintended information
flow matters more than availability---or in fail-silent environments
in safety systems, can use panic_on_warn. That should address your
concern above of handling certain CVEs as well.

In summary, I am not supporting pkill_on_warn. I would support the
other points I mentioned above, i.e., a good enforced policy for use
of warn() and any investigation to understand the complexity of
panic() and reducing its complexity if triggered by such an
investigation.

Of course, the listeners and participants in the ELISA Project are
very, very diverse and still on a steep learning curve, i.e., what
does the kernel do, how complex are certain aspects in the kernel, and
what are reasonable system designs that are in reach for
certification. So, there might be some stakeholders in the ELISA
Project that consider availability of a Linux system safety-critical,
i.e., if the system with a Linux kernel is not available, something
really bad (harmful to people) happens. I personally do not know how
these stakeholders could (ever) argue the availability of a complex
system with a Linux kernel, with the availability criteria and the
needed confidence (evidence and required methods) for exposing anyone
to such system under our current societal expectations on technical
systems (you would to need show sufficient investigation of the
kernel's availability for a certification), but that does not stop
anyone looking into it... Those stakeholders should really speak for
themselves, if they see any need for such a refined control of
"something unexpected happens" (an invocation of warn) and more
generally what features from the kernel are needed for such systems.


Lukas






Re: [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Gabriele Paoloni
 

On 15/11/2021 14:59, Lukas Bulwahn wrote:
On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...> wrote:

On 13.11.2021 00:26, Linus Torvalds wrote:
On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...> wrote:

Hello everyone!
Friendly ping for your feedback.
I still haven't heard a compelling _reason_ for this all, and why
anybody should ever use this or care?
Ok, to sum up:

Killing the process that hit a kernel warning complies with the Fail-Fast
principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when
the **first signs** of wrong behavior are detected.

By default, the Linux kernel ignores a warning and proceeds the execution from
the flawed state. That is opposite to the Fail-Fast principle.
A kernel warning may be followed by memory corruption or other negative effects,
like in CVE-2019-18683 exploit [2] or many other cases detected by the SyzScope
project [3]. pkill_on_warn would prevent the system from the errors going after
a warning in the process context.

At the same time, pkill_on_warn does not kill the entire system like
panic_on_warn. That is the middle way of handling kernel warnings.
Linus, it's similar to your BUG_ON() policy [4]. The process hitting BUG_ON() is
killed, and the system proceeds to work. pkill_on_warn just brings a similar
policy to WARN_ON() handling.

I believe that many Linux distros (which don't hit WARN_ON() here and there)
will enable pkill_on_warn because it's reasonable from the safety and security
points of view.

And I'm sure that the ELISA project by the Linux Foundation (Enabling Linux In
Safety Applications [5]) would support the pkill_on_warn sysctl.
[Adding people from this project to CC]

I hope that I managed to show the rationale.
Alex, officially and formally, I cannot talk for the ELISA project
(Enabling Linux In Safety Applications) by the Linux Foundation and I
do not think there is anyone that can confidently do so on such a
detailed technical aspect that you are raising here, and as the
various participants in the ELISA Project have not really agreed on
such a technical aspect being one way or the other and I would not see
that happening quickly. However, I have spent quite some years on the
topic on "what is the right and important topics for using Linux in
safety applications"; so here are my five cents:

One of the general assumptions about safety applications and safety
systems is that the malfunction of a function within a system is more
critical, i.e., more likely to cause harm to people, directly or
indirectly, than the unavailability of the system. So, before
"something potentially unexpected happens"---which can have arbitrary
effects and hence effects difficult to foresee and control---, it is
better to just shutdown/silence the system, i.e., design a fail-safe
or fail-silent system, as the effect of shutdown is pretty easily
foreseeable during the overall system design and you could think about
what the overall system does, when the kernel crashes the usual way.

So, that brings us to what a user would expect from the kernel in a
safety-critical system: Shutdown on any event that is unexpected.

Here, I currently see panic_on_warn as the closest existing feature to
indicate any event that is unexpected and to shutdown the system. That
requires two things for the kernel development:

1. Allow a reasonably configured kernel to boot and run with
panic_on_warn set. Warnings should only be raised when something is
not configured as the developers expect it or the kernel is put into a
state that generally is _unexpected_ and has been exposed little to
the critical thought of the developer, to testing efforts and use in
other systems in the wild. Warnings should not be used for something
informative, which still allows the kernel to continue running in a
proper way in a generally expected environment. Up to my knowledge,
there are some kernels in production that run with panic_on_warn; so,
IMHO, this requirement is generally accepted (we might of course
discuss the one or other use of warn) and is not too much to ask for.

2. Really ensure that the system shuts down when it hits warn and
panic. That requires that the execution path for warn() and panic() is
not overly complicated (stuffed with various bells and whistles).
Otherwise, warn() and panic() could fail in various complex ways and
potentially keep the system running, although it should be shut down.
Some people in the ELISA Project looked a bit into why they believe
panic() shuts down a system but I have not seen a good system analysis
and argument why any third person could be convinced that panic()
works under all circumstances where it is invoked or that at least,
the circumstances under which panic really works is properly
documented. That is a central aspect for using Linux in a
reasonably-designed safety-critical system. That is possibly also
relevant for security, as you might see an attacker obtain information
because it was possible to "block" the kernel shutting down after
invoking panic() and hence, the attacker could obtain certain
information that was only possible because 1. the system got into an
inconsistent state, 2. it was detected by some check leading to warn()
or panic(), and 3. the system's security engineers assumed that the
system must have been shutting down at that point, as panic() was
invoked, and hence, this would be disallowing a lot of further
operations or some specific operations that the attacker would need to
trigger in that inconsistent state to obtain information.

To your feature, Alex, I do not see the need to have any refined
handling of killing a specific process when the kernel warns; stopping
the whole system is the better and more predictable thing to do. I
would prefer if systems, which have those high-integrity requirements,
e.g., in a highly secure---where stopping any unintended information
flow matters more than availability---or in fail-silent environments
in safety systems, can use panic_on_warn. That should address your
concern above of handling certain CVEs as well.

In summary, I am not supporting pkill_on_warn. I would support the
other points I mentioned above, i.e., a good enforced policy for use
of warn() and any investigation to understand the complexity of
panic() and reducing its complexity if triggered by such an
investigation.
Hi Alex

I also agree with the summary that Lukas gave here. From my experience
the safety system are always guarded by an external flow monitor (e.g. a
watchdog) that triggers in case the safety relevant workloads slows down
or block (for any reason); given this condition of use, a system that
goes into the panic state is always safe, since the watchdog would
trigger and drive the system automatically into safe state.
So I also don't see a clear advantage of having pkill_on_warn();
actually on the flip side it seems to me that such feature could
introduce more risk, as it kills only the threads of the process that
caused the kernel warning whereas the other processes are trusted to
run on a weaker Kernel (does killing the threads of the process that
caused the kernel warning always fix the Kernel condition that lead to
the warning?)

Thanks
Gab


Of course, the listeners and participants in the ELISA Project are
very, very diverse and still on a steep learning curve, i.e., what
does the kernel do, how complex are certain aspects in the kernel, and
what are reasonable system designs that are in reach for
certification. So, there might be some stakeholders in the ELISA
Project that consider availability of a Linux system safety-critical,
i.e., if the system with a Linux kernel is not available, something
really bad (harmful to people) happens. I personally do not know how
these stakeholders could (ever) argue the availability of a complex
system with a Linux kernel, with the availability criteria and the
needed confidence (evidence and required methods) for exposing anyone
to such system under our current societal expectations on technical
systems (you would to need show sufficient investigation of the
kernel's availability for a certification), but that does not stop
anyone looking into it... Those stakeholders should really speak for
themselves, if they see any need for such a refined control of
"something unexpected happens" (an invocation of warn) and more
generally what features from the kernel are needed for such systems.


Lukas





Re: [PATCH v2 0/2] Introduce the pkill_on_warn parameter

Lukas Bulwahn
 

On Sat, Nov 13, 2021 at 7:14 PM Alexander Popov <alex.popov@...> wrote:

On 13.11.2021 00:26, Linus Torvalds wrote:
On Fri, Nov 12, 2021 at 10:52 AM Alexander Popov <alex.popov@...> wrote:

Hello everyone!
Friendly ping for your feedback.
I still haven't heard a compelling _reason_ for this all, and why
anybody should ever use this or care?
Ok, to sum up:

Killing the process that hit a kernel warning complies with the Fail-Fast
principle [1]. pkill_on_warn sysctl allows the kernel to stop the process when
the **first signs** of wrong behavior are detected.

By default, the Linux kernel ignores a warning and proceeds the execution from
the flawed state. That is opposite to the Fail-Fast principle.
A kernel warning may be followed by memory corruption or other negative effects,
like in CVE-2019-18683 exploit [2] or many other cases detected by the SyzScope
project [3]. pkill_on_warn would prevent the system from the errors going after
a warning in the process context.

At the same time, pkill_on_warn does not kill the entire system like
panic_on_warn. That is the middle way of handling kernel warnings.
Linus, it's similar to your BUG_ON() policy [4]. The process hitting BUG_ON() is
killed, and the system proceeds to work. pkill_on_warn just brings a similar
policy to WARN_ON() handling.

I believe that many Linux distros (which don't hit WARN_ON() here and there)
will enable pkill_on_warn because it's reasonable from the safety and security
points of view.

And I'm sure that the ELISA project by the Linux Foundation (Enabling Linux In
Safety Applications [5]) would support the pkill_on_warn sysctl.
[Adding people from this project to CC]

I hope that I managed to show the rationale.
Alex, officially and formally, I cannot talk for the ELISA project
(Enabling Linux In Safety Applications) by the Linux Foundation and I
do not think there is anyone that can confidently do so on such a
detailed technical aspect that you are raising here, and as the
various participants in the ELISA Project have not really agreed on
such a technical aspect being one way or the other and I would not see
that happening quickly. However, I have spent quite some years on the
topic on "what is the right and important topics for using Linux in
safety applications"; so here are my five cents:

One of the general assumptions about safety applications and safety
systems is that the malfunction of a function within a system is more
critical, i.e., more likely to cause harm to people, directly or
indirectly, than the unavailability of the system. So, before
"something potentially unexpected happens"---which can have arbitrary
effects and hence effects difficult to foresee and control---, it is
better to just shutdown/silence the system, i.e., design a fail-safe
or fail-silent system, as the effect of shutdown is pretty easily
foreseeable during the overall system design and you could think about
what the overall system does, when the kernel crashes the usual way.

So, that brings us to what a user would expect from the kernel in a
safety-critical system: Shutdown on any event that is unexpected.

Here, I currently see panic_on_warn as the closest existing feature to
indicate any event that is unexpected and to shutdown the system. That
requires two things for the kernel development:

1. Allow a reasonably configured kernel to boot and run with
panic_on_warn set. Warnings should only be raised when something is
not configured as the developers expect it or the kernel is put into a
state that generally is _unexpected_ and has been exposed little to
the critical thought of the developer, to testing efforts and use in
other systems in the wild. Warnings should not be used for something
informative, which still allows the kernel to continue running in a
proper way in a generally expected environment. Up to my knowledge,
there are some kernels in production that run with panic_on_warn; so,
IMHO, this requirement is generally accepted (we might of course
discuss the one or other use of warn) and is not too much to ask for.

2. Really ensure that the system shuts down when it hits warn and
panic. That requires that the execution path for warn() and panic() is
not overly complicated (stuffed with various bells and whistles).
Otherwise, warn() and panic() could fail in various complex ways and
potentially keep the system running, although it should be shut down.
Some people in the ELISA Project looked a bit into why they believe
panic() shuts down a system but I have not seen a good system analysis
and argument why any third person could be convinced that panic()
works under all circumstances where it is invoked or that at least,
the circumstances under which panic really works is properly
documented. That is a central aspect for using Linux in a
reasonably-designed safety-critical system. That is possibly also
relevant for security, as you might see an attacker obtain information
because it was possible to "block" the kernel shutting down after
invoking panic() and hence, the attacker could obtain certain
information that was only possible because 1. the system got into an
inconsistent state, 2. it was detected by some check leading to warn()
or panic(), and 3. the system's security engineers assumed that the
system must have been shutting down at that point, as panic() was
invoked, and hence, this would be disallowing a lot of further
operations or some specific operations that the attacker would need to
trigger in that inconsistent state to obtain information.

To your feature, Alex, I do not see the need to have any refined
handling of killing a specific process when the kernel warns; stopping
the whole system is the better and more predictable thing to do. I
would prefer if systems, which have those high-integrity requirements,
e.g., in a highly secure---where stopping any unintended information
flow matters more than availability---or in fail-silent environments
in safety systems, can use panic_on_warn. That should address your
concern above of handling certain CVEs as well.

In summary, I am not supporting pkill_on_warn. I would support the
other points I mentioned above, i.e., a good enforced policy for use
of warn() and any investigation to understand the complexity of
panic() and reducing its complexity if triggered by such an
investigation.

Of course, the listeners and participants in the ELISA Project are
very, very diverse and still on a steep learning curve, i.e., what
does the kernel do, how complex are certain aspects in the kernel, and
what are reasonable system designs that are in reach for
certification. So, there might be some stakeholders in the ELISA
Project that consider availability of a Linux system safety-critical,
i.e., if the system with a Linux kernel is not available, something
really bad (harmful to people) happens. I personally do not know how
these stakeholders could (ever) argue the availability of a complex
system with a Linux kernel, with the availability criteria and the
needed confidence (evidence and required methods) for exposing anyone
to such system under our current societal expectations on technical
systems (you would to need show sufficient investigation of the
kernel's availability for a certification), but that does not stop
anyone looking into it... Those stakeholders should really speak for
themselves, if they see any need for such a refined control of
"something unexpected happens" (an invocation of warn) and more
generally what features from the kernel are needed for such systems.


Lukas


Event: ELISA Safety-Architecture Weekly Meeting - 11/16/2021 #cal-reminder

safety-architecture@lists.elisa.tech Calendar <noreply@...>
 

Reminder: ELISA Safety-Architecture Weekly Meeting

When:
11/16/2021
1:00pm to 2:00pm
(UTC+00:00) UTC

Where:
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Organizer: myu@... myu@...

View Event

Description:
──────────

ELISA Project is inviting you to a scheduled Zoom meeting.

Join Zoom Meeting
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Meeting ID: 957 7511 4472
Passcode: 289297
One tap mobile
+16465588656,,95775114472#,,,,,,0#,,289297# US (New York)
+13017158592,,95775114472#,,,,,,0#,,289297# US (Germantown)

Dial by your location
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 6833 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
855 880 1246 US Toll-free
877 369 0926 US Toll-free
+1 587 328 1099 Canada
+1 647 374 4685 Canada
+1 647 558 0588 Canada
+1 778 907 2071 Canada
+1 204 272 7920 Canada
+1 438 809 7799 Canada
855 703 8985 Canada Toll-free
Meeting ID: 957 7511 4472
Passcode: 289297
Find your local number: https://zoom.us/u/aQIrrAQlD


Re: Summer time confusion

Gabriele Paoloni
 

We safely hold meetings in lockstep ;-)


On Tue, Nov 2, 2021 at 1:29 PM Jochen Kall <Jochen.Kall@...> wrote:

Hi Gab, Min

 

I think I know why, it seems we keep up the proud ELISA tradition for another year 😝  

In my calendar all ELISA meetings show up twice for this week, once with and once without timeshift, soo…

Anyways, from next week on everything looks right again.

 

Cheers

Jochen

 

 

Von: safety-architecture@... <safety-architecture@...> Im Auftrag von Gabriele Paoloni
Gesendet: Dienstag, 2. November 2021 13:14
An: safety-architecture@...
Betreff: [ELISA Safety Architecture WG] Summer time confusion

 

Hi All

 

I am getting few msgs of people being online now.

Min sent out the updated meeting series on Fri, so the meeting is at 2pm CET (following the EU time change)

 

Thanks

Gab


Re: Summer time confusion

Jochen Kall
 

Hi Gab, Min

 

I think I know why, it seems we keep up the proud ELISA tradition for another year 😝  

In my calendar all ELISA meetings show up twice for this week, once with and once without timeshift, soo…

Anyways, from next week on everything looks right again.

 

Cheers

Jochen

 

 

Von: safety-architecture@... <safety-architecture@...> Im Auftrag von Gabriele Paoloni
Gesendet: Dienstag, 2. November 2021 13:14
An: safety-architecture@...
Betreff: [ELISA Safety Architecture WG] Summer time confusion

 

Hi All

 

I am getting few msgs of people being online now.

Min sent out the updated meeting series on Fri, so the meeting is at 2pm CET (following the EU time change)

 

Thanks

Gab


Summer time confusion

Gabriele Paoloni
 

Hi All

I am getting few msgs of people being online now.
Min sent out the updated meeting series on Fri, so the meeting is at 2pm CET (following the EU time change)

Thanks
Gab


ww44 Agenda

Gabriele Paoloni
 

Hi All

Today we'll cover the following topic:
- Kernel FFI from Mobileye approach
- Integrity of the safety app process address space

Thanks
Gab


Event: ELISA Safety-Architecture Weekly Meeting - 11/02/2021 #cal-reminder

safety-architecture@lists.elisa.tech Calendar <noreply@...>
 

Reminder: ELISA Safety-Architecture Weekly Meeting

When:
11/02/2021
1:00pm to 2:00pm
(UTC+00:00) UTC

Where:
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Organizer: myu@... myu@...

View Event

Description:
──────────

ELISA Project is inviting you to a scheduled Zoom meeting.

Join Zoom Meeting
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Meeting ID: 957 7511 4472
Passcode: 289297
One tap mobile
+16465588656,,95775114472#,,,,,,0#,,289297# US (New York)
+13017158592,,95775114472#,,,,,,0#,,289297# US (Germantown)

Dial by your location
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 6833 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
855 880 1246 US Toll-free
877 369 0926 US Toll-free
+1 587 328 1099 Canada
+1 647 374 4685 Canada
+1 647 558 0588 Canada
+1 778 907 2071 Canada
+1 204 272 7920 Canada
+1 438 809 7799 Canada
855 703 8985 Canada Toll-free
Meeting ID: 957 7511 4472
Passcode: 289297
Find your local number: https://zoom.us/u/aQIrrAQlD


Cancelled Event: ELISA Safety-Architecture Weekly Meeting - Tuesday, November 9, 2021 #cal-cancelled

safety-architecture@lists.elisa.tech Calendar <noreply@...>
 

Cancelled: ELISA Safety-Architecture Weekly Meeting

This event has been cancelled.

When:
Tuesday, November 9, 2021
12:00pm to 1:00pm
(UTC+00:00) UTC

Where:
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Organizer: myu@... myu@...

Description:
──────────

ELISA Project is inviting you to a scheduled Zoom meeting.

Join Zoom Meeting
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Meeting ID: 957 7511 4472
Passcode: 289297
One tap mobile
+16465588656,,95775114472#,,,,,,0#,,289297# US (New York)
+13017158592,,95775114472#,,,,,,0#,,289297# US (Germantown)

Dial by your location
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 6833 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
855 880 1246 US Toll-free
877 369 0926 US Toll-free
+1 587 328 1099 Canada
+1 647 374 4685 Canada
+1 647 558 0588 Canada
+1 778 907 2071 Canada
+1 204 272 7920 Canada
+1 438 809 7799 Canada
855 703 8985 Canada Toll-free
Meeting ID: 957 7511 4472
Passcode: 289297
Find your local number: https://zoom.us/u/aQIrrAQlD


Updated Event: ELISA Safety-Architecture Weekly Meeting - Tuesday, November 2, 2021 #cal-invite

safety-architecture@lists.elisa.tech Calendar <noreply@...>
 

ELISA Safety-Architecture Weekly Meeting

When:
Tuesday, November 2, 2021
1:00pm to 2:00pm
(UTC+00:00) UTC

Where:
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Organizer: myu@... myu@...

View Event

Description:
──────────

ELISA Project is inviting you to a scheduled Zoom meeting.

Join Zoom Meeting
https://zoom.us/j/95775114472?pwd=VDFzTjRjNW8yd3ZOQWVLS1ZpWFlEUT09

Meeting ID: 957 7511 4472
Passcode: 289297
One tap mobile
+16465588656,,95775114472#,,,,,,0#,,289297# US (New York)
+13017158592,,95775114472#,,,,,,0#,,289297# US (Germantown)

Dial by your location
+1 646 558 8656 US (New York)
+1 301 715 8592 US (Germantown)
+1 312 626 6799 US (Chicago)
+1 669 900 6833 US (San Jose)
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
855 880 1246 US Toll-free
877 369 0926 US Toll-free
+1 587 328 1099 Canada
+1 647 374 4685 Canada
+1 647 558 0588 Canada
+1 778 907 2071 Canada
+1 204 272 7920 Canada
+1 438 809 7799 Canada
855 703 8985 Canada Toll-free
Meeting ID: 957 7511 4472
Passcode: 289297
Find your local number: https://zoom.us/u/aQIrrAQlD


today's meeting cancelled

Gabriele Paoloni
 

Hi All

Unfortunately I had a last minute issue and I need to cancel today's meeting.
We will reconvene next week to continue the investigation on the integrity of the process address space.

Thanks and Regards
Gab

81 - 100 of 719