What to do in response to a kernel warning
Shuah Khan
All,
This is an active thread about "What to do in response to a kernel warning" on Linux kernel mailing lists. Lukas and others from ELISA have been participating. Give it a read. Alexander Popov called out ELISA for input and feedback on his take on solving the big hammer approach of sysctl knob (kernel/panic_on_warn knob with proposing adding kernel/pkill_on_warn knob to kill threads and process that cause the warn as opposed taking the system down. Give it a read - if you can't access it now, it will available without subscription in a week. https://lwn.net/Articles/876209/ thanks, -- Shuah |
|
On Fri, Nov 19, 2021 at 5:58 PM Shuah Khan <skhan@...> wrote:
Thanks, Shuah for pointing out the LWN article. Alex Popov pulled us into a kernel discussion this week on a specific kernel feature proposal with a remark that that is what safety-critical systems need. In short, Alexander Popov suggested that warnings in the kernel need a refined run-time treatment. I disagreed with him and stated that I see that panic_on_warn would be turned on in the kernel for safety-critical systems and that a safety-critical system never would try to continue to operate after a warn(): the risk of malfunction is larger than the benefit of continued operation. All of this is of course largely a hypothesis based on my understanding of the requirements of safety-critical systems that may ever rely on Linux. I would of course be interested in: - do we all agree that setting panic_on_warn is the reasonable choice for this kernel configuration for the safety-critical systems we are discussing? Are there arguments not to set panic_on_warn that I am not aware of or I misjudged? - Which warnings and kernel panics do you encounter in your current test and (early) production systems when switching on panic_on_warn? We can support each other here to debug and resolve them appropriately. Please share such information. I am confident that ELISA contributors could support your development (clean-up) activities if that information on known encountered but unresolved warnings is shared. Lukas |
|
elana.copperman@...
Thanks, Shuah, for sharing this important information.
toggle quoted message
Show quoted text
From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive. For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers. What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior: For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant. Even killing the specific threads causing the warning should not be the only option. It would then be the integrator's responsibility to configure the appropriate behavior per use case. I will join the thread when it will be publicly available. Regards Elana -----Original Message-----
From: devel@... <devel@...> On Behalf Of Lukas Bulwahn Sent: Friday, November 19, 2021 8:16 PM To: Shuah Khan <skhan@...> Cc: devel@... Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning On Fri, Nov 19, 2021 at 5:58 PM Shuah Khan <skhan@...> wrote: Thanks, Shuah for pointing out the LWN article. Alex Popov pulled us into a kernel discussion this week on a specific kernel feature proposal with a remark that that is what safety-critical systems need. In short, Alexander Popov suggested that warnings in the kernel need a refined run-time treatment. I disagreed with him and stated that I see that panic_on_warn would be turned on in the kernel for safety-critical systems and that a safety-critical system never would try to continue to operate after a warn(): the risk of malfunction is larger than the benefit of continued operation. All of this is of course largely a hypothesis based on my understanding of the requirements of safety-critical systems that may ever rely on Linux. I would of course be interested in: - do we all agree that setting panic_on_warn is the reasonable choice for this kernel configuration for the safety-critical systems we are discussing? Are there arguments not to set panic_on_warn that I am not aware of or I misjudged? - Which warnings and kernel panics do you encounter in your current test and (early) production systems when switching on panic_on_warn? We can support each other here to debug and resolve them appropriately. Please share such information. I am confident that ELISA contributors could support your development (clean-up) activities if that information on known encountered but unresolved warnings is shared. Lukas |
|
On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman
<Elana.Copperman@...> wrote:
Sorry, Elana, this argument you are presenting above---with the autonomous car---is very pictorial, but hardly meets reality. What you are writing above suggests that there is no surrounding system and no system engineering in place that ensures degradation and passenger safety within the system that consists of multiple ECUs in a vehicle network. Of course, it is possible to design such a system you are sketching above, where a single warning leads to a life-threatening event, but then kernel warnings are really not the problem, but the fact that the system's safety design is so weak that the integrator's business is at risk if this system is distributed in large numbers to others. I will join the thread when it will be publicly available.There is some misconception on the discussions on the linux-kernel mailing list: the thread is already public. The discussion actually has already moved on; I think the arguments against the suggested pkill_on_warn were overwhelming and now alternatives are discussed. Only the LWN.net article, which summarizes the discussion, is available to the wider public a week after publication. Of course, anyone that has relevant stakes in the overall kernel development, has a LWN.net subscription---which really does not cost much---to understand and follow closely what is happening. The argument above did not convince me: I still think with the current policies on when a warning is emitted in the kernel, panic_on_warn is the only reasonable option. I would of course support anyone that goes through all warnings in the kernel and tries to identify exactly which operations are still functional after the warning, e.g., which system call would still work, which processes might still be functional, which fault operations may still work. But that is very complex code investigation for quite little benefit compared to following a "panic_on_warn" behaviour, but it certainly can be done and is worth presenting if somebody does that in an informed, structured and systematic way. I suggest somebody describe all activities required and estimate the complexity of those activities to build a fail-operational system with linux on modern hardware in a single-channel system. Then, one might have a convincing argument to do some refined handling of warnings, or just make all the kernel functions fail-operational by modifying its failure behavior to not emit a warning at all. Good luck. Lukas |
|
elana.copperman@...
Lukas, please read carefully and align with what I wrote.
In any case, we should park this thread, and move discussion to the LWN thread.
Thanks
Elana
From: Lukas Bulwahn <lukas.bulwahn@...>
Sent: Monday, November 22, 2021 10:58 AM To: Elana Copperman <Elana.Copperman@...> Cc: Shuah Khan <skhan@...>; devel@... <devel@...> Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman <Elana.Copperman@...> wrote: > > Thanks, Shuah, for sharing this important information. > From my experience (and we should hear from others as well!), panic_on_warn as a hard rule is too restrictive. For example, if an autonomous car is driving at 200 KM per hour on a German highway, the kernel panic from a warning could be life threatening to the car passengers. What is necessary is appropriate handling for panic_on_warn, to enable the integrator to define follow up behavior: For example, switch to degraded functionality, or switch to a fault handling application, or panic when relevant. Even killing the specific threads causing the warning should not be the only option. It would then be the integrator's responsibility to configure the appropriate behavior per use case. Sorry, Elana, this argument you are presenting above---with the autonomous car---is very pictorial, but hardly meets reality. What you are writing above suggests that there is no surrounding system and no system engineering in place that ensures degradation and passenger safety within the system that consists of multiple ECUs in a vehicle network. >> That is not at all what was prescribed above.
>> panic_on_warn leads to kernel panic on warning. And that is too restrictive, as explained above. If the kernel panics, it will override any system engineering and system features which have been defined.
>> Fault handling for warnings needs to be more specific (i.e., not panic).
>> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.
Of course, it is possible to design such a system you are sketching above, where a single warning leads to a life-threatening event, >> NO, as already stated - kernel panic on warning is the wrong behavior.
>> Correct system behavior is outlined above, in original email thread.
...
The discussion actually
has already moved on; I think the arguments against the suggested pkill_on_warn were overwhelming and now alternatives are discussed. >> Good. >> OK, this discussion should continue in that context. Thanks
|
|
On Mon, Nov 22, 2021 at 10:19 AM Elana Copperman
<Elana.Copperman@...> wrote: Good luck. Lukas |
|
Jochen Kall
Hi Elana,
not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position. If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.
Best regards Jochen
Von: devel@... <devel@...>
Im Auftrag von elana.copperman@...
Lukas, please read carefully and align with what I wrote. In any case, we should park this thread, and move discussion to the LWN thread. Thanks Elana
From: Lukas Bulwahn <lukas.bulwahn@...>
Sent: Monday, November 22, 2021 10:58 AM To: Elana Copperman <Elana.Copperman@...> Cc: Shuah Khan <skhan@...>; devel@... <devel@...> Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning
EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe. >> That is not at all what was prescribed above. >> panic_on_warn leads to kernel panic on warning. And that is too restrictive, as explained above. If the kernel panics, it will override any system engineering and system features which have been defined. >> Fault handling for warnings needs to be more specific (i.e., not panic). >> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.
>> NO, as already stated - kernel panic on warning is the wrong behavior. >> Correct system behavior is outlined above, in original email thread.
... The discussion actually >> OK, this discussion should continue in that context. Thanks -- Mit freundlichen Grüßen
-- Funktionale Sicherheit
ITK Engineering GmbH
Tel.: +49 7272 7703-546 Mobil:+491734957776
______________________________________________________________
ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100 mailto:info@... | http://www.itk-engineering.de
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: Dr. Rudolf Maier Geschäftsführung/Executive Board: Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke Sitz der Gesellschaft/Registered Office: 76761 Rülzheim Registergericht/Registered Court: Amtsgericht Landau, HRB 32046 USt.-ID-Nr./VAT-ID-No. DE 813165046 |
|
elana.copperman@...
OK, if this is the focus - then we can /should continue here, and not simply annoy LWN.
Jochen, redundancy will not help here. If all systems enable panic_on_warn, the redundant systems will all quickly fail.
And availability is very much a concern, not only for Automotive AV, but also for medical devices and many other safety critical systems.
Panic_on_warn is simply too restrictive and is not "shutting off cleanly"; it is pulling the plug on a running device.
What is needed is a well-defined fault handling mechanism which handles faults appropriately, in a more specific way.
Shutting down a safety-critical device on which requires high availability for any warning, is not an option.
As a rule, warnings should be eliminated by sufficient testing and static analysis before the system hits the road. And if some warning(s) remains in the deployed device, the most practical solution is logging and software (OTA) update.
From: Jochen Kall <Jochen.Kall@...>
Sent: Monday, November 22, 2021 4:03 PM To: Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn <lukas.bulwahn@...> Cc: Shuah Khan <skhan@...>; devel@... <devel@...> Subject: AW: [ELISA Technical Community] What to do in response to a kernel warning Hi Elana,
not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position. If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.
Best regards Jochen
Von: devel@... <devel@...> Im Auftrag von
elana.copperman@...
Lukas, please read carefully and align with what I wrote. In any case, we should park this thread, and move discussion to the LWN thread. Thanks Elana
From: Lukas Bulwahn <lukas.bulwahn@...>
EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe. >> That is not at all what was prescribed above. >> panic_on_warn leads to kernel panic on warning. And that is too restrictive, as explained above. If the kernel panics, it will override any system engineering and system features which have been defined. >> Fault handling for warnings needs to be more specific (i.e., not panic). >> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.
>> NO, as already stated - kernel panic on warning is the wrong behavior. >> Correct system behavior is outlined above, in original email thread.
...
The discussion actually >> OK, this discussion should continue in that context. Thanks
|
|
Jochen Kall
Hi Elana,
you misunderstood my point. What I meant is the following: If shutting off is not safe for a system as it apparently is the case in your example, you simply can not design it with a single controller, but rather you need several of them, monitoring each other such that if one fails, another can take over, unless you have a magical piece of hardware that can never fail to run it on. (That consideration btw is totally independent of the OS used, it applies to all of them) For such a system however, continuing to operate a compromised path makes the whole thing potentially less safe.
In a fail safe system where shutting off leads to the safe state, the same conclusion applies. That’s why I agree with Lukas position that it is probably not very important for safety applications to have this capability and why I believe the example you gave does not support your position. That of course does not mean it’s a bad idea to have this capability in general (Not qualified to judge that ^^), but in safety systems, we’d probably switch it off anyways for the reasons outlined above, that’s all.
Jochen
Von: Elana Copperman <Elana.Copperman@...>
OK, if this is the focus - then we can /should continue here, and not simply annoy LWN. Jochen, redundancy will not help here. If all systems enable panic_on_warn, the redundant systems will all quickly fail. And availability is very much a concern, not only for Automotive AV, but also for medical devices and many other safety critical systems. Panic_on_warn is simply too restrictive and is not "shutting off cleanly"; it is pulling the plug on a running device.
What is needed is a well-defined fault handling mechanism which handles faults appropriately, in a more specific way. Shutting down a safety-critical device on which requires high availability for any warning, is not an option. As a rule, warnings should be eliminated by sufficient testing and static analysis before the system hits the road. And if some warning(s) remains in the deployed device, the most practical solution is logging and software (OTA) update.
From: Jochen Kall <Jochen.Kall@...>
Sent: Monday, November 22, 2021 4:03 PM To: Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn <lukas.bulwahn@...> Cc: Shuah Khan <skhan@...>; devel@... <devel@...> Subject: AW: [ELISA Technical Community] What to do in response to a kernel warning
Hi Elana,
not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position. If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.
Best regards Jochen
Von: devel@... <devel@...>
Im Auftrag von elana.copperman@...
Lukas, please read carefully and align with what I wrote. In any case, we should park this thread, and move discussion to the LWN thread. Thanks Elana
From: Lukas Bulwahn <lukas.bulwahn@...>
EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe. >> That is not at all what was prescribed above. >> panic_on_warn leads to kernel panic on warning. And that is too restrictive, as explained above. If the kernel panics, it will override any system engineering and system features which have been defined. >> Fault handling for warnings needs to be more specific (i.e., not panic). >> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.
>> NO, as already stated - kernel panic on warning is the wrong behavior. >> Correct system behavior is outlined above, in original email thread.
... The discussion actually >> OK, this discussion should continue in that context. Thanks -- Mit freundlichen Grüßen
-- Funktionale Sicherheit
ITK Engineering GmbH
Tel.: +49 7272 7703-546 Mobil:+491734957776
______________________________________________________________
ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100 mailto:info@... | http://www.itk-engineering.de
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: Dr. Rudolf Maier Geschäftsführung/Executive Board: Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke Sitz der Gesellschaft/Registered Office: 76761 Rülzheim Registergericht/Registered Court: Amtsgericht Landau, HRB 32046 USt.-ID-Nr./VAT-ID-No. DE 813165046 |
|
elana.copperman@...
Jochen, I understand your point exactly.
Even with multiple controllers, you do not solve the problem.
Let me explain in more detail:
A warning is a "compromised path" but not necessarily a path which should lead to panic.
panic_on_warn is the wrong way to deal with warnings in a high-availability system. You cannot ignore warnings, of course; but there must be more fine-tuned fault handling - and redundancy won't help to avoid that problem.
It is not only "not very important for safety applications to have this capability", it is potentially dangerous if multiple controllers fail simultaneously on warnings.
This claim is true for any safety critical system with requirement for high availability and graceful degradation. And this claim is not limited to a specific use case (automotive AV was only an example of such a system).
From: devel@... <devel@...> on behalf of Jochen Kall <jochen.kall@...>
Sent: Monday, November 22, 2021 6:46 PM To: Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn <lukas.bulwahn@...> Cc: Shuah Khan <skhan@...>; devel@... <devel@...> Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning Hi Elana,
you misunderstood my point. What I meant is the following: If shutting off is not safe for a system as it apparently is the case in your example, you simply can not design it with a single controller, but rather you need several of them, monitoring each other such that if one fails, another can take over, unless you have a magical piece of hardware that can never fail to run it on. (That consideration btw is totally independent of the OS used, it applies to all of them) For such a system however, continuing to operate a compromised path makes the whole thing potentially less safe.
In a fail safe system where shutting off leads to the safe state, the same conclusion applies. That’s why I agree with Lukas position that it is probably not very important for safety applications to have this capability and why I believe the example you gave does not support your position. That of course does not mean it’s a bad idea to have this capability in general (Not qualified to judge that ^^), but in safety systems, we’d probably switch it off anyways for the reasons outlined above, that’s all.
Jochen
Von: Elana Copperman <Elana.Copperman@...>
OK, if this is the focus - then we can /should continue here, and not simply annoy LWN. Jochen, redundancy will not help here. If all systems enable panic_on_warn, the redundant systems will all quickly fail. And availability is very much a concern, not only for Automotive AV, but also for medical devices and many other safety critical systems. Panic_on_warn is simply too restrictive and is not "shutting off cleanly"; it is pulling the plug on a running device.
What is needed is a well-defined fault handling mechanism which handles faults appropriately, in a more specific way. Shutting down a safety-critical device on which requires high availability for any warning, is not an option. As a rule, warnings should be eliminated by sufficient testing and static analysis before the system hits the road. And if some warning(s) remains in the deployed device, the most practical solution is logging and software (OTA) update.
From: Jochen Kall <Jochen.Kall@...>
Hi Elana,
not sure what you refer to, but in any case, I think Lukas is right, I also don’t see how the example supports your position. If availlability is safety critical, a system without redundancy (linux based or not) is not a feasible design pattern anyways, and if we are talking about a fail safe system, shutting off cleanly when in doubt is the way to go from safety perspective.
Best regards Jochen
Von: devel@... <devel@...>
Im Auftrag von elana.copperman@...
Lukas, please read carefully and align with what I wrote. In any case, we should park this thread, and move discussion to the LWN thread. Thanks Elana
From: Lukas Bulwahn <lukas.bulwahn@...>
EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe. >> That is not at all what was prescribed above. >> panic_on_warn leads to kernel panic on warning. And that is too restrictive, as explained above. If the kernel panics, it will override any system engineering and system features which have been defined. >> Fault handling for warnings needs to be more specific (i.e., not panic). >> And as I wrote above, you need some fault handling application or other system support, switching to degraded functionality or a fault handling application.
>> NO, as already stated - kernel panic on warning is the wrong behavior. >> Correct system behavior is outlined above, in original email thread.
...
The discussion actually >> OK, this discussion should continue in that context. Thanks
--
Mit freundlichen Grüßen
-- Funktionale Sicherheit
ITK Engineering GmbH
Tel.: +49 7272 7703-546 Mobil:+491734957776
______________________________________________________________
ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100 mailto:info@... | http://www.itk-engineering.de
Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: Dr. Rudolf Maier Geschäftsführung/Executive Board: Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke Sitz der Gesellschaft/Registered Office: 76761 Rülzheim Registergericht/Registered Court: Amtsgericht Landau, HRB 32046 USt.-ID-Nr./VAT-ID-No. DE 813165046 |
|
Sudip Mukherjee
iiuc, a WARN_ON() will be used by the kernel when it sees something
toggle quoted message
Show quoted text
which it is not expecting and so the system is in an unknown state. For example, the WARN_ON() at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106 which will give a warning when the system is about to hibernate but it sees the number of CPU is not equal to 1 which means more than 1 CPU is still active and thus is an undefined state. Will you want to continue running the system when the CPU states are in an undefined state or will you want the system to panic so that the supervisory system can then bring the system back to a known good state? -- Regards Sudip On 22/11/2021 7:36 pm, elana.copperman@... wrote:
Jochen, I understand your point exactly. |
|
elana.copperman@...
Sudip, I agree that this is certainly the appropriate reaction in some specific instances.
toggle quoted message
Show quoted text
The argument is that panic on every warning is over reaction. It is an academic exercise which will be rejected by any reasonable system architect. -----Original Message-----
From: Sudip Mukherjee <sudip.mukherjee@...> Sent: Monday, November 22, 2021 10:29 PM To: Elana Copperman <Elana.Copperman@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...> Cc: Shuah Khan <skhan@...>; devel@... Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe. iiuc, a WARN_ON() will be used by the kernel when it sees something which it is not expecting and so the system is in an unknown state. For example, the WARN_ON() at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106 which will give a warning when the system is about to hibernate but it sees the number of CPU is not equal to 1 which means more than 1 CPU is still active and thus is an undefined state. Will you want to continue running the system when the CPU states are in an undefined state or will you want the system to panic so that the supervisory system can then bring the system back to a known good state? -- Regards Sudip On 22/11/2021 7:36 pm, elana.copperman@... wrote: Jochen, I understand your point exactly. |
|
elana.copperman@...
Sorry for quick hit on enter.
toggle quoted message
Show quoted text
Now here is a challenge, how do we define safe and practical requirements for a "safe_panic_on_warn"? that is the type of resolution which may provide the balance which I am looking for. -----Original Message-----
From: devel@... <devel@...> On Behalf Of Elana Copperman Sent: Monday, November 22, 2021 10:47 PM To: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...> Cc: Shuah Khan <skhan@...>; devel@... Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning Sudip, I agree that this is certainly the appropriate reaction in some specific instances. The argument is that panic on every warning is over reaction. It is an academic exercise which will be rejected by any reasonable system architect. -----Original Message----- From: Sudip Mukherjee <sudip.mukherjee@...> Sent: Monday, November 22, 2021 10:29 PM To: Elana Copperman <Elana.Copperman@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...> Cc: Shuah Khan <skhan@...>; devel@... Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe. iiuc, a WARN_ON() will be used by the kernel when it sees something which it is not expecting and so the system is in an unknown state. For example, the WARN_ON() at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106 which will give a warning when the system is about to hibernate but it sees the number of CPU is not equal to 1 which means more than 1 CPU is still active and thus is an undefined state. Will you want to continue running the system when the CPU states are in an undefined state or will you want the system to panic so that the supervisory system can then bring the system back to a known good state? -- Regards Sudip On 22/11/2021 7:36 pm, elana.copperman@... wrote: Jochen, I understand your point exactly. |
|
Paul Sherwood
/* diving in with no idea how hot/deep the water is... */
toggle quoted message
Show quoted text
Wouldn't best practice be just to insist on no warnings in production for safe operation? Any ignorable warnings that occur during testing should be identified and explicit action taken, as code. Any warning never seen in testing should not be ignorable. On 2021-11-22 20:50, elana.copperman@... wrote:
Sorry for quick hit on enter. |
|
elana.copperman@...
Trying once more:
Warnings MUST be managed. No argument about that.
Panic is NOT feasible for every warning. That is the point.
From: Paul Sherwood <paul.sherwood@...>
Sent: Tuesday, November 23, 2021 6:58 PM To: Elana Copperman <Elana.Copperman@...> Cc: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...>; Shuah Khan <skhan@...>; devel@... <devel@...> Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
/* diving in with no idea how hot/deep the water is... */ Wouldn't best practice be just to insist on no warnings in production for safe operation? Any ignorable warnings that occur during testing should be identified and explicit action taken, as code. Any warning never seen in testing should not be ignorable. On 2021-11-22 20:50, elana.copperman@... wrote: > Sorry for quick hit on enter. > Now here is a challenge, how do we define safe and practical > requirements for a "safe_panic_on_warn"? that is the type of > resolution which may provide the balance which I am looking for. > > > -----Original Message----- > From: devel@... <devel@...> On Behalf Of > Elana Copperman > Sent: Monday, November 22, 2021 10:47 PM > To: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall > <jochen.kall@...>; Lukas Bulwahn > <lukas.bulwahn@...> > Cc: Shuah Khan <skhan@...>; devel@... > Subject: Re: [ELISA Technical Community] What to do in response to a > kernel warning > > Sudip, I agree that this is certainly the appropriate reaction in some > specific instances. > The argument is that panic on every warning is over reaction. It is > an academic exercise which will be rejected by any reasonable system > architect. > > -----Original Message----- > From: Sudip Mukherjee <sudip.mukherjee@...> > Sent: Monday, November 22, 2021 10:29 PM > To: Elana Copperman <Elana.Copperman@...>; Jochen Kall > <jochen.kall@...>; Lukas Bulwahn > <lukas.bulwahn@...> > Cc: Shuah Khan <skhan@...>; devel@... > Subject: Re: [ELISA Technical Community] What to do in response to a > kernel warning > > EXTERNAL EMAIL: Do not click any links or open any attachments unless > you trust the sender and know the content is safe. > > iiuc, a WARN_ON() will be used by the kernel when it sees something > which it is not expecting and so the system is in an unknown state. > > For example, the WARN_ON() at > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/hibernate.c?h=v5.16-rc2#n106 > which will give a warning when the system is about to hibernate but it > sees the number of CPU is not equal to 1 which means more than 1 CPU > is still active and thus is an undefined state. > > Will you want to continue running the system when the CPU states are > in an undefined state or will you want the system to panic so that the > supervisory system can then bring the system back to a known good > state? > > > -- > Regards > Sudip > > > On 22/11/2021 7:36 pm, elana.copperman@... wrote: >> Jochen, I understand your point exactly. >> Even with multiple controllers, you do not solve the problem. Let me >> explain in more detail: >> >> 1. Assume you have a safety critical system with requirement for high >> availability. And as we agreed below, including redundant HW. 2. >> panic_on_warn in all the controllers is over reaction. And that is >> my key point. >> 3. If all the controllers are Linux based, with panic_on_warn, you >> risk >> a situation where multiple controllers will fail, because instead >> of >> handling the warnings - their only option is to panic and crash. >> >> A warning is a "compromised path" but not necessarily a path which >> should lead to panic. >> panic_on_warn is the wrong way to deal with warnings in a >> high-availability system. You cannot ignore warnings, of course; but >> there must be more fine-tuned fault handling - and redundancy won't >> help to avoid that problem. >> It is not only "not very important for safety applications to have >> this capability", it is potentially dangerous if multiple controllers >> fail simultaneously on warnings. >> This claim is true for any safety critical system with requirement for >> high availability and graceful degradation. And this claim is not >> limited to a specific use case (automotive AV was only an example of >> such a system). >> >> >> ---------------------------------------------------------------------- >> -- >> *From:* devel@... <devel@...> on behalf of >> Jochen Kall <jochen.kall@...> >> *Sent:* Monday, November 22, 2021 6:46 PM >> *To:* Elana Copperman <Elana.Copperman@...>; Lukas Bulwahn >> <lukas.bulwahn@...> >> *Cc:* Shuah Khan <skhan@...>; devel@... >> <devel@...> >> *Subject:* Re: [ELISA Technical Community] What to do in response to a >> kernel warning >> >> >> Hi Elana, >> >> >> >> you misunderstood my point. >> >> What I meant is the following: >> >> If shutting off is not safe for a system as it apparently is the case >> in your example, you simply can not design it with a single >> controller, but rather you need several of them, monitoring each other >> such that if one fails, another can take over, unless you have a >> magical piece of hardware that can never fail to run it on. >> >> (That consideration btw is totally independent of the OS used, it >> applies to all of them) >> >> For such a system however, continuing to operate a compromised path >> makes the whole thing potentially less safe. >> >> >> >> In a fail safe system where shutting off leads to the safe state, the >> same conclusion applies. >> >> That's why I agree with Lukas position that it is probably not very >> important for safety applications to have this capability and why I >> believe the example you gave does not support your position. >> >> That of course does not mean it's a bad idea to have this capability >> in general (Not qualified to judge that ^^), but in safety systems, >> we'd probably switch it off anyways for the reasons outlined above, >> that's all. >> >> >> >> Jochen >> >> * * >> >> *Von:* Elana Copperman <Elana.Copperman@...> >> *Gesendet:* Montag, 22. November 2021 17:20 >> *An:* Jochen Kall <Jochen.Kall@...>; Lukas Bulwahn >> <lukas.bulwahn@...> >> *Cc:* Shuah Khan <skhan@...>; devel@... >> *Betreff:* Re: [ELISA Technical Community] What to do in response to a >> kernel warning >> >> >> >> OK, if this is the focus - then we can /should continue here, and not >> simply annoy LWN. >> >> Jochen, redundancy will not help here. If all systems enable >> panic_on_warn, the redundant systems will all quickly fail. >> >> And availability is very much a concern, not only for Automotive AV, >> but also for medical devices and many other safety critical systems. >> >> Panic_on_warn is simply too restrictive and is not "shutting off >> cleanly"; it is pulling the plug on a running device. >> >> >> >> What is needed is a well-defined fault handling mechanism which >> handles faults appropriately, in a more specific way. >> >> Shutting down a safety-critical device on which requires high >> availability for any warning, is not an option. >> >> As a rule, warnings should be eliminated by sufficient testing and >> static analysis before the system hits the road. And if some >> warning(s) remains in the deployed device, the most practical solution >> is logging and software (OTA) update. >> >> >> >> ---------------------------------------------------------------------- >> -- >> >> *From:*Jochen Kall <Jochen.Kall@... >> <mailto:Jochen.Kall@...>> >> *Sent:* Monday, November 22, 2021 4:03 PM >> *To:* Elana Copperman <Elana.Copperman@... >> <mailto:Elana.Copperman@...>>; Lukas Bulwahn >> <lukas.bulwahn@... <mailto:lukas.bulwahn@...>> >> *Cc:* Shuah Khan <skhan@... >> <mailto:skhan@...>>; devel@... >> <mailto:devel@...> <devel@... >> <mailto:devel@...>> >> *Subject:* AW: [ELISA Technical Community] What to do in response to a >> kernel warning >> >> >> >> Hi Elana, >> >> >> >> not sure what you refer to, but in any case, I think Lukas is right, I >> also don't see how the example supports your position. >> >> If availlability is safety critical, a system without redundancy >> (linux based or not) is not a feasible design pattern anyways, and if >> we are talking about a fail safe system, shutting off cleanly when in >> doubt is the way to go from safety perspective. >> >> >> >> Best regards >> >> Jochen >> >> >> >> *Von:* devel@... <mailto:devel@...> >> <devel@... <mailto:devel@...>> *Im Auftrag >> von *elana.copperman@... >> <mailto:elana.copperman@...> >> *Gesendet:* Montag, 22. November 2021 10:20 >> *An:* Lukas Bulwahn <lukas.bulwahn@... >> <mailto:lukas.bulwahn@...>> >> *Cc:* Shuah Khan <skhan@... >> <mailto:skhan@...>>; devel@... >> <mailto:devel@...> >> *Betreff:* Re: [ELISA Technical Community] What to do in response to a >> kernel warning >> >> >> >> Lukas, please read carefully and align with what I wrote. >> >> In any case, we should park this thread, and move discussion to the >> LWN thread. >> >> Thanks >> >> Elana >> >> >> >> ---------------------------------------------------------------------- >> -- >> >> *From:*Lukas Bulwahn <lukas.bulwahn@... >> <mailto:lukas.bulwahn@...>> >> *Sent:* Monday, November 22, 2021 10:58 AM >> *To:* Elana Copperman <Elana.Copperman@... >> <mailto:Elana.Copperman@...>> >> *Cc:* Shuah Khan <skhan@... >> <mailto:skhan@...>>; devel@... >> <mailto:devel@...> <devel@... >> <mailto:devel@...>> >> *Subject:* Re: [ELISA Technical Community] What to do in response to a >> kernel warning >> >> >> >> EXTERNAL EMAIL: Do not click any links or open any attachments unless >> you trust the sender and know the content is safe. >> >> On Sun, Nov 21, 2021 at 9:44 AM Elana Copperman >> <Elana.Copperman@... <mailto:Elana.Copperman@...>> >> wrote: >>> >>> Thanks, Shuah, for sharing this important information. >>> From my experience (and we should hear from others as well!), >> panic_on_warn as a hard rule is too restrictive. For example, if an >> autonomous car is driving at 200 KM per hour on a German highway, the >> kernel panic from a warning could be life threatening to the car >> passengers. What is necessary is appropriate handling for >> panic_on_warn, to enable the integrator to define follow up behavior: >> For example, switch to degraded functionality, or switch to a fault >> handling application, or panic when relevant. Even killing the >> specific threads causing the warning should not be the only option. >> It would then be the integrator's responsibility to configure the >> appropriate behavior per use case. >> >> >> Sorry, Elana, this argument you are presenting above---with the >> autonomous car---is very pictorial, but hardly meets reality. What you >> are writing above suggests that there is no surrounding system and no >> system engineering in place that ensures degradation and passenger >> safety within the system that consists of multiple ECUs in a vehicle >> network. >> >>>> That is not at all what was prescribed above. >> >>>> panic_on_warn leads to kernel panic on warning. And that is too >> restrictive, as explained above. If the kernel panics, it will >> override any system engineering and system features which have been >> defined. >> >>>> Fault handling for warnings needs to be more specific (i.e., not >>>> panic). >> >>>> And as I wrote above, you need some fault handling application or >> other system support, switching to degraded functionality or a fault >> handling application. >> >> >> Of course, it is possible to design such a system you are sketching >> above, where a single warning leads to a life-threatening event, >> >>>> NO, as already stated - kernel panic on warning is the wrong >>>> behavior. >> >>>> Correct system behavior is outlined above, in original email thread. >> >> >> >> ... >> >> The discussion actually >> has already moved on; I think the arguments against the suggested >> pkill_on_warn were overwhelming and now alternatives are discussed. >>>> Good. >> >>>> OK, this discussion should continue in that context. Thanks >> >> >> -- >> >> Mit freundlichen Grüßen >> Jochen Kall >> >> >> >> -- >> Dr. rer. nat. Jochen Kall >> >> Funktionale Sicherheit >> >> >> >> ITK Engineering GmbH >> Im Speyerer Tal 6 >> 76761 Rülzheim >> >> >> >> Tel.: +49 7272 7703-546 >> Fax: +49 7272 7703-100 >> >> Mobil:+491734957776 >> >> >> >> mailto:jochen.kall@...<mailto:jochen.kall@itk-engineeri >> ng.de> >> >> >> >> ______________________________________________________________ >> >> >> >> ITK Engineering GmbH | Im Speyerer Tal 6 | 76761 Rülzheim >> >> Tel.: +49 7272 7703-0 | Fax: +49 7272 7703-100 >> >> mailto:info@... <mailto:info@...>| >> http://secure-web.cisco.com/1ihxzBLAUCKgAM4Mmt9kRBDB4hpC8fIOBTK-cQWO0T >> inCXfSYm2vK7GqhgRwHEiuulgHfqa5DzoF55TajxEyX-zPe_lqxg6GT1ctlA0kjBrYvkTu >> S4AelH1YgT8GXMFHuZg0ZJEG2snYVZ2L7kc1Z98Bts7X14gtfzuSkiLEn45AE6idQedd1w >> gOCi7tAb3Qz0Ri6ZKylZiGy480AbGLON4IAaYZ1n812kbA9AnZv0krGBYZk3xrhW5F3A6d >> bp0Bzvk-RiQUCZKMODN_B7YAerPqX519WeadSiVU1ySmF0kNYn5A4tMQOIHjtqm_ni7uf/ >> http%3A%2F%2Fwww.itk-engineering.de<http://secure-web.cisco.com/1mpWU0 >> vrkiFUZcYrpoBbpb5Kl3vJezCrpIF8aj-3p7BtkTq5weIJMnGCcYkMud5xJ0tRASYpVZv- >> c4MbwXiGvTBQK9UWqjgIQbnJ_fAnKJv_2upGw5U1YPnXdcTsnNA-AUEyQJKoQVoh3YHkWI >> AG4To8UDY3Ya0Yy79WxaHCBrIFAuDUmISDxlJkea_et2azfHrTI64RCNx1HQMTY2WYzfRY >> CKLivwuHDrnoTGQVbNL81cUsp2khv3-JpNvhtLx3ZvH4SL6KoTZJPugcTSbVlN9RkA1snx >> ck4R6j5CPNwctTpsdvlS2Ms0En0NBpVP8acDUtlLYxEWKnHvaat7HCYvA/http%3A%2F%2 >> Fwww.itk-engineering.de%2F> >> >> >> >> Vorsitzender des Aufsichtsrats/Chairman of the Supervisory Board: >> >> Dr. Rudolf Maier >> >> Geschäftsführung/Executive Board: >> >> Michael Englert (Vorsitzender/Chairman), Bernd Gohlicke >> >> Sitz der Gesellschaft/Registered Office: 76761 Rülzheim >> >> Registergericht/Registered Court: Amtsgericht Landau, HRB 32046 >> >> USt.-ID-Nr./VAT-ID-No. DE 813165046 >> >> > > > > > > > > > > |
|
Paul Sherwood
On 2021-11-23 18:41, elana.copperman@... wrote:
Warnings MUST be managed. No argument about that.Sorry if I'm missing it, but I think maybe we're agreeing? Folks taking responsibility for safety could/should - review all warnings in the code, and take steps to manage them in their use case and/or - perform risk analysis for all warnings discovered during testing, and put appropriate mitigations in place - panic for all warnings not otherwise handled br Paul |
|
elana.copperman@...
1000% agree with every word you wrote.
toggle quoted message
Show quoted text
Panic_on_warn is just not a single solution for all warnings. The point is that we need some tweaking on panic_on_warn in order to have a solution which is derived from safety requirements, feasible, and a true contribution to Linux kernel safety. -----Original Message-----
From: Paul Sherwood <paul.sherwood@...> Sent: Tuesday, November 23, 2021 9:48 PM To: Elana Copperman <Elana.Copperman@...> Cc: Sudip Mukherjee <sudip.mukherjee@...>; Jochen Kall <jochen.kall@...>; Lukas Bulwahn <lukas.bulwahn@...>; Shuah Khan <skhan@...>; devel@... Subject: Re: [ELISA Technical Community] What to do in response to a kernel warning EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe. On 2021-11-23 18:41, elana.copperman@... wrote: Warnings MUST be managed. No argument about that.Sorry if I'm missing it, but I think maybe we're agreeing? Folks taking responsibility for safety could/should - review all warnings in the code, and take steps to manage them in their use case and/or - perform risk analysis for all warnings discovered during testing, and put appropriate mitigations in place - panic for all warnings not otherwise handled br Paul |
|
Luis Chamberlain
On Tue, Nov 23, 2021 at 07:50:36PM +0000, elana.copperman@... wrote:
1000% agree with every word you wrote.The patch actually merged in replacement to the "panic on warn" then was actually "panic: use error_report_end tracepoint on warnings": https://lore.kernel.org/all/20211115085630.1756817-1-elver@google.com/T/#u And so can monitor the error_report event tracepoint and look for warnings. That should make it easier to monitor for kernel warnings in userspace without having to poll the kernel logs. Marco, can you extend documentation on how to use this so critical safety folks can start using it? Luis |
|
Shuah Khan
On 11/23/21 12:47 PM, Paul Sherwood wrote:
On 2021-11-23 18:41, elana.copperman@... wrote:If anybody is curious about the scope of this work:Warnings MUST be managed. No argument about that.Sorry if I'm missing it, but I think maybe we're agreeing? git grep WARN_ON shows about 18486 usages (- definition). It isn't surprising as WARN_ON is used a debug mechanism by some drivers. In general WARN_ON use is discouraged and supposed to be used in only cases where there is no other choice bu to panic. However in reality there are several usages that are just for debug. thanks, -- Shuah |
|
On Tue, Nov 23, 2021 at 10:30 PM Shuah Khan <skhan@...> wrote:
And a big thanks to Elana to start this; there is no need to look at the 18486 usages at first. It would be a good start to just look at the 1614 WARN_ON* in the kernel/ directory. Please either eliminate them in the code by handling them in place in that context, or describe a reliable reaction that can be executed in user-space and brings the kernel back into a proper operating mode despite the state after the existing warning. Also, https://syzkaller.appspot.com/upstream will point you to warnings that we encountered during our kernel testing campaigns. Good luck. Lukas thanks, |
|