RFC - Discovering Linux kernel subsystems used by a workload


Shuah Khan
 

All,

Please review the document that outlines the process to get
insight into the resources used by a workload.

Shefali Sharma and I identified a process for gathering fine
grained information about system resources necessary to run a
generic workloads on Linux.

This process can then be applied to any workload including
individual commands and important use-cases in that workload.
As an example, what subsystems are used when a user queries the
insulin pump status when OpenAPS workload in running.

In addition this process can be used by System Integrators to gain
insight into the resources used by their workloads.

Please review and give us feedback. Once this review is complete, we
will upload the document to github.

https://docs.google.com/document/d/1OgbTDFdrWtQTCYoRwNIZMQPhGhnHbyughLXrLfiaTi4/edit#

thanks,
-- Shuah & Shefali


Jonathan Moore <jandcmoore@...>
 

Very interesting and some good information about the tools available thank you. 

Have you any given any thought or do you have information that supports the variety of the measurements? ie do the results actually match what is going on? How do we verify that? Are the results between tools generally in agreement or are some results tuned for different workloads? Does the point in time of measurement make a difference? Can peak load eg during application startup up be differentiated from a low demand mode? What instances have you seen where the reports are low or too high eg situations when the reported measurements exceed the actual available etc.? What effect does measuring have on the system/task being measured? Can these measurements be made at the same time as high stress workloads without impacting the workload? How does one deal with Nyquist? When a task is zombie/dead/lost do the measurements indicate this and how quickly? Do you have a set of 'test' loads to explore all of this?

That might be enough questions for now. :-)

Jonathan


On Wed, Aug 3, 2022, 12:55 PM Shuah Khan <skhan@...> wrote:
All,

Please review the document that outlines the process to get
insight into the resources used by a workload.

Shefali Sharma and I identified a process for gathering fine
grained information about system resources necessary to run a
generic workloads on Linux.

This process can then be applied to any workload including
individual commands and important use-cases in that workload.
As an example, what subsystems are used when a user queries the
insulin pump status when OpenAPS workload in running.

In addition this process can be used by System Integrators to gain
insight into the resources used by their workloads.

Please review and give us feedback. Once this review is complete, we
will upload the document to github.

https://docs.google.com/document/d/1OgbTDFdrWtQTCYoRwNIZMQPhGhnHbyughLXrLfiaTi4/edit#

thanks,
-- Shuah & Shefali







Shuah Khan
 

Please see inline.

On 8/5/22 11:22 AM, Jonathan Moore wrote:
Very interesting and some good information about the tools available thank you.
Have you any given any thought or do you have information that supports the variety of the measurements? ie do the results actually match what is going on?
The goal is to get insight into system calls and ioctls invoked
by a workload. Results do match the system activity for that
workload. Running streace on "ls" command will tell you the system
activity for that command.

How do we verify that? Are the results between tools generally in agreement or are some results tuned for different workloads? Does the point in time of measurement make a difference? Can peak load eg during application startup up be differentiated from a low demand mode? What instances have you seen where the reports are low or too high eg situations when the reported measurements exceed the actual available etc.? What effect does measuring have on the system/task being measured? Can these measurements be made at the same time as high stress workloads without impacting the workload? How does one deal with Nyquist? When a task is zombie/dead/lost do the measurements indicate this and how quickly? Do you have a set of 'test' loads to explore all of this?
The only tool we are using here is strace and the same tool is used on 3
workloads. Results are not tuned for a workload.

Please keep in mind, the goal here is understanding the system footprint
for a workload. It isn't goal to see how a workload behaves under varying
loads.

The goal is to give tool and process to system integrators to follow to
get insight into their workloads. This information can then be used to
develop a plan for gathering evidence for certification.

Hope this helps understand the goals of this work.

thanks,
-- Shuah