Yoav Etsion Tal Ben-Nun Dan Tsafrir Dror G. Feitelson
School of Computer Science and Engineering
The Hebrew University, 91904 Jerusalem, Israel
0.98
Klogger is a framework for post mortem analysis of the Linux kernel behavior. It is designed so that developers can easily define events they want logged and log them in the kernel code. At runtime, the Klogger infrastructure is enabled for the period of time the developer wishes to analyze, logging all events into a special log file, which can later be analyzed.
Klogger excels at it's low overhead and flexibility, allowing researchers to analyze performance bottlenecks that are very difficult to reach with standard logging tools, while not requiring the user to become a kernel hacker.
This document is an attempt to allow the research community to gain experience with Klogger, while possibly providing feedback about the tool and its features.
Klogger flexibility is embodied in its support for different schemata -- different disjoint event sets, dedicated to the analysis of different subsystems. Schemata enable users to analyze only the subsystem that interests them, and create new Klogger schemata specific for their research (read more about schemata in sections 3 and 7).
We are trying to collect a host of schemata which will allow both developers and researchers to study the various kernel subsystems without having to dive into the actual implementations. The best example is a storage systems researcher who does not have to fully understand the implementation of the subsystem and driver in one kernel or another, as long as he can get information about the blocks read from/written to storage, their order, timings etc. By supplying a schema logging the major events of the storage subsystem, we can allow more people to be involved in cutting edge research. The same goes for most parts of modern kernels -- networking, scheduling, synchronization, file systems, etc.
Klogger is logging framework, and as such, needs some efficient way to save runtime events onto persistent storage. This section will try to overview the basic mechanism used by Klogger. (defining events and logging them is described in Section 3).
Klogger employs per CPU memory resident buffers, to which logging actually takes place. The logging code first checks if Klogger is enabled. If so, space is allocated on the relevant CPU's buffer, and the event's payload is saved there in the form of a C structure. The buffer is configured to not only be memory resident, but also uses Intel's write-combining cache policy [5], allowing efficient sequential writes and preventing the data from polluting the caches.
Each CPU is also attached with a special thread in charge of dumping the memory buffer to disk whenever the latter fills. These threads run at the highest real-time priority in the system, under the SCHED_FIFO scheduling policy [4,1]. Klogger does not actually wait for the buffer to fill, but rather has a user settable low-water mark indicating what percent of the buffer should be used as a reserve. Whenever the low-water mark is reached, Klogger signals the Linux scheduler to run the appropriate CPU's thread, in order to empty the buffer.
When using Klogger to analyze a subsystem that can be affected by the memory dump operation, events caused by Klogger's own action might cause the buffer to fill up, thus causing the loss of other events. Such cases can be identified by gaps in the logged events' serial numbers. Another possible problem is if logging occurs too often the overhead caused by emptying the buffer might affect the measurement. For such cases, Klogger provides the user with an interface to increase the buffer size (defaulting at 4MB per CPU) and change the low-water mark. This interface is described in Section 5.5. Experience shows that increasing the buffer is sufficient even in the more extreme cases and allows for logging millions of events per second.
One warning though: albeit a very efficient framework, we must strictly advise against using Klogger in production systems. Klogger is designed to be a framework used by researchers and developers. As such, some guidelines that are usually required by subsystem designers were relaxed -- for example: the decision to allow some minimal (and even negligible) overhead even when logging is disabled.
A more comprehensive discussion of Klogger's design and implementation is available in Section 8 and in [3].
A Klogger schema defines a specific facet of Klogger. One way of looking at it is that the Klogger framework supplies the mechanism -- how to log -- while the schemata supply the policy -- what to log.
Technically, when we refer to a Klogger schema we refer two files: a configuration file defining the events to be logged and their payload, with the other file being a kernel patch containing the invocations of klogger -- where are the events defined in the first file actually get logged.
The next two sections will describe how to define and log events. The first will describe the syntax of the Klogger configuration file, and the second will describe how to log events.
Describing the syntax of the configuration file is split into three parts:
To clear up the picture, we will accompany those sections with a simple example: an event named SCHEDOUT which indicates that the CPU scheduler has switched out the running process, logging its pid.
Klogger's configuration files reside in the kernel source root and are named ``.klogger.conf.NAME'' -- where NAME is the name of the schema they define. The separation into multiple configuration files allows great flexibility when adding or removing logging schemata. All Klogger enabled kernels must have at least one configuration file, named ``.klogger.conf.base'' which is included alongside the main Klogger kernel patch and contains definitions for the build in Klogger events, such as when it was enabled/disabled, what processes are running when it's enabled, etc. This file can serve as a basic example of a Klogger configuration file.
Events are declared in a syntax similar to C structs. The basic template is:
event EVENTNAME { typeA varname1; typeB varname2; ... typeN varnameX; }with the various types known to Klogger listed in Table 1. Note that the usage of strings is highly discouraged since it requires the use of buffer copy functions, dramatically increasing the logging overhead (though strings up to the size of the underlying hardware's cache line are optimized into simple long assignments).
|
In our SCHEDOUT example, the event will thus be defined as:
event SCHEDOUT { int pid }and will appear in the final text log as:
{ header => { "type" => "SCHEDOUT", "serial" => "119", "timestamp" => "1032071755760", }, "pid" => "1073", },
Though extracting the text based log is described in Section 5.2, for now simply note that each event appears as a Perl hash. The hash contains all the fields described in the event definition and a header containing the event type (as text), its serial number in the log and a timestamp taken from the hardware's cycle counter.
Enumerations are declared in a syntax similar to C enums. The basic template is:
enum enumerationtype { one, minusone = -1, four = 4, TWO = 2, Three }The type ``enumerationtype'' may be later used in an event as a type. Integer values will automatically be converted to their equivalent strings when decoding Klogger's output.
Event inheritance is an advanced feature of Klogger, enabling analysts the definitions of inclusive groups of events, thereby defining different log levels.
The inheritance model, allows schema authors to specify the levels of the various events logged by Klogger and include log levels in other log levels using the ``include'' command.
Similar to C++ syntax, the currently used log level will be declared before all log level declarations with the ``using'' command.
Furthermore, when a schema author would like to omit an event from an included log level, he may use the ``omit'' command. Such action could be done in order to filter out information unneeded in a specific log level, thereby reducing the information generated by Klogger.
Event inheritance syntax is as follows:
using two loglevel one < event ONE_DO { ... } event ONE_SIMPLE { ... } > loglevel two < include one omit ONE_SIMPLE event TWO_DETAILED { ... } >
A more detailed example of log levels can be found in the network schema's configuration file.
Logging events is performed using a special C macro called klogger. First, we'll need to include the main Klogger header file:
#include <linux/klogger.h>
The syntax for the klogger macro resembles that of printf, with the first argument being the event name, followed by values for each of the event fields as ordered in the event description. The call for our general event template will thus be:
klogger(EVENTNAME, varname1, varname2, ... , varnameX);
This call is converted by the C preprocessor into an inlined event specific logging function. However, If the kernel is compiled without Klogger support, this call converts into a null expression by the preprocessor.
Sometimes special code is needed (like using temporary variables) for non-trivial logging. Such code should be protected using an ifdef statement checking the existence of the CONFIG_KLOGGER macro, so it can be eliminated if the Klogger is not configured into the kernel.
Returning to our SCHEDOUT example, the call logging the event will simply be:
klogger(SCHEDOUT, task->pid);assuming task is a variable of type struct task_struct* pointing to the process that was just preempted off the CPU.
Most modern processors support hardware performance counters, measuring various aspects of the processor's operations. These counters can provide information varying from the number of cache misses for each cache level, through the number of retired instructions, and sometimes even the number of memory operations queued on the processor. Needless to say, these counters are processor dependant. Moreover, these counters may not even be backward compatible with previous versions of the same architecture.
To make Klogger as efficient as possible, all pieces of code accessing the hardware performance counters are inlined. The result is that all abstractions used are taking place at compile time, incurring no runtime overhead from abstract indirections (as oppose to the approach taken by other tools, such as PAPI [2]). In that context, user should be warned that since the counters are processor model dependant, code compiled with performance counter support for one model may not run on another! Simply put, kernels compiled to utilize PentiumIII's performance counters might not even boot on a PentiumIV machine.
In order to use the hardware counters, we must add an architecture section to the configuration file, specifying the underlying hardware and binding Klogger's virtual counters to specific events supported by the hardware.
The basic structure is
arch ARCH_NAME { counter1 HARDWARE_COUNTER_NAME_A; counter2 HARDWARE_COUNTER_NAME_B; ... counterN HARDWARE_COUNTER_NAME_X; }where ARCH_NAME is the architecture name (i.e. PentiumPro, PentiumIII, PentiumIV, etc.), 1counter1...N are the virtual counter names, and HARDWARE_COUNTER_NAME_A...X are the hardware specific counter names. The list and descriptions of the hardware specific counter names supported for the given architecture can be viewed by running the command
./scripts/klogger-autogen.pl --list-countersfrom the Klogger's kernel source root. For example, the counters currently supported for the PentiumIV architecture are listed in table 2.
|
Let's return to our SCHEDOUT example for a moment. Imagine we want Klogger to log not only the preempted process's pid, but also to log the number of overall L2 cache misses it has suffered.
Since we only want the event occurring when the process was running (regardless of any interrupts handled by the kernel during that time), we need to bind the first virtual counter to the underlying hardware's counter of the L2 cache misses occurring at the user protection level -- the l2_cache_misses_user counter. As such, the architecture dependant section in the configuration file would look like this:
arch PentiumIV { counter1 l2_cache_misses_user; }
Also, we would add another 64bit integral field to the SCHEDOUT event, saving the counter's value at the moment of preemption:
event SCHEDOUT { int pid ulonglong l2_cache_misses_user }which means the textual log will hold the event as
{ header => { "type" => "SCHEDOUT", "serial" => "119", "timestamp" => "1032071755760", }, "pid" => "1073", "l2_cache_misses_user" => "35678014", },
The next section describes how to read the hardware's performance counters, and how to log our extended SCHEDOUT event.
Reading the performance counters is done by two types of functions: The first type is generic, accepting which counter to read as an argument. The second type is logical, with the event name being part of the function's name. This separation allows the programmer two different modus operandi:
Both types are implemented using compile time generated inlined functions. The reasoning behind this implementation is that while it would be simpler to just generate a logical to physical counter map and have a function use that map, such method involves more overhead and cache pollution.
The first, counter-based interface is defined as:
unsigned long klogger_get_counter(unsigned long ctr); unsigned long long klogger_get_counter_ll(unsigned long ctr);where the first function returns just the lower 32 bits of the 64 bit counter, and the second function returns the entire 64 bit number. The ctr argument is a macro, named KLOGGER_COUNTERN where N is the counter number -- i.e. KLOGGER_COUNTER1, KLOGGER_COUNTER4, etc.
The second function type is declared as:
unsigned long klogger_get_EVENTNAME() unsigned long klogger_get_EVENTNAME_ll()where EVENTNAME is the actual name used in the configuration file.
Returning once again to our SCHEDOUT example, we can read the counter we defined using either function type. We bound logical counter 1 to the l2_cache_misses_user event. Thus, using the first function type, logging would take place using the following line:
klogger(SCHEDOUT, task->pid, klogger_get_counter(KLOGGER_COUNTER1);
If we want to use the function based on the event names, we just use the following line:
klogger(SCHEDOUT, task->pid, klogger_get_l2_cache_misses_user());
Sometimes the logging is not completely straightforward, involving some code preparing the data to be logged. In such cases, it may be important to know for which architecture the kernel was compiled. For that reason we use the KLOGGER_ARCH_COMPILED macro which is set to one of the klogger_arch_t enumeration values, defined in include/asm/klogger.h.
The PentiumIV architecture's performance counters support many different events, only few of which we support (Table 2). Adding support for other counters is not difficult, but rather time consuming. As such, we have decided to take the lazy road -- each programmer is welcome to add support for the events she needs.
Adding support is very simple. It only involves copying bit configuration from the long list of events listed in Appendix A of [5] (available online at http://developer.intel.com/design/Pentium4/documentation.htm) into the Perl hash named %EVENTS found in KERNEL_ROOT/scripts/klogger/PentiumIV.pm (the module implementing the hardware counters for the PentiumIV architecture). The copying is straight forward as each entry in the hash uses similar naming as that in the architecture's manual, with only two new fields: the event name, and its description. Please send us any new hash entries you add so it can be added to the main distribution.
For simplicity, let us assume you have opened the Klogger distribution into a directory called KLOGGER.
Adding Klogger support is a simple process:
foo[ linux-2.6.9 ] patch -p1 < KLOGGER/klogger-2.6.9/patch/klogger-2.6.9.patch foo[ linux-2.6.9 ] cp KLOGGER/klogger-2.6.9/patch/klogger.conf.base .klogger.conf.base
foo[ linux-2.6.9 ] patch -p1 < \ KLOGGER/klogger-2.6.9/schemata/sched/klogger-scheduler-schema-2.6.9.patch foo[ linux-2.6.9 ] cp KLOGGER/klogger-2.6.9/schemata/sched/klogger.conf.sched \ .klogger.conf.sched
Now compile the kernel, and log like the wind...
Klogger offers a control interface through the /proc filesystem (this is actually sysctl options exported through the /proc filesystem). All the control files reside under /proc/sys/klogger. The convention is that event names are in upper case, so all files whose name consists of upper case letters represent events, while lower-cased files represent other configuration options.
Note that Klogger's setting cannot change while it is enabled!
Klogger's main control switch -- toggling logging on or off -- is done via the /proc/sys/klogger/enabled file. By writing the text string ``1'' or ``0'' into that file, you can enable or disable logging, respectively.
Logs are saved in binary form into per-CPU files residing under the /tmp directory. These dump files are named /tmp/kernlog.dump.cpuN where N is the CPU number. The dump files might get very big, depending on the rate by which events are logged -- as such, make sure you have adequate space on the filesystem inside which /tmp resides.
Note of caution for SMP users: inter-CPU synchronization is based on the Linux kernel synchronization, and we have not tested its accuracy, nor the drift between CPU clocks. While the assumption that the drift is very small due to the fact that different CPUs on the same machine experience the same operating conditions (temprature/power), the same does not hold for the accuracy of the mechanism resetting all CPU clocks during Linux boot time. This is definitely on our TODO list.
When building Klogger with a specific set of schemata, a decoding utility is created for that specific kernel able to decode the binary log into a more human friendly, text formatted file. As mentioned earlier, the text format is basically an array of Perl Hashes, each represeting an event.
Assuming your Klogger kernel sources reside under KERN_ROOT, the decoder is compiled into
KERN_ROOT/kdecoder/klogger_decoder
This tool's operation is very simple: its only command line argument is the binary log, whose textual conversion is printed into the standard output. As such, the usual command line would look like:
KERN_ROOT/kdecoder/klogger_decoder /tmp/kernlog.dump.cpu0 > kernlog.cpu0.txt
Sometimes a Klogger is compiled with a broad set of events, but during certain measurement one would like to log only a subset of these. As such, Klogger is equipped with a simple mechanism to block logging of specific events. Under the /proc/sys/klogger directory, you can find a special file named after each of the events compiled in. This special file is an event specific toggle switch. By writing the strings ``0'' or ``1''you can block or unblock logging of the specific event, respectively. Note that the overhead imposed by an event's logging code when it is disabled -- either specifically, or because Klogger is disabled entirely -- is similar.
Klogger has several base events, used to account for invocation times, buffer dump times, and the state of the system when the logging starts -- i.e. all the processes in the system.
These events are:
Changing the buffer size is done via the /proc/sys/klogger/buffer_size file by simply writing a new size in units of MB (the size refers to the buffer of each CPU, so the actual memory allocated is , where is the number of processors in the system). Reading the file shows the buffer size allocated for each CPU.
For example, if we want to change the buffer size to 16MB per CPU right after boot the command set would be:
foo[ /proc/sys/klogger ] cat buffer_size 4MB/CPU foo[ /proc/sys/klogger ] echo 16 > buffer_size foo[ /proc/sys/klogger ] cat buffer_size 16MB/CPU(note that although writing to the file is done with only numbers, when reading the file the units are appended to the data).
To avoid stressing the memory by allocating large, physically continuous buffers Klogger uses virtually continuous buffers. However, there is a limit on the kernel's virtual memory size, which is set to 128MB on Linux 2.6.9 kernels. If the total memory allocated exceeds this limit, you can boot the machine with the special vmalloc boot time flag, resetting the kernel's virtual memory limit. For example, when using 128MB buffers on a 4-way machine, we added the vmalloc=640MB boot parameter so the kernel will have enough virtual memory for both the needed by Klogger, alongside the original virtual memory space it uses.
Resetting the low-water mark is done in a similar manner, using the /proc/sys/klogger/lowwater file. The number written to that file is the percent of the memory buffer space below which the buffer will be emptied (default is 10%). For example:
foo[ /proc/sys/klogger ] cat lowwater 10% foo[ /proc/sys/klogger ] echo 5 > lowwater foo[ /proc/sys/klogger ] cat lowwater 5% foo[ /proc/sys/klogger ]
If the kernel was configured with Klogger's internal benchmarking mechanism (kernel option [Kernel hacking --> KLogger benchmarks] enabled), you will find a directory called /proc/sys/klogger/benchmarks, in which there is a file matching each event's name. The file's contents is the event's logging overhead, in CPU cycles.
When a file is read, the matching event is logged several times to measure the average overhead. The number of iterations can be tuned using the special /proc/sys/klogger/benchmarks/iterations file (default is 1000 iterations).
For example, on our PentiumIV machine clocked at 2.8GHz the overhead of the KLOGGER_START event is 234 cycles (83.6 nanoseconds!), while the overhead of the more complex PROCESS_EXISTS (which contain character strings) is 16614 cycles (5.93us).
<433>>
Perl is a scripting language providing rapid development for both text analysis and execution of specific workloads. As such, Klogger includes a Perl modules to simplify both logging of workloads and the analysis of the logs.
When examining the effects of specific workloads on kernel behavior using Klogger, Perl can be used to automatically generate these workloads by running sequences of the needed applications. Afterwards, Perl can serve as a great tool to analyze log files, generate statistics and look for kernel behavioral patterns expressed by the various events.
In order for Perl to load the module, add the directory in which the module resides to the PERL5LIB environment variable.
The first group can be regarded as Klogger control functions, used in automated workload generating scripts. These are just wrappers for the basic shell commands introduces so far:
The second group of functions is used for log analysis:
Using the log analysis functions, analysis scripts can be built around the code depicted in Algorithm 1.
As previously described, a major goal when designing Klogger was to separate the mechanism from the policy. While the core of the framework is regarded as the mechanism, the schemata are the available policies. In this section we describe the schemata that are supplied with Klogger itself. We hope users will develop more schemata, sharing them with us and the community, so that future versions of Klogger will deliver a variety of subsystems' schemata.
The stopwatch schema is brought here mainly to serve an example. As its name suggests, it is simply a stopwatch: it has two events, EVENT_START and EVENT_STOP, marking the beginning and end of some events, respectively. Since it is a generic stopwatch, the schema does not even has a matching kernel patch, simply because it is aimed a measuring anything -- just place the klogger calls before and after the event you wish to measure.
Each events has an id field, to match a start event to its adjoining stop event, and to separate different events from each other. By checking the timestamp in the headers of both events, you can measure the cycles the event in question took.
To measure events, simply use the following template in your kernel code:
klogger(EVENT_START, user_chosen_event_id); /* The event in question... */ klogger(EVENT_STOP, user_chosen_event_id);
Use this basic schema as a template, adding more data to the event payload to get a feeling of klogger's operation.
The scheduler schema consists 8 events depicting all process states (to view the exact payload of each process look at the schema's configuration file):
This schema tries to catch the essence of Linux's locking mechanism -- the subsystem used to achieve inter-processor mutual exclusion. Linux uses the busy-wait approach for locking, and three major lock type:
A more detailed discussion of locking in Linux can be found in [1] (and a general discussion in [6]).
The events included in this schema are:
The networking schema enables the user to run networking based tests and analyze the data being transferred through the layers. Currently supported layers are the socket layer, the protocol(TCP/UDP/ICMP) layer and the IP layer.
The schema consists of 6 enumerations and 38 events, separated to log levels using Klogger's event inheritance system, the top level being the highest layer. That way, an analyst may choose the amount of logging he requires of Klogger and receive the wanted results in compile time. No modifications to kernel code are required.
Currently, we only support the IPv4 protocol stack.
TBD (read the technical report [3] for now)
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -notop_navigation -nobottom_navigation -split 0 -local_icons -mkdir -dir html Klogger-HOWTO.tex
The translation was initiated by Yoav Etsion on 2007-09-09