The Klogger HOWTO

Yoav Etsion Tal Ben-Nun Dan Tsafrir Dror G. Feitelson

School of Computer Science and Engineering

The Hebrew University, 91904 Jerusalem, Israel

0.98

What is Klogger?

Klogger is a framework for post mortem analysis of the Linux kernel behavior. It is designed so that developers can easily define events they want logged and log them in the kernel code. At runtime, the Klogger infrastructure is enabled for the period of time the developer wishes to analyze, logging all events into a special log file, which can later be analyzed.

Klogger excels at it's low overhead and flexibility, allowing researchers to analyze performance bottlenecks that are very difficult to reach with standard logging tools, while not requiring the user to become a kernel hacker.

This document is an attempt to allow the research community to gain experience with Klogger, while possibly providing feedback about the tool and its features.

Klogger flexibility is embodied in its support for different schemata -- different disjoint event sets, dedicated to the analysis of different subsystems. Schemata enable users to analyze only the subsystem that interests them, and create new Klogger schemata specific for their research (read more about schemata in sections 3 and 7).

We are trying to collect a host of schemata which will allow both developers and researchers to study the various kernel subsystems without having to dive into the actual implementations. The best example is a storage systems researcher who does not have to fully understand the implementation of the subsystem and driver in one kernel or another, as long as he can get information about the blocks read from/written to storage, their order, timings etc. By supplying a schema logging the major events of the storage subsystem, we can allow more people to be involved in cutting edge research. The same goes for most parts of modern kernels -- networking, scheduling, synchronization, file systems, etc.

How does Klogger Work?

Klogger is logging framework, and as such, needs some efficient way to save runtime events onto persistent storage. This section will try to overview the basic mechanism used by Klogger. (defining events and logging them is described in Section 3).

Klogger employs per CPU memory resident buffers, to which logging actually takes place. The logging code first checks if Klogger is enabled. If so, space is allocated on the relevant CPU's buffer, and the event's payload is saved there in the form of a C structure. The buffer is configured to not only be memory resident, but also uses Intel's write-combining cache policy [5], allowing efficient sequential writes and preventing the data from polluting the caches.

Each CPU is also attached with a special thread in charge of dumping the memory buffer to disk whenever the latter fills. These threads run at the highest real-time priority in the system, under the SCHED_FIFO scheduling policy [4,1]. Klogger does not actually wait for the buffer to fill, but rather has a user settable low-water mark indicating what percent of the buffer should be used as a reserve. Whenever the low-water mark is reached, Klogger signals the Linux scheduler to run the appropriate CPU's thread, in order to empty the buffer.

When using Klogger to analyze a subsystem that can be affected by the memory dump operation, events caused by Klogger's own action might cause the buffer to fill up, thus causing the loss of other events. Such cases can be identified by gaps in the logged events' serial numbers. Another possible problem is if logging occurs too often the overhead caused by emptying the buffer might affect the measurement. For such cases, Klogger provides the user with an interface to increase the buffer size (defaulting at 4MB per CPU) and change the low-water mark. This interface is described in Section 5.5. Experience shows that increasing the buffer is sufficient even in the more extreme cases and allows for logging millions of events per second.

One warning though: albeit a very efficient framework, we must strictly advise against using Klogger in production systems. Klogger is designed to be a framework used by researchers and developers. As such, some guidelines that are usually required by subsystem designers were relaxed -- for example: the decision to allow some minimal (and even negligible) overhead even when logging is disabled.

A more comprehensive discussion of Klogger's design and implementation is available in Section 8 and in [3].

Klogger Schemata

A Klogger schema defines a specific facet of Klogger. One way of looking at it is that the Klogger framework supplies the mechanism -- how to log -- while the schemata supply the policy -- what to log.

Technically, when we refer to a Klogger schema we refer two files: a configuration file defining the events to be logged and their payload, with the other file being a kernel patch containing the invocations of klogger -- where are the events defined in the first file actually get logged.

The next two sections will describe how to define and log events. The first will describe the syntax of the Klogger configuration file, and the second will describe how to log events.

Klogger Configuration File

Describing the syntax of the configuration file is split into three parts:

Configuration file naming (Section 3.1.1).
Defining Klogger events (Section 3.1.2).
Using the optional hardware performance counters (Section 3.3).

To clear up the picture, we will accompany those sections with a simple example: an event named SCHEDOUT which indicates that the CPU scheduler has switched out the running process, logging its pid.

Configuration File Naming

Klogger's configuration files reside in the kernel source root and are named ``.klogger.conf.NAME'' -- where NAME is the name of the schema they define. The separation into multiple configuration files allows great flexibility when adding or removing logging schemata. All Klogger enabled kernels must have at least one configuration file, named ``.klogger.conf.base'' which is included alongside the main Klogger kernel patch and contains definitions for the build in Klogger events, such as when it was enabled/disabled, what processes are running when it's enabled, etc. This file can serve as a basic example of a Klogger configuration file.

Declaring Events

Events are declared in a syntax similar to C structs. The basic template is:

event EVENTNAME {
    typeA varname1;
    typeB varname2;
    ...
    typeN varnameX;
}

with the various types known to Klogger listed in Table 1. Note that the usage of strings is highly discouraged since it requires the use of buffer copy functions, dramatically increasing the logging overhead (though strings up to the size of the underlying hardware's cache line are optimized into simple long assignments).

Table 1: The basic types available in a Klogger event, and their C equivalents.

Klogger Type	C Type
short	short
ushort	unsigned short
int	int
long	long
longlong	long long
ushort	unsigned short
uint	unsigned int
ulong	unsigned long
ulonglong	unsigned long long
stringN	char[N]

In our SCHEDOUT example, the event will thus be defined as:

event SCHEDOUT {
    int pid
}

and will appear in the final text log as:

{
    header => {
            "type"      => "SCHEDOUT",
            "serial"    => "119",
            "timestamp" => "1032071755760",
    },
    "pid"               => "1073",
},

Though extracting the text based log is described in Section 5.2, for now simply note that each event appears as a Perl hash. The hash contains all the fields described in the event definition and a header containing the event type (as text), its serial number in the log and a timestamp taken from the hardware's cycle counter.

Declaring Enumerations

Enumerations are declared in a syntax similar to C enums. The basic template is:

enum enumerationtype {
    one,
    minusone = -1,
    four = 4,
    TWO = 2,
    Three
}

The type ``enumerationtype'' may be later used in an event as a type. Integer values will automatically be converted to their equivalent strings when decoding Klogger's output.

Event Inheritance

Event inheritance is an advanced feature of Klogger, enabling analysts the definitions of inclusive groups of events, thereby defining different log levels.

The inheritance model, allows schema authors to specify the levels of the various events logged by Klogger and include log levels in other log levels using the ``include'' command.

Similar to C++ syntax, the currently used log level will be declared before all log level declarations with the ``using'' command.

Furthermore, when a schema author would like to omit an event from an included log level, he may use the ``omit'' command. Such action could be done in order to filter out information unneeded in a specific log level, thereby reducing the information generated by Klogger.

Event inheritance syntax is as follows:

using two

loglevel one <
   event ONE_DO {
     ...
   }

   event ONE_SIMPLE {
     ...
   }
>

loglevel two <
   include one

   omit ONE_SIMPLE

   event TWO_DETAILED {
     ...
   }
>

A more detailed example of log levels can be found in the network schema's configuration file.

Logging Events

Logging events is performed using a special C macro called klogger. First, we'll need to include the main Klogger header file:

#include <linux/klogger.h>

The syntax for the klogger macro resembles that of printf, with the first argument being the event name, followed by values for each of the event fields as ordered in the event description. The call for our general event template will thus be:

	klogger(EVENTNAME, varname1, varname2, ... , varnameX);

This call is converted by the C preprocessor into an inlined event specific logging function. However, If the kernel is compiled without Klogger support, this call converts into a null expression by the preprocessor.

Sometimes special code is needed (like using temporary variables) for non-trivial logging. Such code should be protected using an ifdef statement checking the existence of the CONFIG_KLOGGER macro, so it can be eliminated if the Klogger is not configured into the kernel.

Returning to our SCHEDOUT example, the call logging the event will simply be:

	klogger(SCHEDOUT, task->pid);

assuming task is a variable of type struct task_struct* pointing to the process that was just preempted off the CPU.

Using Hardware Performance Counters

Most modern processors support hardware performance counters, measuring various aspects of the processor's operations. These counters can provide information varying from the number of cache misses for each cache level, through the number of retired instructions, and sometimes even the number of memory operations queued on the processor. Needless to say, these counters are processor dependant. Moreover, these counters may not even be backward compatible with previous versions of the same architecture.

To make Klogger as efficient as possible, all pieces of code accessing the hardware performance counters are inlined. The result is that all abstractions used are taking place at compile time, incurring no runtime overhead from abstract indirections (as oppose to the approach taken by other tools, such as PAPI [2]). In that context, user should be warned that since the counters are processor model dependant, code compiled with performance counter support for one model may not run on another! Simply put, kernels compiled to utilize PentiumIII's performance counters might not even boot on a PentiumIV machine.

Binding Hardware Performance Counters

In order to use the hardware counters, we must add an architecture section to the configuration file, specifying the underlying hardware and binding Klogger's virtual counters to specific events supported by the hardware.

The basic structure is

arch ARCH_NAME {
        counter1 HARDWARE_COUNTER_NAME_A;
        counter2 HARDWARE_COUNTER_NAME_B;
	...
        counterN HARDWARE_COUNTER_NAME_X;
}

where ARCH_NAME is the architecture name (i.e. PentiumPro, PentiumIII, PentiumIV, etc.), ¹counter1...N are the virtual counter names, and HARDWARE_COUNTER_NAME_A...X are the hardware specific counter names. The list and descriptions of the hardware specific counter names supported for the given architecture can be viewed by running the command

./scripts/klogger-autogen.pl --list-counters

from the Klogger's kernel source root. For example, the counters currently supported for the PentiumIV architecture are listed in table 2.

Table 2: Hardware performance counters currently supported for the PentiumIV architecture. For full descriptions and the list of all counters supported by the hardware please refer to [5].

Counter Name	Counter Description
dtlb_miss	Page walks for a data TLB miss, all protection levels.
dtlb_miss_os	Page walks for a data TLB miss, OS protection level.
dtlb_miss_user	Page walks for a data TLB miss, user protection level.
itlb_miss	Page walks for an instruction TLB miss, all protection levels.
itlb_miss_os	Page walks for an instruction TLB miss, OS protection level.
itlb_miss_user	Page walks for an instruction TLB miss, user protection level.
l1_cache_misses	L1 cache misses, all protection levels.
l1_cache_misses_os	L1 cache misses, OS protection level.
l1_cache_misses_user	L1 cache misses, user protection level.
l2_cache_misses	L2 cache misses, all protection levels.
l2_cache_misses_os	L1 cache misses, OS protection level.
l2_cache_misses_user	L1 cache misses, user protection level.

Let's return to our SCHEDOUT example for a moment. Imagine we want Klogger to log not only the preempted process's pid, but also to log the number of overall L2 cache misses it has suffered.

Since we only want the event occurring when the process was running (regardless of any interrupts handled by the kernel during that time), we need to bind the first virtual counter to the underlying hardware's counter of the L2 cache misses occurring at the user protection level -- the l2_cache_misses_user counter. As such, the architecture dependant section in the configuration file would look like this:

arch PentiumIV {
        counter1 l2_cache_misses_user;
}

Also, we would add another 64bit integral field to the SCHEDOUT event, saving the counter's value at the moment of preemption:

event SCHEDOUT {
    int pid
    ulonglong l2_cache_misses_user
}

which means the textual log will hold the event as

{
    header => {
            "type"           => "SCHEDOUT",
            "serial"         => "119",
            "timestamp"      => "1032071755760",
    },
    "pid"                    => "1073",
    "l2_cache_misses_user"   => "35678014",
},

The next section describes how to read the hardware's performance counters, and how to log our extended SCHEDOUT event.

Reading Hardware Performance Counters

Reading the performance counters is done by two types of functions: The first type is generic, accepting which counter to read as an argument. The second type is logical, with the event name being part of the function's name. This separation allows the programmer two different modus operandi:

Read what ever is in a specific counter to generate log, allowing changes to the configuration file to change the counter's type, with no need to modify the code itself.
Require that a logical event be defined in some counter, when the logical event has specific importance.

Both types are implemented using compile time generated inlined functions. The reasoning behind this implementation is that while it would be simpler to just generate a logical to physical counter map and have a function use that map, such method involves more overhead and cache pollution.

The first, counter-based interface is defined as:

unsigned long      klogger_get_counter(unsigned long ctr);
unsigned long long klogger_get_counter_ll(unsigned long ctr);

where the first function returns just the lower 32 bits of the 64 bit counter, and the second function returns the entire 64 bit number. The ctr argument is a macro, named KLOGGER_COUNTERN where N is the counter number -- i.e. KLOGGER_COUNTER1, KLOGGER_COUNTER4, etc.

The second function type is declared as:

unsigned long      klogger_get_EVENTNAME()
unsigned long      klogger_get_EVENTNAME_ll()

where EVENTNAME is the actual name used in the configuration file.

Returning once again to our SCHEDOUT example, we can read the counter we defined using either function type. We bound logical counter 1 to the l2_cache_misses_user event. Thus, using the first function type, logging would take place using the following line:

klogger(SCHEDOUT, task->pid, klogger_get_counter(KLOGGER_COUNTER1);

If we want to use the function based on the event names, we just use the following line:

klogger(SCHEDOUT, task->pid, klogger_get_l2_cache_misses_user());

Sometimes the logging is not completely straightforward, involving some code preparing the data to be logged. In such cases, it may be important to know for which architecture the kernel was compiled. For that reason we use the KLOGGER_ARCH_COMPILED macro which is set to one of the klogger_arch_t enumeration values, defined in include/asm/klogger.h.

Adding Support for Other PentiumIV Performance Counters

The PentiumIV architecture's performance counters support many different events, only few of which we support (Table 2). Adding support for other counters is not difficult, but rather time consuming. As such, we have decided to take the lazy road -- each programmer is welcome to add support for the events she needs.

Adding support is very simple. It only involves copying bit configuration from the long list of events listed in Appendix A of [5] (available online at http://developer.intel.com/design/Pentium4/documentation.htm) into the Perl hash named %EVENTS found in KERNEL_ROOT/scripts/klogger/PentiumIV.pm (the module implementing the hardware counters for the PentiumIV architecture). The copying is straight forward as each entry in the hash uses similar naming as that in the architecture's manual, with only two new fields: the event name, and its description. Please send us any new hash entries you add so it can be added to the main distribution.

Compiling a Klogger Enabled Kernel

For simplicity, let us assume you have opened the Klogger distribution into a directory called KLOGGER.

Adding Klogger support is a simple process:

Patching klogger
Add the Klogger framework to the kernel using the base patch, and copy the base configuration file to the kernel root:

foo[ linux-2.6.9 ] patch -p1 < KLOGGER/klogger-2.6.9/patch/klogger-2.6.9.patch
foo[ linux-2.6.9 ] cp KLOGGER/klogger-2.6.9/patch/klogger.conf.base .klogger.conf.base

Adding schemata
Now you can either create your own schema, or use a ready made one supplied with the distribution (for example, the scheduler schema):

foo[ linux-2.6.9 ] patch -p1 < \
KLOGGER/klogger-2.6.9/schemata/sched/klogger-scheduler-schema-2.6.9.patch
foo[ linux-2.6.9 ] cp KLOGGER/klogger-2.6.9/schemata/sched/klogger.conf.sched \
.klogger.conf.sched

Enable KLogger in the kernel configuration
Enable [Kernel hacking --> KLogger] in the kernel configuration menus. A second option, [Kernel hacking --> KLogger benchmarks] also enables runtime measurements of each event's logging overhead (more on benchmarks in Section 5.6).

Now compile the kernel, and log like the wind...

Using Klogger at Runtime

Klogger offers a control interface through the /proc filesystem (this is actually sysctl options exported through the /proc filesystem). All the control files reside under /proc/sys/klogger. The convention is that event names are in upper case, so all files whose name consists of upper case letters represent events, while lower-cased files represent other configuration options.

Note that Klogger's setting cannot change while it is enabled!

Enable/Disable Logging

Klogger's main control switch -- toggling logging on or off -- is done via the /proc/sys/klogger/enabled file. By writing the text string ``1'' or ``0'' into that file, you can enable or disable logging, respectively.

Logs are saved in binary form into per-CPU files residing under the /tmp directory. These dump files are named /tmp/kernlog.dump.cpuN where N is the CPU number. The dump files might get very big, depending on the rate by which events are logged -- as such, make sure you have adequate space on the filesystem inside which /tmp resides.

Note of caution for SMP users: inter-CPU synchronization is based on the Linux kernel synchronization, and we have not tested its accuracy, nor the drift between CPU clocks. While the assumption that the drift is very small due to the fact that different CPUs on the same machine experience the same operating conditions (temprature/power), the same does not hold for the accuracy of the mechanism resetting all CPU clocks during Linux boot time. This is definitely on our TODO list.

Binary to Text Conversion

When building Klogger with a specific set of schemata, a decoding utility is created for that specific kernel able to decode the binary log into a more human friendly, text formatted file. As mentioned earlier, the text format is basically an array of Perl Hashes, each represeting an event.

Assuming your Klogger kernel sources reside under KERN_ROOT, the decoder is compiled into

KERN_ROOT/kdecoder/klogger_decoder

This tool's operation is very simple: its only command line argument is the binary log, whose textual conversion is printed into the standard output. As such, the usual command line would look like:

KERN_ROOT/kdecoder/klogger_decoder /tmp/kernlog.dump.cpu0 > kernlog.cpu0.txt

Enable/Disable Specific Events

Sometimes a Klogger is compiled with a broad set of events, but during certain measurement one would like to log only a subset of these. As such, Klogger is equipped with a simple mechanism to block logging of specific events. Under the /proc/sys/klogger directory, you can find a special file named after each of the events compiled in. This special file is an event specific toggle switch. By writing the strings ``0'' or ``1''you can block or unblock logging of the specific event, respectively. Note that the overhead imposed by an event's logging code when it is disabled -- either specifically, or because Klogger is disabled entirely -- is similar.

Default Events

Klogger has several base events, used to account for invocation times, buffer dump times, and the state of the system when the logging starts -- i.e. all the processes in the system.

These events are:

KLOGGER_START
Logged whenever Klogger is enabled (the first event logged), saving the kernel version and the size of the memory resident buffer used per CPU.
KLOGGER_FINISH
The reciprocal of the previous event. Simply marks when Klogger was stopped.
PROCESS_EXISTS
When Klogger is enabled, all existing processes are logged, being the mainline of the system's status. This event is logged for each existing process, holding the process's general information: pid, scheduling priority, scheduling policy, and its command line. Admittedly, this event is pretty expensive overhead-wise, it is only logged once as part of the log's header.
LOGGING_EVENT
A complement to the system status is Klogger's status, mainly the events that are being logged. Again, a LOGGING_EVENT is saved for each event name Klogger is configured to log.
DUMP_BEGIN
This event is logged whenever Klogger dumps its memory resident buffer to the disk. Its main purpose is that post mortem analysis of the log can isolate event implicitly generated by Klogger itself. For example, when analyzing disk traffic, one would like to discard traffic generated by the measurement itself. This event also saved the dump index -- the index of the dump during the current logging session.
DUMP_FINISH
Again, a reciprocal of the former, indicating the end of the buffer dump. Its payload also includes the number of bytes actually dumped.
MARK
A special event being used to mark places in the log -- for example, a measurement can have several phases, whose transitions the user would like mark in the log. Generating this event is done by writing some number string into a special file: /proc/sys/klogger/mark. Doing this will cause the event to be logged, along with the logging process's pid and the number written into the file. The written number can be used by the user to separate the different MARK events.

Buffer Size and Low Water Mark

Changing the buffer size is done via the /proc/sys/klogger/buffer_size file by simply writing a new size in units of MB (the size refers to the buffer of each CPU, so the actual memory allocated is , where is the number of processors in the system). Reading the file shows the buffer size allocated for each CPU.

For example, if we want to change the buffer size to 16MB per CPU right after boot the command set would be:

foo[ /proc/sys/klogger ] cat buffer_size 
4MB/CPU
foo[ /proc/sys/klogger ] echo 16 > buffer_size 
foo[ /proc/sys/klogger ] cat buffer_size 
16MB/CPU

(note that although writing to the file is done with only numbers, when reading the file the units are appended to the data).

To avoid stressing the memory by allocating large, physically continuous buffers Klogger uses virtually continuous buffers. However, there is a limit on the kernel's virtual memory size, which is set to 128MB on Linux 2.6.9 kernels. If the total memory allocated exceeds this limit, you can boot the machine with the special vmalloc boot time flag, resetting the kernel's virtual memory limit. For example, when using 128MB buffers on a 4-way machine, we added the vmalloc=640MB boot parameter so the kernel will have enough virtual memory for both the needed by Klogger, alongside the original virtual memory space it uses.

Resetting the low-water mark is done in a similar manner, using the /proc/sys/klogger/lowwater file. The number written to that file is the percent of the memory buffer space below which the buffer will be emptied (default is 10%). For example:

foo[ /proc/sys/klogger ] cat lowwater 
10%
foo[ /proc/sys/klogger ] echo 5 > lowwater 
foo[ /proc/sys/klogger ] cat lowwater 
5%
foo[ /proc/sys/klogger ]

Event Benchmarks

If the kernel was configured with Klogger's internal benchmarking mechanism (kernel option [Kernel hacking --> KLogger benchmarks] enabled), you will find a directory called /proc/sys/klogger/benchmarks, in which there is a file matching each event's name. The file's contents is the event's logging overhead, in CPU cycles.

When a file is read, the matching event is logged several times to measure the average overhead. The number of iterations can be tuned using the special /proc/sys/klogger/benchmarks/iterations file (default is 1000 iterations).

For example, on our PentiumIV machine clocked at 2.8GHz the overhead of the KLOGGER_START event is 234 cycles (83.6 nanoseconds!), while the overhead of the more complex PROCESS_EXISTS (which contain character strings) is 16614 cycles (5.93us).

<433>>

Special ``Mark'' Event

Marking a point in time

Klogger Perl Module

Perl is a scripting language providing rapid development for both text analysis and execution of specific workloads. As such, Klogger includes a Perl modules to simplify both logging of workloads and the analysis of the logs.

When examining the effects of specific workloads on kernel behavior using Klogger, Perl can be used to automatically generate these workloads by running sequences of the needed applications. Afterwards, Perl can serve as a great tool to analyze log files, generate statistics and look for kernel behavioral patterns expressed by the various events.

In order for Perl to load the module, add the directory in which the module resides to the PERL5LIB environment variable.

Control Functions

The first group can be regarded as Klogger control functions, used in automated workload generating scripts. These are just wrappers for the basic shell commands introduces so far:

klogger_start_logging()
As the same says: starts logging! (an equivalent to writing ``1'' into /proc/sys/klogger/enabled).
klogger_stop_logging()
Another surprise: the function simply stops the logging! (equivalent to writing ``1'' into /proc/sys/klogger/enabled).
klogger_dump2txt(kernel_home, dumpfile, txtfile)
Convert a binary dump file into the textual log format. The arguments are:
- kernel_home: the path to Klogger's kernel source tree. Needed for the binary2text decoder.
- dumpfile: the binary file to be converted.
- txtfile: the output text file.

Log Analysis Functions

The second group of functions is used for log analysis:

klogger_open_log(text_log_filename)
Opens the file text_log_filename for analysis. Returns a handle to the open log, or the undefined value if the open failed (the handle can only be used with the following functions).
klogger_next_event(log_handle)
Given a handle to an open log, return the next event in the log. The return value is a Perl reference to a Hash containing the next event in the log (as depicted in the textual file format). When reaching the end of the log the undefined value is returned.
klogger_get_processes(log_handle)
Returns a reference to a Hash containing the data about the running processes, indexed by the processes pids.
klogger_close_log(log_handle)
Close the open log associated with the given handle.

Using the log analysis functions, analysis scripts can be built around the code depicted in Algorithm 1.

Current Schemata

As previously described, a major goal when designing Klogger was to separate the mechanism from the policy. While the core of the framework is regarded as the mechanism, the schemata are the available policies. In this section we describe the schemata that are supplied with Klogger itself. We hope users will develop more schemata, sharing them with us and the community, so that future versions of Klogger will deliver a variety of subsystems' schemata.

Stopwatch Schema

The stopwatch schema is brought here mainly to serve an example. As its name suggests, it is simply a stopwatch: it has two events, EVENT_START and EVENT_STOP, marking the beginning and end of some events, respectively. Since it is a generic stopwatch, the schema does not even has a matching kernel patch, simply because it is aimed a measuring anything -- just place the klogger calls before and after the event you wish to measure.

Each events has an id field, to match a start event to its adjoining stop event, and to separate different events from each other. By checking the timestamp in the headers of both events, you can measure the cycles the event in question took.

To measure events, simply use the following template in your kernel code:

klogger(EVENT_START, user_chosen_event_id);
/* The event in question... */
klogger(EVENT_STOP, user_chosen_event_id);

Use this basic schema as a template, adding more data to the event payload to get a feeling of klogger's operation.

Scheduler Schema

The scheduler schema consists 8 events depicting all process states (to view the exact payload of each process look at the schema's configuration file):

TRY_TO_WAKEUP
A process was supposed to have been woken up. The TRY prefix emphasize that the process might have already been awake (for example, a running process have been sent a signal).
REMOVE_FROM_RUNQ
A process has became non-runnable (exited, blocked on I/O) and has been removed from the run queue.
ADD_TO_RUNQ
A process has became runnable and was added to the run queue.
SCHEDOUT
A process was preempted as part of a context switch.
SCHEDIN
A process was allocated a CPU as part of a context switch.
FORK
The fork system call was called.
EXEC
The exec system call was called.
EXIT
A process exited.

Locking Schema

This schema tries to catch the essence of Linux's locking mechanism -- the subsystem used to achieve inter-processor mutual exclusion. Linux uses the busy-wait approach for locking, and three major lock type:

spinlock: only allow one processor to execute a critical section at any given time.
read/write spinlock: asymmetric locks, which allow multiple read-only contexts to execute simultaneously, while writers contexts get exclusive access.
Big Kernel Lock (BKL), a relic from early SMP support in Linux, which was meant to be a transitional solution from the monolithic support (only one CPU running kernel code at any given time) to the fine grained support (separate locks for each global data structure). Although the BKL has been deemed a deprecated feature, it is still widely used in some parts of the kernel.

A more detailed discussion of locking in Linux can be found in [1] (and a general discussion in [6]).

The events included in this schema are:

SPINLOCK_INIT
A spinlock was initialized.
SPINLOCK_FINISH
A spinlock was released. This event also saves the time in which the lock was acquired.
PREEMPT_SPINLOCK_FINISH
A spinlock was released on kernel preemption. This event also saves the time in which the lock was acquired.
RWLOCK_INIT
A read/write lock was initialized.
READLOCK_FINISH
A spinlock was released from reader context. This event also saves the time in which the lock was acquired.
WRITELOCK_FINISH
A spinlock was released from writer context. This event also saves the time in which the lock was acquired.
PREEMPT_WRITELOCK_FINISH
A spinlock was released from writer context on kernel preemption. This event also saves the time in which the lock was acquired.
BKL_LOCK_FINISH
The BKL was released (again, also saving the in which time it was acquired).

Networking Schema

The networking schema enables the user to run networking based tests and analyze the data being transferred through the layers. Currently supported layers are the socket layer, the protocol(TCP/UDP/ICMP) layer and the IP layer.

The schema consists of 6 enumerations and 38 events, separated to log levels using Klogger's event inheritance system, the top level being the highest layer. That way, an analyst may choose the amount of logging he requires of Klogger and receive the wanted results in compile time. No modifications to kernel code are required.

Currently, we only support the IPv4 protocol stack.

Socket level:

SOCKET_CREATE
A socket has been created due to a request. The socket's new address in memory is logged (and could be used to track socket activities) and its domain and protocol.
SOCKET_CLOSE
A socket has been closed and will not be active anymore.
SOCKET_ASSIGNADDR
A socket has been assigned to an IP address and port and was set to communicate using a specified protocol.
SOCKET_LISTEN
A socket has been assigned to be a listening socket and will listen for data.
SOCKET_ACCEPT
A socket has been accepted to connect to another socket.
SOCKET_WRITE
A socket is writing data to its target.
SOCKET_RECEIVE
A socket is receiving (processed) data from its target.

Connection level:

SOCKET_BIND
(REPLACES SOCKET_ASSIGNADDR) A socket has been bound to an IP address and port and was set to communicate using a specified protocol.
SOCKET_CONNECT
(REPLACES SOCKET_ASSIGNADDR) A socket has connected to an IP address and port and was set to communicate using a specified protocol.
CONNECTION_HANDSHAKE
A socket has begun in a protocol connection handshake.
CONNECTION_CLOSE
A socket has closed its connection with the other socket it was connected to.

Protocol level:

TCP_SENDSYN
A SYN packet has been sent from the specified TCP socket.
TCP_RECVSYN
A SYN packet has been received to the specified TCP socket.
TCP_SENDSYNACK
A SYN-ACK packet has been sent from the specified TCP socket.
TCP_RECVSYNACK
A SYN-ACK packet has been received to the specified TCP socket.
TCP_SENDACK
An ACK packet has been sent from the specified TCP socket.
TCP_RECVACK
An ACK packet has been received to the specified TCP socket.
TCP_SEND
The specified amount of data is being sent from the specified TCP socket.
TCP_SENDPACKET
A fragmented/whole (unspecified) TCP packet is being queued for sending in the specified TCP socket.
TCP_RECEIVE
The specified amount of data (processed) is being received to the specified TCP socket.
TCP_RECEIVEPACKET
A fragmented/whole (specified) TCP packet has been received to the specified TCP socket and is queued for processing.
TCP_RECV_URGENTDATA
The specified amount of data (processed) is being received to the specified TCP socket. The packet carries the URG (urgent) flag and is processed before other non-urgent data.
TCP_DISCONNECT
A TCP socket has disconnected from its target.
UDP_PUSHPENDINGFRAMES
A UDP socket is pushing all of its pending frames to a specified socket buffer.
UDP_FLUSHPENDINGFRAMES
There has been a ``cork'' in the specified UDP socket and therefore all of the pending frames are being flushed immediately.
UDP_QUEUESOCKETBUFFER
A UDP socket is queueing IP packets received from a specified socket buffer for processing.
UDP_RECEIVE
The specified amount of data (processed) is being received to the specified UDP socket.
UDP_SEND
The specified amount of data is being sent from the specified UDP socket.
UDP_CONNECT
A UDP socket has connected to its target using the UDP protocol.
UDP_DISCONNECT
A UDP socket has disconnected from its target.
ICMP_SEND
An ICMP packet, containing constant data (specified as enumerated code types), is being sent.
ICMP_REPLY
An ICMP packet reply packet is being sent back to the ICMP packet sender.
ICMP_RECEIVE
An ICMP packet, containing constant data (specified as enumerated code types), has been received.
ICMP_PUSHPENDINGFRAMES
An ICMP socket is pushing all of its pending frames to a specified socket buffer.

IP Level:

IP_PUSHPENDINGFRAMES
Pushed pending frames from a protocol have been received from a specified socket into the specified socket buffer.
IP_FLUSHPENDINGFRAMES
There has been a ``cork'' in the specified socket buffer and therefore all of the pending frames are being flushed immediately and won't be sent.
IP_SENDPACKET
The specified IP socket buffer has sent a packet. Whether it's a fragment or not, its fragment offset, packet size, total size, the requesting PID and the current MTU are all logged.
IP_RECEIVEPACKET
The specified IP socket buffer has received a packet (unprocessed). Whether it's a fragment or not, its fragment offset, packet size, total size, the requesting PID and the current MTU are all logged. NOTE: The total size of the de-fragmented data is not applicable until the last fragment arrives.

Design and Implementation

TBD (read the technical report [3] for now)

Future Work

Complete this document.
More schemata
Add support for hardware performance counters on a variety of architectures.
Make Klogger available at boot time.
Let the user change the output log file name.
Verify cycle synchronization on SMPs.

Bibliography

1: D. P. Bovet and M. Cesati.
Understanding the Linux Kernel.
O'Reilly & Associates, 2nd edition, 2003.
2: S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci.
A scalable cross-platform infrastructure for application performance tuning using hardware counters.
In Supercomputing, Nov 2000.
3: Y. Etsion, D. Tsafrir, S. Kirkpatrick, and D. G. Feitelson.
Fine grained kernel logging with klogger: Experience and insights.
In 2nd ACM EuroSys, pages 259-272, Mar 2007.
4: B. O. Gallmeister.
Posix. 4: Programming for the Real World.
O'Reilly & Associates, January 1995.
5: Intel Corp.
IA-32 Intel Achitecture Software Developr's Manual. Vol. 3: System Programming Guide.
6: C. Schimmel.
UNIX Systems for Modern Architectures.
Addison Wesley, 1994.

About this document ...

The Klogger HOWTO

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

The command line arguments were:
latex2html -notop_navigation -nobottom_navigation -split 0 -local_icons -mkdir -dir html Klogger-HOWTO.tex

The translation was initiated by Yoav Etsion on 2007-09-09

Footnotes

... etc.),¹: Embarrassingly, the only architecture currently supported is the PentiumIV. However, adding new architectures is not a very difficult task, but requires a thorough reading of the relevant chapter in [5]. We would be glad to assist anyone willing to undertake such a task...

Yoav Etsion 2007-09-09