Parallel Workloads Archive: Standard Workload Format

The Standard Workload Format

The standard workload format (swf) was defined in order to ease the use of workload logs and models. With it, programs that analyze workloads or simulate system scheduling need only be able to parse a single format, and can be applied to multiple workloads.

The standard workload format abides by the following principles:

The files are portable and easy to parse:
- Each workload is stored in a single ASCII file.
- Each job (or roll) is represented by a single line in the file.
- Lines contain a predefined number of fields, which are mostly integers, separated by whitespace. Fields that are irrelevant for a specific log or model appear with a value of -1.
- Comments are allowed, as identified by lines that start with a `;'. In particular, files are expected to start with a set of header comments that define the environment or model.
The same format is used for models and logs. This implies that in each context certain fields may be redundant; for example, logs do not contain data about feedback and depdencies among jobs, which might appear in models, whereas models do not contain data about wait times in queues.
The format is completely defined, with no scope for user extensability. Thus you are guaranteed to be able to parse any file that adheres to the standard, and multiple competing and incompatible extensions are avoided. If experience shows that important attributes have been left out, they will be included in the future by creating an updated version of the standard.

Versions

The first version to be published was version 2.

The current version, described here, is version 2.2.

Version Additions

2.1 Add status 5 (canceled job)

MaxProcs header comment to support SMP nodes

Format for queue information to help automatic parsing

2.2 Add MaxQueues header comment

Add MaxPartitions header comment

Added TimeZoneString header comment

Deprecated the buggy TimeZone header comment (replaced by TimeZoneString)

Changed format of StartTime and EndTime be like that of 'date'

The Data Fields

Job Number -- a counter field, starting from 1.
Submit Time -- in seconds. The earliest time the log refers to is zero, and is usually the submittal time of the first job. The lines in the log are sorted by ascending submittal times. It makes sense for jobs to also be numbered in this order.
Wait Time -- in seconds. The difference between the job's submit time and the time at which it actually began to run. Naturally, this is only relevant to real logs, not to models.
Run Time -- in seconds. The wall clock time the job was running (end time minus start time).
We decided to use "wait time" and "run time" instead of the equivalent "start time" and "end time" because they are directly attributable to the scheduler and application, and are more suitable for models where only the run time is relevant.
Note that when values are rounded to an integral number of seconds (as often happens in logs) a run time of 0 is possible and means the job ran for less than 0.5 seconds. On the other hand it is permissable to use floating point values for time fields.
Number of Allocated Processors -- an integer. In most cases this is also the number of processors the job uses; if the job does not use all of them, we typically don't know about it.
Average CPU Time Used -- both user and system, in seconds. This is the average over all processors of the CPU time used, and may therefore be smaller than the wall clock runtime. If a log contains the total CPU time used by all the processors, it is divided by the number of allocated processors to derive the average.
Used Memory -- in kilobytes. This is again the average per processor.
Requested Number of Processors.
Requested Time. This can be either runtime (measured in wallclock seconds), or average CPU time per processor (also in seconds) -- the exact meaning is determined by a header comment. In many logs this field is used for the user runtime estimate (or upper bound) used in backfilling. If a log contains a request for total CPU time, it is divided by the number of requested processors.
Requested Memory (again kilobytes per processor).
Status 1 if the job was completed, 0 if it failed, and 5 if cancelled. If information about chekcpointing or swapping is included, other values are also possible. See usage note below. This field is meaningless for models, so would be -1.
User ID -- a natural number, between one and the number of different users.
Group ID -- a natural number, between one and the number of different groups. Some systems control resource usage by groups rather than by individual users.
Executable (Application) Number -- a natural number, between one and the number of different applications appearing in the workload. in some logs, this might represent a script file used to run jobs rather than the executable directly; this should be noted in a header comment.
Queue Number -- a natural number, between one and the number of different queues in the system. The nature of the system's queues should be explained in a header comment. This field is where batch and interactive jobs should be differentiated: we suggest the convention of denoting interactive jobs by 0.
Partition Number -- a natural number, between one and the number of different partitions in the systems. The nature of the system's partitions should be explained in a header comment. For example, it is possible to use partition numbers to identify which machine in a cluster was used.
Preceding Job Number -- this is the number of a previous job in the workload, such that the current job can only start after the termination of this preceding job. Together with the next field, this allows the workload to include feedback as described below.
Think Time from Preceding Job -- this is the number of seconds that should elapse between the termination of the preceding job and the submittal of this one.

Consistency and data quality

It is recommended to take measures to ensure that the data is consistent, and this is indeed done for logs on this site. For example, it is verified that fields like wait time, runtime, and number of processors do not contain negative values, and that the number of processors specified does not exceed the number available in the system. However, such checks are not made for fields representing requested resources, so for example the field of requested processors may indeed specify more than the number of processors available in the system.

Data quality problems are discussed in the paper Experience with the Parallel Workloads Archive (by D. G. Feitelson, D. Tsafrir, and D. Krakov, published in J. Parallel & Distributed Comput. 74(10), pp. 2967-2982, Oct 2014; DOI 10.1016/j.jpdc.2014.06.013).

Usage of the Status field

The main usage of the status field is to note the job's status. This isn't as straightforward as it sounds.

The simple case is jobs that complete normally, and have status 1.

The harder case is jobs that don't complete normally. This can happen for several reasons:

The job failed (e.g. segmentation fault). This is given status 0.
The job was cancelled by the user (like ^C in Unix). This is given status 5. Note that cancelled jobs may have positive runtimes and processors if cancelled after they started to run, or 0 or -1 if cancelled while waiting in the queue.
The job was killed by the system (e.g. because it exceeded its requested run time). This may be given different status values in different logs; it will typically be 0 or 5, but might also be 1.

Note also that the distinction between failure / cancellation / killing is not necessarily accurate, as the distinction typically does not appear in the original logs.

If a log contains information about checkpoints and swapping out of jobs, a job can have multiple lines in the log. In fact, we propose that the job information appear twice. First, there will be one line that summarizes the whole job: its submit time is the submit time of the job, its runtime is the sum of all partial runtimes, and its code is 0 or 1 according to the completion status of the whole job. In addition, there will be separate lines for each instance of partial execution between being swapped out. All these lines have the same job ID and appear consecutively in the log. Only the first has a submit time; the rest only have a wait time since the previous burst. The completed code for all these lines is 2, meaning "to be continued"; the completion code for the last such line is 3 or 4, corresponding to completion or being killed. It should be noted that such details are only useful for studying the behavior of the logged system, and are not a feature of the workload. Such studies should ignore lines with completion codes of 0 and 1, and only use lines with 2, 3, and 4. For workload studies, only the single-line summary of the job should be used, as identified by a code of 0 or 1.

To summarize, the status field codes are (or should be) as follows:

0 Job Failed

1 Job completed successfully

2 This partial execution will be continued

3 This is the last partial execution, job completed

4 This is the last partial execution, job failed

5 Job was cancelled (either before starting or during run)

Usage of the preceding / think fields

The last two fields work as follows. Suppose we know that a.out, job number 123, should start ten seconds after the termination of gcc x.c, which is job number 120. We could give job number 123 a submittal time that is 10 seconds after the submittal time plus run time of job 120, but this wouldn't be right -- changing the scheduler might change the wait time of job 120 and spoil the connection. The solution is to use "preceding job" and "think time" fields to save such relationships between jobs explicitly. In this example, for job number 123 we'll put 120 in its preceding job number field, and 10 in its think time from preceding job field. In a workload log, it is possible to include both the submittal time and the precedence information; models can include precedences with dependent jobs that don't have independent arrival times.

Header Comments

The first lines of the log may (or rather, shold) be of the comments with the format ";Label: Value". These are special header comments with a fixed format, used to define global aspects of the workload. Predefined labels are:

Version: Version number of the standard format the file uses. The format described here is version 2.
Computer: Brand and model of computer
Installation: Location of installation and machine name
Acknowledge: Name of person(s) to acknowledge for creating/collecting the workload.
Information: Web site or email that contain more information about the workload or installation.
Conversion: Name and email of whoever converted the log to the standard format.
MaxJobs: Integer, total number of jobs in this workload file.
MaxRecords: Integer, total number of records in this workload file. If no checkpointing/swapping information is included, there is one record per job, and this is equal to MaxJobs. But with chekpointing/swapping there may be multiple records per job.
Preemption: Enumerated, with four possible values. 'No' means that jobs run to completion, and are represented by a single line in the file. 'Yes' means that the execution of a job may be split into several parts, and each is represented by a separate line. 'Double' means that jobs may be split, and their information appears twice in the file: once as a one-line summary, and again as a sequence of lines representing the parts, as suggested above. 'TS' means time slicing is used, but no details are available.
UnixStartTime: When the log starts, in Unix time (seconds since the epoch)
TimeZone: DEPRECATED and replaced by TimeZoneString.
A value to add to times given as seconds since the epoch. The sum can then be fed into gmtime (Greenwich time function) to get the correct date and hour of the day. The default is 0, and then gmtime can be used directly. Note: do not use localtime, as then the results will depend on the difference between your time zone and the installation time zone.
TimeZoneString: Replaces the buggy and now deprecated TimeZone.
TimeZoneString is a standard UNIX string indicating the time zone in which the log was generated; this is actually the name of a zoneinfo file, e.g. "Europe/Paris". All times within the SWF file are in this time zone. For more details see the usage note below.
StartTime: When the log starts, in human readable form, in this standard format: Tue Feb 21 18:44:15 IST 2006 (as printed by the UNIX 'date' utility).
EndTime: When the log ends (the last termination), formatted like StartTime.
MaxNodes: Integer, number of nodes in the computer. List the number of nodes in different partitions in parentheses if applicable.
MaxProcs: Integer, number of processors in the computer. This is different from MaxNodes if each node is an SMP. List the number of processors in different partitions in parentheses if applicable.
MaxRuntime: Integer, in seconds. This is the maximum that the system allowed, and may be larger than any specific job's runtime in the workload.
MaxMemory: Integer, in kilobytes. Again, this is the maximum the system allowed.
AllowOveruse: Boolean. 'Yes' if a job may use more than it requested for any resource, 'No' if it can't.
MaxQueues: Integer, number of queues used.
Queues: A verbal description of the system's queues. Should explain the queue number field (if it has known values). As a minimum it should be explained how to tell between a batch and interactive job.
Queue: A description of a single queue in the following format: queue-number queue-name (optional-details). This should be repeated for all the queues.
MaxPartitions: Integer, number of partitions used.
Partitions: A verbal description of the system's partitions, to explain the partition number field. For example, partitions can be distinct parallel machines in a cluster, or sets of nodes with different attributes (memory configuration, number of CPUs, special attached devices), especially if this is known to the scheduler.
Partition: Description of a single partition.
Note: There may be several notes, describing special features of the log. For example, "The runtime is until the last node was freed; jobs may have freed some of their nodes earlier".

Usage of the TimeZone and TimeZoneString fields

The TimeZone header comment is DEPRECATED and replaced by TimeZoneString.

TimeZoneString is a standard UNIX string indicating the time zone in which the log was generated.

All specified epoch times within the SWF file relate to this time zone, e.g. in the LPC log, UnixStartTime=1091615532 (= seconds since epoch) translates to "Wed Aug 04 12:32:12 CEST 2004" in France time (where the log was generated).
Indeed, the TimeZoneString of the LPC log is Europe/Paris. This must be the name of a standard zoneinfo file, which is usually found under the /usr/share/zoneinfo/ directory in UNIX systems (that is, the file /usr/share/zoneinfo/Europe/Paris must exist).

If you want to convert an SWF time to local a time (e.g. to know the time-of-day in which a job J was submitted), then in Perl (for example) you do the following:

        use POSIX;
	$ENV{TZ} = 'US/Pacific';  # a file under /usr/share/zoneinfo/
	POSIX::tzset();           # new timezone takes effect
        my $UnixStartTime = ...;  # start time of log in seconds since epoch
        my $submit        = ...;  # SWF submit time of J
	my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst)
            = localtime($UnixStartTime + $submit);

If you want to do the same in C:

        #include <time.h>
        #include <stdlib.h>

        setenv("TZ", ":Europe/Paris", 1/*overwrite*/);
	tzset();

        time_t unixstart = ...; // UnixStartTime
	time_t submit    = ...; // SWF submit time of J
	time_t sum       = unixstart + submit;

        struct tm *datetime;
	datetime = localtime( &sum );

The reason TimeZone is buggy and was replaced by TimeZoneString is that with the former we completely ignore daylight saving periods.
Note that in contrast to TimeZone, if we use TimeZoneString then we should use localtime (not gmtime).
Also, note that contrary to common belief, the epoch is defined to be 00:00:00 on January 1, 1970, in London (that is, it's not true that every time-zone counts seconds since 00:00:00 on January 1, 1970 on that time zone).

Parsing

To support usage of the standard workload format, we have an example Perl script for parsing it.

Acknowledgements

The standard workload format was proposed by David Talby and refined through discussions with Dror Feitelson, James Patton Jones, and others.

If you use it, you can refer to the following paper:
Steve J. Chapin, Walfredo Cirne, Dror G. Feitelson, James Patton Jones, Scott T. Leutenegger, Uwe Schwiegelshohn, Warren Smith, and David Talby, "Benchmarks and Standards for the Evaluation of Parallel Job Schedulers". In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (Eds.), Springer-Verlag, 1999, Lect. Notes Comput. Sci. vol. 1659, pp. 66-89; DOI 10.1007/3-540-47954-6_4.
Don't forget to also point to this web page, which contains the most up-to-date version of the definition.

The improvements in handling time zones are due to Dan Tsafrir.

Back to the Parallel Workloads Archive home page

feit@cs.huji.ac.il / Feb 21, 2006

Version	Additions
2.1	Add status 5 (canceled job)
	MaxProcs header comment to support SMP nodes
	Format for queue information to help automatic parsing
2.2	Add MaxQueues header comment
	Add MaxPartitions header comment
	Added TimeZoneString header comment
	Deprecated the buggy TimeZone header comment (replaced by TimeZoneString)
	Changed format of StartTime and EndTime be like that of 'date'

0	Job Failed
1	Job completed successfully
2	This partial execution will be continued
3	This is the last partial execution, job completed
4	This is the last partial execution, job failed
5	Job was cancelled (either before starting or during run)