Parallel Workloads Archive: LLNL Thunder

The LLNL Thunder log

System: Linux Cluster (Thunder) at LLNL
Duration: Feb 2007 to Jun 2007
Jobs: 128,662

This log contains several months worth of accounting records from a large Linux cluster called Thunder installed at Lawrence Livermore National Lab. For more information about Linux clusters at LLNL, see URL https://computing.llnl.gov/tutorials/linux_clusters/. This specific cluster has 1024 nodes, each with 4 processors, for a total of 4096 processors.

At the time that this log was recorded, Thunder was considered a "capacity" computing resource, meaning that it was intended for running large numbers of smaller to medium jobs. This is in contrast with the newer Atlas cluster, which is a "capability" machine, used for running large parallel jobs that cannot execute on lesser machines.

Note that the log does not include arrival information, only start times.

The LLNL Thunder workload log was graciously provided by Moe Jette, who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.

Downloads:

LLNL-Thunder-2007-0 2.3 MB gz original log
LLNL-Thunder-2007-1.swf 1.4 MB gz converted log
LLNL-Thunder-2007-1.1-cln.swf 1.3 MB gz cleaned log -- RECOMMENDED, see usage notes
(May need to click with right mouse button to save to disk)

Papers Using this Log:

This log was used in the following papers: [thebe09] [pascual09] [minh11] [kleineweber11] [yuan11] [garg11] [lindsay12] [etinski12] [gomezm13] [liang13] [ming13] [rajbhandary13] [tian14] [feitelson14] [lic14] [lucarelli17]

System Environment

Thunder is an 1024 node Linux cluster. Each node boasts 4 Intel IA-64 Itanium processors clocked at 1.4 GHz and 8 GB of memory. The nodes are connected by a Quadrics network. When it was installed in 2004, this was the #2 machine on the Top500 list.

The nodes are divided into the following partitions:
login4 nodes
debug16 nodes
batch986 nodes
file servers16 nodes
metadata servers2 nodes
The data in the log pertains to jobs that ran on the debug and batch partitions. Scheduling is performed with the LCRM and Slurm resource management systems. For more information about Slurm, see URL https://computing.llnl.gov/LCdocs/slurm/.

Log Format

The original log is available as LLNL-Thunder-2007-0.

This file contains one line per completed job in the Slurm format. The fields are

  1. JobId=<number>
  2. UserId=xxxxx(<number>) - the x's hide the actual user name to conserve privacy
  3. Name=<string> - name of executable (script), could be empty
  4. JobState=<status>
  5. Partition=<string>
  6. TimeLimit=<number> - in minutes
  7. StartTime=<date and time>
  8. EndTime=<date and time>
  9. NodeList=<string> - comma separated list of single nodes and ranges
  10. NodeCnt=<number>

Conversion Notes

The converted log is available as LLNL-Thunder-2007-1.swf. The conversion from the original format to SWF was done subject to the following. The conversion was done by a log-specific parser in conjunction with a more general converter module (version 3).

Usage Notes

The original log contains several flurries of very high activity by individual users, which may not be representative of normal usage. These were removed in the cleaned version. It is recommended that the clean version be used.
The cleaned log is available as LLNL-Thunder-2007-1.1-cln.swf.

A flurry is a burst of very high activity by a single user. The filters used to remove the three flurries that were identified are

user=160 and job>19279 and job<19453 (173 jobs)
user=79 and job>47409 and job<58080 (6539 jobs)
user=40 and job>109910 and job<110858 (911 jobs)
Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the filtered logs job numbering is not consecutive.

Further information on flurries and the justification for removing them can be found in:

The Log in Graphics

File LLNL-Thunder-2007-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization

File LLNL-Thunder-2007-1.1-cln.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization


Parallel Workloads Archive - Logs