Parallel Workloads Archive: LLNL uBGL

The LLNL uBGL log

System: A small BlueGene/L system at LLNL
Duration: Nov 2006 to Jun 2007
Jobs: 112,611

This log contains several months worth of accounting records from a small BlueGene/L system installed at Lawrence Livermore National Lab. This is not the big BlueGene machine that was no. 1 on the Top500 list. For more information about BlueGene/L machines at LLNL, see URL https://asc.llnl.gov/computing_resources/bluegenel/. The specific machine from which this log is derived had 2 midplanes, with a total of 2048 processors.

The log is from the machine's initial unclassified stage, when it was not really stable yet; most of the jobs failed, and thus this workload is not representative of normal production use. As soon as the machine became stable it was converted to classified use, and no more data is available.

The LLNL uBGL workload log was graciously provided by Moe Jette, who also helped with background information and interpretation. If you use this log in your work, please use a similar acknowledgment.

Downloads:

LLNL-uBGL-2006-0 1.2 MB gz original log
LLNL-uBGL-2006-2.swf 750 KB gz converted log
LLNL-uBGL-2006-1.swf 750 KB gz OLD VERSION of converted log (replaced 11 Dec 2011)
(May need to click with right mouse button to save to disk)

Papers Using this Log:

This log was used in the following papers: [thebe09] [liuz10] [feitelson14] [lic14]

System Environment

uBGL is a 2048 processor BlueGene/L system, with two midplanes of 1024 processors each. Users can request to run jobs on a subset of a midplane's processors, so midplanes can be shared by several jobs at a time.

Scheduling is performed with the Slurm resource management system.

Log Format

The original log is available as LLNL-uBGL-2006-0.

This file contains one line per completed job in the Slurm format. The fields are

  1. JobId=<number>
  2. UserId=xxxxxx(<number>) - the x's hide the actual user name to conserve privacy
  3. Name=<string> - name of executable (script), could be empty
  4. JobState=<status>
  5. Partition=<string>
  6. TimeLimit=<number> - in minutes
  7. StartTime=<date and time>
  8. EndTime=<date and time>
  9. NodeList=<string> - midplanes, could be (null)
  10. NodeCnt=<number> - midplanes
  11. Connection=<string>
  12. Reboot=no - OPTIONAL FIELD, affects numbering of subsequent ones
  13. Rotate=yes
  14. MaxProcs=<number> - max procs to actually use
  15. Geometry=<string> - midplanes
  16. Start=None
  17. Block_ID=<string> - or unassigned

Conversion Notes

The converted log is available as LLNL-uBGL-2006-2.swf. The conversion from the original format to SWF was done subject to the following. The difference between conversion 1 and conversion 2 is only in the wait time field, which was set to 0 in conversion 1 and to -1 in conversion 2.

The conversion was done by a log-specific parser in conjunction with a more general converter module.

Usage Notes

A huge number of jobs failed: 101,331 out of a total of 112,611, i.e. 90% (including 1290 that were cancelled). Of these, the vast majority were very short: 99,401 failed jobs ran for less than 5 seconds. This is attributed to the fact that the machine was new and unstable at the time this log was recorded. It obviously affects the usability of the data, and implies it is not representative of normal production use.

The log contains abnormally high activity by two main users, one of which is more concentrated at the beginning, while the other is spread out throughout the log. However, a cleaned version has not been prepared, as it seemed illadvised for a log that does not represent production work anyway.

The Log in Graphics

File LLNL-uBGL-2006-2.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization


Parallel Workloads Archive - Logs