The ANL Intrepid log

System: Blue Gene/P (Intrepid) at ANL
Duration: Jan 2009 to Sept 2009
Jobs: 68,936

This log contains several months worth of accounting records from a large Blue Gene/P system called Intrepid. Intrepid is a 557 TF, 40-rack Blue Gene/P system deployed at Argonne Leadership Computing Facility (ALCF) at Argonne National Laboratory. This system comprises 40,960 quad-core nodes, with 163,840 cores, associated I/O nodes, storage servers, and an I/O network. It debuted as No. 3 in the TOP 500 supercomputer list released in June 2008 and was ranked No. 13 in the list released in November 2010.

Intrepid has been in full production since the beginning of 2009. The system is used primarily for scientific and engineering computing. The vast majority of the use is allocated to awardees of the DOE Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program. For more information about the system at ANL, see URL http://www.alcf.anl.gov/resources/storage.php.

The workload log from ANL Intrepid was graciously provided by Susan Coghlan (smc@alcf.anl.gov) from ALCF at ANL, Narayan Desai (desai@mcs.anl.gov) from MCS at ANL. It was converted to SWF and made available by Wei Tang (wtang6@iit.edu) from Illinois Institute of Technology.

Downloads:

ANL-Intrepid-2009-1.swf 0.9 MB gz converted log
failure data 40 MB zip RAS log available from the CFDR
(May need to click with right mouse button to save to disk)

System Environment

Intrepid (Blue Gene/P) – the ALCF production machine for open science research
  • 40,960 quad-core nodes
  • 163,840 cores available for computation
  • 80 terbytes memory (2GB per node, 512MB per core)
  • 557 teraflops
  • 640 additional I/O nodes

The log contains the first 8 months' workload on the 40-rack production Intrepid. Each rack houses 1024 nodes, representing 4096 processor cores and 2TB of memory. As other Blue Gene/P system, Intrepid groups nodes into partitions. Each job is exected in a separate partition. In 8 racks the minimal partition size is 64 nodes (256 cores). In the rest the minimal size is 512 nodes (2048 cores). Partitions of less than 512 nodes are only used for development jobs. Scheduling is performed with the Cobalt resource management system. For more information about Cobalt, see URL http://trac.mcs.anl.gov/projects/cobalt/.

In parallel to the job log available here, there is also a RAS log on the Computer Failure Data Repository. This enables the joint analysis of how failures affect jobs.

Papers Using this Log:

This log was used in the following papers:
[tang10] [tang11] [zheng11]

Log Format

The original log is not available.

The available file contains one line per completed job in the SWF format. The valid fields are:

1 Job number
2 Submit time (in seconds)
3 Wait time (in seconds)
4 running time (in seconds)
5 Number of allocated processors
8 Requested number of processors
9 Requested running time (in seconds)
12 User ID
15 Queue Number

Conversion Notes

The converted log is available as ANL-Intrpid-2009-1.swf. The conversion from the original format to SWF was done subject to the following.

Usage Notes

Field 8 is the requested number of processors which is provided by the user. It need not correspond to a partition size. This is most probably the number of processors the job will actually use.

Field 5 is the number of processors actually allocated, which is larger or equal to field 8, and corresponds to the partition size. When allocating nodes, the Cobalt scheduler will choose a partition with least size that can accomodate the job. If you want to analyze fragmentation on Blue Gene/P you need to look at allocations, but if you are using this data to simulate user requests regardless of constraints of the architecture then you should use the requests.

Specifically, in the log 30,948 jobs got more processors than they requested. In some cases this difference was extreme; for example, there are jobs that requested 40,960 processors but were allocated the full 163,840 processors in the machine.

The Log in Graphics

File ANL-Intrepid-2009-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance


Parallel Workloads Archive - Logs