Parallel Workloads Archive: CTC SP2

The Cornell Theory Center (CTC) IBM SP2 log

System: 512-node IBM SP2
Duration: July 1996 thru May 1997
Jobs: 79,302

This log contains 11 months worth of accounting records for the 512-node IBM SP2 located at the Cornell Theory Center (CTC). Apparently, only 338 nodes are used for the batch jobs in the log. Scheduling on this machine was performed by EASY and LoadLeveler. For more information about CTC, see URL http://www.tc.cornell.edu/.

The workload log from the CTC SP2 was graciously provided by Dan Dwyer (dwyer@tc.cornell.edu) from the Cornell Theory Center, a high-performance computing center at Cornell University, Ithaca, New York, USA. The information below was provided by Steve Hotovy. If you use this log in your work, please use a similar acknowledgment. Also, please send a notice of your work to cal@tc.cornell.edu.

In addition to the production log from July 1996 to May 1997, an early log covering 75,944 jobs during June 1995 to April 1996 is also available. This is the log used by Hotovy in his analysis of the evolution of the workload soon after the machine was installed ([hotovy96]). During this period only LoadLeveler was used.

Downloads:

CTC-SP2-1996-0 3.6 MB gz original log
CTC-SP2-1996-3.swf 1.5 MB gz converted log
CTC-SP2-1996-3.1-cln.swf 1.5 MB gz cleaned log -- RECOMMENDED, see usage notes
CTC-SP2-1996-1.swf 1.5 MB gz OLD VERSION of converted log (replaced 1 Aug 2006)
CTC-SP2-1996-1.1-cln.swf 1.5 MB gz OLD VERSION of cleaned log (replaced 1 Aug 2006)
CTC-SP2-1996-2.swf 1.5 MB gz OLD VERSION of converted log (replaced 30 Nov 2011)
CTC-SP2-1996-2.1-cln.swf 1.5 MB gz OLD VERSION of cleaned log (replaced 30 Nov 2011)
CTC-SP2-1995-2.swf 1.4 MB gz the early log
(May need to click with right mouse button to save to disk)

System Environment

Of the 512 nodes in the system, 430 are dedicated to running batch jobs (but see usage notes below). The remainder of the nodes are used for interactive jobs, I/O nodes, special projecs, and system testing. The log pertains to the batch partition.

The CTC SP2 is heterogeneous in the sense that not all 512 nodes are identical. The actual configurations of the 430 nodes in the batch partition are as follows:
Node typeMemory
128MB 256MB 512MB 1024MB 2048MB
Thin 352 30 0 0 0
Wide 0 22 21 4 1

Update (3 June 2013):

The link given above for data about the system is no longer available, but a snapshot from 1997 is available on the Internet Archive. In particular, this includes a page specifying the details of the SP system's comfiguration. This indicates that the system was divided into several distinct pools that were scheduled in different ways. Specifically, pool 4 was scheduled by EASY-LL, and included 21 racks of 16 thin nodes each, plus 27 nodes from additional racks. Given that 16x21=336, this may be the actual partition and size that gave rise to this log. This also matches the usage data as shown below. If including the nodes from the other racks the size is 363, but then the typical usage level is only 0.93.

(Thanks to Dan Tsafrir for digging this up.)

Papers Using these Logs:

These logs were used in the following papers:
[hotovy96] [downey97a] [downey97c] [downey98b] [smith98] [schwiegelshohn98b] [downey99] [squillante99] [krallmann99] [talby99b] [cirne00] [mualem01] [feitelson01] [cirne01b] [streit02] [srinivasan02] [srinivasan02b] [lawson02] [ernemann02] [sabin03] [shmueli03] [ernemann03] [islam03] [feitelson03a] [song04] [schroeder04] [streit04] [aridor04] [england04] [feitelson04b] [feitelson05b] [feitelson05c] [feitelson05d] [tsafrir05b] [dutot05] [heine05] [sabin05] [shmueli05] [zilber05] [feitelson06a] [tsafrir06a] [tsafrir06b] [shmueli06] [franke06] [sabin06] [ranjan06] [tsafrir07a] [feitelson07a] [tsafrir07b] [talby07] [shmueli07] [ranjan08] [iosup08] [feitelson08] [goh08] [shmueli09] [feitelson09] [folling09] [guim09] [minh09] [thebe09] [aida09] [tsafrir10] [yuan11] [lindsay12] [liux12] [utrera12] [niu12] [krakov12] [kumar12] [klusacek12] [etinski12] [ababneh12] [zakay13] [liang13] [chen13] [krakov13] [rajbhandary13] [cao14] [kumar14] [zakay14] [feitelson14] [liu15] [carastans17] [wang18] [soysal19]

Log Format

The original log is available as CTC-SP2-1996-0.

This file contains one line per completed job with the following white-space separated fields:

Conversion Notes

The converted log is available as CTC-SP2-1996-3.swf. The conversion from the original format to SWF was done subject to the following. The conversion was done by a log-specific parser in conjunction with a more general converter module.

The differences between conversion 3 (reflected in CTC-SP2-1996-3.swf) and conversion 2 (CTC-SP2-1996-2.swf) is only in the assumed size of the machine: in conversion 3 it set to 338.

The differences between conversion 2 (reflected in CTC-SP2-1996-2.swf) and conversion 1 (CTC-SP2-1996-1.swf) are

The converted early log is available as CTC-SP2-1995-2.swf. The conversion from the original format to SWF was done subject to the following.

The difference between CTC-SP2-1995-2.swf and CTC-SP2-1995-1.swf is 1 second in the arrival times of the 98 jobs that had negative wait times.

Usage Notes

From the utilization plot of log CTC-SP2-1996-2 it is apparent that the utilization is actually capped at around 78.4%. This implies that the actual batch partition size used was probably 338 nodes, and not 430. This evidence was strong enough to warrant the production of the CTC-SP2-1996-3 version, where the size is indeed set to 338. In the early log it seems to have really been 430, but there is also a period where the utilization is nearly double what it should be. [See update above indicating the real number may actually be 336.]

The original log contains a flurry of activity by one user which may not be representative of normal usage. This has been removed in the cleaned version of the log, and it is recommended that this version be used.
The cleaned log is available as CTC-SP2-1996-2.1-cln.swf.

A flurry is a burst of very high activity by a single user. In this case, it involved 2080 jobs. The filter used to remove it was

user=135 and job>47420 and job<50308
Note that the filter was applied to the original log, and unfiltered jobs remain untouched. As a result, in the cleaned log job numbering is not consecutive.

Further information on flurries and the justification for removing them can be found in:


The Log in Graphics

File CTC-SP2-1996-2.swf

utilization with 430
nodes This is the utilization graph when assuming 430 nodes, showing that the utilization has a pronounced upper limit of 0.78, and implying that the actual partition size is actually smaller.

File CTC-SP2-1996-3.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance

File CTC-SP2-1996-3.1-cln.swf (cleaned)

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance

File CTC-SP2-1995-2.swf (early log)

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance


Parallel Workloads Archive - Logs