Parallel Workloads Archive: Sandia Ross

The Sandia/Ross log

System:	CPlant Cluster with 48 cabinets of 32 nodes
Duration:	Dec 2001 to Jan 2005
Jobs:	85,355

This log contains some three years worth of accounting records from the Sandia Ross cluster. This is phase III of the CPlant project. It was installed in 2000, and comprises 48 scalable units from Compaq, each with 32 nodes. Note, however, that this size was probably reduced later. This implies that the load in different parts of the log may be quite different, making it unsuitable for bulk usage in simulations, see usage notes.

The workload log from the Sandia Ross cluster was graciously provided by Jon Stearley (jrstear@sandia.gov). If you use this log in your work, please use a similar acknowledgment.

Downloads:

Sandia-Ross-2001-1.swf	1.5 MB gz	converted log
Sandia-Ross-2001-1.1-cln.swf	1.2 MB gz	cleaned log -- RECOMMENDED, see usage notes

(May need to click with right mouse button to save to disk)

System Environment

The Sandia CPlant project was a realtively early large-scale cluster intended to replace an MPP.

The log available here is from phase III of the project. Initially this was composed of 48 cabinets. Each cabinet had 32 Compaq DS10L servers, for a total of 1536 servers. Of these, 1524 were used to run parallel jobs. However, it seems that later this number was reduced. Each node had a 466 MHz 21264 (EV6) Alpha microprocessor and 256 MB ECC SDRAM. Each cabinet also has a service node used for management.

The nodes are connected by a Myrinet gigabit network. Each cabinet also has an Ethernet.

System software included a parallel job launcher called yod, a compute-node daemon process called PCT on each node, and a system-wide compute node allocator called bebopd, which works with PBS.

Papers Using this Log:

This log was used in the following papers:
[feitelson14]

Log Format

The log is available directly in SWF.

Conversion Notes

The conversion was performed by Stephanie McAllister using scripts written by Gerald Sabin. There is no data about any problems in the conversion process.

However, using an SWF parser in conjunction with a general converter module the following anomalies were observed:

41 jobs had an average CPU time higher than their runtime, but all the differences were less than a minute.
339 jobs were recorded as requesting 0 runtime; this was changed to -1.
4352 jobs got more runtime than they requested. in 3069 cases the extra runtime was larger than 1 min.
3970 jobs were recorded as using 0 CPU time; this was changed to -1. 807 of them had "success" status.
14 jobs were recorded as requesting and using 0 processors; these were changed to -1. 1 of these jobs had "success" status.
13733 jobs were recorded as using 0 memory; this was changed to -1. 1548 of them had "success" status
81 jobs did not have a status indication.

Usage Notes

From the utilization graph of Sandia-Ross-2001-1.swf it appears that the effective size of the machine was reduced in the fall of 2002. After that date the number of processors seems to be around 992 processors (31 cabinets). Another small reduction in size may have occurred a year later.

The original log contains quite a few flurries of activity by three users which may not be representative of normal usage. This has been removed in the cleaned version of the log, and it is recommended that this version be used.
The cleaned log is available as File Sandia-Ross-2001-1.1-cln.swf

A flurry is a burst of very high activity by a single user. The filters used to remove the three flurries that were identified are

user=38 and (job>5843 and job<9472) or (job>34042 and job<36017) (2593 jobs)
user=84 and (job>10178 and job<24056) or (job>25398 and job<29185) (10600 jobs)
user=175 and job>50166 and job<70468 (14280 jobs)

In total, 27473 jobs were removed. Note that the filters were applied to the original log, and unfiltered jobs remain untouched. As a result, in the cleaned log job numbering is not consecutive.

Further information on flurries and the justification for removing them can be found in:

D. G. Feitelson and D. Tsafrir, “Workload sanitation for performance evaluation”. In IEEE Intl. Symp. Performance Analysis of Systems and Software, pp. 221-230, Mar 2006.
D. Tsafrir and D. G. Feitelson, “Instability in parallel job scheduling simulation: the role of workload flurries”. In 20th Intl. Parallel and Distributed Processing Symp., Apr 2006.

The Log in Graphics

File Sandia-Ross-2001-1.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance

File Sandia-Ross-2001-1.1-cln.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot utilization offered load performance

Parallel Workloads Archive - Logs