Parallel Workloads Archive: MetaCentrum2

The MetaCentrum 2 log

System:	MetaCentrum Czech National Grid
Duration:	Jan 2013 to Apr 2015
Jobs:	5,731,100

This log contains over two years worth of accounting records from the national grid of the Czech republic, called MetaCentrum. It is a longer log from a later period compared to the original MetaCentrum log.

The MetaCentrum grid is composed of a varying number of clusters, each with several multiprocessor machines with multicore CPUs or GPUs. Importantly, the the scheduling system underwent a significant reconfiguration in the middle of this period, which is the subject of a paper based on this log.

For more information about the system, see URL http://metavo.metacentrum.cz/en/index.html.

The MetaCentrum workload log was graciously provided by Czech National Grid Infrastructure MetaCentrum. If you use this log in your work, please use a similar acknowledgment. It was made available via the web page of Dalibor Klusacek, which also includes data about the configuration, specifying 19 clusters with 495 nodes and 8412 cores in total (however, the log appears to contain some jobs that ran on additional clusters as well). To acknowledge Dalibor's work please consider citing the paper that introduced this log:

D. Klusacek, S. Toth, and G. Podolnikova, ``Real-life Experience with Major Reconfiguration of Job Scheduling System''. In Job Scheduling Strategies for Parallel Processing, May 2015.

Downloads:

METACENTRUM-2013-1.swf	72 MB gz	original log in augmented SWF format as received
METACENTRUM-2013-3.swf	58 MB gz	re-converted log
METACENTRUM-2013-2.swf	58 MB gz	OLD VERSION of re-converted log (replaced 16 Sep 2015)

(May need to click with right mouse button to save to disk)

Papers Using this Log:

This log was used in the following papers:
[klusacek15]

System Environment

MetaCentrum is composed of up to around 30 Linux clusters, with different configurations. Some were changed during the logging period.

no.	Cluster	From	To	NxC	Cores	Mem/node (GB)	GPUs/node
1	ajax.zcu.cz	start	end	1x8	8	72	-
2	alela.feec.vutbr.cz	start	5-Oct-2013	12x8	96	32	-
3	doom.metacentrum.cz	30-Sep-2013	end	30x16	480	67	2xGPU
4	eru.ruk.cuni.cz	start	end	2x32	64	264	-
5	gram.zcu.cz	start	end	10x16	160	67	4xGPU
6	haldir.metacentrum.cz	2-Apr-2013	end	1x64	64	1040	-
7	hda.cerit-sc.cz (zapat)	start	end	112x16	1792	134	-
8	hdb.cerit-sc.cz (zigur)	22-Apr-2013	end	32x8	256	134	-
9	hdc.cerit-sc.cz (zegox)	22-Apr-2013	end	48x12	576	94	-
10	hermes.metacentrum.cz	19-Feb-2013	end	11x8	88	14	-
11a	hildor.prf.jcu.cz	start	11-Feb-2013 (renamed)	26x16	416	67	-
11b	hildor.metacentrum.cz	11-Feb-2013	end	26x16	416	67	-
12	konos.fav.zcu.cz	start	end	9x12	108	24	2xGPU
13	losgar.ics.muni.cz	start	end	2x48	96	64	-
14	loslab.ics.muni.cz	start	end	14x12	168	12	-
15	luna.fzu.cz	start	end	47x16	752	96	-
16	mandos.ics.muni.cz	start	end	14x64	896	264	-
17	manegrot.ics.muni.cz	16-Dec-2014	end	4x32	128	512	-
18	manwe.ics.muni.cz	start	end	7x16	112	66	-
19	minos.zcu.cz	start	end	49x12	588	20	-
20	mudrc.metacentrum.cz	17-May-2014	end	12x4	48	3	-
21	nympha.zcu.cz	start	end	19x8	152	14	-
22a	perian1-20.ncbr.muni.cz	start	29-May-2014	20x8	160	25	-
22b	perian21-40.ncbr.muni.cz	start	end	20x8	160	25	-
22c	perian41-56.ncbr.muni.cz	start	end	16x12	192	50	-
23	quark.video.muni.cz	start	end	3x8	24	18	-
24	ramdal.ics.muni.cz	start	end	1x32	32	1058	-
25	skirit.ics.muni.cz	start	5-Oct-2013	28x4	112	3	-
26	tarkil.cesnet.cz	start	end	28x8	224	22	-
27	ungu.cerit-sc.cz	12-Dec-2013	end	1x288	288	6144	-
28	urga.cerit-sc.cz	19-Nov-2014	end	1x384	384	6144	-
29a	zewura.cerit-sc.cz	start	12-Nov-2014 (split)	20x80	1600	512	-
29b	zewura.cerit-sc.cz	12-Nov-2014	end	8x80	640	512	-
30	zebra.cerit-sc.cz	12-Nov-2014	end	12x80	960	512	-
31	zorg.cerit-sc.cz	11-Dec-2014	end	4x10	40	1536	-
32	kalpa.fzu.cz	6-Nov-2013	end	2x24	48	256	-

The notation NxC means N nodes with C cores each; Cores is the total cores.

Jobs could run on processors from more than one cluster. While relatively rare, this did happen for 7011 jobs in the log.

Scheduling is done with TORQUE with a custom built scheduler, employing a system of general queues served by two scheduling servers. The scheduler uses common approaches such as backfilling and fairshare. Documentation is available on the MetaCentrum site. The main queues are as follows:

Queue Priority Time limit

q_2h 50 2h

q_4h 500 4h

q_1d 50 24h

q_2d 50 48h

q_4d 50 96h

q_1w 50 168h

q_2w 50 336h

q_2w_plus 50 720/1488h

backfill 20 24h

short 50 2h

normal 50 24h

long 50 720h

uv 30 96h

gpu 75 24h

gpu_long 55 168h

In addition, there are multiple special queues for specific users and groups and for administrative purposes. The full list is available in the MetaCentrum documentation.

Queue	Priority	Time limit
q_2h	50	2h
q_4h	500	4h
q_1d	50	24h
q_2d	50	48h
q_4d	50	96h
q_1w	50	168h
q_2w	50	336h
q_2w_plus	50	720/1488h
backfill	20	24h
short	50	2h
normal	50	24h
long	50	720h
uv	30	96h
gpu	75	24h
gpu_long	55	168h

The above data is valid for the second half of the log, from January 2014. Note that nearly all the queues have the same priority (50). The practical effect is that jobs are prioritized just by fairshare, and queues are basically only used to define various per-user/group limits. Thus, the system operates over one "virtual" queue which is ordered by fairshare. Before January 2014, there was a fixed queue ordering, where the highest priority was for "long" (70), followed by "short" (60), "normal" (50) and "backfill" (20). Fairshare was only used "locally", within a given queue. The changes in configuration are described in detail in the paper which introduced the log [klusacek15]. The change in configuration apparently led to a change in utilization as seen in the figures below.

Importantly, data about the specific requests made by users is included as an additional field in the original log. This is a ':'-separated list of properties, such as the number of nodes and cores requested, the architecture, and specific clusters to use or to avoid. The possible properties and the mapping of properties to clusters is available in the MetaCentrum documentation. This is considered important as it enables evaluations that take all these different constraints into account.

Log Format

The original log is available as METACENTRUM-2013-1.swf although it does not completely adhere to the SWF format. Note that fields that do not contain valid information are identically 1 instead of -1. Moreover, at the end of each line, 3 additional fields are included:

The list of nodes from which cores were allocated. Each node is indicated by a a string including the cluster name and node number, e.g. "minos25-1.zcu.cz". These are separated by commas (,) and surrounded bu curly braces ({,}). If multiple cores are allocated from the same node the node name will appear multiple times in the list.
The queue name. The queue field of the SWF standard is not used.
The job parameters as given to the "qsub" command. This is a ':'-separated list of properties, which often includes some of the following:
- A number indicating the number of nodes requested
- ppn=n meaning n processors (cores) per node
- The architecture, typically x86_64
- Cluster names to use or cluster names preceded by ^ to avoid
- "excl" meaning nodes should be allocated exclusively even if not all cores are being used

Conversion Notes

The converted log is available as METACENTRUM-2013-2.swf. The conversion from the original format to valid SWF was done subject to the following.

The original SWF format reserves a field for "partition", which is used to represent the cluster on which the job ran. In order to accommodate jobs that ran on more than one cluster this field was allowed to contain a comma-separated list of clusters.
There is no status information so it is unknown if any jobs were killed.
The conversion loses the following data, that cannot be represented in the SWF:
- The number of clusters used by each job.
- The number of nodes and cores per node requested by each job.
- The precise list of processors allocated to the job, and which clusters and nodes they belong to.
- The properties requested by the application.
The following anomalies were identified in the conversion:
- 8,975 jobs were recorded as requesting 0 memory; this was changed to -1.
- 182,742 jobs were recorded as requesting 0 runtime; this was changed to -1.
- 124,486 jobs got more runtime than they requested. in 27,834 cases the extra runtime was larger than 1 min.
- The requested time did not match the queue limit in 2,141,108 jobs. This is OK as users are allowed to specify a time limit rather than rely on the queue limit.
- All the jobs in the log passed a sanity check to ensure the length of the list of assigned processors was equal to the given number of assigned processors. this failed in 989 jobs because a full node was allocated exclusively ("excl" flag) even though not all the cores were required. The numbers did not match also in 290 other jobs.
- Moreover, the number of allocated cores did not match the product of the requested nodes and ppn in 7,128 jobs.

The difference between the first conversion (reflected in METACENTRUM-2013-2.swf) and the second conversion (reflected in METACENTRUM-2013-3.swf) is that additional information about the clusters (partitions) and as a result the numbering of partitions changed. The conversion was done by a log-specific parser in conjunction with a more general converter module. This version of the converter can handle multi-partition (multi-cluster) jobs.

Usage Notes

Large scale flurries apparently exist but have not been cleaned yet.

The log contains all the jobs that started in the logging period, which is all of 2013-2014. Some of these jobs are extremely long, as the maximal runtime allowed on this system is 30 days. Therefore edge effects may happen at both ends of the log, where the logged data does not represent the actual load faithfully. In particular, all the jobs executing in 2015 are actually leftovers from 2014.

The Log in Graphics

File METACENTRUM-2013-3.swf

weekly cycle daily cycle burstiness and active users job size and runtime histograms job size vs. runtime scatterplot clusters utilization offered load performance

Parallel Workloads Archive - Logs