Parallel Systems -- Exercise 3

Course number 67605 --- 2002/2003

Exercise 3: Backfilling and User Runtime Estimates

The simplest way to schedule parallel jobs is to run them first-come-first-serve (FCFS): the jobs are kept in a queue in the order of arrival, and executed whenever enough processors are available. A simple optimization is to schedule jobs out of order if this improves performance.

A specific algorithm that is often used is the EASY backfilling algorithm. Assuming the first queued job is too big for the currently available processors, a reservation is made for some future time at which enough processors will be available. Then the rest of the queue is scanned, and other smaller jobs are allowed to run, provided they will not violate this reservation.

In order to know when processors will become available, and whether jobs are short enough to terminate in time, the system needs to know job runtimes. For this, estimates of the runtimes are requested from users: when a job is submitted, the user gives an estimate of how long it will run. If it is short, it has a chance to backfill. But if it runs for longer than the estimate, the system may kill it so as not to violate a previous reservation.

The interesting thing is that it turns out that inaccurate estimates, that overestimate the runtime by a factor of 3-10, lead to better performance than accurate estimates. The goal of this exercise is to try and understand how this happens. This is an open research question.

What to do

The exercise is divided into two parts.

Part 1: Reproduce old results

The goal of the first part is to get acquainted with the simulation software, and see the performance results.

The simulator

To make things simpler, I suggest you use a simple event-based simulator I have been using for several years. This simulator reads a workload file as input, and simulates the execution of the specified jobs. Initially, an arrival event is created for each job. Then the simulator enters the event-execution loop, in which it extracts the earliest event, advances simulation time to this event's time, and performes the event, potentially creating new events.

If the event is a job arrival, the scheduler is called. The scheduler adds the new job to the set of waiting jobs it knows about (which might be empty), and tries to schedule it or another job. Scheduling means that nodes are allocated, and a termination event is created for the termination time of the job. Note that this is based on the real termination time, not the estimated one.

If the event is a termination, the job's statistics are recorded. Then the scheduler is called, to see if it can make use of the freed nodes.

The simulator can run the input workload as is, or it can change the load factor. I recommend you start with the code as is (i.e. run the simulator without the -l flag). But you may also experiment with loads in the range of 0.7 to 0.9. To increase the load, the simulator simply multiplies the interarrival times by a factor so that the jobs arrive faster, hence cause higher average load. Again, this code exists; you just need to set the desired load level(s) for the load loop in main, and run with -l.

The software is available (locally) at ~perf/sim.

Workloads

To drive the simulation you need a workload file. This should be in ``standard workload format'', which is an ASCII file with one line per job, including various fields like arrival time, runtime, requested time, etc. The simulator understands this format.

You should use workload files from real production machines that run the EASY scheduler, and therefore have the required data, and specifically, user runtime estimates. You can find such workloads in the Parallel Workloads Archive. Use the logs from the SP2 machines installed at CTC, KTH, and SDSC.

Runs

Run simulations of the EASY scheduler and one or more of these workloads, and tabulate the results. Then modify the simulator to use accurate runtimes instead of the original user estimates. This is all done in the initialization, as the data is read into the simulator (the code actually exists, you just need to find it and turn it on...). Now repeat the runs. Are the results better or worse? And what if you use synthetic estimates that are say 10 or 100 times higher than the real runtime? Or what if you take the original estimates but just multiply them by 3?

Hint: in at least some of these cases, you should see unexpected results: that performance is better then estimates are farther from the real runtimes. Try to find at least one such case (i.e. combination of workload and types of estimates).

Part 2: Try to understand what is going on

In the first part you were supposed to observe that accurate estimates are not necessarily better than overestimates, and that the original user estimates are also bad. The goal of the second part is to try and understand how these somewhat strange results come about.

What you need to do is to monitor exactly what happens during the run. Here are some possible hints.

You might partition the jobs into classes and look at the results for each class separately. Which jobs get delayed more? The short ones or the long ones? The large ones (may processors) or the small ones?
In cases where the estimates have different qualities, you might partition the jobs according to their estimates. Is life better for jobs that have tight estimates or for those that have loose estimates?
You can look at the simulation as a whole. How many jobs get backfilled? Does this typically happen upon arrival of a new job or upon termination of a running one? How many jobs are scheduled as a result of a single event? How many jobs are typically skipped when searching for a backfill job?

The really hard part, of course, it to try to put the pieces together and come up with an explanation of how the different types of estimates lead to different qualities of performance. Better yet, can you test whether your explanation is correct?

Submit

Send an email with your results to parsys@cs. Please use plain ASCII text, not word attachements! You can add graphs as gif/jpeg attachements if you think they are useful.

In your email, outline the experiments you ran, the performance results you got, and what you learnt from them. In part 1, you should just make a few runs and reproduce the results described above. In part 2, you should design additional experiments, or collect additional data, that will help explain these results. If you indeed come up with a good explanation that's great. If you don't, at least explain what you tried to do and what you did find.

Due date for part 1: Thursday, May 8. I'll send you a reply email with feedback.

Due date for part 2: Friday, May 23.