The dynamics of backfilling: solving the mystery of why
increased inaccuracy may help
Dan Tsafrir, Dror G. Feitelson
IEEE International Symposium on Workload Characterization
(IISWC)
October 2006,
San Jose, California,
pages 131–141
Abstract ,
BibTeX ,
PDF ,
Definitive
Parallel job scheduling with backfilling requires users to provide
runtime estimates, used by the scheduler to better pack the jobs.
Studies of the impact of such estimates on performance have
modeled them using a “badness factor” f &ge 0 in an attempt to
capture their inaccuracy (given a runtime r, the estimate is
uniformly distributed in [r, (f+1) · r]). Surprisingly,
inaccurate estimates (f>0) yielded better performance than
accurate ones (f=0). We explain this by a “heel and toe”
dynamics that, with f>0, cause backfilling to approximate
shortest-job first scheduling. We further find the effect of
systematically increasing f is V-shaped: average wait time and
slowdown initially drop, only to rise later on. This happens
because higher fs create bigger “holes” in the schedule
(longer jobs can backfill) and increase the randomness (more long
jobs appear as short), thus overshadowing the initial heel-and-toe
preference for shorter jobs.
The bottom line is that artificial inaccuracy generated by
multiplying (real or perfect) estimates by a factor is (1) just a
scheduling technique that trades off fairness for performance, and
is (2) ill-suited for studying the effect of real
inaccuracy. Real estimates are modal (90% of the jobs use the
same 20 estimates) and bounded by a maximum (usually the most
popular estimate). Therefore, when performing an evaluation,
“increased inaccuracy” should translate to increased modality.
Unlike multiplying, this indeed worsens performance as one would
intuitively expect.
@InProceedings{tsafrir06-bfdyn,
| author |
= |
{Dan Tsafrir and Dror G. Feitelson}, |
| title |
= |
{The dynamics of backfilling: solving the mystery of why
increased inaccuracy may help}, |
| booktitle |
= |
{IEEE International Symposium on Workload Characterization
(IISWC)}, |
| year |
= |
2006, |
| month |
= |
{October}, |
| pages |
= |
{131--141}, |
| address |
= |
{San Jose, California} |
}
Modeling, evaluating, and improving the performance of
supercomputer scheduling
[PhD thesis]
Dan Tsafrir
Technical Report 2006-78, School of Computer Science and
Engineering, the Hebrew University
September 2006,
Jerusalem, Israel
Abstract ,
BibTeX ,
PDF ,
The most popular scheduling policy for parallel systems is FCFS
with backfilling (a.k.a. “EASY” scheduling), where short jobs
are allowed to run ahead of their time provided they do not delay
previously queued jobs (or at least the first queued job). This
mandates users to provide estimates of how long jobs will run, and
jobs that violate these estimates are killed so as not to violate
subsequent commitments. The de-facto standard of evaluating the
impact of inaccurate estimates on performance has been to use a
“badness factor” f &ge 0, such that given a runtime r, the
associated estimate is uniformly distributed in [r, r ·
(f+1)], or is simply r·(f+1). The underlying assumption
was that bigger fs imply worse information.
Surprisingly, inaccurate estimates (f>0) yield better
performance than accurate ones (f=0), a fact that has repeatedly
produced statements like “inaccurate estimates actually improve
performance” or “what the scheduler doesn't know won't hurt
it”, in many independent studies. This has promoted the
perception that estimates are “unimportant”. At the same time,
other studies noted that real user estimates are inaccurate, and
that system-generated predictions based on history can do better.
But predictions were never incorporated into production
schedulers, partially due the aforementioned perception that
inaccuracy actually helps, partially because suggested predictors
were too complex, and partially because underprediction is
technically unacceptable, as users will not tolerate jobs being
killed just because system predictions were too short. All
attempts to solve the latter technicality yielded algorithms that
are inappropriate for many supercomputing settings (e.g. using
preemption, assuming all jobs are restartable, etcetera).
This work has four major contributions. First, we show
that the “inaccuracy helps” common wisdom is merely an
unwarranted artifact of the erroneous manner in which inaccurate
estimates have been modeled, and that increased accuracy actually
improves performance. Specifically, previously observed
improvements turn out to be due to a “heel and toe” dynamics
that, with f>0, cause backfilling to approximate shortest-job
first scheduling. We show that multiplying estimates by a factor
translates to trading off fairness for performance, and that this
reasoning works regardless of whether the values being multiplied
are actual runtimes (“perfect estimates”) or the flawed
estimates that are supplied by users. We further show that the
more accurate the values we multiply, the better the resulting
performance. Thus, better estimates actually improve performance,
and multiplying is in fact a scheduling policy that
exercises the fairness/performance tradeoff. Regardless,
multiplying is anything but representative of real inaccuracy, as
outlined next.
Our second contribution is developing a more
representative model of estimates that, from now on, will allow
for a valid evaluation of the effect of inaccurate estimates. It
is largely based on noting that human users repeatedly use the
same “round” values (ten minutes, one hour etc.) and on the
invariant that 90% of the jobs use the same 20 estimates.
Importantly, the most popular estimate is typically the maximal
allowed. As a result, the jobs associated with this estimate
cannot be backfilled, and indeed, the more this value is used, the
more EASY resembles plain FCFS. Thus, to artificially increase
the inaccuracy one should e.g. associate more jobs with the
maximum (a realistic manipulation), not multiply by a
greater factor (a bogus boost of performance).
Our third contribution exploits the above understandings
to devise a new scheduler that is able to automatically improve
the quality of estimates and put this into productive use in the
context of EASY, while preserving its attractive simple batch
essence and refraining from any unacceptable assumptions.
Specifically, the problem of underprediction is solved by
divorcing kill-time from the runtime prediction, and correcting
predictions adaptively at runtime as needed, if they are proved
wrong. The result is a surprisingly simple scheduler, which
requires minimal deviations from current practices, and behaves
exactly like EASY as far as users are concerned. Nevertheless, it
achieves significant improvements in performance, predictability,
and accuracy. Notably, this is based on a very simple runtime
predictor that just averages the runtimes of the last two jobs by
the same user; counterintuitively, our results indicate that using
recent data is more important than saving and mining the history
for similar jobs, as was done by previous work. For further
performance enhancements, we propose to exploit the “heel and
toe” understanding: explicitly using a shortest job
backfilled first (SJBF) backfilling order. This directly
leads to a performance improvements similar to those previously
attributed to stunts like multiplying estimates. By still
preserving FCFS as the basis, we maintain EASY's appeal and enjoy
both worlds: a fair scheduler that nevertheless backfills
effectively.
Finally, our fourth contribution has broader
applicability, that transcends the supercomputing domain. All of
the above results are based on the standard methodology of
modeling and simulating real activity logs of production systems,
which is routinely practiced in system-related research. The
overwhelmingly accepted assumption underlying this methodology is
that such real workloads are representative and reliable. We
show, however, that real workloads may also contain anomalies that
make them non-representative and unreliable. This is a special
case of multi-class workloads, where one class is the “real”
workload which we wish to use in the evaluation, and the other
class contaminates the log with “bogus” data. We provide
several examples of this situation, including an anomaly we call
“workload flurries”: surges of activity with a repetitive
nature, caused by a single user, that dominate the workload for a
relatively short period. Using a workload with such anomalies in
effect emphasizes rare and unique events (e.g. occurring for a
few days out of two years of logged data), and risks optimizing
the design decision for the anomalous workload at the expense of
the normal workload. Thus, we claim that such anomalies should be
removed from the workload before it is used in evaluations, and
that ignoring them is actually an unjustifiable approach.
@PhdThesis{tsafrir06-phd,
| author |
= |
{Dan Tsafrir}, |
| title |
= |
{Modeling, evaluating, and improving the performance of
supercomputer scheduling}, |
| school |
= |
{School of Computer Science and
Engineering, the Hebrew University}, |
| year |
= |
2006, |
| month |
= |
{September}, |
| address |
= |
{Jerusalem, Israel}, |
| note |
= |
{Technical Report 2006-78} |
}
Workload sanitation for performance evaluation
Dror G. Feitelson, Dan Tsafrir
IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS)
March 2006,
Austin, Texas,
pages 221–230
Abstract ,
BibTeX ,
PDF ,
Definitive
The performance of computer systems depends, among other things,
on the workload. Performance evaluations are therefore often done
using logs of workloads on current productions systems, under the
assumption that such real workloads are representative and
reliable; likewise, workload modeling is typically based on real
workloads. We show, however, that real workloads may also contain
anomalies that make them non-representative and unreliable. This
is a special case of multi-class workloads, where one class is the
“real” workload which we wish to use in the evaluation, and the
other class contaminates the log with “bogus” data. We provide
several examples of this situation, including a previously
unrecognized type of anomaly we call “workload flurries”: surges
of activity with a repetitive nature, caused by a single user,
that dominate the workload for a relatively short period. Using a
workload with such anomalies in effect emphasizes rare and unique
events (e.g. occurring for a few days out of two years of logged
data), and risks optimizing the design decision for the anomalous
workload at the expense of the normal workload. Thus we claim that
such anomalies should be removed from the workload before it is
used in evaluations, and that ignoring them is actually an
unjustifiable approach.
@InProceedings{feitelson06-clean,
| author |
= |
{Dror G. Feitelson and Dan Tsafrir}, |
| title |
= |
{Workload sanitation for performance evaluation}, |
| booktitle |
= |
{IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS)}, |
| year |
= |
2006, |
| month |
= |
{March}, |
| pages |
= |
{221--230}, |
| address |
= |
{Austin, Texas} |
}