|
|
The evaluation process
New to this year's competition, participants in the evaluation will receive performance reports and will be able to submit improvements to their solvers during the entire competition period. A leader board will be continuously updated. The criteria for the evaluation of the different tasks are as follows:
-
PR: The score for the partition function, evaluated only for those models for which we were able to obtain exact answers (given extensive time and memory resources), is as follows:
Denote the exact partition function by Z* and the approximated one by Zs. The score will be |log(Z*/Zs)|.
-
MPE: The performance of the most probable explanation estimate will be computed relative to the performance of the other competitors a simple asynchronous belief propagation baseline and a default result. Thus we will also evaluate this task on models where MPE cannot be computed exactly. The default result is the assignment that maximize only the one variable factors (and the first value if no such factor exist for some variable). The score will be calculated as follows.
Denote the energy of a solution x by E(x).
The energy is E(x) = - ∑ log fa(Xa = xa).
We denote the best result by x* = arg maxs ∈ S E(xs) where S denote the group of all the solvers. We denote the standard BP result as xbp and the default result by xdef.
Solvers scores will be relative to the BP or the default solution.
The score will be:
-
MAR: The score for the marginals, evaluated only for those models for which we were able to obtain exact answers, will be calculated as follows.
Denote the exact marginal for the i variable taking the x value as:P*(Xi = xi).
In the same way the solver marginal will be denoted by: Ps(Xi = xi).
The score will be:
-
BEL The score for the learning task will be calculated only for models that we could evaluate (and in relatively short time) the exact marginal of the model.
The learned network marginal will be computed exactly and compared with the true marginals of the model.
The score will be calculate as in the MAR task.
|