[Termtools] Upcoming competition - discussion during Nancy Workshop

Johannes Waldmann waldmann at imn.htwk-leipzig.de
Mon May 14 09:23:34 CEST 2007


Discussion on upcoming (and future) Termination Competition
with participants of Color Workshop, Nancy, May 11th.
(with authors/members of Aprove, TPA, TTT, Jambox, Matchbox)


1) Competition runs on a subset of TPDB,
random distribution on subsets (directories) is determined by committee:
General agreement. Aart suggested random selection from complete TPDB,

I said, some parts (subdirs) will probably be done in full,
and others will be subject to random selection.

BTW: I request that my SRS problem set be fully included with TPDB,
and random selection shall be applied, details to be discussed
(e.g. run size 11 SRS fully, size 12 randomly).


2) count YES and/or NO: generally, the idea of summing both counts
(to determine the overall winner) was favoured.
It was the impression that this was the original intention,
but it had been neglected in last year's report.


3) measure difficulty of problems: general agreement for:

a) determine weight of "old" problems from last year's performance
b) evaluation  of prover depends on current performance

that means: for (a) we compute a "value" of each problem,
according to (b), this value is distributed among this year's solvers,

For us (the committee) this means we have to give concrete formulas
for (a) and (b). Simplest: compute weight as proposed by Claude
(1 - #solvers/#participants). We don't want these weights as annotations
in the problems (because weights will change).

Suggestion: we run last year's provers (now) on the new problems
to have an initial difficulty rating.
If we don't want to have this extra run,
then we could use the current performance for weighting.



4) Discussion on "checkme".

4.1 The current rules (as written) would allow
dummy checkme executables that just print "TRUE". This is not intended.
To make this intention more clear, the following is suggested
(by all participants): "checkme" should in fact be "generate":
it reads a proof and produces an input file for some established proof
checker, possibly referring to some established libraries.
This input file shall be made visible on the competition web site.
The organizer (not the "checkme") then calls  the proof checker on that 
file.
(For this year, we expect that all participants use Coq and CoLoR.
Technical detail: this would rely on  Coq and Color installed by Claude,
and agreement on Coq command line arguments, if any - it should be possible
to specifiy everything within the Coq input.)

4.2 Since it is allowed to submit different versions of provers
to uncertified and certified part of competition,
and Adam intends to do so, both versions need to be run anyway.
Suggestion (by all) is to make these two runs separately
(first, run complete uncertified competition;
then, run complete certified competition.)
(No time would be won by merging these two runs.)

4.3 on timeout for checkme. Adam says he expects average run time
of five seconds, but has some examples with 75 seconds.
So he could live with 1 minute, but 5 minute would be better.
I think we could take 5 min limit since it seems it will seldom be needed.
(The prover itself should finish in one minute anyway.)

4.4 how to determine the winner of the "certified" competition?
Do we award points for emitting proofs or for certifying proofs?
If (say) TPA emits an XML proof that is transformed by Rainbow to a Coq 
proof,
reward goes to TPA, Rainbow, CoLoR, Coq? Most of the (termination specific)
work is done in Rainbow and CoLoR.


5) Timing. Since there seems to be no solution for exactly
measuring time for subprocesses, we expect to use wall clock time.


General discussion on further competitions:

Committe should be more open, there is no need for confidentiality.
Especially so if the committee themselves are participants.
Otherwise, committe should not be allowed to take part in competition.
Alternate suggestion: each year's winner automatically is organizer
for next competition, but is not allowed to take part.

Timing: there are several applications (e.g. completion)
where one indeed wants a very fast answer. On the other hand there
are "research" problems where one wants to allow a lot of time.
It was already suggested (by Jörg) that each prover gets some total time
which it can distribute on all the problems freely.
(Aart, Adam, Johannes agree).
One simple strategy would be to run with small timeout (few seconds)
on all problems, and then distribute remaining time evenly on
unsolved problems. More elaborate schemes are of course possible.
The technical problem of how to specify the set of inputs
can be solved (e.g. have on file containing problem file names).
The disadvantage (according to Aprove) is that this gives more room
for tuning of parameters, requiring much work, thus distracting from 
"real research".

Weighting: could possibly use additional adjustment according
to time used for proofs (Aprove does not agree).


(The above summary of the discussion
has been read and approved by the participants.)

Best regards, Johannes Waldmann.


More information about the Termtools mailing list