[Termtools] Upcoming competition - discussion during Nancy
Workshop
Claude Marché
Claude.Marche at lri.fr
Mon May 14 15:46:30 CEST 2007
Thanks Johannes for this summary. Let me give a few answers.
>>>>> "Johannes" == Johannes Waldmann <waldmann at imn.htwk-leipzig.de> writes:
Johannes> 1) Competition runs on a subset of TPDB,
Johannes> random distribution on subsets (directories) is determined by committee:
Johannes> General agreement. Aart suggested random selection from complete TPDB,
Johannes> I said, some parts (subdirs) will probably be done in full,
Johannes> and others will be subject to random selection.
Johannes> BTW: I request that my SRS problem set be fully included with TPDB,
Johannes> and random selection shall be applied, details to be discussed
Johannes> (e.g. run size 11 SRS fully, size 12 randomly).
To exaplin a bit the context: this year we received many new problems
to add to the TPDB for the next competition. It is not anymore
reasonable to run the competition on the full TPDB, if we include
everything, the current proposition is to accept a large TPDB,
but using only a random selected set of problems from some parts of
it, namely:
- the SRSs of size 12 submitted by Johannes
- the TRSs generated from context-sensitive TRS or Maude programs
submitted by Salvador Lucas.
If nobody complains, this is what will be done. The number (or the
percentage) of problems randomly selected remains to be decided.
Johannes> 2) count YES and/or NO: generally, the idea of summing both counts
Johannes> (to determine the overall winner) was favoured.
Johannes> It was the impression that this was the original intention,
Johannes> but it had been neglected in last year's report.
It will be easy to add an overall score adding both answers YES and NO
Johannes> 3) measure difficulty of problems: general agreement for:
Johannes> a) determine weight of "old" problems from last year's performance
Johannes> b) evaluation of prover depends on current performance
Johannes> that means: for (a) we compute a "value" of each problem,
Johannes> according to (b), this value is distributed among this year's solvers,
Johannes> For us (the committee) this means we have to give concrete formulas
Johannes> for (a) and (b). Simplest: compute weight as proposed by Claude
Johannes> (1 - #solvers/#participants). We don't want these weights as annotations
Johannes> in the problems (because weights will change).
Johannes> Suggestion: we run last year's provers (now) on the new problems
Johannes> to have an initial difficulty rating.
Johannes> If we don't want to have this extra run,
Johannes> then we could use the current performance for weighting.
if "we" means Claude Marché, then the answer is: no extra run.
My suggestion for this year:
1) compute the weight of a problem as above
2) give both normal score and weighted score
About putting the weights in the TPDB itself: I think it should be
done, but this not that important.
Johannes> 4) Discussion on "checkme".
Johannes> 4.1 The current rules (as written) would allow
Johannes> dummy checkme executables that just print "TRUE". This
Johannes> is not intended.
Before writing the rules for the "certifying option", I discussed by
email with potential participants (Xavier Urbain for A3PAT, Frederic
Blanqui and Adam Koprowski for Color) and I thought we more or less
agreed. After publishing the rules, I received no remarks about that.
Now I realize that it was misunderstood. The rules I intended do not
accept a checkme which would produce TRUE.
Johannes> To make this intention more clear, the following is suggested
Johannes> (by all participants): "checkme" should in fact be "generate":
Johannes> it reads a proof and produces an input file for some established proof
Johannes> checker, possibly referring to some established
Johannes> libraries.
NO. the "runme" script should generate that. It can be any format,
such as the XML format expected from the Color project.
Johannes> This input file shall be made visible on the competition web site.
YES
Johannes> The organizer (not the "checkme") then calls the proof checker on that
Johannes> file.
NO. The "checkme" is called automatically with the generated file as
argument. It can do several steps of verification, such as
1) generate a Coq script from the XML file
2) run coqc on the generated file
3) checks coqc return status, and return 0 if everything went OK (as
said in the rules document)
The important points are:
1) the process in the checkme file must be explained in a short text (as
said in the rules document) so that the organizers may be
more or less convinced that the certification procedure is safe.
2) it should also be explain how the format of the file generated by
"runme" is related to the input problem, so that organizers can be
convinced that the checking procedure is run on the right problem
Johannes> (For this year, we expect that all participants use Coq
Johannes> and CoLoR.
Wrong. Cime/A3PAT will use Coq but not Color.
Johannes> Technical detail: this would rely on Coq and Color installed by Claude,
Johannes> and agreement on Coq command line arguments, if any - it should be possible
Johannes> to specifiy everything within the Coq input.)
About Coq: as I did for Java, I can manage to install a standard Coq
version, such as the current one 8.1
On the other hand, I don't want to take care of other unstable
versions of Coq or Coq libraries. In particular, participants willing
to use Coq as proof checking tool should provide all needed librairies.
Johannes> 4.2 Since it is allowed to submit different versions of provers
Johannes> to uncertified and certified part of competition,
Johannes> and Adam intends to do so, both versions need to be run anyway.
Johannes> Suggestion (by all) is to make these two runs separately
Johannes> (first, run complete uncertified competition;
Johannes> then, run complete certified competition.)
Johannes> (No time would be won by merging these two runs.)
OK for two separate run. And the certifying version can be different,
in particular I guess it is not necessary to look for some termination
proof based on a criterion which is not supported by the checking
procedure.
Johannes> 4.3 on timeout for checkme. Adam says he expects average run time
Johannes> of five seconds, but has some examples with 75 seconds.
Johannes> So he could live with 1 minute, but 5 minute would be better.
Johannes> I think we could take 5 min limit since it seems it will seldom be needed.
Johannes> (The prover itself should finish in one minute anyway.)
I discussed a little of that before. 5 minutes time limit is OK for
me.
Johannes> 4.4 how to determine the winner of the "certified" competition?
Johannes> Do we award points for emitting proofs or for certifying
Johannes> proofs?
For certifying of course: if emitted proofs appears to be rejected, it
counts as 0.
Johannes> If (say) TPA emits an XML proof that is transformed by Rainbow to a Coq
Johannes> proof,
Johannes> reward goes to TPA, Rainbow, CoLoR, Coq? Most of the (termination specific)
Johannes> work is done in Rainbow and CoLoR.
Rewards goes to the registered participants. When registering, please
tell me who are the authors and which tool you wrote yourself, and
which external tools you use. The same remark applies to tools which
calls an external SAT solver for example.
Johannes> 5) Timing. Since there seems to be no solution for exactly
Johannes> measuring time for subprocesses, we expect to use wall clock time.
Yes, if no other solution is found.
Johannes> General discussion on further competitions:
Only a general answer about remarks below, in the next mail
Johannes> Committe should be more open, there is no need for confidentiality.
Johannes> Especially so if the committee themselves are
Johannes> participants.
Johannes> Otherwise, committe should not be allowed to take part in competition.
Johannes> Alternate suggestion: each year's winner automatically is organizer
Johannes> for next competition, but is not allowed to take part.
Johannes> Timing: there are several applications (e.g. completion)
Johannes> where one indeed wants a very fast answer. On the other hand there
Johannes> are "research" problems where one wants to allow a lot of time.
Johannes> It was already suggested (by Jörg) that each prover gets some total time
Johannes> which it can distribute on all the problems freely.
Johannes> (Aart, Adam, Johannes agree).
Johannes> One simple strategy would be to run with small timeout (few seconds)
Johannes> on all problems, and then distribute remaining time evenly on
Johannes> unsolved problems. More elaborate schemes are of course possible.
Johannes> The technical problem of how to specify the set of inputs
Johannes> can be solved (e.g. have on file containing problem file names).
Johannes> The disadvantage (according to Aprove) is that this gives more room
Johannes> for tuning of parameters, requiring much work, thus distracting from
Johannes> "real research".
Johannes> Weighting: could possibly use additional adjustment according
Johannes> to time used for proofs (Aprove does not agree).
Johannes> (The above summary of the discussion
Johannes> has been read and approved by the participants.)
Johannes> Best regards, Johannes Waldmann.
Johannes> _______________________________________________
Johannes> Termtools mailing list
Johannes> Termtools at lists.lri.fr
Johannes> http://lists.lri.fr/mailman/listinfo/termtools
--
Claude Marché | tel: +33 1 72 92 59 69
INRIA Futurs - ProVal | mobile: +33 6 33 14 57 93
Parc Orsay Université - ZAC des Vignes | fax: +33 1 74 85 42 29
3, rue Jacques Monod - Bâtiment N | http://www.lri.fr/~marche/
F-91893 ORSAY Cedex |
More information about the Termtools
mailing list