[Termtools] Upcoming competition - discussion during Nancy Workshop

Mon May 14 15:46:30 CEST 2007

Thanks Johannes for this summary. Let me give a few answers.

>>>>> "Johannes" == Johannes Waldmann <waldmann at imn.htwk-leipzig.de> writes:

    Johannes> 1) Competition runs on a subset of TPDB,
    Johannes> random distribution on subsets (directories) is determined by committee:
    Johannes> General agreement. Aart suggested random selection from complete TPDB,

    Johannes> I said, some parts (subdirs) will probably be done in full,
    Johannes> and others will be subject to random selection.

    Johannes> BTW: I request that my SRS problem set be fully included with TPDB,
    Johannes> and random selection shall be applied, details to be discussed
    Johannes> (e.g. run size 11 SRS fully, size 12 randomly).

To exaplin a bit the context: this year we received many new problems
to add to the TPDB for the next competition. It is not anymore
reasonable to run the competition on the full TPDB, if we include
everything, the current proposition is to accept a large TPDB,
but using only a random selected set of problems from some parts of
it, namely:

- the SRSs of size 12 submitted by Johannes
- the TRSs generated from context-sensitive TRS or Maude programs
  submitted by Salvador Lucas.

If nobody complains, this is what will be done. The number (or the
percentage) of problems randomly selected remains to be decided. 

    Johannes> 2) count YES and/or NO: generally, the idea of summing both counts
    Johannes> (to determine the overall winner) was favoured.
    Johannes> It was the impression that this was the original intention,
    Johannes> but it had been neglected in last year's report.

It will be easy to add an overall score adding both answers YES and NO

    Johannes> 3) measure difficulty of problems: general agreement for:

    Johannes> a) determine weight of "old" problems from last year's performance
    Johannes> b) evaluation  of prover depends on current performance

    Johannes> that means: for (a) we compute a "value" of each problem,
    Johannes> according to (b), this value is distributed among this year's solvers,

    Johannes> For us (the committee) this means we have to give concrete formulas
    Johannes> for (a) and (b). Simplest: compute weight as proposed by Claude
    Johannes> (1 - #solvers/#participants). We don't want these weights as annotations
    Johannes> in the problems (because weights will change).

    Johannes> Suggestion: we run last year's provers (now) on the new problems
    Johannes> to have an initial difficulty rating.
    Johannes> If we don't want to have this extra run,
    Johannes> then we could use the current performance for weighting.

if "we" means Claude Marché, then the answer is: no extra run. 

My suggestion for this year:
1) compute the weight of a problem as above
2) give both normal score and weighted score

About putting the weights in the TPDB itself: I think it should be
done, but this not that important.

    Johannes> 4) Discussion on "checkme".

    Johannes> 4.1 The current rules (as written) would allow
    Johannes> dummy checkme executables that just print "TRUE". This
    Johannes> is not intended.

Before writing the rules for the "certifying option", I discussed by
email with potential participants (Xavier Urbain for A3PAT, Frederic
Blanqui and Adam Koprowski for Color) and I thought we more or less
agreed. After publishing the rules, I received no remarks about that.

Now I realize that it was misunderstood. The rules I intended do not
accept a checkme which would produce TRUE.

    Johannes> To make this intention more clear, the following is suggested
    Johannes> (by all participants): "checkme" should in fact be "generate":
    Johannes> it reads a proof and produces an input file for some established proof
    Johannes> checker, possibly referring to some established
    Johannes> libraries.

NO. the "runme" script should generate that. It can be any format,
such as the XML format expected from the Color project.

    Johannes> This input file shall be made visible on the competition web site.
YES

    Johannes> The organizer (not the "checkme") then calls  the proof checker on that 
    Johannes> file.

NO. The "checkme" is called automatically with the generated file as
argument. It can do several steps of verification, such as
  1) generate a Coq script from the XML file
  2) run coqc on the generated file
  3) checks coqc return status, and return 0 if everything went OK (as
     said in the rules document)	

The important points are:
1) the process in the checkme file must be explained in a short text (as
   said in the rules document) so that the organizers may be
   more or less convinced that the certification procedure is safe.
2) it should also be explain how the format of the file generated by
   "runme" is related to the input problem, so that organizers can be
   convinced that the checking procedure is run on the right problem

    Johannes> (For this year, we expect that all participants use Coq
    Johannes> and CoLoR.

Wrong. Cime/A3PAT will use Coq but not Color.

    Johannes> Technical detail: this would rely on  Coq and Color installed by Claude,
    Johannes> and agreement on Coq command line arguments, if any - it should be possible
    Johannes> to specifiy everything within the Coq input.)

About Coq: as I did for Java, I can manage to install a standard Coq
version, such as the current one 8.1

On the other hand, I don't want to take care of other unstable
versions of Coq or Coq libraries. In particular, participants willing
to use Coq as proof checking tool should provide all needed librairies.

    Johannes> 4.2 Since it is allowed to submit different versions of provers
    Johannes> to uncertified and certified part of competition,
    Johannes> and Adam intends to do so, both versions need to be run anyway.
    Johannes> Suggestion (by all) is to make these two runs separately
    Johannes> (first, run complete uncertified competition;
    Johannes> then, run complete certified competition.)
    Johannes> (No time would be won by merging these two runs.)

OK for two separate run. And the certifying version can be different,
in particular I guess it is not necessary to look for some termination
proof based on a criterion which is not supported by the checking
procedure. 

    Johannes> 4.3 on timeout for checkme. Adam says he expects average run time
    Johannes> of five seconds, but has some examples with 75 seconds.
    Johannes> So he could live with 1 minute, but 5 minute would be better.
    Johannes> I think we could take 5 min limit since it seems it will seldom be needed.
    Johannes> (The prover itself should finish in one minute anyway.)

I discussed a little of that before. 5 minutes time limit is OK for
me. 

    Johannes> 4.4 how to determine the winner of the "certified" competition?
    Johannes> Do we award points for emitting proofs or for certifying
    Johannes> proofs?

For certifying of course: if emitted proofs appears to be rejected, it
counts as 0.

    Johannes> If (say) TPA emits an XML proof that is transformed by Rainbow to a Coq 
    Johannes> proof,
    Johannes> reward goes to TPA, Rainbow, CoLoR, Coq? Most of the (termination specific)
    Johannes> work is done in Rainbow and CoLoR.

Rewards goes to the registered participants. When registering, please
tell me who are the authors and which tool you wrote yourself, and
which external tools you use. The same remark applies to tools which
calls an external SAT solver for example.

    Johannes> 5) Timing. Since there seems to be no solution for exactly
    Johannes> measuring time for subprocesses, we expect to use wall clock time.

Yes, if no other solution is found.

    Johannes> General discussion on further competitions:

Only a general answer about remarks below, in the next mail

    Johannes> Committe should be more open, there is no need for confidentiality.
    Johannes> Especially so if the committee themselves are
    Johannes> participants.

    Johannes> Otherwise, committe should not be allowed to take part in competition.
    Johannes> Alternate suggestion: each year's winner automatically is organizer
    Johannes> for next competition, but is not allowed to take part.

    Johannes> Timing: there are several applications (e.g. completion)
    Johannes> where one indeed wants a very fast answer. On the other hand there
    Johannes> are "research" problems where one wants to allow a lot of time.
    Johannes> It was already suggested (by Jörg) that each prover gets some total time
    Johannes> which it can distribute on all the problems freely.
    Johannes> (Aart, Adam, Johannes agree).
    Johannes> One simple strategy would be to run with small timeout (few seconds)
    Johannes> on all problems, and then distribute remaining time evenly on
    Johannes> unsolved problems. More elaborate schemes are of course possible.
    Johannes> The technical problem of how to specify the set of inputs
    Johannes> can be solved (e.g. have on file containing problem file names).
    Johannes> The disadvantage (according to Aprove) is that this gives more room
    Johannes> for tuning of parameters, requiring much work, thus distracting from 
    Johannes> "real research".

    Johannes> Weighting: could possibly use additional adjustment according
    Johannes> to time used for proofs (Aprove does not agree).

    Johannes> (The above summary of the discussion
    Johannes> has been read and approved by the participants.)

    Johannes> Best regards, Johannes Waldmann.
    Johannes> _______________________________________________
    Johannes> Termtools mailing list
    Johannes> Termtools at lists.lri.fr
    Johannes> http://lists.lri.fr/mailman/listinfo/termtools

-- 
Claude Marché                          | tel: +33 1 72 92 59 69           
INRIA Futurs - ProVal                  | mobile: +33 6 33 14 57 93 
Parc Orsay Université - ZAC des Vignes | fax: +33 1 74 85 42 29   
3, rue Jacques Monod - Bâtiment N      | http://www.lri.fr/~marche/
F-91893 ORSAY Cedex                    |