[Termtools] Upcoming competition

Mon May 14 15:36:07 CEST 2007

Hi,

Here are some comments on the results of the discussion some of us had
in Nancy last week:

******************************

1) Competition runs on a subset of TPDB,
random distribution on subsets (directories) is determined by committee:
General agreement. Aart suggested random selection from complete TPDB,

I said, some parts (subdirs) will probably be done in full,
and others will be subject to random selection.

BTW: I request that my SRS problem set be fully included with TPDB,
and random selection shall be applied, details to be discussed
(e.g. run size 11 SRS fully, size 12 randomly).

COMMENT:

I made a choice among the size 12 set of SRSs of around 20%, namely all
numbers of the list that were divisible by 5.

After that we agreed to make a random selection in the (otherwise too
unbalanced) list of standard TRSs obtained from transforming a
context-sensitive termination problem; so now indeed it would be
consistent with this policy to take the full list of size 12 SRSs
instead of the choice I made. 

Proposal:

* From the TRCSR systems in the standard TRSs (151 old from TPDB06 and
318 new submitted by Salvador Lucas) we make a random selection of 40%

* From the full list of 335 size 12 SRSs over <= 3 symbols that could
not be solved by any of the 2006 tools, we make a random selection of
20%

* From all other termination problems the competition runs on all of
them.  

These random selections should be made just before running the
competition using a procedure that is made public.

******************************

2) count YES and/or NO: generally, the idea of summing both counts
(to determine the overall winner) was favoured.
It was the impression that this was the original intention,
but it had been neglected in last year's report.

COMMENT:

I am not aware of this. We call it a termination competition, so
obviously the results on termination proofs should be counted. Apart
from that it was decided from the beginning in 2004 also to count 'NO'
results. Adding 'YES' and 'NO' results, and present this as a final
result, would be something new.
As I said, I am not aware that this ever was an intention. In any case,
changing this so short before the deadline should not be done.

******************************

3) measure difficulty of problems: general agreement for:

a) determine weight of "old" problems from last year's performance
b) evaluation  of prover depends on current performance

that means: for (a) we compute a "value" of each problem,
according to (b), this value is distributed among this year's solvers,

For us (the committee) this means we have to give concrete formulas
for (a) and (b). Simplest: compute weight as proposed by Claude
(1 - #solvers/#participants). We don't want these weights as annotations
in the problems (because weights will change).

Suggestion: we run last year's provers (now) on the new problems
to have an initial difficulty rating.
If we don't want to have this extra run,
then we could use the current performance for weighting.

COMMENT:

A problem with this is that several different formulas for computing
difficulty sound reasonable, and we all should agree with the choice we
make here. To keep it simple and to avoid dependability of earlier
tools, I propose not to involve the 2006 results or 2006 tools. To keep
results integer we should avoid dividing by a constant.

Proposal:
For every subcategory for every problem solved the number of points for
a tool is

	Number of tools that did not solve this problem 

This is #participants - #solvers, so Claude's formula multiplied by the
constant #participants. The total score for a tool is the total number
of points obtained in this way by the tool.

If we agree, this formula should be put on the termination competition
web page within a few days, otherwise we should keep the simple 1 point
per solution rule. Since it is short before deadlines I think we only
should change these basic rules if every one agrees. If some one of the
participants of the 2007 competition disagrees, he/she is asked to send
me an email about this.

******************************

5) Timing. Since there seems to be no solution for exactly
measuring time for subprocesses, we expect to use wall clock time.

COMMENT:

OK; let's fix the 1 minute time-out

            Best regards, Hans Zantema.

_________________________________________________________
Dr Hans Zantema            
Technische Universiteit Eindhoven, Department of Computer Science 
P.O. Box 513, 5600 MB Eindhoven, The Netherlands 
e-mail: H.Zantema at tue.nl, homepage: www.win.tue.nl/~hzantema
office: Hoofdgebouw room 6.73, tel: (040)2472749