[Termination tools] Rules of the WST competition

Mon Apr 26 18:01:14 CEST 2004

On Mon, 26 Apr 2004, Claude Marche wrote:

> My idea was then that such a "competition" should be more formally
> organized so that: 
> . tools should be run all on the same computer
> . tools should be run fully automatically, as they would be run by a
>   "naive" user of the tool
> . tools should be all run on the same large set of examples.
> 
> The large set of examples exists now: the TPDB, with the common
> format. I have now coded some wrapper that is able to run tools
> without user interaction, so I'm ready to start the process as soon as
> I have the tools themselves.  This wrapper may be interrupted and
> restarted later on, new tools may be added as new participant in the
> middle of the process. I think a few days are enough to run all tools
> on all the TPDB, and for me this should be the "competition", whose
> results would be very interesting on a scientific point of view.

I am happy that after some weeks of silence now there is a reaction
on the termtools list. I think every one agrees that the points you 
mention above are a big step forward compared to the competition of 
last year, and you did a great job in that. Thanks!

> Now, some of you would like this "competition" to become some "sport"
> were, as we all know, what is important is to win and not to
> participate. This is not my idea of this "competition". I don't want
> to spend time to track possible cheats, such as recognizing the shape
> of the given TRS and looking up to database of examples which are already
> known to be terminating. I don't think there is any tool doing this
> anyway. I don't want to spend time in trying to find a way of
> selecting a set of examples that will be agreed by all participants.

I think a main goal of the competition is to be able to compare the 
power of various tools in a more objective way than by reading papers
about these tools. I am afraid that this desired objectivity is
hard to reach if

* the competition is only on systems from some data base while 
  the moderators of this database are participants too, and

* there is an obvious way to perform well in the competition by a 
  simple fake program.

Other areas like chess programs once started by only doing brute force
computation, but now also use databases with well-known results. 
I do not see why this could not happen in the termination community, 
and then this format will not work any more.

> Some of you suggested that each participant should select a set
> of examples, say 10 or 20. I don't think it is a good idea: this rule will
> encourage people to find examples which are solved by their own tool,
> but not by the others, and the results will not reflect the real
> abilities of each tool. 
I fully agree that research in our area should not concentrate only
on finding examples that could be treated by the one tool and not by 
the other. But for understanding why sometimes the one tool
outperforms the other there is nothing wrong with searching for such
examples incidentally, not only for competition-as-a-sport but also 
from a scientific point of view. 

> Other suggested that examples should be chosen
> by independant people, why not, but who ? Not me because I'm a
> participant. It could be hard anyway to find people who would be
> completely impartial. 

I should prefer some 'jury' of people in the area that are not involved
with tools. Their task could also be to judge whether the generated
'YES' is just a word, or summarizes a valid proof. Indeed one cannot
expect that these people are willing to do a real lot of work like
generating many new systems. But even if such a jury is mainly 
honorary this would make the competition more serious than only 
counting the number of occurrences of the word 'YES' in a generated 
text.

> So which solution remains ? We could benefit
> from the CASC experience (The competition of theorem provers of CADE),
> which has been run successfully for years: the solution found by the
> organizers is to select problems randomly from 
> the TPTP database. We could do the same using the TPDB, but the TPDB
> is unfortunately much younger than the TPTP, has much less problems,
> so is probably not yet "representative" of termination problems (if it
> means something).
This is indeed a problem. For some categories like term rewriting there
is already quite a list, but for some other categories (like string
rewriting) the list is still too poor for being the basis for a serious
competition.

> I think the best thing for future WST competitions would be to
> encourage growth of the TPDB (with cleaning, removing duplicates,
> organizing problems differently, etc.), so that a selection of random
> problems would become the best way of running the WST competition.
> Unfortunately, Albert Rubio had to give up from its task of
> maintaining the TPDB, I had to do the porting to the new format
> myself, hence the TPDB did not grow much since last year. Regarding
> that, I would like to thank people who sent me mails pointing out the
> mistakes, and proposed new examples. I think there is no more mistake
> now. Any idea of organizing the TPDB differently, any suggestions, are
> welcome.  Volunteers are welcome.

Indeed moderating TPDB is a substantial ongoing job. I think it is good
that more than one person is responsible for this. I assume there will
be candidates for this job (including me). Let's decide in Aachen.

> >So, for this year, I will follow my first idea of running all tools on
> all the TPDB. If some of you are willing to set up another
> "competition" on a set of examples that will come from another source,
> please do so.  If this set of examples is made available in the TPBD
> format, I can run my wrapper on it, it is no problem with me since it
> runs automatically.

I agree for the categories having sufficiently many systems in TPDB this
is the best decision for this moment. Other formats have advantages
and disadvantages, but it is not wise to change the format drastically
only a few weeks before the competition. 

However, I already mentioned that for string rewriting the data base
is still too poor, and may be for other categories too. In order to
force extension of these categories I propose that for these 
categories the participants may submit 10 systems before some date
(say, one week before WST). Then these systems will be added to 
the data base and the competition will be on the full data base 
including these new systems. I hope that also term rewrite tools
will participate on string rewriting, and submit new systems.

One final point: I should like to be decided that only valid 'YES' 
answers count, and not 'NO' answers. As I said before techniques for
proving non-termination are quit different from proving termination,
these areas should not be mixed. As most tools do not have facilities
for proving non-termination this would hardly influence the competition,
but it is good to be clear about the rules. Can you adjust the rules
for this? 

		Best regards, Hans Zantema.

+--------------------------------------+-----------------------------+
|                                      |                             |
|   Dr Hans Zantema                    |   Hoofdgebouw   kamer 6.73  |
|   Faculteit Wiskunde en Informatica  |   tel (040)2472749          |
|   Technische Universiteit Eindhoven  |   fax (040)2468508          |
|   Postbus 513     5600 MB Eindhoven  |   e-mail  H.Zantema at tue.nl  |
|   The Netherlands                    |   www.win.tue.nl/~hzantema  |
|                                      |                             |
+--------------------------------------+-----------------------------+