Evaluation

Measuring the Competition Results

Case Law Competition Results (Tasks 1 and 2)

For Tasks 1 and 2, the evaluation metrics will be precision, recall, and F-measure:

Precision:

the number of correctly retrieved cases(paragraphs) for all queriesthe number of retrieved cases(paragraphs) for all queries\frac{\text{the number of correctly retrieved cases(paragraphs) for all queries}}{\text{the number of retrieved cases(paragraphs) for all queries}}

Recall:

the number of correctly retrieved cases(paragraphs) for all queriesthe number of relevant cases(paragraphs) for all queries\frac{\text{the number of correctly retrieved cases(paragraphs) for all queries}}{\text{the number of relevant cases(paragraphs) for all queries}}

F-measure:

2×Precision×RecallPrecision+Recall\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

In the evaluation of Task 1 and Task 2, we simply use micro-average (evaluation measure is calculated using the results of all queries) rather than macro-average (evaluation measure is calculated for each query and then take average).

Statute Law Competition Results (Tasks 3 and 4)

For Task 3, the evaluation metrics will be precision, recall, and F2-measure (since IR process is a pre-process to select candidate articles for providing candidates which will be used in the entailment process, we put emphasis on recall), and it is:

Precision:

average of (the number of correctly retrieved articles for each query)the number of retrieved articles for each query\frac{\text{average of (the number of correctly retrieved articles for each query)}}{\text{the number of retrieved articles for each query}}

Recall:

average of (the number of correctly retrieved articles for each query)the number of relevant articles for each query\frac{\text{average of (the number of correctly retrieved articles for each query)}}{\text{the number of relevant articles for each query}}

F2-measure:

5×Precision×Recall4×Precision+Recall\frac{5 \times \text{Precision} \times \text{Recall}}{4 \times \text{Precision} + \text{Recall}}

In addition to the above evaluation measures, ordinal information retrieval measures such as Mean Average Precision and R-precision can be used for discussing the characteristics of the submission results.

In COLIEE 2024, the method used to calculate the final evaluation score of all queries is macro-average (evaluation measure is calculated for each query and their average is used as the final evaluation measure) instead of micro-average (evaluation measure is calculated using results of all queries).

For Task 4, the evaluation measure will be accuracy, with respect to whether the yes/no question was correctly confirmed:

Accuracy:

the number of queries which were correctly confirmed as true or falsethe number of all queries\frac{\text{the number of queries which were correctly confirmed as true or false}}{\text{the number of all queries}}