Measuring the Competition Results
Case Law Competition Results (Tasks 1 and 2)
For Tasks 1 and 2, the evaluation metrics will be precision, recall, and F-measure:
Precision:
Recall:
F-measure:
In the evaluation of Task 1 and Task 2, we simply use micro-average (evaluation measure is calculated using the results of all queries) rather than macro-average (evaluation measure is calculated for each query and then take average).
Statute Law Competition Results (Tasks 3 and 4)
For Task 3, the evaluation metrics will be precision, recall, and F2-measure (since IR process is a pre-process to select candidate articles for providing candidates which will be used in the entailment process, we put emphasis on recall), and it is:
Precision:
Recall:
F2-measure:
In addition to the above evaluation measures, ordinal information retrieval measures such as Mean Average Precision and R-precision can be used for discussing the characteristics of the submission results.
In COLIEE 2024, the method used to calculate the final evaluation score of all queries is macro-average (evaluation measure is calculated for each query and their average is used as the final evaluation measure) instead of micro-average (evaluation measure is calculated using results of all queries).
For Task 4, the evaluation measure will be accuracy, with respect to whether the yes/no question was correctly confirmed:
Accuracy: