COLIEE 2026
Evaluation

Measuring the Competition Results

Case Law Competition Results (Tasks 1 and 2)

For Tasks 1 and 2, the evaluation metrics will be precision, recall, and F-measure:

Precision:

the number of correctly retrieved cases(paragraphs) for all queriesthe number of retrieved cases(paragraphs) for all queries\frac{\text{the number of correctly retrieved cases(paragraphs) for all queries}}{\text{the number of retrieved cases(paragraphs) for all queries}}

Recall:

the number of correctly retrieved cases(paragraphs) for all queriesthe number of relevant cases(paragraphs) for all queries\frac{\text{the number of correctly retrieved cases(paragraphs) for all queries}}{\text{the number of relevant cases(paragraphs) for all queries}}

F-measure:

2×Precision×RecallPrecision+Recall\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

In the evaluation of Task 1 and Task 2, we simply use micro-average (evaluation measure is calculated using the results of all queries) rather than macro-average (evaluation measure is calculated for each query and then take average).

Statute Law Competition Results (Tasks 3 and 4)

For Task 3, the primary evaluation metric is accuracy, defined as the proportion of correct entailment results accompanied by sufficient relevant articles (Recall = 1). In the case of a tie in accuracy between two or more teams, the F2-measure will be used as the secondary evaluation metric.

Accuracy:

the number of queries which were correctly predicted as true or false​ using sufficient relevant articles​the number of all queries\frac{\text{the number of queries which were correctly predicted as true or false​ using sufficient relevant articles​}}{\text{the number of all queries}}

Precision:

average of (the number of correctly retrieved articles for each query)the number of retrieved articles for each query\frac{\text{average of (the number of correctly retrieved articles for each query)}}{\text{the number of retrieved articles for each query}}

Recall:

average of (the number of correctly retrieved articles for each query)the number of relevant articles for each query\frac{\text{average of (the number of correctly retrieved articles for each query)}}{\text{the number of relevant articles for each query}}

F2-measure:

5×Precision×Recall4×Precision+Recall\frac{5 \times \text{Precision} \times \text{Recall}}{4 \times \text{Precision} + \text{Recall}}

For Task 4, the evaluation measure will be accuracy, with respect to whether the yes/no question was correctly confirmed:

Accuracy:

the number of queries which were correctly confirmed as true or falsethe number of all queries\frac{\text{the number of queries which were correctly confirmed as true or false}}{\text{the number of all queries}}

Legal Judgment Prediction for Japanese Tort cases (Pilot Task)

Tort Prediction

For Tort Prediction of Pilot Task, the evaluation measure will be accuracy, with respect to whether the True/False label for court_decision was correctly predicted:

Accuracy:

the number of instances which were correctly predicted as true or falsethe number of all instances\frac{\text{the number of instances which were correctly predicted as true or false}}{\text{the number of all instances}}

Rationale Extraction

For Rationale Extraction of Pilot Task, the evaluation metrics will be F1-measure concerning True label for is_acceptedand it is:

Precision:

the number of claims correctly predicted as Truethe number of claims predicted as True\frac{\text{the number of claims correctly predicted as True}}{\text{the number of claims predicted as True}}

Recall:

the number of claims correctly predicted as Truethe number of claims whose gold labels are True\frac{\text{the number of claims correctly predicted as True}}{\text{the number of claims whose gold labels are True}}

F-measure:

2×Precision×RecallPrecision+Recall\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

We first calculate the F-measure for each tort instance, then average it over all tort instances. (A tort instance corresponds to a line in a submission JSON file.)