Measuring the Competition Results
Case Law Competition Results (Tasks 1 and 2)
For Tasks 1 and 2, the evaluation metrics will be precision, recall, and F-measure:
Precision:
Recall:
F-measure:
In the evaluation of Task 1 and Task 2, we simply use micro-average (evaluation measure is calculated using the results of all queries) rather than macro-average (evaluation measure is calculated for each query and then take average).
Statute Law Competition Results (Tasks 3 and 4)
For Task 3, the evaluation metrics will be precision, recall, and F2-measure (since IR process is a pre-process to select candidate articles for providing candidates which will be used in the entailment process, we put emphasis on recall), and it is:
Precision:
Recall:
F2-measure:
In addition to the above evaluation measures, ordinal information retrieval measures such as Mean Average Precision and R-precision can be used for discussing the characteristics of the submission results.
In the evaluation of Task 3, we use macro-average (evaluation measure is calculated for each query and then take average) to calculate the final evaluation score.
For Task 4, the evaluation measure will be accuracy, with respect to whether the yes/no question was correctly confirmed:
Accuracy:
Legal Judgment Prediction for Japanese Tort cases (Pilot Task)
Tort Prediction
For Tort Prediction of Pilot Task, the evaluation measure will be accuracy, with respect to whether the True/False label for court_decision
was correctly predicted:
Accuracy:
Rationale Extraction
For Rationale Extraction of Pilot Task, the evaluation metrics will be F1-measure concerning True
label for is_accepted
and it is:
Precision:
Recall:
F-measure:
We first calculate the F-measure for each tort instance, then average it over all tort instances. (A tort instance corresponds to a line in a submission JSON file.)