Measuring the Competition Results
Case Law Competition Results (Tasks 1 and 2)
For Tasks 1 and 2, the evaluation metrics will be precision, recall, and F-measure:
Precision:
Recall:
F-measure:
In the evaluation of Task 1 and Task 2, we simply use micro-average (evaluation measure is calculated using the results of all queries) rather than macro-average (evaluation measure is calculated for each query and then take average).
Statute Law Competition Results (Tasks 3 and 4)
For Task 3, the primary evaluation metric is accuracy, defined as the proportion of correct entailment results accompanied by sufficient relevant articles (Recall = 1). In the case of a tie in accuracy between two or more teams, the F2-measure will be used as the secondary evaluation metric.
Accuracy:
Precision:
Recall:
F2-measure:
For Task 4, the evaluation measure will be accuracy, with respect to whether the yes/no question was correctly confirmed:
Accuracy:
Legal Judgment Prediction for Japanese Tort cases (Pilot Task)
Tort Prediction
For Tort Prediction of Pilot Task, the evaluation measure will be accuracy, with respect to whether the True/False label for court_decision was correctly predicted:
Accuracy:
Rationale Extraction
For Rationale Extraction of Pilot Task, the evaluation metrics will be F1-measure concerning True label for is_acceptedand it is:
Precision:
Recall:
F-measure:
We first calculate the F-measure for each tort instance, then average it over all tort instances. (A tort instance corresponds to a line in a submission JSON file.)