Evaluation

Measuring the Competition Results

Case Law Competition Results (Tasks 1 and 2)

For Tasks 1 and 2, the evaluation metrics will be precision, recall, and F-measure:

Precision:

the number of correctly retrieved cases(paragraphs) for all queriesthe number of retrieved cases(paragraphs) for all queries\frac{\text{the number of correctly retrieved cases(paragraphs) for all queries}}{\text{the number of retrieved cases(paragraphs) for all queries}}

Recall:

the number of correctly retrieved cases(paragraphs) for all queriesthe number of relevant cases(paragraphs) for all queries\frac{\text{the number of correctly retrieved cases(paragraphs) for all queries}}{\text{the number of relevant cases(paragraphs) for all queries}}

F-measure:

2×Precision×RecallPrecision+Recall\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

In the evaluation of Task 1 and Task 2, we simply use micro-average (evaluation measure is calculated using the results of all queries) rather than macro-average (evaluation measure is calculated for each query and then take average).

Statute Law Competition Results (Tasks 3 and 4)

For Task 3, the evaluation metrics will be precision, recall, and F2-measure (since IR process is a pre-process to select candidate articles for providing candidates which will be used in the entailment process, we put emphasis on recall), and it is:

Precision:

average of (the number of correctly retrieved articles for each query)the number of retrieved articles for each query\frac{\text{average of (the number of correctly retrieved articles for each query)}}{\text{the number of retrieved articles for each query}}

Recall:

average of (the number of correctly retrieved articles for each query)the number of relevant articles for each query\frac{\text{average of (the number of correctly retrieved articles for each query)}}{\text{the number of relevant articles for each query}}

F2-measure:

5×Precision×Recall4×Precision+Recall\frac{5 \times \text{Precision} \times \text{Recall}}{4 \times \text{Precision} + \text{Recall}}

In addition to the above evaluation measures, ordinal information retrieval measures such as Mean Average Precision and R-precision can be used for discussing the characteristics of the submission results.

In the evaluation of Task 3, we use macro-average (evaluation measure is calculated for each query and then take average) to calculate the final evaluation score.

For Task 4, the evaluation measure will be accuracy, with respect to whether the yes/no question was correctly confirmed:

Accuracy:

the number of queries which were correctly confirmed as true or falsethe number of all queries\frac{\text{the number of queries which were correctly confirmed as true or false}}{\text{the number of all queries}}

Legal Judgment Prediction for Japanese Tort cases (Pilot Task)

Tort Prediction

For Tort Prediction of Pilot Task, the evaluation measure will be accuracy, with respect to whether the True/False label for court_decision was correctly predicted:

Accuracy:

the number of instances which were correctly predicted as true or falsethe number of all instances\frac{\text{the number of instances which were correctly predicted as true or false}}{\text{the number of all instances}}

Rationale Extraction

For Rationale Extraction of Pilot Task, the evaluation metrics will be F1-measure concerning True label for is_acceptedand it is:

Precision:

the number of claims correctly predicted as Truethe number of claims predicted as True\frac{\text{the number of claims correctly predicted as True}}{\text{the number of claims predicted as True}}

Recall:

the number of claims correctly predicted as Truethe number of claims whose gold labels are True\frac{\text{the number of claims correctly predicted as True}}{\text{the number of claims whose gold labels are True}}

F-measure:

2×Precision×RecallPrecision+Recall\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

We first calculate the F-measure for each tort instance, then average it over all tort instances. (A tort instance corresponds to a line in a submission JSON file.)

Last updated on