A Benchmark for Automatic Evaluation of K–12 Science Instructional Materials
SciEval is the first benchmark dataset for Automatic Instructional Materials Evaluation (AIME) — a generative AI task where large language models evaluate K–12 science instructional materials by producing quality scores and evidence-based justifications aligned with the EQuIP rubric.
The dataset consists of NGSS-aligned science lessons sourced from OpenSciEd, annotated by trained science education researchers through a multi-round process with structured adjudication. Each lesson is evaluated across 13 criteria spanning three-dimensional learning design, instructional supports, and student progress monitoring.
| Column | Description |
|---|---|
ID | Unique identifier (e.g., Course_lesson_N_Criterion) |
File | Source PDF filename |
Criterion | EQuIP rubric criterion code |
Score | Rating: 0 (N/A), 1 (Inadequate), 2 (Adequate), 3 (Extensive) |
Evidence | Detailed justification with page references |
Pos_Evidence | Positive evidence examples |
Neg_Evidence | Negative evidence / gaps |
Advice | Reviewer recommendations |
| Property | Value |
|---|---|
| Instructional units | 32 |
| Total pages | 4,499 |
| Avg. pages per lesson | 16.5 |
| Avg. words per lesson | ~5,908 |
| Score 0 (N/A) | 41.8% |
| Score 3 (Extensive) | 54.2% of active |
Each lesson is evaluated across 13 criteria organized into three categories:
Download the SciEval dataset to begin your research:
Each annotation file contains 13 rows (one per EQuIP criterion) with scores, evidence, and reviewer justifications. The PDFs and annotations are split into 218 training and 55 test documents via stratified sampling.
If you use SciEval in your research, please cite our paper:
@inproceedings{scieval2026,
title = {SciEval: A Benchmark for Automatic Evaluation of
K-12 Science Instructional Materials},
author = {Anonymous Authors},
booktitle = {Proceedings of the 27th International Conference on
Artificial Intelligence in Education (AIED)},
year = {2026}
}