SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials

About

SciEval is the first benchmark dataset for Automatic Instructional Materials Evaluation (AIME) — a generative AI task where large language models evaluate K–12 science instructional materials by producing quality scores and evidence-based justifications aligned with the EQuIP rubric.

The dataset consists of NGSS-aligned science lessons sourced from OpenSciEd, annotated by trained science education researchers through a multi-round process with structured adjudication. Each lesson is evaluated across 13 criteria spanning three-dimensional learning design, instructional supports, and student progress monitoring.

Dataset Details

Annotation Schema

Column	Description
`ID`	Unique identifier (e.g., `Course_lesson_N_Criterion`)
`File`	Source PDF filename
`Criterion`	EQuIP rubric criterion code
`Score`	Rating: 0 (N/A), 1 (Inadequate), 2 (Adequate), 3 (Extensive)
`Evidence`	Detailed justification with page references
`Pos_Evidence`	Positive evidence examples
`Neg_Evidence`	Negative evidence / gaps
`Advice`	Reviewer recommendations

Corpus Statistics

Property	Value
Instructional units	32
Total pages	4,499
Avg. pages per lesson	16.5
Avg. words per lesson	~5,908
Score 0 (N/A)	41.8%
Score 3 (Extensive)	54.2% of active

Train / Test Split

218

Training PDFs

Test PDFs

EQuIP Rubric Criteria

Each lesson is evaluated across 13 criteria organized into three categories:

I: NGSS 3D Design

I.A Explaining Phenomena / Designing Solutions
I.B.1 Science & Engineering Practices (SEPs)
I.B.2 Disciplinary Core Ideas (DCIs)
I.B.3 Crosscutting Concepts (CCCs)
I.C Integrating the Three Dimensions

II: Instructional Supports

II.A Relevance and Authenticity
II.B Student Ideas
II.C Scientific Accuracy

III: Monitoring Student Progress

III.A Monitoring 3D Performance
III.B Formative Assessment
III.C Scoring Guidance
III.D Unbiased Tasks / Items
III.E Opportunity to Learn

Getting Started

Download the SciEval dataset to begin your research:

📄

Lesson PDFs

221 PDF files · ~440 MB

📊

Annotations

222 XLSX files · ~1.9 MB

Each annotation file contains 13 rows (one per EQuIP criterion) with scores, evidence, and reviewer justifications. The PDFs and annotations are split into 218 training and 55 test documents via stratified sampling.

Citation

If you use SciEval in your research, please cite our paper:

@inproceedings{scieval2026,
  title     = {SciEval: A Benchmark for Automatic Evaluation of
               K-12 Science Instructional Materials},
  author    = {Anonymous Authors},
  booktitle = {Proceedings of the 27th International Conference on
               Artificial Intelligence in Education (AIED)},
  year      = {2026}
}