AIED 2026

SciEval

A Benchmark for Automatic Evaluation of K–12 Science Instructional Materials

About

SciEval is the first benchmark dataset for Automatic Instructional Materials Evaluation (AIME) — a generative AI task where large language models evaluate K–12 science instructional materials by producing quality scores and evidence-based justifications aligned with the EQuIP rubric.

The dataset consists of NGSS-aligned science lessons sourced from OpenSciEd, annotated by trained science education researchers through a multi-round process with structured adjudication. Each lesson is evaluated across 13 criteria spanning three-dimensional learning design, instructional supports, and student progress monitoring.

Dataset at a Glance

273
Lesson-level materials
3,549
Criterion-level annotations
13
EQuIP criteria
89.7%
Inter-rater agreement

Dataset Details

Annotation Schema

ColumnDescription
IDUnique identifier (e.g., Course_lesson_N_Criterion)
FileSource PDF filename
CriterionEQuIP rubric criterion code
ScoreRating: 0 (N/A), 1 (Inadequate), 2 (Adequate), 3 (Extensive)
EvidenceDetailed justification with page references
Pos_EvidencePositive evidence examples
Neg_EvidenceNegative evidence / gaps
AdviceReviewer recommendations

Corpus Statistics

PropertyValue
Instructional units32
Total pages4,499
Avg. pages per lesson16.5
Avg. words per lesson~5,908
Score 0 (N/A)41.8%
Score 3 (Extensive)54.2% of active

Train / Test Split

218
Training PDFs
55
Test PDFs

EQuIP Rubric Criteria

Each lesson is evaluated across 13 criteria organized into three categories:

I: NGSS 3D Design

  • I.A Explaining Phenomena / Designing Solutions
  • I.B.1 Science & Engineering Practices (SEPs)
  • I.B.2 Disciplinary Core Ideas (DCIs)
  • I.B.3 Crosscutting Concepts (CCCs)
  • I.C Integrating the Three Dimensions

II: Instructional Supports

  • II.A Relevance and Authenticity
  • II.B Student Ideas
  • II.C Scientific Accuracy

III: Monitoring Student Progress

  • III.A Monitoring 3D Performance
  • III.B Formative Assessment
  • III.C Scoring Guidance
  • III.D Unbiased Tasks / Items
  • III.E Opportunity to Learn

Getting Started

Download the SciEval dataset to begin your research:

Each annotation file contains 13 rows (one per EQuIP criterion) with scores, evidence, and reviewer justifications. The PDFs and annotations are split into 218 training and 55 test documents via stratified sampling.

Citation

If you use SciEval in your research, please cite our paper:

@inproceedings{scieval2026,
  title     = {SciEval: A Benchmark for Automatic Evaluation of
               K-12 Science Instructional Materials},
  author    = {Anonymous Authors},
  booktitle = {Proceedings of the 27th International Conference on
               Artificial Intelligence in Education (AIED)},
  year      = {2026}
}