PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation

Evaluating Automated Real-World Slide Generation with Fine-Grained, Instance-Specific Criteria.

Xin-Sheng Chen1, Jiayu Zhu1, Pei-lin Li1, Hanzheng Wang1, Shuojin Yang1†, Meng-Hao Guo1
1 Tsinghua University
Corresponding author
Fine-Grained Evaluation Instance-Specific Rubric Material-Grounded Full Deck Generation Task

📄 Abstract

Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment.

In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks.

Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.

💡 Why PresentBench?

1. Instance-Specific, Fine-Grained Criteria

Existing evaluation frameworks often adopt instance-agnostic scoring schemes, typically relying on a judging paradigm that poses the same set of general questions for all slide decks. Such evaluations fail to account for instance-specific content, making it difficult to assess if a slide generation system truly follows the intended input.

PresentBench establishes fine-grained checklist items tailored to each slide deck instance. On average, each instance is associated with more than 50 specifically designed atomic evaluation items, converting vague qualitative grading into verifiable binary checks.

2. Authentic, Grounded Scenarios

A large portion of prior work focuses on isolated subtasks or reference-free settings without grounding the task in concrete background materials. This creates a mismatch between evaluation settings and real-world usage.

For each instance in PresentBench, we curate authoritative background materials, such as top-tier conference papers, university course textbooks, and financial reports, and require systems to generate slides grounded in these materials. This design ensures that every task reflects realistic, end-to-end slide generation scenarios based on authentic sources.

Comparison between coarse-grained and fine-grained evaluation in slide generation

Comparison of coarse-grained, instance-agnostic (M)LLM-as-a-Judge evaluation frameworks and PresentBench.

Description of the second image

Performance comparison of various slide generation systems on the PPTEval evaluation framework and PresentBench. PresentBench adopts a stricter scoring scheme and poses a greater challenge to slide generation systems.

📊 Experiment Results

🏆 Leaderboard

Method Total Academia Advertising Education Economics Talk
NotebookLM 62.5 68.6 54.9 55.0 58.2 69.2
Manus 1.6 57.8 64.0 52.4 50.7 52.8 63.0
Tiangong 54.7 59.2 44.5 53.7 46.5 59.8
Zhipu 53.6 57.5 41.0 52.5 47.6 59.0
PPTAgent v2 50.2 53.3 46.7 46.1 46.1 56.6
Gamma 49.2 54.4 46.7 47.8 35.1 56.3
Doubao 48.0 50.3 42.9 45.4 44.0 54.7
Qwen 35.9 39.4 31.9 36.6 26.5 38.6

Note: Comparative results across five domains. The highest scores are highlighted in red, and the second-highest scores are highlighted in blue. Evaluation results are mean aggregated scores over 238 instances.

📐 Dimension-wise Results

Method Presentation Fundamentals Visual Design and Layout Content Completeness Content Correctness Content Fidelity
NotebookLM 81.0 62.8 67.8 56.0 45.1
Manus 1.6 80.1 53.7 63.6 46.2 45.4
Gamma 66.9 22.6 54.3 47.7 54.1
Doubao 71.8 40.7 58.2 34.7 34.8
Tiangong 77.7 47.2 68.8 45.7 34.3
Zhipu 73.3 40.6 63.0 47.1 44.1
Qwen 53.1 21.9 29.7 29.9 44.6
PPTAgent v2 79.8 44.4 60.2 37.9 28.8

Note: Comparative results across five evaluation dimensions. The highest scores are highlighted in red, and the second-highest scores are highlighted in blue.



✨ Highlights

Slide Generation is Still Challenging

Even the best-performing system only reaches an overall score of 62.5, indicating that grounded, end-to-end slide authoring is far from solved.

The primary difficulty lies in long-context distillation: inputs average 22.2k tokens (approximately 34 pages), requiring models to read, select, synthesize, and organize information across many dispersed facts.

Open-Source Systems Lag Behind

The representative open-source framework, PPTAgent (50.2), significantly trails behind NotebookLM (62.5) and Manus (57.8).

This gap likely arises not only from differences in backbone models but also from proprietary end-to-end pipelines, including slide-specific long-context planning, grounding mechanisms, and advanced layout and rendering engines.

Visual Design is a Primary Bottleneck and Differentiator

While many systems achieve high Fundamentals scores (around 70–80), their design and layout scores are much lower, with most systems scoring in the 40s.

Even strong content generators (such as Manus) still lag in layout quality, suggesting that better visual design will require dedicated layout and rendering pipelines, not just stronger models.

Material Grounding Remains Challenging

Content Completeness is notably higher than Correctness, meaning systems often build structure but frequently make factual mistakes.

Content Fidelity also remains challenging even for strong systems (e.g., NotebookLM 45.1, Manus 45.4), pointing to persistent ungrounded details and hallucinations.

⚙️ Construction & Evaluation Workflow

By decomposing slide evaluation into verifiable, instance-specific checklist items and aggregating decisions via principled scoring mechanisms, PresentBench provides reliable and interpretable signals.

Construction Stage 1 — Data Source Curation

The dataset comprises 238 high-quality evaluation instances, covering five major thematic categories: Academia, Education, Economics, Talk, and Advertising.

  • Experts manually inspect and filter all data sources to ensure correctness, relevance, and suitability.
  • The average input length is approximately 22.2k tokens, with an average of 34.0 pages of material, requiring slide generation systems to process long-context information effectively.

Construction Stage 2 — Instruction & Checklist

To rigorously evaluate slide deck generation, we craft a highly constrained, instance-specific instruction and a corresponding fine-grained checklist for each evaluation instance.

  • The instructions impose strict constraints on overall structure, faithfulness, presentation quality, visual layout, and audience tone.
  • The checklist is organized into two tiers: material-independent (assessing fundamental logic and visual layout) and material-dependent (verifying content completeness, accuracy, and fidelity).
Automated slide generation and evaluation workflow

The construction and evaluation workflow of PresentBench.

Evaluation — Generation & Judgment

During evaluation, the slide generation system generates a slide deck based on the instructions and corresponding materials.

  1. A judge model utilizes the structured checklist to conduct the evaluation.
  2. Each checklist item is verified independently and assigned a discrete verdict (e.g., satisfied or violated), along with localized evidence.
  3. An aggregated score is computed for each evaluation dimension, calculated as the average completion rate of all checklist items within that specific dimension.
  4. The final score for each evaluation instance is computed as the average of the five dimension scores.

🗂️ Dataset Statistics

Distribution of PresentBench Instances
Distribution of PresentBench across domains and sources.
Input Tokens Distribution
Distribution of the number of input tokens (left) and checklist items (right).
Statistics Number
Total evaluation instances 238
• Academia evaluation instances 91
• Advertising evaluation instances 16
• Economics evaluation instances 41
• Education evaluation instances 60
• Talk evaluation instances 30
• English evaluation instances 219
• Chinese evaluation instances 19
Average input tokens 22.2 × 10³
• Average instruction tokens 2.3 × 10³
• Average material tokens 19.9 × 10³
Average material pages 34.0
Average checklist items 54.1
• Avg. Present. Fundamentals items 13.0
• Avg. Visual Design and Layout items 17.0
• Avg. Content Completeness items 12.6
• Avg. Content Correctness items 11.5
• Avg. Content Fidelity items Dynamic
Key statistics of PresentBench. The number of Content Fidelity items varies across slide decks, which is dynamic.

🔍 PresentBench Example

Background Materials

Instructions

Loading instructions...

Generated Slide Deck (by NotebookLM)

Evaluation Rubrics & Results

Loading evaluation results...

🧠 Evaluation Protocol

PROMPT_PREFIX = r"""You are an expert in evaluating talk or presentation slides.
An AI agent is tasked with creating a complete, comprehensive, and logically-structured slide deck suitable for a talk or presentation.
Your task is to **evaluate the slides generated by that AI agent based on the requirement provided below**. The AI-generated slides are provided to you as File 1, and the material that the AI agent relied on is provided to you in the subsequent files.
Please indicate whether the generated slides meet the specified requirement by answering "yes" or "no". If no, provide a clear explanation of why it does not meet the requirement. If possible, reference specific slides (e.g., Slide 3, Slide 5) in your explanation.
If the slides fall anywhere between fully meeting and fully failing the requirement (i.e., partially meet it), you MUST classify the answer as "no". Only slides that fully satisfy the requirement with no exceptions may receive "yes".
Your answer must include a `\boxed{...}`, where `...` is "yes" or "no". Aside from this requirement, there are no restrictions on the response format.

Below is the requirement.
---

"""

# Material Dependent Evaluation Protocol
def evaluate_checklist_item(judge_model, slides, background_material, criteria):
    prompt = PROMPT_PREFIX + criteria
    response = judge_model.generate(prompt, context=[slides, background_material])
    
    if match := re.search(r'\\boxed\{([^}]+)\}', response):
        answer = match.group(1).strip().lower()
        if answer == "yes":
            return 1.0
    
    return 0.0

# Dimension Aggregation
def compute_dimension_score(dimension_results):
    score = sum(dimension_results) / len(dimension_results)
    return score

# Overall Score
def compute_overall_score(dimension_scores):
    score = sum(dimension_scores) / len(dimension_scores) # len(dimension_scores) == 5
    return score

To reduce the judge model's cognitive burden and thereby improve evaluation reliability, we evaluate each checklist item in a separate call to the judge model.

📚 BibTeX

@misc{chen2026presentbenchfinegrainedrubricbasedbenchmark,
      title={PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation}, 
      author={Xin-Sheng Chen and Jiayu Zhu and Pei-lin Li and Hanzheng Wang and Shuojin Yang and Meng-Hao Guo},
      year={2026},
      eprint={2603.07244},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.07244}, 
}