PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation

📄 Abstract

Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment.

In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks.

Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.

💡 Why PresentBench?

1. Instance-Specific, Fine-Grained Criteria

Existing evaluation frameworks often adopt instance-agnostic scoring schemes, typically relying on a judging paradigm that poses the same set of general questions for all slide decks. Such evaluations fail to account for instance-specific content, making it difficult to assess if a slide generation system truly follows the intended input.

PresentBench establishes fine-grained checklist items tailored to each slide deck instance. On average, each instance is associated with more than 50 specifically designed atomic evaluation items, converting vague qualitative grading into verifiable binary checks.

2. Authentic, Grounded Scenarios

A large portion of prior work focuses on isolated subtasks or reference-free settings without grounding the task in concrete background materials. This creates a mismatch between evaluation settings and real-world usage.

For each instance in PresentBench, we curate authoritative background materials, such as top-tier conference papers, university course textbooks, and financial reports, and require systems to generate slides grounded in these materials. This design ensures that every task reflects realistic, end-to-end slide generation scenarios based on authentic sources.

Comparison between coarse-grained and fine-grained evaluation in slide generation

Comparison of coarse-grained, instance-agnostic (M)LLM-as-a-Judge evaluation frameworks and PresentBench.

Performance comparison of various slide generation systems on the PPTEval evaluation framework and PresentBench. PresentBench adopts a stricter scoring scheme and poses a greater challenge to slide generation systems.

📊 Experiment Results

🏆 Leaderboard

Method	Total	Academia	Advertising	Education	Economics	Talk
NotebookLM	62.5	68.6	54.9	55.0	58.2	69.2
Manus 1.6	57.8	64.0	52.4	50.7	52.8	63.0
Tiangong	54.7	59.2	44.5	53.7	46.5	59.8
Zhipu	53.6	57.5	41.0	52.5	47.6	59.0
PPTAgent v2	50.2	53.3	46.7	46.1	46.1	56.6
Gamma	49.2	54.4	46.7	47.8	35.1	56.3
Doubao	48.0	50.3	42.9	45.4	44.0	54.7
Qwen	35.9	39.4	31.9	36.6	26.5	38.6

Note: Comparative results across five domains. The highest scores are highlighted in red, and the second-highest scores are highlighted in blue. Evaluation results are mean aggregated scores over 238 instances.

📐 Dimension-wise Results

Method	Presentation Fundamentals	Visual Design and Layout	Content Completeness	Content Correctness	Content Fidelity
NotebookLM	81.0	62.8	67.8	56.0	45.1
Manus 1.6	80.1	53.7	63.6	46.2	45.4
Gamma	66.9	22.6	54.3	47.7	54.1
Doubao	71.8	40.7	58.2	34.7	34.8
Tiangong	77.7	47.2	68.8	45.7	34.3
Zhipu	73.3	40.6	63.0	47.1	44.1
Qwen	53.1	21.9	29.7	29.9	44.6
PPTAgent v2	79.8	44.4	60.2	37.9	28.8

Note: Comparative results across five evaluation dimensions. The highest scores are highlighted in red, and the second-highest scores are highlighted in blue.

✨ Highlights

Slide Generation is Still Challenging

Even the best-performing system only reaches an overall score of 62.5, indicating that grounded, end-to-end slide authoring is far from solved.

The primary difficulty lies in long-context distillation: inputs average 22.2k tokens (approximately 34 pages), requiring models to read, select, synthesize, and organize information across many dispersed facts.

Open-Source Systems Lag Behind

The representative open-source framework, PPTAgent (50.2), significantly trails behind NotebookLM (62.5) and Manus (57.8).

This gap likely arises not only from differences in backbone models but also from proprietary end-to-end pipelines, including slide-specific long-context planning, grounding mechanisms, and advanced layout and rendering engines.

Visual Design is a Primary Bottleneck and Differentiator

While many systems achieve high Fundamentals scores (around 70–80), their design and layout scores are much lower, with most systems scoring in the 40s.

Even strong content generators (such as Manus) still lag in layout quality, suggesting that better visual design will require dedicated layout and rendering pipelines, not just stronger models.

Material Grounding Remains Challenging

Content Completeness is notably higher than Correctness, meaning systems often build structure but frequently make factual mistakes.

Content Fidelity also remains challenging even for strong systems (e.g., NotebookLM 45.1, Manus 45.4), pointing to persistent ungrounded details and hallucinations.

⚙️ Construction & Evaluation Workflow

By decomposing slide evaluation into verifiable, instance-specific checklist items and aggregating decisions via principled scoring mechanisms, PresentBench provides reliable and interpretable signals.

Construction Stage 1 — Data Source Curation

The dataset comprises 238 high-quality evaluation instances, covering five major thematic categories: Academia, Education, Economics, Talk, and Advertising.

Experts manually inspect and filter all data sources to ensure correctness, relevance, and suitability.
The average input length is approximately 22.2k tokens, with an average of 34.0 pages of material, requiring slide generation systems to process long-context information effectively.

Construction Stage 2 — Instruction & Checklist

To rigorously evaluate slide deck generation, we craft a highly constrained, instance-specific instruction and a corresponding fine-grained checklist for each evaluation instance.

The instructions impose strict constraints on overall structure, faithfulness, presentation quality, visual layout, and audience tone.
The checklist is organized into two tiers: material-independent (assessing fundamental logic and visual layout) and material-dependent (verifying content completeness, accuracy, and fidelity).

Automated slide generation and evaluation workflow

The construction and evaluation workflow of PresentBench.

Evaluation — Generation & Judgment

During evaluation, the slide generation system generates a slide deck based on the instructions and corresponding materials.

A judge model utilizes the structured checklist to conduct the evaluation.
Each checklist item is verified independently and assigned a discrete verdict (e.g., satisfied or violated), along with localized evidence.
An aggregated score is computed for each evaluation dimension, calculated as the average completion rate of all checklist items within that specific dimension.
The final score for each evaluation instance is computed as the average of the five dimension scores.

🗂️ Dataset Statistics

Distribution of PresentBench across domains and sources.

Distribution of the number of input tokens (left) and checklist items (right).

              
                    Statistics
                    Number
                  
                    Total evaluation instances
                    238
                  
                      • Academia evaluation instances
                    91
                  
                      • Advertising evaluation instances
                    16
                  
                      • Economics evaluation instances
                    41
                  
                      • Education evaluation instances
                    60
                  
                    • Talk evaluation instances
                    
                    30
                  
                      • English evaluation instances
                    219
                    
                    • Chinese
                      evaluation instances
                    19
                  
                    Average input tokens
                    22.2 × 10³
                  
                      • Average instruction tokens
                    
                      2.3 × 10³
                  
                      • Average material tokens
                    
                      19.9 × 10³
                  
                    Average
                      material pages
                    34.0
                  
                    Average checklist items
                    54.1
                  
                      • Avg. Present. Fundamentals items
                    13.0
                    
                      • Avg. Visual Design and Layout items
                    17.0
                    
                      • Avg. Content Completeness items
                    12.6
                    
                      • Avg. Content Correctness items
                    11.5
                    
                      • Avg. Content Fidelity items
                    
                      Dynamic
                  
              Key statistics of PresentBench. The number of Content Fidelity items varies across slide decks, which is
              dynamic.

Statistics	Number
Total evaluation instances	238
• Academia evaluation instances	91
• Advertising evaluation instances	16
• Economics evaluation instances	41
• Education evaluation instances	60
• Talk evaluation instances	30
• English evaluation instances	219
• Chinese evaluation instances	19
Average input tokens	22.2 × 10³
• Average instruction tokens	2.3 × 10³
• Average material tokens	19.9 × 10³
Average material pages	34.0
Average checklist items	54.1
• Avg. Present. Fundamentals items	13.0
• Avg. Visual Design and Layout items	17.0
• Avg. Content Completeness items	12.6
• Avg. Content Correctness items	11.5
• Avg. Content Fidelity items	Dynamic

🔍 PresentBench Example

Background Materials

Instructions

              Loading instructions...
            

Generated Slide Deck (by NotebookLM)

Evaluation Rubrics & Results

Loading evaluation results...

🧠 Evaluation Protocol

PROMPT_PREFIX = r"""You are an expert in evaluating talk or presentation
                slides.
An AI agent is tasked with creating a complete, comprehensive, and
                logically-structured slide deck suitable for a talk or presentation.
Your task is to **evaluate the slides generated by that AI agent based on the
                requirement provided below**. The AI-generated slides are provided to you as File 1, and the material
                that the AI agent relied on is provided to you in the subsequent files.
Please indicate whether the generated slides meet the specified requirement by
                answering "yes" or "no". If no, provide a clear explanation of why it does not meet the requirement. If
                possible, reference specific slides (e.g., Slide 3, Slide 5) in your explanation.
If the slides fall anywhere between fully meeting and fully failing the requirement
                (i.e., partially meet it), you MUST classify the answer as "no". Only slides that fully satisfy the
                requirement with no exceptions may receive "yes".
Your answer must include a `\boxed{...}`, where `...` is "yes" or "no". Aside from
                this requirement, there are no restrictions on the response format.

Below is the requirement. 
---

"""

# Material Dependent Evaluation Protocol
def evaluate_checklist_item(judge_model, slides,
              background_material, criteria):
    prompt = PROMPT_PREFIX + criteria
    response = judge_model.generate(prompt, context=[slides,
              background_material])
    
    if match := re.search(r'\\boxed\{([^}]+)\}', response):
        answer = match.group(1).strip().lower()
        if answer == "yes":
            return
              1.0
    
    return 0.0

# Dimension Aggregation
def compute_dimension_score(dimension_results):
    score = sum(dimension_results) / len(dimension_results)
    return score

# Overall Score
def compute_overall_score(dimension_scores):
    score = sum(dimension_scores) / len(dimension_scores) #
                len(dimension_scores) == 5
    return score

To reduce the judge model's cognitive burden and thereby improve evaluation reliability, we evaluate each checklist item in a separate call to the judge model.

📚 BibTeX

@misc{chen2026presentbenchfinegrainedrubricbasedbenchmark,
      title={PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation}, 
      author={Xin-Sheng Chen and Jiayu Zhu and Pei-lin Li and Hanzheng Wang and Shuojin Yang and Meng-Hao Guo},
      year={2026},
      eprint={2603.07244},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.07244}, 
}