Performance comparison of various slide generation systems on the PPTEval evaluation framework and PresentBench. PresentBench adopts a stricter scoring scheme and poses a greater challenge to slide generation systems.
Evaluating Automated Real-World Slide Generation with Fine-Grained, Instance-Specific Criteria.
Slides serve as a critical medium for conveying information in presentation-oriented scenarios such as academia, education, and business. Despite their importance, creating high-quality slide decks remains time-consuming and cognitively demanding. Recent advances in generative models, such as Nano Banana Pro, have made automated slide generation increasingly feasible. However, existing evaluations of slide generation are often coarse-grained and rely on holistic judgments, making it difficult to accurately assess model capabilities or track meaningful advances in the field. In practice, the lack of fine-grained, verifiable evaluation criteria poses a critical bottleneck for both research and real-world deployment.
In this paper, we propose PresentBench, a fine-grained, rubric-based benchmark for evaluating automated real-world slide generation. It contains 238 evaluation instances, each supplemented with background materials required for slide creation. Moreover, we manually design an average of 54.1 checklist items per instance, each formulated as a binary question, to enable fine-grained, instance-specific evaluation of the generated slide decks.
Extensive experiments show that PresentBench provides more reliable evaluation results than existing methods, and exhibits significantly stronger alignment with human preferences. Furthermore, our benchmark reveals that NotebookLM significantly outperforms other slide generation methods, highlighting substantial recent progress in this domain.
Existing evaluation frameworks often adopt instance-agnostic scoring schemes, typically relying on a judging paradigm that poses the same set of general questions for all slide decks. Such evaluations fail to account for instance-specific content, making it difficult to assess if a slide generation system truly follows the intended input.
PresentBench establishes fine-grained checklist items tailored to each slide deck instance. On average, each instance is associated with more than 50 specifically designed atomic evaluation items, converting vague qualitative grading into verifiable binary checks.
A large portion of prior work focuses on isolated subtasks or reference-free settings without grounding the task in concrete background materials. This creates a mismatch between evaluation settings and real-world usage.
For each instance in PresentBench, we curate authoritative background materials, such as top-tier conference papers, university course textbooks, and financial reports, and require systems to generate slides grounded in these materials. This design ensures that every task reflects realistic, end-to-end slide generation scenarios based on authentic sources.
Comparison of coarse-grained, instance-agnostic (M)LLM-as-a-Judge evaluation frameworks and PresentBench.
Performance comparison of various slide generation systems on the PPTEval evaluation framework and PresentBench. PresentBench adopts a stricter scoring scheme and poses a greater challenge to slide generation systems.
| Method | Total | Academia | Advertising | Education | Economics | Talk |
|---|---|---|---|---|---|---|
| NotebookLM | 62.5 | 68.6 | 54.9 | 55.0 | 58.2 | 69.2 |
| Manus 1.6 | 57.8 | 64.0 | 52.4 | 50.7 | 52.8 | 63.0 |
| Tiangong | 54.7 | 59.2 | 44.5 | 53.7 | 46.5 | 59.8 |
| Zhipu | 53.6 | 57.5 | 41.0 | 52.5 | 47.6 | 59.0 |
| PPTAgent v2 | 50.2 | 53.3 | 46.7 | 46.1 | 46.1 | 56.6 |
| Gamma | 49.2 | 54.4 | 46.7 | 47.8 | 35.1 | 56.3 |
| Doubao | 48.0 | 50.3 | 42.9 | 45.4 | 44.0 | 54.7 |
| Qwen | 35.9 | 39.4 | 31.9 | 36.6 | 26.5 | 38.6 |
Note: Comparative results across five domains. The highest scores are highlighted in red, and the second-highest scores are highlighted in blue. Evaluation results are mean aggregated scores over 238 instances.
| Method | Presentation Fundamentals | Visual Design and Layout | Content Completeness | Content Correctness | Content Fidelity |
|---|---|---|---|---|---|
| NotebookLM | 81.0 | 62.8 | 67.8 | 56.0 | 45.1 |
| Manus 1.6 | 80.1 | 53.7 | 63.6 | 46.2 | 45.4 |
| Gamma | 66.9 | 22.6 | 54.3 | 47.7 | 54.1 |
| Doubao | 71.8 | 40.7 | 58.2 | 34.7 | 34.8 |
| Tiangong | 77.7 | 47.2 | 68.8 | 45.7 | 34.3 |
| Zhipu | 73.3 | 40.6 | 63.0 | 47.1 | 44.1 |
| Qwen | 53.1 | 21.9 | 29.7 | 29.9 | 44.6 |
| PPTAgent v2 | 79.8 | 44.4 | 60.2 | 37.9 | 28.8 |
Note: Comparative results across five evaluation dimensions. The highest scores are highlighted in red, and the second-highest scores are highlighted in blue.
Even the best-performing system only reaches an overall score of 62.5, indicating that grounded, end-to-end slide authoring is far from solved.
The primary difficulty lies in long-context distillation: inputs average 22.2k tokens (approximately 34 pages), requiring models to read, select, synthesize, and organize information across many dispersed facts.
The representative open-source framework, PPTAgent (50.2), significantly trails behind NotebookLM (62.5) and Manus (57.8).
This gap likely arises not only from differences in backbone models but also from proprietary end-to-end pipelines, including slide-specific long-context planning, grounding mechanisms, and advanced layout and rendering engines.
While many systems achieve high Fundamentals scores (around 70–80), their design and layout scores are much lower, with most systems scoring in the 40s.
Even strong content generators (such as Manus) still lag in layout quality, suggesting that better visual design will require dedicated layout and rendering pipelines, not just stronger models.
Content Completeness is notably higher than Correctness, meaning systems often build structure but frequently make factual mistakes.
Content Fidelity also remains challenging even for strong systems (e.g., NotebookLM 45.1, Manus 45.4), pointing to persistent ungrounded details and hallucinations.
By decomposing slide evaluation into verifiable, instance-specific checklist items and aggregating decisions via principled scoring mechanisms, PresentBench provides reliable and interpretable signals.
The dataset comprises 238 high-quality evaluation instances, covering five major thematic categories: Academia, Education, Economics, Talk, and Advertising.
To rigorously evaluate slide deck generation, we craft a highly constrained, instance-specific instruction and a corresponding fine-grained checklist for each evaluation instance.
The construction and evaluation workflow of PresentBench.
During evaluation, the slide generation system generates a slide deck based on the instructions and corresponding materials.
Loading evaluation results...
To reduce the judge model's cognitive burden and thereby improve evaluation reliability, we evaluate each checklist item in a separate call to the judge model.
@misc{chen2026presentbenchfinegrainedrubricbasedbenchmark,
title={PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation},
author={Xin-Sheng Chen and Jiayu Zhu and Pei-lin Li and Hanzheng Wang and Shuojin Yang and Meng-Hao Guo},
year={2026},
eprint={2603.07244},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.07244},
}