AnnotateGPT | CHI 2026

Overview

Abstract

Providing high-quality feedback on writing is cognitively demanding, requiring reviewers to identify issues, suggest fixes, and ensure consistency. We introduce AnnotateGPT, a system that uses pen-based annotations as an input modality for AI agents to assist with essay feedback. AnnotateGPT enhances feedback by interpreting handwritten annotations and extending them throughout the document.

One AI agent classifies the purpose of each annotation, which is confirmed or corrected by the user. A second AI agent uses the confirmed purpose to generate contextually relevant feedback for other parts of the essay. In a study with 12 novice teachers annotating essays, we compared AnnotateGPT with a baseline pen-based tool without AI support.

Our findings demonstrate how reviewers used annotations to regulate AI feedback generation, refine AI suggestions, and incorporate AI-generated feedback into their review process. We highlight design implications for AI-augmented feedback systems, including balanced human-AI collaboration and using pen annotations as subtle interaction.

annotation digital pen LLM feedback human-AI collaboration

Motivation

Why AnnotateGPT?

Existing annotation tools fall short of supporting real feedback workflows. Teachers value handwritten feedback for its personal tone, but face persistent challenges, and students struggle to benefit from current practices.

Limitations of Prior Annotation Tools

Static Marks Only

Digital tools like Adobe Acrobat reduce annotations to highlights and sticky notes, treating them as final outputs rather than interactive, actionable inputs.

Sentence-Level Corrections

AI tools like Grammarly provide grammar and style suggestions, but prioritize correctness over capturing a reviewer’s intent or integrating with annotation workflows.

Domain-Specific & Rigid

Prior pen-based systems like XLibris [Golovchinsky et al. 1999] and Metatation [Mehta et al. 2017] rely on heuristic pattern matching and cannot infer annotation purpose beyond predefined, narrow contexts.

Challenges in Providing Feedback to Students

Poor Legibility

Physical and temporal constraints degrade the readability of handwritten annotations. Students struggle to interpret feedback due to poor readability and lack of clarity.

Time Pressure

Teachers are unable to provide detailed, thoughtful feedback due to time constraints and the need for prompt return to students, affecting the quality of feedback they receive.

Inconsistent Quality

Feedback tends to fixate on surface-level errors rather than providing balanced guidance, leaving students without constructive direction on organization, coherence, or argumentation.

Our Approach

Where prior tools stop at static marks or sentence-level corrections, AnnotateGPT treats handwritten marks as clues for AI collaboration. It interprets clusters of pen strokes, infers their likely purpose, and generates contextually relevant feedback across the document. By grounding AI generation in teacher-provided annotations, AnnotateGPT preserves educators’ voices and styles while addressing the limitations of existing tools while providing students with feedback that is legible, timely, and constructive.

Impact

Key Contributions

AnnotateGPT System

A pen-based system that integrates LLMs into document annotation through purpose inference and feedback propagation. Pen strokes are classified, clustered, and interpreted by two AI agents to generate actionable feedback.

Empirical Insights

A study with 12 novice teachers revealing how reviewers manage, appropriate, and negotiate AI-augmented feedback, including diverse annotation workflows and shifting annotation behaviours.

Design Implications

Design implications for AI-augmented feedback systems, framing pen annotations not only as feedback but as a broader interaction design paradigm for human-AI communication.

Demo

See It in Action

Watch how AnnotateGPT transforms pen-based annotations into AI-augmented feedback.

System Design

How AnnotateGPT Works

A six-step pipeline from pen strokes to AI-generated feedback.

Annotate

The user annotates the document with a digital pen: highlights, circles, underlines, or handwritten notes.

Cluster Strokes

Pen strokes are grouped using hierarchical agglomerative clustering based on spatiotemporal distance.

Activate Assistant

Tapping a cluster reveals a hidden assistant marker, the entry point for AI support.

Infer Purpose

The LLM captures two images and proposes four possible annotation purposes, personalized via RAG.

Select Purpose

The user selects a purpose (or types their own). AnnotateGPT remembers choices for future inferences.

Generate & Verify

AI generates context-specific feedback across the document. Users accept, reject, or mark as helpful.

Figure 2. An overview of AnnotateGPT’s framework: (a) The user first annotates the document. (b) AnnotateGPT then clusters the pen strokes based on spatiotemporal distance, representing an annotation. (c) The user taps on the cluster/annotation to activate and open the assistant. (d) The assistant captures two images from the cluster, one with the underlying text and one without, and makes four guesses about the annotation’s purpose. (e) The user then selects a purpose, which AnnotateGPT will remember for future inferences. (f) Finally, AnnotateGPT generates annotations based on the selected purpose and (g) highlights them on the document.

Interface

User Interface Design

Built as a Next.js web application with GPT-4o integration.

Prototype Interface

The interface comprises three components: (b) a toolbar with colour palette, highlighter, and pen; (c) the document; and (d) a specialized scrollbar. Designed for the Microsoft Surface Studio with simultaneous thumb and pen interaction.

Assistant Marker

Tapping an annotation opens an assistant marker displaying suggested purposes. The marker uses colour-coded states: waiting processing done

Verification & Feedback

Generated annotations appear as yellow highlights. Users tap to view feedback, then rate: accept helpful reject. A reply box extends the conversation.

Evaluation

Study Results

A within-subjects study with 12 pre-service teachers comparing AnnotateGPT to a baseline digital annotation tool.

Participants

Pre-service teachers

Annotating Window

25 min annotating + 5 min to finalize ratings

Overall

Explicit (17/17)

Telegraphic (24/48)

Purpose Inference Accuracy

41 of 65 total

Accepted

Rejected

Helpful

Annotation Ratings

Average per user

0 vs. 0

Strokes per Annotation

Baseline vs. AnnotateGPT

0 vs. 0

Annotations per Question

Baseline vs. AnnotateGPT

Types of Annotations Observed

Annotations fall along two dimensions: Form (telegraphic vs. explicit) describes whether marks are personal shorthand or clear textual feedback. Purpose (micro vs. macro) captures whether the annotation targets fine-grained features or broader structural aspects.

Form Telegraphic ↔ Explicit

Telegraphic

Personal opaque codings such as quick marks, highlights, circles, or crossing-out without written explanation.

Baseline: 93 User: 73 Inferred: 48

Explicit

Clear and explicit meaning, usually handwritten textual comments, corrections, and detailed feedback.

Baseline: 380 User: 86 Inferred: 17

Purpose Micro ↔ Macro

Grammar

336 total

Tense: verb form consistency
Preposition: phrase usage
Punctuation: commas, periods
Capitalization: case rules

Baseline: 110 User: 31 Inferred: 13 Generated: 182

Vocabulary

274 total

Word Choice: context-fitting words
Spelling: correct spelling
Collocation: natural word pairings

Baseline: 93 User: 33 Inferred: 9 Generated: 139

Sentence Structure

538 total

Clarity: unclear phrasing
Run-ons: improperly joined
Fragments: incomplete sentences

Baseline: 157 User: 59 Inferred: 17 Generated: 305

Organization & Coherence

552 total

Logical Flow: disjointed ideas
Paragraphing: grouping ideas

Baseline: 94 User: 26 Inferred: 25 Generated: 407

Task Achievement

46 total

Completeness: addresses the prompt
Encouragement: constructive tone

Baseline: 19 User: 10 Inferred: 1 Generated: 16

Observed Workflows

Participants developed distinct strategies for creating annotations and interacting with AI-generated feedback, revealing how users naturally negotiate agency with the assistant.

Annotation Workflows

How participants created and submitted their annotations

Annotate Interpret

10 / 12

Make one annotation, then have the assistant interpret it and wait for the result.

Annotate K Interpret K

9 / 12

Make K annotations, then have the assistant interpret them concurrently in a batch.

Annotate Follow-up

5 / 12

Annotate to fill in gaps where AnnotateGPT’s highlights didn’t cover.

Interaction Workflows

How participants engaged with AI-generated annotations

Generate Verify

6 / 12

Generate annotations from one annotation, then verify them afterwards.

Generate K Verify

9 / 12

Generate annotations from K annotations and verify all at once in bulk.

Verify Comment

7 / 12

Verify the annotation and continue to comment, extending the AI feedback.

Figure 8. Example of how P6 fills in the gaps of the automated annotations, where the annotations were filled around the highlights.

Shifted Annotation Behaviour

With AnnotateGPT, participants shifted to telegraphic annotations (quick marks instead of full comments), relying on the AI to generate detailed feedback.

“I could let [AnnotateGPT] come up with the comments.” — P4

“It could scan the whole document.” — P9

“[It could] look at all for me, so I could focus on structure.” — P10

Broader Coverage

AnnotateGPT surfaced issues participants would have missed. AI feedback addressed organization & coherence more than baseline.

“It could look at a bigger scope than me.” — P1

“It caught other points that I may have missed.” — P12

Higher-Quality Feedback

Baseline feedback only stated problems (e.g., awk), while AnnotateGPT explicitly stated reasons and suggested fixes.

“It gave better feedback than I would write.” — P4

“It would help me come up with more ideas to edit English work.” — P9

Over-Reliance & Agency

Some participants began delegating interpretive work to the AI, reducing their own engagement. Annotation density and duration dropped after the first question. Batched workflows and marking first, then using AI helped preserve ownership.

“The AI would write for me in a way.” — P5

“Go through it and mark it first, and then use the assistance as the secondary tool.” — P5

“I felt more responsible when annotating alone.” — P1

What Teachers Said

Novice teachers highlighted how AnnotateGPT addressed real feedback challenges for the students:

“I graded more in total using [AnnotateGPT] and was able to find things a lot faster.” — P3

“It gave better feedback than I would write” and “knows the correct terms to use.” — P4 & P11

“This would be helpful… especially for beginner writers.” — P12

Comparison

Baseline vs. AnnotateGPT

Figure 9. Screenshots of the first page for P9. The left side is with the baseline, and the right side is with AnnotateGPT. It demonstrates different annotation approaches, with the baseline annotations focusing on identifying issues such as unclear phrasing and negative tone using textual feedback, while in the AnnotateGPT condition, P9 only used highlights with no textual feedback.

Discussion

Design Implications

Balancing Agency & Over-Reliance

Batched workflows promote deliberate engagement over continuous delegation. Marking manually first and then using AI fosters ownership. Systems could track behavioural metrics (stroke density, duration) to nudge users toward more explicit annotations before invoking AI.

LLMs as Cognitive Augmentation

AnnotateGPT externalizes phrasing and formatting work, allowing educators to focus cognitive effort on identifying issues rather than constructing responses.

III

Supporting Equity in Education

AI-augmented feedback helps novice teachers deliver timely, accurate, and curriculum-aligned feedback, even at scale, levelling the playing field.

Annotation as Interaction Paradigm

Annotations transform from private marks into expressive inputs that shape AI system behaviour, creating a shared language between humans and AI.

Vision

Beyond Education

Annotations as a universal interaction paradigm for AI systems.

Editing Generative Content

Annotations as spatial constraints for refining AI-generated images, defining action flows for video generation, and creating & editing user interfaces, offering a low-friction alternative to repeated prompting.

Annotating the World

On-screen annotations on camera feeds enable users to direct AI attention precisely, like writing “Pie?” next to apples to ask which are best for baking. Discreet, precise, contextual.

Figure 15. Sample use cases of using annotations: Left: editing and refining generative content, including (a) refining image generation, (b) making action flows for video generation, and (c) creating and editing user interfaces. Right: interacting with a phone using on-screen annotations and form complex queries, such as (a) which apples are best for apple pies, (b) translation, and (c) saving an item for later reference. All sample queries are sourced from ChatGPT and paired with the corresponding image.

Abstract

Why AnnotateGPT?

Limitations of Prior Annotation Tools

Static Marks Only

Sentence-Level Corrections

Domain-Specific & Rigid

Challenges in Providing Feedback to Students

Poor Legibility

Time Pressure

Inconsistent Quality

Our Approach

Key Contributions

AnnotateGPT System

Empirical Insights

Design Implications

See It in Action

How AnnotateGPT Works

Annotate

Cluster Strokes

Activate Assistant

Infer Purpose

Select Purpose

Generate & Verify

User Interface Design

Prototype Interface

Assistant Marker

Verification & Feedback

Study Results

Types of Annotations Observed

Form Telegraphic ↔ Explicit

Purpose Micro ↔ Macro

Grammar

Vocabulary

Sentence Structure

Organization & Coherence

Task Achievement

Observed Workflows

Annotation Workflows

Interaction Workflows

Shifted Annotation Behaviour

Broader Coverage

Higher-Quality Feedback

Over-Reliance & Agency

What Teachers Said

Baseline vs. AnnotateGPT

Design Implications

Balancing Agency & Over-Reliance

LLMs as Cognitive Augmentation

Supporting Equity in Education

Annotation as Interaction Paradigm

Beyond Education

Editing Generative Content

Annotating the World

Authors

Benedict Leung

Mariana Shimabukuro

Christopher Collins

Cite Our Work