CHI 2026 • Barcelona, Spain • April 13–17
Annotate GPT

Designing Human–AI Collaboration in
Pen-Based Document Annotation

Benedict Leung • Mariana Shimabukuro • Christopher Collins

Ontario Tech University, Canada

Scroll to explore

Figure 1. A high-level overview of the interaction design of AnnotateGPT. (a) The user manually annotates the document. (b) Tapping on an annotation will activate the assistant. (c) The assistant will guess the purpose of the annotation. (d) Selecting a purpose will prompt the assistant to provide further annotations (yellow highlights) based on the selected purpose. (e) Users can read, verify and continue the feedback.

Abstract

Providing high-quality feedback on writing is cognitively demanding, requiring reviewers to identify issues, suggest fixes, and ensure consistency. We introduce AnnotateGPT, a system that uses pen-based annotations as an input modality for AI agents to assist with essay feedback. AnnotateGPT enhances feedback by interpreting handwritten annotations and extending them throughout the document.

One AI agent classifies the purpose of each annotation, which is confirmed or corrected by the user. A second AI agent uses the confirmed purpose to generate contextually relevant feedback for other parts of the essay. In a study with 12 novice teachers annotating essays, we compared AnnotateGPT with a baseline pen-based tool without AI support.

Our findings demonstrate how reviewers used annotations to regulate AI feedback generation, refine AI suggestions, and incorporate AI-generated feedback into their review process. We highlight design implications for AI-augmented feedback systems, including balanced human-AI collaboration and using pen annotations as subtle interaction.

annotation digital pen LLM feedback human-AI collaboration

Why AnnotateGPT?

Existing annotation tools fall short of supporting real feedback workflows. Teachers value handwritten feedback for its personal tone, but face persistent challenges, and students struggle to benefit from current practices.

Limitations of Prior Annotation Tools

Static Marks Only

Digital tools like Adobe Acrobat reduce annotations to highlights and sticky notes, treating them as final outputs rather than interactive, actionable inputs.

Sentence-Level Corrections

AI tools like Grammarly provide grammar and style suggestions, but prioritize correctness over capturing a reviewer’s intent or integrating with annotation workflows.

Domain-Specific & Rigid

Prior pen-based systems like XLibris [Golovchinsky et al. 1999] and Metatation [Mehta et al. 2017] rely on heuristic pattern matching and cannot infer annotation purpose beyond predefined, narrow contexts.

Challenges in Providing Feedback to Students

Poor Legibility

Physical and temporal constraints degrade the readability of handwritten annotations. Students struggle to interpret feedback due to poor readability and lack of clarity.

Time Pressure

Teachers are unable to provide detailed, thoughtful feedback due to time constraints and the need for prompt return to students, affecting the quality of feedback they receive.

Inconsistent Quality

Feedback tends to fixate on surface-level errors rather than providing balanced guidance, leaving students without constructive direction on organization, coherence, or argumentation.

Our Approach

Where prior tools stop at static marks or sentence-level corrections, AnnotateGPT treats handwritten marks as clues for AI collaboration. It interprets clusters of pen strokes, infers their likely purpose, and generates contextually relevant feedback across the document. By grounding AI generation in teacher-provided annotations, AnnotateGPT preserves educators’ voices and styles while addressing the limitations of existing tools while providing students with feedback that is legible, timely, and constructive.

Key Contributions

01

AnnotateGPT System

A pen-based system that integrates LLMs into document annotation through purpose inference and feedback propagation. Pen strokes are classified, clustered, and interpreted by two AI agents to generate actionable feedback.

02

Empirical Insights

A study with 12 novice teachers revealing how reviewers manage, appropriate, and negotiate AI-augmented feedback, including diverse annotation workflows and shifting annotation behaviours.

03

Design Implications

Design implications for AI-augmented feedback systems, framing pen annotations not only as feedback but as a broader interaction design paradigm for human-AI communication.

See It in Action

Watch how AnnotateGPT transforms pen-based annotations into AI-augmented feedback.

How AnnotateGPT Works

A six-step pipeline from pen strokes to AI-generated feedback.

a

Annotate

The user annotates the document with a digital pen: highlights, circles, underlines, or handwritten notes.

b

Cluster Strokes

Pen strokes are grouped using hierarchical agglomerative clustering based on spatiotemporal distance.

c

Activate Assistant

Tapping a cluster reveals a hidden assistant marker, the entry point for AI support.

d

Infer Purpose

The LLM captures two images and proposes four possible annotation purposes, personalized via RAG.

e

Select Purpose

The user selects a purpose (or types their own). AnnotateGPT remembers choices for future inferences.

f

Generate & Verify

AI generates context-specific feedback across the document. Users accept, reject, or mark as helpful.

Figure 2. An overview of AnnotateGPT’s framework: (a) The user first annotates the document. (b) AnnotateGPT then clusters the pen strokes based on spatiotemporal distance, representing an annotation. (c) The user taps on the cluster/annotation to activate and open the assistant. (d) The assistant captures two images from the cluster, one with the underlying text and one without, and makes four guesses about the annotation’s purpose. (e) The user then selects a purpose, which AnnotateGPT will remember for future inferences. (f) Finally, AnnotateGPT generates annotations based on the selected purpose and (g) highlights them on the document.

User Interface Design

Built as a Next.js web application with GPT-4o integration.

Study Results

A within-subjects study with 12 pre-service teachers comparing AnnotateGPT to a baseline digital annotation tool.

0
Participants
Pre-service teachers
0
Annotating Window
25 min annotating + 5 min to finalize ratings
0
Overall
0
Explicit (17/17)
0
Telegraphic (24/48)
Purpose Inference Accuracy
41 of 65 total
0
Accepted
0
Rejected
0
Helpful
Annotation Ratings
Average per user
0 vs. 0
Strokes per Annotation
Baseline vs. AnnotateGPT
0 vs. 0
Annotations per Question
Baseline vs. AnnotateGPT

Types of Annotations Observed

Annotations fall along two dimensions: Form (telegraphic vs. explicit) describes whether marks are personal shorthand or clear textual feedback. Purpose (micro vs. macro) captures whether the annotation targets fine-grained features or broader structural aspects.

Form Telegraphic ↔ Explicit

Telegraphic

Personal opaque codings such as quick marks, highlights, circles, or crossing-out without written explanation.

Baseline: 93 User: 73 Inferred: 48
Explicit

Clear and explicit meaning, usually handwritten textual comments, corrections, and detailed feedback.

Baseline: 380 User: 86 Inferred: 17

Purpose Micro ↔ Macro

Grammar

336 total
  • Tense: verb form consistency
  • Preposition: phrase usage
  • Punctuation: commas, periods
  • Capitalization: case rules
Baseline: 110 User: 31 Inferred: 13 Generated: 182

Vocabulary

274 total
  • Word Choice: context-fitting words
  • Spelling: correct spelling
  • Collocation: natural word pairings
Baseline: 93 User: 33 Inferred: 9 Generated: 139

Sentence Structure

538 total
  • Clarity: unclear phrasing
  • Run-ons: improperly joined
  • Fragments: incomplete sentences
Baseline: 157 User: 59 Inferred: 17 Generated: 305

Organization & Coherence

552 total
  • Logical Flow: disjointed ideas
  • Paragraphing: grouping ideas
Baseline: 94 User: 26 Inferred: 25 Generated: 407

Task Achievement

46 total
  • Completeness: addresses the prompt
  • Encouragement: constructive tone
Baseline: 19 User: 10 Inferred: 1 Generated: 16

Observed Workflows

Participants developed distinct strategies for creating annotations and interacting with AI-generated feedback, revealing how users naturally negotiate agency with the assistant.

Annotation Workflows

How participants created and submitted their annotations

Annotate Interpret
10 / 12

Make one annotation, then have the assistant interpret it and wait for the result.

Annotate K Interpret K
9 / 12

Make K annotations, then have the assistant interpret them concurrently in a batch.

Annotate Follow-up
5 / 12

Annotate to fill in gaps where AnnotateGPT’s highlights didn’t cover.

Interaction Workflows

How participants engaged with AI-generated annotations

Generate Verify
6 / 12

Generate annotations from one annotation, then verify them afterwards.

Generate K Verify
9 / 12

Generate annotations from K annotations and verify all at once in bulk.

Verify Comment
7 / 12

Verify the annotation and continue to comment, extending the AI feedback.

Figure 8. Example of how P6 fills in the gaps of the automated annotations, where the annotations were filled around the highlights.

Shifted Annotation Behaviour

With AnnotateGPT, participants shifted to telegraphic annotations (quick marks instead of full comments), relying on the AI to generate detailed feedback.

“I could let [AnnotateGPT] come up with the comments.” — P4
“It could scan the whole document.” — P9
“[It could] look at all for me, so I could focus on structure.” — P10

Broader Coverage

AnnotateGPT surfaced issues participants would have missed. AI feedback addressed organization & coherence more than baseline.

“It could look at a bigger scope than me.” — P1
“It caught other points that I may have missed.” — P12

Higher-Quality Feedback

Baseline feedback only stated problems (e.g., awk), while AnnotateGPT explicitly stated reasons and suggested fixes.

“It gave better feedback than I would write.” — P4
“It would help me come up with more ideas to edit English work.” — P9

Over-Reliance & Agency

Some participants began delegating interpretive work to the AI, reducing their own engagement. Annotation density and duration dropped after the first question. Batched workflows and marking first, then using AI helped preserve ownership.

“The AI would write for me in a way.” — P5
“Go through it and mark it first, and then use the assistance as the secondary tool.” — P5
“I felt more responsible when annotating alone.” — P1

What Teachers Said

Novice teachers highlighted how AnnotateGPT addressed real feedback challenges for the students:

“I graded more in total using [AnnotateGPT] and was able to find things a lot faster.” — P3
“It gave better feedback than I would write” and “knows the correct terms to use.” — P4 & P11
“This would be helpful… especially for beginner writers.” — P12

Baseline vs. AnnotateGPT

Figure 9. Screenshots of the first page for P9. The left side is with the baseline, and the right side is with AnnotateGPT. It demonstrates different annotation approaches, with the baseline annotations focusing on identifying issues such as unclear phrasing and negative tone using textual feedback, while in the AnnotateGPT condition, P9 only used highlights with no textual feedback.

Design Implications

I

Balancing Agency & Over-Reliance

Batched workflows promote deliberate engagement over continuous delegation. Marking manually first and then using AI fosters ownership. Systems could track behavioural metrics (stroke density, duration) to nudge users toward more explicit annotations before invoking AI.

II

LLMs as Cognitive Augmentation

AnnotateGPT externalizes phrasing and formatting work, allowing educators to focus cognitive effort on identifying issues rather than constructing responses.

III

Supporting Equity in Education

AI-augmented feedback helps novice teachers deliver timely, accurate, and curriculum-aligned feedback, even at scale, levelling the playing field.

IV

Annotation as Interaction Paradigm

Annotations transform from private marks into expressive inputs that shape AI system behaviour, creating a shared language between humans and AI.

Beyond Education

Annotations as a universal interaction paradigm for AI systems.

Editing Generative Content

Annotations as spatial constraints for refining AI-generated images, defining action flows for video generation, and creating & editing user interfaces, offering a low-friction alternative to repeated prompting.

Annotating the World

On-screen annotations on camera feeds enable users to direct AI attention precisely, like writing “Pie?” next to apples to ask which are best for baking. Discreet, precise, contextual.

Figure 15. Sample use cases of using annotations: Left: editing and refining generative content, including (a) refining image generation, (b) making action flows for video generation, and (c) creating and editing user interfaces. Right: interacting with a phone using on-screen annotations and form complex queries, such as (a) which apples are best for apple pies, (b) translation, and (c) saving an item for later reference. All sample queries are sourced from ChatGPT and paired with the corresponding image.

Authors

Benedict Leung

Benedict Leung

Researcher, MSc

Ontario Tech University

Mariana Shimabukuro

Mariana Shimabukuro

Researcher, Associate Professor

Ontario Tech University

Christopher Collins

Christopher Collins

Professor

Ontario Tech University

Cite Our Work

Format:
@inproceedings{leung2026annotategpt,
  title     = {AnnotateGPT: Designing Human--AI Collaboration 
               in Pen-Based Document Annotation},
  author    = {Leung, Benedict and Shimabukuro, Mariana 
               and Collins, Christopher},
  booktitle = {Proceedings of the 2026 CHI Conference on 
               Human Factors in Computing Systems (CHI '26)},
  year      = {2026},
  publisher = {ACM},
  address   = {New York, NY, USA},
  location  = {Barcelona, Spain},
  doi       = {10.1145/3772318.3790867},
  pages     = {26}
}