Designing Human–AI Collaboration in
Pen-Based Document Annotation
Ontario Tech University, Canada
Providing high-quality feedback on writing is cognitively demanding, requiring reviewers to identify issues, suggest fixes, and ensure consistency. We introduce AnnotateGPT, a system that uses pen-based annotations as an input modality for AI agents to assist with essay feedback. AnnotateGPT enhances feedback by interpreting handwritten annotations and extending them throughout the document.
One AI agent classifies the purpose of each annotation, which is confirmed or corrected by the user. A second AI agent uses the confirmed purpose to generate contextually relevant feedback for other parts of the essay. In a study with 12 novice teachers annotating essays, we compared AnnotateGPT with a baseline pen-based tool without AI support.
Our findings demonstrate how reviewers used annotations to regulate AI feedback generation, refine AI suggestions, and incorporate AI-generated feedback into their review process. We highlight design implications for AI-augmented feedback systems, including balanced human-AI collaboration and using pen annotations as subtle interaction.
Existing annotation tools fall short of supporting real feedback workflows. Teachers value handwritten feedback for its personal tone, but face persistent challenges, and students struggle to benefit from current practices.
Digital tools like Adobe Acrobat reduce annotations to highlights and sticky notes, treating them as final outputs rather than interactive, actionable inputs.
AI tools like Grammarly provide grammar and style suggestions, but prioritize correctness over capturing a reviewer’s intent or integrating with annotation workflows.
Prior pen-based systems like XLibris [Golovchinsky et al. 1999] and Metatation [Mehta et al. 2017] rely on heuristic pattern matching and cannot infer annotation purpose beyond predefined, narrow contexts.
Physical and temporal constraints degrade the readability of handwritten annotations. Students struggle to interpret feedback due to poor readability and lack of clarity.
Teachers are unable to provide detailed, thoughtful feedback due to time constraints and the need for prompt return to students, affecting the quality of feedback they receive.
Feedback tends to fixate on surface-level errors rather than providing balanced guidance, leaving students without constructive direction on organization, coherence, or argumentation.
Where prior tools stop at static marks or sentence-level corrections, AnnotateGPT treats handwritten marks as clues for AI collaboration. It interprets clusters of pen strokes, infers their likely purpose, and generates contextually relevant feedback across the document. By grounding AI generation in teacher-provided annotations, AnnotateGPT preserves educators’ voices and styles while addressing the limitations of existing tools while providing students with feedback that is legible, timely, and constructive.
A pen-based system that integrates LLMs into document annotation through purpose inference and feedback propagation. Pen strokes are classified, clustered, and interpreted by two AI agents to generate actionable feedback.
A study with 12 novice teachers revealing how reviewers manage, appropriate, and negotiate AI-augmented feedback, including diverse annotation workflows and shifting annotation behaviours.
Design implications for AI-augmented feedback systems, framing pen annotations not only as feedback but as a broader interaction design paradigm for human-AI communication.
Watch how AnnotateGPT transforms pen-based annotations into AI-augmented feedback.
A six-step pipeline from pen strokes to AI-generated feedback.
The user annotates the document with a digital pen: highlights, circles, underlines, or handwritten notes.
Pen strokes are grouped using hierarchical agglomerative clustering based on spatiotemporal distance.
Tapping a cluster reveals a hidden assistant marker, the entry point for AI support.
The LLM captures two images and proposes four possible annotation purposes, personalized via RAG.
The user selects a purpose (or types their own). AnnotateGPT remembers choices for future inferences.
AI generates context-specific feedback across the document. Users accept, reject, or mark as helpful.
Figure 2. An overview of AnnotateGPT’s framework: (a) The user first annotates the document. (b) AnnotateGPT then clusters the pen strokes based on spatiotemporal distance, representing an annotation. (c) The user taps on the cluster/annotation to activate and open the assistant. (d) The assistant captures two images from the cluster, one with the underlying text and one without, and makes four guesses about the annotation’s purpose. (e) The user then selects a purpose, which AnnotateGPT will remember for future inferences. (f) Finally, AnnotateGPT generates annotations based on the selected purpose and (g) highlights them on the document.
Built as a Next.js web application with GPT-4o integration.
The interface comprises three components: (b) a toolbar with colour palette, highlighter, and pen; (c) the document; and (d) a specialized scrollbar. Designed for the Microsoft Surface Studio with simultaneous thumb and pen interaction.
Tapping an annotation opens an assistant marker displaying suggested purposes. The marker uses colour-coded states: waiting processing done
Generated annotations appear as yellow highlights. Users tap to view feedback, then rate: . A reply box extends the conversation.
A within-subjects study with 12 pre-service teachers comparing AnnotateGPT to a baseline digital annotation tool.
Annotations fall along two dimensions: Form (telegraphic vs. explicit) describes whether marks are personal shorthand or clear textual feedback. Purpose (micro vs. macro) captures whether the annotation targets fine-grained features or broader structural aspects.
Personal opaque codings such as quick marks, highlights, circles, or crossing-out without written explanation.
Clear and explicit meaning, usually handwritten textual comments, corrections, and detailed feedback.
Participants developed distinct strategies for creating annotations and interacting with AI-generated feedback, revealing how users naturally negotiate agency with the assistant.
How participants created and submitted their annotations
How participants engaged with AI-generated annotations
Figure 8. Example of how P6 fills in the gaps of the automated annotations, where the annotations were filled around the highlights.
With AnnotateGPT, participants shifted to telegraphic annotations (quick marks instead of full comments), relying on the AI to generate detailed feedback.
“I could let [AnnotateGPT] come up with the comments.” — P4
“It could scan the whole document.” — P9
“[It could] look at all for me, so I could focus on structure.” — P10
AnnotateGPT surfaced issues participants would have missed. AI feedback addressed organization & coherence more than baseline.
“It could look at a bigger scope than me.” — P1
“It caught other points that I may have missed.” — P12
Baseline feedback only stated problems (e.g., awk), while AnnotateGPT explicitly stated reasons and suggested fixes.
“It gave better feedback than I would write.” — P4
“It would help me come up with more ideas to edit English work.” — P9
Some participants began delegating interpretive work to the AI, reducing their own engagement. Annotation density and duration dropped after the first question. Batched workflows and marking first, then using AI helped preserve ownership.
“The AI would write for me in a way.” — P5
“Go through it and mark it first, and then use the assistance as the secondary tool.” — P5
“I felt more responsible when annotating alone.” — P1
Novice teachers highlighted how AnnotateGPT addressed real feedback challenges for the students:
“I graded more in total using [AnnotateGPT] and was able to find things a lot faster.” — P3
“It gave better feedback than I would write” and “knows the correct terms to use.” — P4 & P11
“This would be helpful… especially for beginner writers.” — P12
Figure 9. Screenshots of the first page for P9. The left side is with the baseline, and the right side is with AnnotateGPT. It demonstrates different annotation approaches, with the baseline annotations focusing on identifying issues such as unclear phrasing and negative tone using textual feedback, while in the AnnotateGPT condition, P9 only used highlights with no textual feedback.
Batched workflows promote deliberate engagement over continuous delegation. Marking manually first and then using AI fosters ownership. Systems could track behavioural metrics (stroke density, duration) to nudge users toward more explicit annotations before invoking AI.
AnnotateGPT externalizes phrasing and formatting work, allowing educators to focus cognitive effort on identifying issues rather than constructing responses.
AI-augmented feedback helps novice teachers deliver timely, accurate, and curriculum-aligned feedback, even at scale, levelling the playing field.
Annotations transform from private marks into expressive inputs that shape AI system behaviour, creating a shared language between humans and AI.
Annotations as a universal interaction paradigm for AI systems.
Annotations as spatial constraints for refining AI-generated images, defining action flows for video generation, and creating & editing user interfaces, offering a low-friction alternative to repeated prompting.
On-screen annotations on camera feeds enable users to direct AI attention precisely, like writing “Pie?” next to apples to ask which are best for baking. Discreet, precise, contextual.
Figure 15. Sample use cases of using annotations: Left: editing and refining generative content, including (a) refining image generation, (b) making action flows for video generation, and (c) creating and editing user interfaces. Right: interacting with a phone using on-screen annotations and form complex queries, such as (a) which apples are best for apple pies, (b) translation, and (c) saving an item for later reference. All sample queries are sourced from ChatGPT and paired with the corresponding image.
@inproceedings{leung2026annotategpt,
title = {AnnotateGPT: Designing Human--AI Collaboration
in Pen-Based Document Annotation},
author = {Leung, Benedict and Shimabukuro, Mariana
and Collins, Christopher},
booktitle = {Proceedings of the 2026 CHI Conference on
Human Factors in Computing Systems (CHI '26)},
year = {2026},
publisher = {ACM},
address = {New York, NY, USA},
location = {Barcelona, Spain},
doi = {10.1145/3772318.3790867},
pages = {26}
}