Most UI-to-code benchmarks focus on static layouts; Interaction2Code flips the question to interaction: can multimodal LLMs generate code that implements dynamic, stateful webpage behaviors observable across screenshots? This dataset assembles interactive prototyping images and structured interaction metadata so evaluators can measure not just static layout fidelity but whether generated code reproduces interaction transitions.
What Sets It Apart
- Snapshot-to-snapshot interaction pairs: each webpage folder contains screenshots for states (e.g., 0.png → 1.png) plus an action.json describing the source/destination images, tag types, and visual change types — enabling evaluation of state transitions rather than single-frame rendering. This lets you test whether a model captures behavior like navigation, component addition, or visual property changes.
- Interaction diversity and scale: the benchmark comprises 127 unique webpages and 374 distinct interactions spanning ~15 webpage types and 31 interaction categories, providing breadth across common interactive patterns (buttons, image updates, new components, text/position changes).
- End-to-end evaluation focus: distributed with example generation scripts (GitHub repo) and a paper describing metrics and prompting methods, so it supports reproducible comparisons of prompting strategies and MLLM pipelines for interactive code generation.
Who It's For & Trade-offs
Great fit if you want to evaluate or compare multimodal LLMs on producing interactive frontend behavior (e.g., prompting strategies, chain-of-thought vs. direct prompts, code synthesis for React). It’s useful for benchmark-driven research, prompt engineering, and small-scale model probing. Look elsewhere if you need training-scale corpora or full-fidelity real-world production websites: the dataset is benchmark-oriented (127 webpages) and centers on prototyping screenshots + annotated interactions rather than large-scale HTML/CSS/JS crawl data. Also, interactions are encoded via metadata files (action.json) and example generators—so integration into other pipelines may require format adaptation.
