Aria-UI: Visual Grounding for GUI Instructions

Key Features of Aria-UI

✨ Versatile Grounding Instruction Understanding:
Aria-UI handles diverse grounding instructions, excelling in interpreting varied formats, ensuring robust adaptability across dynamic scenarios or paired with diverse planning agents.
📝 Context-aware Grounding:
Aria-UI effectively leverages historical input, whether in pure text or text-image-interleaved formats, to improve grounding accuracy.
⚡ Lightweight and Fast:
Aria-UI is a mixture-of-expert model with 3.9B activated parameters per token. It efficiently encodes GUI input of variable sizes and aspect ratios, with ultra resolution support.
🎉 Superior Performances:
Aria-UI sets new state-of-the-art results on offline and online agent benchmarks. Especially, Aria-UI achieves the 🏆 1st place on AndroidWorld with 44.8% task success rate and the 🥉 3rd place on OSWorld with 15.2% task success rate. (Dec. 2024)

Abstract

Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines.

Showcases

check the stars for Byte Blaze / empathy-prompts.

enable iCloud Photos.

stop the service.

check color palette.

BibTeX


    @article{ariaui,
          title={Aria-UI: Visual Grounding for GUI Instructions}, 
          author={Yuhao Yang and Yue Wang and Dongxu Li and Ziyang Luo and Bei Chen and Chao Huang and Junnan Li},
          year={2024},
          journal={arXiv preprint arXiv:2412.16256},
    }