mcp-screen v1.1 — GUI Grounding Model Research¶
Date: 2026-05-31 Author: Research subagent (Claude Opus 4.8) Question: Is the OmniParser ONNX model we run top of the May 2026 GUI-grounding charts? Should we swap it?
Grounding in the live code: Section 2 below was verified against the live
/home/andrew/mcp-screen/grounding.pyandREADME.mdthis session (not just the task brief). Confirmed: the model file is/home/andrew/mcp-screen/models/onnx/model.onnx(12,136,163 bytes ≈ 11.6 MiB); it is loaded viaonnxruntimeCPUExecutionProviderwithintra_op_num_threads = MCP_SCREEN_CPU_THREADS(default 6) andinter_op_num_threads = 1; OCR is RapidOCR; OpenCV does letterbox + NMS. Theworldmodel.pyfile exists alongside but is the dHash cache, not a model (see below).
1. TL;DR¶
No — the OmniParser ONNX we run is not, and was never designed to be, "top of the charts," and that is fine. Recommendation: STAY (CPU-bound), with a small targeted re-evaluation. Confidence: high on the "stay" call, medium-high on the exact 2026 leaderboard numbers.
Every model at the top of the May 2026 GUI-grounding leaderboards (ScreenSpot-Pro, OSWorld-G, UI-Vision) is a multi-billion-parameter vision-language model — UI-Venus-1.5-30B-A3B (69.6% ScreenSpot-Pro), Gelato-30B-A3B (63.9%), GTA1-72B (58.4%), or zoom-in pipelines built on 8B–32B VLMs (KV-Ground, MAI-UI-32B at ~77–81% on the official board). None of these are CPU-deployable under our hard constraints (no GPU, 6 CPU threads, dual 4K = 7680×2160). A 30B MoE or 72B dense VLM doing multi-second-to-multi-minute-per-frame inference on CPU is a non-starter for an interactive screen driver. OmniParser's icon_detect (a ~11.6 MB YOLOv8 ONNX) + RapidOCR is in a completely different weight class: it is a lightweight detector, not a grounding VLM, and it is the correct class of model for our deployment envelope. The leaderboards do not contain a CPU-runnable model that would beat our current pipeline by enough to justify the latency/RAM blowup. The honest gap is that we lack our own numbers on our own benchmark (dual-4K, real apps), so we should measure before any future swap — but there is no swap target that fits the box today.
2. What we actually run¶
Two things are frequently conflated; only one is an ML model.
-
worldmodel.pyis NOT an ML model. It is a SQLite dHash (perceptual-hash) cache of the perception layer's output. It stores prior grounding results keyed by a perceptual hash of the screen region so that unchanged regions can be served from cache instead of re-running detection. No leaderboard applies to it. It is not a competitor to any grounding model and is explicitly out of scope for this benchmarking exercise. Benchmarking it against UI-Venus etc. would be a category error. -
The only ML model in scope is the perception / grounding layer, which is:
- OmniParser
icon_detect— a YOLOv8 single-class detector exported to ONNX atmodels/onnx/model.onnx(verified 12,136,163 bytes ≈ 11.6 MiB). Job: detect interactable UI elements / icons and emit bounding boxes. Loaded viaonnxruntimeCPUExecutionProvider,intra_op_num_threads=MCP_SCREEN_CPU_THREADS(default 6),inter_op_num_threads=1. Inference is letterbox→sess.run→threshold→NMS atimgsz=640. No torch. - RapidOCR (
from rapidocr import RapidOCR, v3 output object with.boxes/.txts/.scores) — text detection + recognition, also ONNX/CPU. - OpenCV — letterboxing,
cv2.dnn.NMSBoxes, and geometry glue. A classical-CV Canny+contour path (cv_regions) exists only as a fallback when the OmniParser ONNX / onnxruntime is unavailable. -
All of the above run ONNX / CPU-only.
annotate()then names each icon box by the OCR text inside it and draws the Set-of-Marks overlay; the host LLM picks the numbered mark. -
The model's actual job is pure screen grounding / element detection — output is bounding boxes + click coordinates for UI elements. It does not need to be a VLM agent or planner. mcp-screen's host LLM does the reasoning ("which element, why, what next"); the perception layer only has to answer "where are the clickable things and what text is on screen." This framing matters enormously for the swap question: the leaderboard SOTA models are end-to-end grounding VLMs that take a natural-language instruction and return a coordinate. That is a different and heavier task than "detect all elements," and it inherently requires a language model in the loop. Our architecture deliberately keeps the language model out of the perception layer.
Deployment envelope (hard constraints): No GPU. CPU-only. 6 CPU threads. Linux. Driving dual 4K monitors (7680×2160 combined). Any model needing a GPU, tens of GB of RAM, or multi-second-per-frame CPU latency is effectively a non-starter for interactive use.
3. May 2026 leaderboard landscape¶
Three benchmarks differentiate 2026 grounding SOTA. The original ScreenSpot and ScreenSpot-v2 are saturated (~92–96% for top open models) and no longer discriminative; ignore them for model selection.
ScreenSpot-Pro (high-res professional apps — the headline benchmark)¶
ScreenSpot-Pro launched in 2025 with a brutal baseline: the best model scored only 18.9% (OS-Atlas-7B; confirmed against the official leaderboard JSON and the launch paper arXiv:2504.07981). Scores have since climbed dramatically. There are two conflicting pictures of the top, and the discrepancy is important and honestly reported here:
Picture A — the OFFICIAL ScreenSpot-Pro leaderboard (gui-agent.github.io/grounding-leaderboard, the canonical JSON parsed in full during adversarial verification). Top entries are GUI-specialist / zoom-in pipelines, not general chat models:
| Rank | Model | ScreenSpot-Pro |
|---|---|---|
| 1 | KV-Ground-8B + Qwen3.5-27B Consistency Router | 80.9% |
| 2 | KV-Ground-GuiOwl1.5-0315-8B-ZoomIn | 80.5% |
| 3 | Holo2-235B-A22B (Agentic) | 78.5% |
| 4 | MAI-UI-32B | 77.5% |
| 5 | AdaZoom-GUI-4B | 76.8% |
On this official board the maximum score is 80.9%, there is no Claude Opus / GPT-5 / Gemini / "Muse Spark" entry near the top, the only Claude entry is "Claude (Computer Use)" at 17.1% (rank 73/86), and GPT-5 entries are 18.5% / 6.0%.
Picture B — third-party aggregators (benchlm.ai, llm-stats.com) that claimed frontier general VLMs lead at ~84–88% (Claude Opus 4.8 87.9%, GPT-5.4 85.4%, Gemini 3.1 Pro 84.4%, Meta "Muse Spark" 84.1%). This picture is REFUTED. Adversarial verification against the official benchmark data found none of those models appear on the official board; the figures look fabricated/synthetic. Both aggregator sites are reachable (HTTP 200) but their content directly conflicts with the authoritative source. Do not rely on the "frontier VLMs top ScreenSpot-Pro at ~88%" claim — it is UNVERIFIED at best and contradicted by the primary source.
Dedicated open grounding models on ScreenSpot-Pro (verified per-variant, primary sources):
| Model | Params | ScreenSpot-Pro | Source |
|---|---|---|---|
| UI-Venus-1.5-30B-A3B (MoE) | 30B MoE (A3B active) | 69.6% | arXiv:2602.09082 (confirmed verbatim) |
| UI-Venus-1.5-8B | 8B | 68.4% | arXiv:2602.09082 |
| MAI-UI-32B | 32B | 67.9% (paper) / 77.5% (official board, w/ pipeline) | arXiv:2602.09082; official board |
| Gelato-30B-A3B | 30B MoE | 63.88% | github.com/mlfoundations/Gelato |
| GTA1-72B | 72B | 58.4% | HF HelloKKMe/GTA1-72B |
| UI-Venus-1.5-2B | 2B | 57.7% | arXiv:2602.09082 |
| LASER (GTA1-7B based) | 7B + method | 55.7% | emergentmind (UNVERIFIED secondary) |
| GTA1-7B | 7B | 50.1% | HF GTA1 |
| UI-Venus-Ground-7B (v1.0) | 7B | 50.8% | arXiv:2508.10833 |
| UI-AGILE-7B | 7B | 48.7% | emergentmind (UNVERIFIED secondary) |
| GUI-G2-72B | 72B | 47.5% | emergentmind (UNVERIFIED secondary) |
| Phi-Ground | <10B | 43.2% | arXiv:2507.23779 |
OSWorld-G (xlang-ai, NeurIPS 2025; 564 fine-grained samples)¶
Launch baseline Jedi-7B 54.1%. 2026 leaders (verified): UI-Venus-1.5-30B 70.6% (76.4% on OSWorld-G-Refined), Gelato-30B 69.15% / 74.65% Refined, MEGA-GUI 68.63%, GTA1-7B 67.7% (notably edging GTA1-72B's 66.7%), MAI-UI-32B 67.6%, GTA1-32B 65.2%.
UI-Vision (ServiceNow, ICML 2025 — hardest desktop benchmark)¶
Launch leader UI-TARS-72B managed only 31.4% element grounding / 24.3% layout grounding; Claude-3.7-Sonnet 9.48% element. By Feb 2026 UI-Venus-1.5-30B reaches 69.0% element grounding (prior best MAI-UI-32B 59.1%). Phi-Ground earlier reported 27.2%.
Other / newer¶
- VenusBench-GD (arXiv:2512.16501; multi-platform, 6,100+ pairs, 97+ apps): UI-Venus-1.5-30B 75.0%. (Confidence: medium — vendor benchmark.)
- ScreenSpot-v2 / ScreenSpot: saturated, ~92–96% top; not discriminative.
Lineage summary (dedicated-grounding SOTA over time)¶
UI-TARS-1.5 (~61.6%, mid-2025) → GTA1-72B (58.4%) / UI-Venus-Ground-72B (61.9%, Aug 2025) → Gelato-30B (63.9%, Nov 2025) → UI-Venus-1.5-30B-A3B (69.6%, Feb 2026, current verified dedicated SOTA).
4. Where OmniParser stands¶
OmniParser is in a different category and is effectively absent from the modern grounding leaderboards — by design. Key facts:
- OmniParser V2 (Microsoft, Feb 2025) is the latest version. No V3 / v2.1 was found as of May 2026 — development appears stalled (last substantive milestones: Feb 2025 V2 checkpoints, March 2025 OmniTool integration). This is a real risk signal: the model we depend on is not actively advancing.
- OmniParser is not an instruction-conditioned grounding VLM. Its
icon_detectis a YOLO detector ("find all interactable elements") paired with OCR/captioning. The ScreenSpot-Pro/OSWorld-G/UI-Vision leaderboards measure "given a natural-language instruction, return the single correct coordinate," which is a VLM task. OmniParser's detector does not directly compete on that metric, so it does not appear at the top of those boards. - Where OmniParser-style detection has been measured against modern grounding tasks (and where general detectors are referenced in the ScreenSpot-Pro launch context), pure-detection approaches land far below the 50–70% instruction-grounding scores of 2026 VLMs. The 18.9% launch baseline era is roughly the world OmniParser's detector lives in for the instruction-following metric — but again, that is the wrong metric for our architecture.
The honest read: On the leaderboards' own task (instruction → coordinate), OmniParser is surpassed by every 2026 VLM. On our task (detect-all-elements, hand the list + OCR to the host LLM, let the LLM pick), OmniParser is competitive and appropriate, because the heavy "which element matches the instruction" reasoning is done by mcp-screen's LLM, not the perception model. We are not running the same race the leaderboards score.
5. CPU-deployable alternatives¶
Ruthless assessment against the box: no GPU, 6 CPU threads, dual 4K (7680×2160), interactive latency. For an interactive screen driver, "feasible" means roughly sub-second to low-single-digit seconds per frame on CPU and a few GB RAM, not tens.
| Model | Params / size | ONNX / CPU feasible? | Grounding acc (best benchmark) | License | Verdict for our constraints |
|---|---|---|---|---|---|
| OmniParser icon_detect (current) | YOLOv8, ~11.6 MB ONNX | Yes — already running | N/A as VLM; detector-class | MIT (icon_detect; OmniParser repo) | Baseline. Fits the box. Keep unless beaten on CPU. |
| RapidOCR v3 (current) | small ONNX (det+rec) | Yes — already running | OCR, not grounding | Apache-2.0 | Keep; complementary, not a grounding competitor. |
| UI-Venus-1.5-30B-A3B | 30B MoE (~3B active) | No — MoE still loads 30B weights (~60 GB fp16 / ~15–30 GB quantized); CPU latency many seconds–minutes/frame | 69.6% SSPro / 70.6% OSWorld-G | Check repo (inclusionAI) | Non-starter. GPU-class memory + latency. |
| Gelato-30B-A3B | 30B MoE | No — same MoE memory wall on CPU | 63.9% SSPro | Check repo (mlfoundations) | Non-starter. |
| GTA1-72B | 72B dense | No — absurd on CPU | 58.4% SSPro | Check HF card | Non-starter. |
| GTA1-7B / UI-Venus-Ground-7B / Phi-Ground (<10B) | 7–8B VLM | Marginal-to-No — 7B VLM on 6 CPU threads at 4K input is multi-second to tens-of-seconds/frame; ~8–16 GB RAM quantized | 43–51% SSPro | Varies | Effectively a non-starter for interactive use. Possible for batch/offline only. Even quantized (GGUF/INT4) a 7B VLM at high-res input will not hit interactive latency on 6 CPU threads. |
| UI-Venus-1.5-2B | 2B VLM | Borderline — smallest credible grounding VLM; quantized ~1.5–3 GB; still likely 1–5 s/frame on CPU at downscaled input, slower at native 4K | 57.7% SSPro | Check repo | Re-evaluate candidate only — the single leaderboard model worth CPU-benchmarking, but expect it to be too slow at dual-4K native res; would require aggressive downscaling/tiling. |
| AdaZoom-GUI-4B / KV-Ground-8B pipelines | 4–8B + zoom-in | No — multi-pass zoom pipelines multiply latency; VLM core | 76.8–80.9% (official board) | Varies | Non-starter on CPU. The high official scores come from the expensive multi-crop pipeline. |
| Keep YOLO detector, retrain/upgrade detector (e.g., newer YOLO export, or fine-tune icon_detect on our app set) | ~10–50 MB ONNX | Yes | N/A (detection) | our choice | Best within-envelope improvement path if we want gains without leaving CPU. |
Bottom line for this section: There is no model on the 2026 grounding leaderboards that is both (a) meaningfully better at our actual task and (b) runnable on CPU at dual-4K interactive latency. The grounding-VLM SOTA is purchased entirely with GPU-class compute. Our constraints exclude the entire top of every board.
6. Recommendation: STAY (with a bounded re-evaluation)¶
Call: STAY-AND-RE-EVALUATE. Confidence: high on STAY; the "re-evaluate" rider exists because we lack our own numbers, not because a swap target exists.
Why STAY¶
- No CPU-deployable model beats us at our task by enough to justify churn. Every leaderboard leader is a 7B–72B grounding VLM that needs a GPU. On 6 CPU threads at 7680×2160, a 7B VLM is multi-second-to-minutes per frame; a 30B MoE or 72B dense model is simply impossible. That kills interactivity, which is the product.
- The leaderboards score a different task than we run. They score instruction→coordinate (VLM). We run detect-all-elements (lightweight detector) + OCR, and let mcp-screen's host LLM do the instruction reasoning. Swapping to a grounding VLM would also mean moving language reasoning into the perception layer, duplicating what our LLM already does — a Four-Ds "Different/Difficult" violation with no payoff under our compute budget.
worldmodel.pydHash cache already absorbs much of the per-frame cost by short-circuiting unchanged regions — further reducing the marginal benefit of a heavier per-frame model.- The benchlm.ai/llm-stats "frontier VLMs top the board at ~88%" narrative is refuted/synthetic, so there is no credible signal that a drop-in better model even exists, let alone a CPU-runnable one.
Why the rider (what to actually do)¶
We are flying blind on our own numbers. Before v1.1 ships, run a small, honest, CPU, 6-thread, dual-4K benchmark:
- (a) Latency/RAM ceiling test: measure current OmniParser+RapidOCR ms/frame and RAM at native 4K and at our actual downscale, on 6 threads. Establishes the budget any replacement must beat.
- (b) Accuracy on OUR distribution: build a small held-out set of real screenshots from the apps we actually drive (dual-4K, our toolchain) with ground-truth click targets. Score current pipeline (element-detected recall + click-coordinate hit rate). This is the number that matters, not ScreenSpot-Pro.
- (c) One swap probe: quantized UI-Venus-1.5-2B (or a 2–4B grounding VLM) on CPU — measure ms/frame at our resolution. If (and only if) it lands under the latency ceiling from (a) AND beats (b) by a wide margin, reconsider. Expectation: it will fail the latency test at dual-4K; document the number so the question is closed with data.
- (d) Within-envelope upgrade probe: evaluate a newer/fine-tuned YOLO export of icon_detect on (b). This is the realistic path to gains without leaving CPU.
Migration sketch (only if a probe unexpectedly wins)¶
If a 2–4B grounding VLM ever passes (a)+(b), grounding.py would change from "run YOLO detector → NMS → return all boxes; LLM picks" to "send instruction + (tiled/downscaled) screenshot to the grounding VLM → receive coordinate(s)." That inverts the data flow (instruction must reach the perception layer), requires tiling logic for 7680×2160 input, and likely an ONNX/GGUF export + quantization step. It is a significant refactor, not a model-file swap — another reason the bar must be cleared with real numbers first.
7. Honesty & limitations¶
- Knowledge cutoff was January 2026. Everything dated Feb–May 2026 (UI-Venus-1.5, the May 2026 leaderboard snapshots, OmniParser "no V3 as of May 2026") rests on live web search performed late May 2026, synthesized from the provided sweep + adversarial-verification inputs — not on my training data.
- REFUTED / do-not-trust: The claim that frontier general VLMs (Claude Opus 4.8 87.9%, GPT-5.x ~85%, Gemini 3.x ~84%, Meta "Muse Spark" ~84%) top ScreenSpot-Pro is contradicted by the official leaderboard JSON (max 80.9%, no such entries; only Claude entry 17.1%, GPT-5 entries 18.5%/6.0%). benchlm.ai and llm-stats.com figures appear fabricated/synthetic. I did not build any recommendation on them.
- CONFIRMED (primary sources): ScreenSpot-Pro 18.9% launch baseline (OS-Atlas-7B, arXiv:2504.07981); UI-Venus-1.5-30B-A3B = 69.6% SSPro / 70.6% OSWorld-G / 75.0% VenusBench-GD, is MoE, beats MAI-UI-32B 67.9%/67.6% and GTA1-32B 65.2% (arXiv:2602.09082, read verbatim); Gelato-30B figures (mlfoundations repo); GTA1 figures (HF card / arXiv:2507.05791); OSWorld-G / UI-Vision launch baselines and 2026 leaders.
- UNVERIFIED secondary (treat with caution): LASER 55.7%, UI-AGILE-7B 48.7%, GUI-G2-72B 47.5%, MEGA-GUI 68.63%, and exact GTA1-32B numbers — sourced from emergentmind / secondary aggregators, not primary tables. VenusBench-GD is a vendor benchmark (medium confidence).
- Verified against live code this session:
grounding.pyandREADME.mdwere read directly. Confirmed the OmniParser YOLOv8 ONNX atmodels/onnx/model.onnx(12,136,163 bytes ≈ 11.6 MiB), CPU-only onnxruntime with default 6 intra-op threads, RapidOCR, OpenCV NMS, and the detector→OCR-naming→Set-of-Marks flow. The one detail NOT pinned down: the exact upstream OmniParser checkpoint version / training date of the bundledmodel.onnx(the file is a bare export with no embedded version string surfaced here). Low-risk, but worth recording the provenance hash in v1.1. - No CPU latency numbers were measured. All "non-starter on CPU" verdicts are reasoned from model size + the no-GPU/6-thread/dual-4K envelope, not from benchmarks I ran. The Section 6 rider exists precisely to replace that reasoning with measurements.
8. Sources¶
- https://gui-agent.github.io/grounding-leaderboard/ — official ScreenSpot-Pro / grounding leaderboard (authoritative; refutes the frontier-VLM claim)
- https://arxiv.org/abs/2504.07981 — ScreenSpot-Pro launch paper (18.9% baseline)
- https://benchlm.ai/benchmarks/screenSpotPro — third-party aggregator (content REFUTED/unreliable)
- https://llm-stats.com/benchmarks/screenspot-pro — aggregator (figures uncorroborated by official source)
- https://llm-stats.com/benchmarks/osworld-g
- https://llm-stats.com/benchmarks/screenspot
- https://arxiv.org/abs/2602.09082 — UI-Venus-1.5 Technical Report (current dedicated SOTA)
- https://arxiv.org/html/2602.09082v1
- https://ui-venus.github.io/UI-Venus-1.5/
- https://huggingface.co/inclusionAI/UI-Venus-1.5-30B-A3B
- https://arxiv.org/abs/2508.10833 — UI-Venus 1.0 Technical Report
- https://huggingface.co/papers/2508.10833
- https://github.com/inclusionAI/UI-Venus
- https://huggingface.co/inclusionAI/UI-Venus-Ground-72B
- https://huggingface.co/inclusionAI/UI-Venus-Ground-7B
- https://github.com/mlfoundations/Gelato — Gelato-30B-A3B
- https://huggingface.co/mlfoundations/Gelato-30B-A3B
- https://www.marktechpost.com/2025/11/10/gelato-30b-a3b-a-state-of-the-art-grounding-model-for-gui-computer-use-tasks-surpassing-computer-grounding-models-like-gta1-32b/
- https://arxiv.org/abs/2507.05791 — GTA1 paper
- https://huggingface.co/HelloKKMe/GTA1-72B — GTA1-72B card
- https://github.com/Yan98/GTA1
- https://www.alphaxiv.org/overview/2507.05791
- https://arxiv.org/abs/2507.23779 — Phi-Ground tech report
- https://arxiv.org/abs/2505.13227 — Jedi / OSWorld-G paper
- https://osworld-grounding.github.io/ — OSWorld-G project page
- https://github.com/xlang-ai/osworld-g
- https://arxiv.org/abs/2503.15661 — UI-Vision paper
- https://uivision.github.io/
- https://arxiv.org/abs/2509.02544 — UI-TARS-2
- https://arxiv.org/pdf/2511.13087 — MEGA-GUI
- https://ui-venus.github.io/VenusBench-GD/ — VenusBench-GD
- https://arxiv.org/pdf/2512.16501 — VenusBench-GD paper
- https://www.emergentmind.com/topics/screenspot-pro — secondary method-level scores (UNVERIFIED)
- https://github.com/QwenLM/Qwen3-VL/issues/1576 — Qwen3-VL out-of-box grounding sensitivity
- https://github.com/microsoft/OmniParser — OmniParser repo (V2 latest, Feb 2025)
- https://huggingface.co/microsoft/OmniParser-v2.0
- https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/
- https://learnopencv.com/omniparser-v2-bridging-vision-and-action-for-gui-agents/
- https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding
- https://huggingface.co/datasets/likaixin/ScreenSpot-Pro
- https://arxiv.org/abs/2507.22291