mcp-screen v1.1 — GUI Grounding Model Research¶

Date: 2026-05-31 Author: Research subagent (Claude Opus 4.8) Question: Is the OmniParser ONNX model we run top of the May 2026 GUI-grounding charts? Should we swap it?

Grounding in the live code: Section 2 below was verified against the live /home/andrew/mcp-screen/grounding.py and README.md this session (not just the task brief). Confirmed: the model file is /home/andrew/mcp-screen/models/onnx/model.onnx (12,136,163 bytes ≈ 11.6 MiB); it is loaded via onnxruntime CPUExecutionProvider with intra_op_num_threads = MCP_SCREEN_CPU_THREADS (default 6) and inter_op_num_threads = 1; OCR is RapidOCR; OpenCV does letterbox + NMS. The worldmodel.py file exists alongside but is the dHash cache, not a model (see below).

1. TL;DR¶

No — the OmniParser ONNX we run is not, and was never designed to be, "top of the charts," and that is fine. Recommendation: STAY (CPU-bound), with a small targeted re-evaluation. Confidence: high on the "stay" call, medium-high on the exact 2026 leaderboard numbers.

Every model at the top of the May 2026 GUI-grounding leaderboards (ScreenSpot-Pro, OSWorld-G, UI-Vision) is a multi-billion-parameter vision-language model — UI-Venus-1.5-30B-A3B (69.6% ScreenSpot-Pro), Gelato-30B-A3B (63.9%), GTA1-72B (58.4%), or zoom-in pipelines built on 8B–32B VLMs (KV-Ground, MAI-UI-32B at ~77–81% on the official board). None of these are CPU-deployable under our hard constraints (no GPU, 6 CPU threads, dual 4K = 7680×2160). A 30B MoE or 72B dense VLM doing multi-second-to-multi-minute-per-frame inference on CPU is a non-starter for an interactive screen driver. OmniParser's icon_detect (a ~11.6 MB YOLOv8 ONNX) + RapidOCR is in a completely different weight class: it is a lightweight detector, not a grounding VLM, and it is the correct class of model for our deployment envelope. The leaderboards do not contain a CPU-runnable model that would beat our current pipeline by enough to justify the latency/RAM blowup. The honest gap is that we lack our own numbers on our own benchmark (dual-4K, real apps), so we should measure before any future swap — but there is no swap target that fits the box today.

2. What we actually run¶

Two things are frequently conflated; only one is an ML model.

worldmodel.py is NOT an ML model. It is a SQLite dHash (perceptual-hash) cache of the perception layer's output. It stores prior grounding results keyed by a perceptual hash of the screen region so that unchanged regions can be served from cache instead of re-running detection. No leaderboard applies to it. It is not a competitor to any grounding model and is explicitly out of scope for this benchmarking exercise. Benchmarking it against UI-Venus etc. would be a category error.
The only ML model in scope is the perception / grounding layer, which is:
OmniParser icon_detect — a YOLOv8 single-class detector exported to ONNX at models/onnx/model.onnx (verified 12,136,163 bytes ≈ 11.6 MiB). Job: detect interactable UI elements / icons and emit bounding boxes. Loaded via onnxruntime CPUExecutionProvider, intra_op_num_threads=MCP_SCREEN_CPU_THREADS (default 6), inter_op_num_threads=1. Inference is letterbox→sess.run→threshold→NMS at imgsz=640. No torch.
RapidOCR (from rapidocr import RapidOCR, v3 output object with .boxes/.txts/.scores) — text detection + recognition, also ONNX/CPU.
OpenCV — letterboxing, cv2.dnn.NMSBoxes, and geometry glue. A classical-CV Canny+contour path (cv_regions) exists only as a fallback when the OmniParser ONNX / onnxruntime is unavailable.
All of the above run ONNX / CPU-only. annotate() then names each icon box by the OCR text inside it and draws the Set-of-Marks overlay; the host LLM picks the numbered mark.
The model's actual job is pure screen grounding / element detection — output is bounding boxes + click coordinates for UI elements. It does not need to be a VLM agent or planner. mcp-screen's host LLM does the reasoning ("which element, why, what next"); the perception layer only has to answer "where are the clickable things and what text is on screen." This framing matters enormously for the swap question: the leaderboard SOTA models are end-to-end grounding VLMs that take a natural-language instruction and return a coordinate. That is a different and heavier task than "detect all elements," and it inherently requires a language model in the loop. Our architecture deliberately keeps the language model out of the perception layer.

Deployment envelope (hard constraints): No GPU. CPU-only. 6 CPU threads. Linux. Driving dual 4K monitors (7680×2160 combined). Any model needing a GPU, tens of GB of RAM, or multi-second-per-frame CPU latency is effectively a non-starter for interactive use.

3. May 2026 leaderboard landscape¶

Three benchmarks differentiate 2026 grounding SOTA. The original ScreenSpot and ScreenSpot-v2 are saturated (~92–96% for top open models) and no longer discriminative; ignore them for model selection.

ScreenSpot-Pro (high-res professional apps — the headline benchmark)¶

ScreenSpot-Pro launched in 2025 with a brutal baseline: the best model scored only 18.9% (OS-Atlas-7B; confirmed against the official leaderboard JSON and the launch paper arXiv:2504.07981). Scores have since climbed dramatically. There are two conflicting pictures of the top, and the discrepancy is important and honestly reported here:

Picture A — the OFFICIAL ScreenSpot-Pro leaderboard (gui-agent.github.io/grounding-leaderboard, the canonical JSON parsed in full during adversarial verification). Top entries are GUI-specialist / zoom-in pipelines, not general chat models:

Rank	Model	ScreenSpot-Pro
1	KV-Ground-8B + Qwen3.5-27B Consistency Router	80.9%
2	KV-Ground-GuiOwl1.5-0315-8B-ZoomIn	80.5%
3	Holo2-235B-A22B (Agentic)	78.5%
4	MAI-UI-32B	77.5%
5	AdaZoom-GUI-4B	76.8%

On this official board the maximum score is 80.9%, there is no Claude Opus / GPT-5 / Gemini / "Muse Spark" entry near the top, the only Claude entry is "Claude (Computer Use)" at 17.1% (rank 73/86), and GPT-5 entries are 18.5% / 6.0%.

Picture B — third-party aggregators (benchlm.ai, llm-stats.com) that claimed frontier general VLMs lead at ~84–88% (Claude Opus 4.8 87.9%, GPT-5.4 85.4%, Gemini 3.1 Pro 84.4%, Meta "Muse Spark" 84.1%). This picture is REFUTED. Adversarial verification against the official benchmark data found none of those models appear on the official board; the figures look fabricated/synthetic. Both aggregator sites are reachable (HTTP 200) but their content directly conflicts with the authoritative source. Do not rely on the "frontier VLMs top ScreenSpot-Pro at ~88%" claim — it is UNVERIFIED at best and contradicted by the primary source.

Dedicated open grounding models on ScreenSpot-Pro (verified per-variant, primary sources):

Model	Params	ScreenSpot-Pro	Source
UI-Venus-1.5-30B-A3B (MoE)	30B MoE (A3B active)	69.6%	arXiv:2602.09082 (confirmed verbatim)
UI-Venus-1.5-8B	8B	68.4%	arXiv:2602.09082
MAI-UI-32B	32B	67.9% (paper) / 77.5% (official board, w/ pipeline)	arXiv:2602.09082; official board
Gelato-30B-A3B	30B MoE	63.88%	github.com/mlfoundations/Gelato
GTA1-72B	72B	58.4%	HF HelloKKMe/GTA1-72B
UI-Venus-1.5-2B	2B	57.7%	arXiv:2602.09082
LASER (GTA1-7B based)	7B + method	55.7%	emergentmind (UNVERIFIED secondary)
GTA1-7B	7B	50.1%	HF GTA1
UI-Venus-Ground-7B (v1.0)	7B	50.8%	arXiv:2508.10833
UI-AGILE-7B	7B	48.7%	emergentmind (UNVERIFIED secondary)
GUI-G2-72B	72B	47.5%	emergentmind (UNVERIFIED secondary)
Phi-Ground	<10B	43.2%	arXiv:2507.23779

OSWorld-G (xlang-ai, NeurIPS 2025; 564 fine-grained samples)¶

Launch baseline Jedi-7B 54.1%. 2026 leaders (verified): UI-Venus-1.5-30B 70.6% (76.4% on OSWorld-G-Refined), Gelato-30B 69.15% / 74.65% Refined, MEGA-GUI 68.63%, GTA1-7B 67.7% (notably edging GTA1-72B's 66.7%), MAI-UI-32B 67.6%, GTA1-32B 65.2%.

UI-Vision (ServiceNow, ICML 2025 — hardest desktop benchmark)¶

Launch leader UI-TARS-72B managed only 31.4% element grounding / 24.3% layout grounding; Claude-3.7-Sonnet 9.48% element. By Feb 2026 UI-Venus-1.5-30B reaches 69.0% element grounding (prior best MAI-UI-32B 59.1%). Phi-Ground earlier reported 27.2%.

Other / newer¶

VenusBench-GD (arXiv:2512.16501; multi-platform, 6,100+ pairs, 97+ apps): UI-Venus-1.5-30B 75.0%. (Confidence: medium — vendor benchmark.)
ScreenSpot-v2 / ScreenSpot: saturated, ~92–96% top; not discriminative.

Lineage summary (dedicated-grounding SOTA over time)¶

UI-TARS-1.5 (~61.6%, mid-2025) → GTA1-72B (58.4%) / UI-Venus-Ground-72B (61.9%, Aug 2025) → Gelato-30B (63.9%, Nov 2025) → UI-Venus-1.5-30B-A3B (69.6%, Feb 2026, current verified dedicated SOTA).

4. Where OmniParser stands¶

OmniParser is in a different category and is effectively absent from the modern grounding leaderboards — by design. Key facts:

OmniParser V2 (Microsoft, Feb 2025) is the latest version. No V3 / v2.1 was found as of May 2026 — development appears stalled (last substantive milestones: Feb 2025 V2 checkpoints, March 2025 OmniTool integration). This is a real risk signal: the model we depend on is not actively advancing.
OmniParser is not an instruction-conditioned grounding VLM. Its icon_detect is a YOLO detector ("find all interactable elements") paired with OCR/captioning. The ScreenSpot-Pro/OSWorld-G/UI-Vision leaderboards measure "given a natural-language instruction, return the single correct coordinate," which is a VLM task. OmniParser's detector does not directly compete on that metric, so it does not appear at the top of those boards.
Where OmniParser-style detection has been measured against modern grounding tasks (and where general detectors are referenced in the ScreenSpot-Pro launch context), pure-detection approaches land far below the 50–70% instruction-grounding scores of 2026 VLMs. The 18.9% launch baseline era is roughly the world OmniParser's detector lives in for the instruction-following metric — but again, that is the wrong metric for our architecture.

The honest read: On the leaderboards' own task (instruction → coordinate), OmniParser is surpassed by every 2026 VLM. On our task (detect-all-elements, hand the list + OCR to the host LLM, let the LLM pick), OmniParser is competitive and appropriate, because the heavy "which element matches the instruction" reasoning is done by mcp-screen's LLM, not the perception model. We are not running the same race the leaderboards score.

5. CPU-deployable alternatives¶

Ruthless assessment against the box: no GPU, 6 CPU threads, dual 4K (7680×2160), interactive latency. For an interactive screen driver, "feasible" means roughly sub-second to low-single-digit seconds per frame on CPU and a few GB RAM, not tens.

Model	Params / size	ONNX / CPU feasible?	Grounding acc (best benchmark)	License	Verdict for our constraints
OmniParser icon_detect (current)	YOLOv8, ~11.6 MB ONNX	Yes — already running	N/A as VLM; detector-class	MIT (icon_detect; OmniParser repo)	Baseline. Fits the box. Keep unless beaten on CPU.
RapidOCR v3 (current)	small ONNX (det+rec)	Yes — already running	OCR, not grounding	Apache-2.0	Keep; complementary, not a grounding competitor.
UI-Venus-1.5-30B-A3B	30B MoE (~3B active)	No — MoE still loads 30B weights (~60 GB fp16 / ~15–30 GB quantized); CPU latency many seconds–minutes/frame	69.6% SSPro / 70.6% OSWorld-G	Check repo (inclusionAI)	Non-starter. GPU-class memory + latency.
Gelato-30B-A3B	30B MoE	No — same MoE memory wall on CPU	63.9% SSPro	Check repo (mlfoundations)	Non-starter.
GTA1-72B	72B dense	No — absurd on CPU	58.4% SSPro	Check HF card	Non-starter.
GTA1-7B / UI-Venus-Ground-7B / Phi-Ground (<10B)	7–8B VLM	Marginal-to-No — 7B VLM on 6 CPU threads at 4K input is multi-second to tens-of-seconds/frame; ~8–16 GB RAM quantized	43–51% SSPro	Varies	Effectively a non-starter for interactive use. Possible for batch/offline only. Even quantized (GGUF/INT4) a 7B VLM at high-res input will not hit interactive latency on 6 CPU threads.
UI-Venus-1.5-2B	2B VLM	Borderline — smallest credible grounding VLM; quantized ~1.5–3 GB; still likely 1–5 s/frame on CPU at downscaled input, slower at native 4K	57.7% SSPro	Check repo	Re-evaluate candidate only — the single leaderboard model worth CPU-benchmarking, but expect it to be too slow at dual-4K native res; would require aggressive downscaling/tiling.
AdaZoom-GUI-4B / KV-Ground-8B pipelines	4–8B + zoom-in	No — multi-pass zoom pipelines multiply latency; VLM core	76.8–80.9% (official board)	Varies	Non-starter on CPU. The high official scores come from the expensive multi-crop pipeline.
Keep YOLO detector, retrain/upgrade detector (e.g., newer YOLO export, or fine-tune icon_detect on our app set)	~10–50 MB ONNX	Yes	N/A (detection)	our choice	Best within-envelope improvement path if we want gains without leaving CPU.

Bottom line for this section: There is no model on the 2026 grounding leaderboards that is both (a) meaningfully better at our actual task and (b) runnable on CPU at dual-4K interactive latency. The grounding-VLM SOTA is purchased entirely with GPU-class compute. Our constraints exclude the entire top of every board.

6. Recommendation: STAY (with a bounded re-evaluation)¶

Call: STAY-AND-RE-EVALUATE. Confidence: high on STAY; the "re-evaluate" rider exists because we lack our own numbers, not because a swap target exists.

Why STAY¶

No CPU-deployable model beats us at our task by enough to justify churn. Every leaderboard leader is a 7B–72B grounding VLM that needs a GPU. On 6 CPU threads at 7680×2160, a 7B VLM is multi-second-to-minutes per frame; a 30B MoE or 72B dense model is simply impossible. That kills interactivity, which is the product.
The leaderboards score a different task than we run. They score instruction→coordinate (VLM). We run detect-all-elements (lightweight detector) + OCR, and let mcp-screen's host LLM do the instruction reasoning. Swapping to a grounding VLM would also mean moving language reasoning into the perception layer, duplicating what our LLM already does — a Four-Ds "Different/Difficult" violation with no payoff under our compute budget.
worldmodel.py dHash cache already absorbs much of the per-frame cost by short-circuiting unchanged regions — further reducing the marginal benefit of a heavier per-frame model.
The benchlm.ai/llm-stats "frontier VLMs top the board at ~88%" narrative is refuted/synthetic, so there is no credible signal that a drop-in better model even exists, let alone a CPU-runnable one.

Why the rider (what to actually do)¶

We are flying blind on our own numbers. Before v1.1 ships, run a small, honest, CPU, 6-thread, dual-4K benchmark: - (a) Latency/RAM ceiling test: measure current OmniParser+RapidOCR ms/frame and RAM at native 4K and at our actual downscale, on 6 threads. Establishes the budget any replacement must beat. - (b) Accuracy on OUR distribution: build a small held-out set of real screenshots from the apps we actually drive (dual-4K, our toolchain) with ground-truth click targets. Score current pipeline (element-detected recall + click-coordinate hit rate). This is the number that matters, not ScreenSpot-Pro. - (c) One swap probe: quantized UI-Venus-1.5-2B (or a 2–4B grounding VLM) on CPU — measure ms/frame at our resolution. If (and only if) it lands under the latency ceiling from (a) AND beats (b) by a wide margin, reconsider. Expectation: it will fail the latency test at dual-4K; document the number so the question is closed with data. - (d) Within-envelope upgrade probe: evaluate a newer/fine-tuned YOLO export of icon_detect on (b). This is the realistic path to gains without leaving CPU.

Migration sketch (only if a probe unexpectedly wins)¶

If a 2–4B grounding VLM ever passes (a)+(b), grounding.py would change from "run YOLO detector → NMS → return all boxes; LLM picks" to "send instruction + (tiled/downscaled) screenshot to the grounding VLM → receive coordinate(s)." That inverts the data flow (instruction must reach the perception layer), requires tiling logic for 7680×2160 input, and likely an ONNX/GGUF export + quantization step. It is a significant refactor, not a model-file swap — another reason the bar must be cleared with real numbers first.

7. Honesty & limitations¶

Knowledge cutoff was January 2026. Everything dated Feb–May 2026 (UI-Venus-1.5, the May 2026 leaderboard snapshots, OmniParser "no V3 as of May 2026") rests on live web search performed late May 2026, synthesized from the provided sweep + adversarial-verification inputs — not on my training data.
REFUTED / do-not-trust: The claim that frontier general VLMs (Claude Opus 4.8 87.9%, GPT-5.x ~85%, Gemini 3.x ~84%, Meta "Muse Spark" ~84%) top ScreenSpot-Pro is contradicted by the official leaderboard JSON (max 80.9%, no such entries; only Claude entry 17.1%, GPT-5 entries 18.5%/6.0%). benchlm.ai and llm-stats.com figures appear fabricated/synthetic. I did not build any recommendation on them.
CONFIRMED (primary sources): ScreenSpot-Pro 18.9% launch baseline (OS-Atlas-7B, arXiv:2504.07981); UI-Venus-1.5-30B-A3B = 69.6% SSPro / 70.6% OSWorld-G / 75.0% VenusBench-GD, is MoE, beats MAI-UI-32B 67.9%/67.6% and GTA1-32B 65.2% (arXiv:2602.09082, read verbatim); Gelato-30B figures (mlfoundations repo); GTA1 figures (HF card / arXiv:2507.05791); OSWorld-G / UI-Vision launch baselines and 2026 leaders.
UNVERIFIED secondary (treat with caution): LASER 55.7%, UI-AGILE-7B 48.7%, GUI-G2-72B 47.5%, MEGA-GUI 68.63%, and exact GTA1-32B numbers — sourced from emergentmind / secondary aggregators, not primary tables. VenusBench-GD is a vendor benchmark (medium confidence).
Verified against live code this session: grounding.py and README.md were read directly. Confirmed the OmniParser YOLOv8 ONNX at models/onnx/model.onnx (12,136,163 bytes ≈ 11.6 MiB), CPU-only onnxruntime with default 6 intra-op threads, RapidOCR, OpenCV NMS, and the detector→OCR-naming→Set-of-Marks flow. The one detail NOT pinned down: the exact upstream OmniParser checkpoint version / training date of the bundled model.onnx (the file is a bare export with no embedded version string surfaced here). Low-risk, but worth recording the provenance hash in v1.1.
No CPU latency numbers were measured. All "non-starter on CPU" verdicts are reasoned from model size + the no-GPU/6-thread/dual-4K envelope, not from benchmarks I ran. The Section 6 rider exists precisely to replace that reasoning with measurements.

8. Sources¶

https://gui-agent.github.io/grounding-leaderboard/ — official ScreenSpot-Pro / grounding leaderboard (authoritative; refutes the frontier-VLM claim)
https://arxiv.org/abs/2504.07981 — ScreenSpot-Pro launch paper (18.9% baseline)
https://benchlm.ai/benchmarks/screenSpotPro — third-party aggregator (content REFUTED/unreliable)
https://llm-stats.com/benchmarks/screenspot-pro — aggregator (figures uncorroborated by official source)
https://llm-stats.com/benchmarks/osworld-g
https://llm-stats.com/benchmarks/screenspot
https://arxiv.org/abs/2602.09082 — UI-Venus-1.5 Technical Report (current dedicated SOTA)
https://arxiv.org/html/2602.09082v1
https://ui-venus.github.io/UI-Venus-1.5/
https://huggingface.co/inclusionAI/UI-Venus-1.5-30B-A3B
https://arxiv.org/abs/2508.10833 — UI-Venus 1.0 Technical Report
https://huggingface.co/papers/2508.10833
https://github.com/inclusionAI/UI-Venus
https://huggingface.co/inclusionAI/UI-Venus-Ground-72B
https://huggingface.co/inclusionAI/UI-Venus-Ground-7B
https://github.com/mlfoundations/Gelato — Gelato-30B-A3B
https://huggingface.co/mlfoundations/Gelato-30B-A3B
https://www.marktechpost.com/2025/11/10/gelato-30b-a3b-a-state-of-the-art-grounding-model-for-gui-computer-use-tasks-surpassing-computer-grounding-models-like-gta1-32b/
https://arxiv.org/abs/2507.05791 — GTA1 paper
https://huggingface.co/HelloKKMe/GTA1-72B — GTA1-72B card
https://github.com/Yan98/GTA1
https://www.alphaxiv.org/overview/2507.05791
https://arxiv.org/abs/2507.23779 — Phi-Ground tech report
https://arxiv.org/abs/2505.13227 — Jedi / OSWorld-G paper
https://osworld-grounding.github.io/ — OSWorld-G project page
https://github.com/xlang-ai/osworld-g
https://arxiv.org/abs/2503.15661 — UI-Vision paper
https://uivision.github.io/
https://arxiv.org/abs/2509.02544 — UI-TARS-2
https://arxiv.org/pdf/2511.13087 — MEGA-GUI
https://ui-venus.github.io/VenusBench-GD/ — VenusBench-GD
https://arxiv.org/pdf/2512.16501 — VenusBench-GD paper
https://www.emergentmind.com/topics/screenspot-pro — secondary method-level scores (UNVERIFIED)
https://github.com/QwenLM/Qwen3-VL/issues/1576 — Qwen3-VL out-of-box grounding sensitivity
https://github.com/microsoft/OmniParser — OmniParser repo (V2 latest, Feb 2025)
https://huggingface.co/microsoft/OmniParser-v2.0
https://www.microsoft.com/en-us/research/articles/omniparser-for-pure-vision-based-gui-agent/
https://learnopencv.com/omniparser-v2-bridging-vision-and-action-for-gui-agents/
https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding
https://huggingface.co/datasets/likaixin/ScreenSpot-Pro
https://arxiv.org/abs/2507.22291