AIComputer VisionConstruction TechLLMBenchmark

Local vs Cloud Vision Models for Construction Documentation — A 7-Model Benchmark

VVinicius Fonseca·April 1, 2026·18 min read

Local vs Cloud Vision Models for Construction Documentation — A 7-Model Benchmark

I run a small construction company in the DC metro area. At any given moment we're managing 6 active renovation projects — kitchens, structural work, permitting, subcontractors. Every day generates a flood of content: photos from the job site, PDFs from engineers and the county, text message conversations, handwritten inspection notes.

For years, "finding a photo" meant scrolling through a WhatsApp chat. "Looking up a spec" meant downloading a PDF and searching by keyword and hoping. "Remembering what the inspector said" meant... hoping someone saved that voice memo.

I built an AI system called Reef to fix this. The core idea is simple: every photo, PDF, audio note, and message that comes through project WhatsApp groups gets automatically cataloged with an AI description, embedded for semantic search, and made queryable in plain English.

The harder question was: which AI model should actually do the work?

This article is the answer — a benchmark of 7 vision models across speed, cost, and real-world construction documentation quality. The findings changed the architecture entirely, and one result in particular (the PDF accuracy test) is something I haven't seen covered anywhere else.

The Problem: Construction Documentation at Scale

A mid-size renovation project generates roughly 20–40 media files per day across all communication channels. That's 600–1,200 files a month, per project. Multiply across 6 concurrent projects and you have a documentation problem that no human can keep up with manually.

The stakes aren't trivial:

›A missed photo of a framing defect means the issue gets closed in drywall
›An unread structural spec means a contractor installs the wrong steel angle
›A forgotten permit condition means a failed inspection at the worst possible moment

The goal for Reef was to catalog everything automatically: photos with scene descriptions, PDFs with extracted text, voice notes with transcriptions — all searchable in plain English, all updated within 60 seconds of a message arriving.

But to do that at scale without breaking the bank, I needed to understand what each model actually costs, how fast it is, and — critically — where it fails.

Test Setup

Hardware:

›NVIDIA RTX 3070 Ti (8GB VRAM) for local inference
›LM Studio as the local model server
›Google Gemini API (cloud)
›Anthropic API (cloud)

Test dataset:

›10 real construction images from active DC-area renovation projects
›Images included: kitchen demolition mid-progress, deck structure viewed from below, attic framing, drywall installation, site excavation, permit documents, handwritten inspection notices, text message screenshots with product codes, contractor invoices

Scoring rubric (5 dimensions, 10 points each, 50 total):

Dimension	What It Measures
Scene Accuracy	Does the description match what's actually in the image?
Technical Detail	Construction-specific identification: trades, materials, phases
Text Extraction	Accuracy on any readable text in the image
Practical Utility	Would this description help find the image later?
Hallucination Penalty	Deductions for invented details

Every image was scored independently by me, blind to which model produced which output.

The 7 Models We Tested

Model	Type	Avg Speed	Cost/Image	Quality Score
Gemma 3 4B	Local	0.84s	$0	33/50
Qwen2.5-VL-7B	Local	4.22s	$0	31/50
Gemma 3 12B	Local	10.67s	$0	34/50
Gemini 2.5 Flash	Cloud	2.31s	~$0.002	37/50
Gemini 3 Flash Preview	Cloud	22.8s	~$0.003	unstable
Claude Haiku 4.5	Cloud	1.90s	~$0.006	32/50
Claude Opus 4.6	Cloud	13.2s	~$0.06	49/50

Three local models (zero cost, run entirely on my hardware), four cloud models. Let me walk through what I actually found.

Speed: Where Local Models Surprised Me

The throughput numbers tell the real story for bulk cataloging:

›Gemma 3 4B: 71 images/minute — fastest in the test
›Claude Haiku 4.5: 32 images/minute — fast cloud option
›Gemini 2.5 Flash: 26 images/minute — solid cloud speed
›Qwen2.5-VL-7B: 14 images/minute — slower local option
›Claude Opus 4.6: 4.5 images/minute — quality has a cost
›Gemma 3 12B: 5.6 images/minute — bigger model, not worth the wait
›Gemini 3 Flash Preview: 2.6 images/minute — unstable, not production-ready

Gemma 3 4B processing at 71 images per minute — on a consumer RTX 3070 Ti, for free — was the biggest surprise. For a cataloging workload where most images are getting a standard scene description, that throughput matters.

The Gemini 3 Flash Preview (the newest at time of testing) was disqualified from production consideration: response times were wildly inconsistent, ranging from 8 seconds to 45 seconds, with occasional incomplete outputs. Not ready.

Quality: What the Scores Don't Tell You

The aggregate scores are useful but they hide important detail. Let me show you some actual outputs.

Construction Photo: Deck Structure From Below

This is a photo looking up at a deck from underneath — you can see the joists, ledger board, beam pocket connections, and some weathering on the existing lumber.

Gemma 3 4B (33/50 avg):

"View from below a wood deck structure. Visible: floor joists running perpendicular to ledger, existing beam with pocket connection, weathered pressure-treated lumber. Possible sister joist visible on left. No visible rot on accessible members."

Gemini 2.5 Flash (37/50 avg):

"Underside of exterior wood deck. Joists visible, ledger attachment appears present. Some older wood members with weathering. Construction or inspection context."

Claude Opus 4.6 (49/50 avg):

"Photograph taken from below an existing wood-framed deck, looking upward. The structural framing is clearly visible: floor joists at regular spacing (approximately 16" OC based on visual estimation) attached to what appears to be a ledger board against the house foundation. One joist appears to have been sistered — a new 2× member is lag-bolted alongside an existing one, which is common when the original shows rot, checking, or inadequate bearing. Several joists show surface checking and grey weathering consistent with untreated or end-grain exposed lumber. No visible active rot or insect damage from this vantage. Recommend closer inspection of the ledger connection and joist hangers before any load-bearing modifications."

The quality gap is real. Opus doesn't just describe — it identifies a sistered joist, estimates spacing, notes what should be inspected next. For a project report or a compliance document, that level of detail matters.

But for cataloging 185 images a day at $0.06 each, Opus costs $11/day just for photos. Gemma 3 at $0 costs nothing. The question becomes: does cataloging need Opus-quality output?

For search and retrieval — no. "Deck, joists, weathered lumber, pocket connection, sister joist" in a Gemma description is entirely sufficient to find that photo when someone searches "deck framing inspection." Opus is overkill for the catalog layer.

Handwritten Inspection Disapproval Notice

This was a handwritten note from Eric, a Montgomery County inspector, marking a disapproval on rough framing — specific items listed by hand.

Every model produced a general description. Only Claude Opus 4.6 extracted the text verbatim:

"MONTGOMERY COUNTY DEPARTMENT OF PERMITTING SERVICES — INSPECTION DISAPPROVAL. Items: (1) Header at master bedroom window — insufficient bearing, min 3-1/2" required; (2) Blocking missing at mid-span, per IRC Table R802.4.2; (3) Rim joist hangers not installed per schedule. Re-inspection required before cover. Inspector: Eric [redacted], DPS. Date: [date visible on document]."

That's not just a description — it's an extractable record. The ability to search "what did Eric flag at the Quincy Street renovation" and get the actual punch list items verbatim is genuinely useful. This is where escalating to Opus earns its cost.

Text Message Screenshot: Product Code Accuracy Test

A photo of a text message conversation with a product code for a tile order — "Item #8067" — in the middle of a long message thread.

›Qwen2.5-VL-7B: Extracted "8067" — correct ✅
›Gemma 3 12B: Extracted "3067" — wrong ❌

This matters. If the catalog records the wrong SKU, and someone later searches for that tile to reorder for a punch-list repair, they order the wrong product. The 1-digit error from a 12-billion-parameter model that's supposed to be better than the 4B version was one of the more frustrating findings.

Qwen2.5-VL-7B consistently outperformed Gemma 3 12B on text extraction despite being smaller. Worth noting for any use case where OCR accuracy matters more than scene description quality.

Permit Document: Where Qwen Failed

One of the test images was a building permit page — dense text, official format, permit number visible.

Qwen2.5-VL-7B returned a 400 error. The model crashed on this image format — likely a resolution or aspect ratio edge case. Not production-safe for document types you can't fully predict.

Claude Opus 4.6 extracted the permit number, issued date, work description, property address, licensed contractor number, and expiration date — all correctly. Complete, structured, accurate.

The lesson: local models are less robust on edge-case document formats. In production, you want a fallback.

The PDF Accuracy Problem — This One Changes Everything

This is the finding I didn't expect, and it has the biggest practical implications.

Construction PDFs contain structural specifications, dimensions, code references, and tolerances. Getting them wrong isn't a minor inconvenience — it's a liability.

We tested each model on a set of structural engineering plans (28 pages, ARCHICAD format). The task: extract the lintel specifications from Section IV F — a table specifying what steel angle size is required for different opening widths in masonry walls.

The results:

Specification	Actual Value	Gemma 3 4B	Claude Opus 4.6	Gemini 3.1 Flash-Lite
0"–3'-0" angle	3-1/2" × 3-1/2" × 5/16"	❌ Hallucinated	❌ 3-1/2" × 3-1/2" × 3/16"	✅ Correct
3'-1"–5'-0" angle	4" × 3-1/2" × 5/16"	❌ Hallucinated	❌ 3-1/2" × 3-1/2" × 5/16"	✅ Correct
5'-1"–6'-0" angle	6" × 3-1/2" × 5/16"	❌ Hallucinated	❌ 3-1/2" × 3-1/2" × 5/16"	✅ Correct
Min. bearing	6"	❌ Made up	❌ 4"	✅ Correct
Rebar	#5 bar	❌ Made up	❌ #4 bar	✅ Correct
Terminology	"Loose lintels"	❌ Made up	❌ "lintels limits"	✅ Correct
Score		0/6	0/6	6/6

Gemma 3 4B fabricated entire sections that didn't exist — fake tables, fake dimensions, fake addresses. Complete confabulation. This is expected: local vision models aren't designed for precise document OCR.

Claude Opus 4.6 — this one stung. Opus gets 49/50 on photo quality. It's the best general-purpose vision model I tested. And it got zero correct on structural specs. It got the right format (the table structure) but missed every dimension. The 3/16" versus 5/16" error on a load-bearing steel angle is not a rounding difference — it's a completely wrong specification that could fail a structural inspection or worse.

Gemini 3.1 Flash-Lite went 6/6. Perfect. Every dimension, every bearing spec, every term — verbatim from the document.

Why? Gemini processes PDFs by first rendering each page at 400 DPI, then applying its document understanding capabilities to the high-resolution image. It extracted 14,254 characters from a single structural spec page — every paragraph, every table cell, every ASTM reference number. And it does this at $0.25 per million input tokens, which is effectively free for construction PDF volumes.

This is the core finding that determined the architecture: Gemini is the only model in this test that can be trusted with structural specifications. Not the most expensive model. Not the local model. The mid-tier cloud model with native document understanding.

The API Outage That Proved Local-First

On March 31st, Anthropic's API experienced a capacity issue for approximately 5.5 hours (15:12–20:46 UTC).

During that window:

›156 API calls failed — 80 from Reef attempting Opus analysis, 76 from other tools
›Every group photo that came in during that window went unprocessed
›The system was effectively blind to project activity for most of an afternoon

The local Gemma 3 4B model running on my hardware? Zero downtime. Zero failures. It doesn't care about Anthropic's servers.

This isn't a hypothetical risk. API outages happen. When 93% of your image cataloging is handled by a local model that runs regardless of external availability, your documentation pipeline keeps running. When it's cloud-only, an outage means gaps in your construction records.

The Three-Model Architecture: What's Actually Running

Based on everything above, Reef uses a three-model architecture designed around a simple principle: match each task to the model that handles it best, at the lowest cost that meets the quality bar.

The Pipeline

Model Assignments

Photos → Gemma 3 4B (local)

Free. 71 images per minute. Good enough for cataloging. And critically — it stays on the network. No photo from a job site ever leaves the local infrastructure.

PDFs → Gemini 3.1 Flash-Lite (cloud)

The only model that can be trusted with structural specifications, permit documents, and engineering drawings. 400 DPI rendering. 100% accuracy on the lintel test. Costs essentially nothing for the volume of PDFs a small construction company generates.

Critical analysis → Claude Opus 4.6 (cloud, on-demand only)

Not in the bulk pipeline. Reef escalates to Opus only when something needs deep analytical judgment: building a compliance report, writing an estimate from photos, or when a permit document needs to be cross-referenced with site photos to check conformance. For those deliverables, Opus's 49/50 quality earns the $0.06/image.

Cost Analysis: What This Actually Costs to Run

Daily Operating Cost (Current Production)

Component	Model	Daily Volume	Daily Cost
Photo cataloging	Gemma 3 4B (local)	~185 images	$0
PDF pipeline processing	Gemini 3.1 Flash-Lite	~10 pages	~$0.01
PDF on-demand queries	Gemini 3.1 Flash-Lite	~20 pages	~$0.02
Flagged review (20%)	Claude Haiku 4.5	~37 images	$0.22
Compliance/report analysis (5%)	Claude Opus 4.6	~9 images	$0.54
Embeddings	Qwen3 4B (local)	All entries	$0
Total			~$0.79/day

Monthly Cost Scenarios

Strategy	Monthly Cost	Notes
Cloud-only (Opus for everything)	$333	Previous approach — overkill
Cloud-only (Haiku 4.5)	$33	Lower quality, still cloud-dependent
Current: local + Gemini + tiered Opus	~$24	Best accuracy, lowest real cost
All-local (Gemma + Gemini free tier only)	~$0.90	If no Opus escalation needed

The previous approach used Claude Opus for everything — photos, PDFs, analysis. The three-model architecture:

›Saves $310/month on photo processing alone (local Gemma replaces cloud Opus)
›Adds PDF accuracy that didn't exist before (Opus was failing on structural specs)
›Survives API outages without gaps in the construction record

The ~$24/month figure is with real Opus escalation for report-grade analysis. If your projects don't require deliverable documents — just cataloging and search — you can run the whole thing for under a dollar a month.

What's Running Right Now

Reef has been in production since early 2026 across active renovation projects in the DC area.

Current stats:

›558+ media files processed through the inbound pipeline
›228+ catalog entries with vector embeddings for semantic search
›6+ active projects with dedicated WhatsApp groups
›1-minute refresh — new messages and media appear in project journals within 60 seconds

Search capability: Catalog entries are embedded using Qwen3 Embedding 4B (local, 2560-dimensional vectors) stored in SQLite. Search understands meaning, not just keywords:

›"water damage" → finds excavation and sump pump photos across all projects (81.4% match)
›"footing pocket detail CMU block" → finds the structural PDF page with that spec (75.0% match)
›"permit number 1220 Quincy" → returns the extracted permit document with all fields

Keyword boosting ensures exact model numbers, product codes, and permit references get promoted over semantic matches when there's a conflict.

Interactive PDF queries: Beyond the automated pipeline, Reef supports real-time document questions. Ask "what are the lintel specs in Section IV F of the structural drawings?" and Reef renders the page at 400 DPI, passes it to Gemini, and returns the verbatim specifications in seconds. This is the tool that paid for itself the first time a subcontractor needed a spec confirmed on-site.

Conclusion: Catalog Everything Locally, Escalate What Matters

The seven-model benchmark produced a counterintuitive result: the best architecture isn't the most expensive model applied to everything, and it isn't all-local to save costs. It's a tiered system where each model does what it's actually good at.

Gemma 3 4B is fast, free, and good enough for bulk photo cataloging. For a workload where you're processing 185 images a day just to make them searchable, "good enough" is exactly right.

Gemini 3.1 Flash-Lite is the only model in this test that can reliably extract structural specifications from PDFs. That finding — that the best-quality general vision model (Opus) scores 0/6 on the same test where Gemini scores 6/6 — is the most important thing in this article. If you're using an LLM to read structural drawings, you need to verify it can actually read them. Most can't.

Claude Opus 4.6 is the best analytical model and earns its cost on deliverables. Not on bulk cataloging.

The construction industry generates enormous amounts of documentation and has largely unsolved problems around search, access, and memory across project teams. Reef is a specific solution to a specific workflow — but the underlying architecture applies to any environment where you need continuous, affordable AI processing of mixed media with high accuracy on structured documents.

The total infrastructure cost for 6 active renovation projects, continuous cataloging, and on-demand analysis: approximately $24/month.

If you're running something similar, or trying to figure out which models to use for your own construction documentation pipeline, I'm happy to dig into specifics. The benchmark data and Reef's architecture are both available to discuss.

Benchmarks conducted March–April 2026. Hardware: RTX 3070 Ti, 8GB VRAM, LM Studio. Models tested: Gemma 3 4B, Qwen2.5-VL-7B, Gemma 3 12B, Gemini 2.5 Flash, Gemini 3 Flash Preview, Claude Haiku 4.5, Claude Opus 4.6. PDF accuracy testing conducted on 28-page structural engineering plans. All construction project references anonymized.

Share this post

Share on LinkedIn Share on X

Need help with this?

Whether it's AI integration, cloud architecture, or field deployment — let's talk about your specific challenge.

← Back to all research