Multimodal ScamCheck — Screenshot & Image Scam Analysis (OCR + Vision + Semantic Retrieval)

Production multimodal scam-intelligence for ScamCheck: screenshot/image upload, lightweight OCR (Cloud Vision + Gemini fallback), deterministic fraud-signal detection, gated deep Gemini-vision analysis, and semantic comparison against known scam clusters via BigQuery VECTOR_SEARCH. Cost-gated, serverless, scale-to-zero.

June 3, 2026· by Anis Ansari, Founder, A Square Solutions· 5 min read

#scamcheck #multimodal #ocr #vision #vertex-ai #bigquery #fraud-detection #api

ShareX LinkedIn

Generate post copy →

ScamCheck now analyzes screenshots (WhatsApp/Telegram/Instagram DMs, fake UPI/payment confirmations, phishing UIs, banking SMS) in addition to text. The pipeline is cost-gated: cheap OCR + deterministic signals run first; expensive Gemini-vision inference fires only when the verdict is ambiguous.

Pipeline (cost-gated)

Code

screenshot ─► [1] OCR (Cloud Vision TEXT_DETECTION → Gemini fallback)  ── cheap
                  → text + word bounding boxes (highlight regions)
           ─► [2] enrich(text): scam category/severity/tactics + trust signals  ── 0 Vertex
                  + visual-heuristic detectors (fake payment, OTP, KYC phish, urgency, impersonation…)
           ─► [3] embedQuery(text) → VECTOR_SEARCH over scam corpus  ── cheap (cached)
                  → similar known scam patterns
           ─► [4] GATED deep Gemini-vision verdict  ── expensive, only if riskScore ∈ [25,70] or forceDeep
                  → blended risk + rationale
           ─► verdict + riskScore + confidence + signals + regions + similar (JSON)

Lightweight first (task 6): OCR via Cloud Vision (≈$1.5/1k images, returns boxes) with a Gemini multimodal fallback if Vision isn't enabled. Deterministic detectors and trust scoring cost zero Vertex calls.
Expensive only when needed: the deep gemini-2.5-flash vision pass runs only for mid-risk/ambiguous screenshots (or forceDeep), preserving scale-to-zero.
Preserves architecture: Vertex AI, BigQuery VECTOR_SEARCH, Cloud Run, and the semantic enrichment pipeline are reused, not replaced.

Modules

File	Role
`lib/scam-intel/ocr.ts`	`ocrImage()` — Cloud Vision TEXT_DETECTION (+word boxes) → Gemini fallback. Multilingual (en/hi hints).
`lib/scam-intel/multimodal.ts`	`analyzeScreenshot()` — orchestrates OCR → enrichment + visual detectors → semantic similarity → gated deep vision → verdict.
`lib/intelligence/enrichment.ts`	Reused for scam category/severity/tactics + trust signals on the OCR text.
`lib/store/bigquery.ts`	`vectorSearch(..., { sourceTypes })` for "similar known scams".

Visual / textual detectors

Fake payment/UPI confirmation · OTP/PIN/CVV request · KYC/verification phishing · urgency/pressure · brand/authority impersonation · lottery/reward/job bait · suspicious links/shorteners · move-to-WhatsApp/call-this-number. Each contributes to a 0–100 risk score; danger signals weigh more than warnings.

API contract

`POST /api/scam-intel/screenshot` (public, rate-limited 12/min)

JSON: { "imageBase64": "<...>", "mime": "image/png", "forceDeep": false }
multipart/form-data: field image (file) [+ forceDeep]
Limits: PNG/JPEG/WebP, ≤ 6 MB.

JSON

{ "verdict":"likely_scam","riskScore":84,"confidence":0.82,
  "ocr":{"text":"…","engine":"cloud-vision","lang":"en","wordCount":42},
  "regions":[{"text":"OTP","x":120,"y":340,"w":60,"h":28}],
  "classification":{"category":"otp_fraud","severity":"high","tactics":["urgency","impersonation"]},
  "trust":{"score":10,"band":"standard"},
  "visualSignals":[{"id":"otp_request","label":"OTP / PIN / CVV request","severity":"danger","evidence":"do not share otp"}],
  "similar":[{"id":"…","title":"…","url":"…","confidence":0.79,"confidenceBand":"high"}],
  "deepAnalysisUsed":true,"deepAnalysis":"Spoofed bank UI requesting OTP; classic account-takeover." }

Always returns structured JSON (wrapped by lib/api/json.ts — no HTML error pages).

UI

app/scamcheck/screenshot/page.tsx — drag/drop or tap-to-upload, screenshot preview with highlighted suspicious regions (scaled Vision boxes), editable OCR text, verdict + risk + confidence, fraud-signal list, and similar known scams. Mobile-friendly, dark theme. Images are processed in-request and not stored.

v2 additions — entities, trust score, telemetry, caching

Entity extraction (lib/scam-intel/extract-entities.ts, deterministic): phone numbers, URLs, link shorteners/risky TLDs, UPI VPAs, ₹ amounts, QR/payment-collect references, urgency + impersonation markers. Entity risk feeds the score.
Pipeline fan-out (task 5): OCR text is fed into the trustscore pipeline (computeTrustScore → trustScore, scamProbability, AI explanation), the semantic-search + scam-intel similarity pipelines (VECTOR_SEARCH over scam sources).
Richer verdict: verdict, riskScore, scamProbability, trustScore, explanation, safetyAdvice[], entities, visualSignals, regions, similar, classification, deepAnalysisUsed.
Cost optimization (task 10): the image is SHA-256 hashed; the full verdict, OCR result, and query embedding are each cached by image hash (24h) — a duplicate upload costs zero OCR/embedding/vision calls (cached: true). (Pixel resize/compression is delegated to Gemini's server-side downsampling since the runtime has no image lib; uploads are capped at 6 MB.)
BigQuery telemetry (task 8): logImageAnalysis() writes one best-effort row per scan to scam_image_analysis (verdict, risk, scam probability, trust, category, entity counts, deep-used) — self-ensuring table, never blocks the response.
Test fixtures (task 11): lib/scam-intel/__fixtures__/screenshot-scams.ts (fake SBI SMS, fake courier customs, fake UPI refund, fake KYC) + runnable scripts/test-screenshot-fixtures.mjs asserting expected signals/entities/risk (4/4 passing, offline).

v3 hardening — calibration, multilingual, URL/QR intel, eval, dashboard

Confidence calibration (lib/scam-intel/calibration.ts): uncertainty penalty (high raw risk + thin evidence → pulled toward neutral, confidence cut), evidence-weighted boost, source-reliability weighting, and a low-confidence needs_review fallback when OCR text is too sparse. Anti over-confidence.
Multilingual (en / Hindi / Hinglish / transliterated): detectors match "KYC update karo", "account block ho jayega", "OTP bhejo", "refund ke liye", plus Devanagari.
URL/domain intel (lib/scam-intel/url-intel.ts): punycode, non-ASCII homoglyphs, brand look-alikes (edit-distance vs SBI/HDFC/ICICI/Paytm/PhonePe/Amazon/India Post…), shorteners, suspicious TLDs, raw-IP URLs, digit-substitution, excessive subdomains.
QR / payment intel: QR/collect-request detection + UPI VPA extraction feed the risk score (a real credit never needs you to scan/approve).
Visual spoof detection: the gated Gemini-vision prompt explicitly checks fake bank/courier logos, fake WhatsApp/UPI system screens, and fake payment-success screens — retrieval-grounded with the top similar corpus scams to reduce hallucination.
Explainability: explainability { whyFlagged, evidence[], matchingPatterns[], confidenceReasoning[] }.
Observability: per-stage timings { ocrMs, embedMs, vectorMs, deepMs, totalMs } + estCostUsd returned and logged (event=multimodal.analyzed).
Abuse protection: per-minute (12) + per-day (200) IP caps, 6 MB limit, and magic-byte sniffing (rejects payloads whose bytes aren't a real PNG/JPEG/WebP regardless of declared mime).
Evaluation: labeled dataset under /datasets/{scam,legit}-samples/ + scripts/benchmark-scamcheck.mjs (precision/recall/F1/FP/FN/entity accuracy; offline deterministic layer = P/R/F1 1.0 on the 16-sample set, live mode for OCR/retrieval).
Dashboard: GET /api/scam-intel/dashboard?days=30 (ADMIN) — totals, verdict + category distribution, OCR failures, deep-vision usage, avg risk/scam-probability, entity totals, daily trend (from scam_image_analysis).

Operational guidance

Enable Cloud Vision (gcloud services enable vision.googleapis.com) for cheap OCR + region boxes; without it, the Gemini OCR fallback still works (slightly higher cost, no boxes).
Env: VERTEX_VISION_MODEL (deep, default gemini-2.5-flash), VERTEX_OCR_VISION_MODEL (OCR fallback). ADC on Cloud Run needs no keys.
Cost control: deep vision is gated to ambiguous risk; raise/lower the [25,70] band in multimodal.ts to trade cost vs sensitivity. Query embeddings are cached.
Monetization: a clean public, rate-limited endpoint — a ScamCheck Pro "scan a screenshot" feature; per-scan metering via the existing rate-limit + usage layers.

Frequently asked questions

What is multimodal ScamCheck?

It analyses screenshots and images for scams using OCR, AI vision, and semantic retrieval against known scam campaigns.

Are uploaded screenshots stored?

No. Images are optimised on-device and processed in-request; they are not stored.

AI Execution Lab Weekly

Production AI engineering notes, systems, and failure post-mortems — once a week.

Related in Docs

ScamCheck Multimodal v3 — Production Evaluation Report

Large-scale evaluation of the ScamCheck multimodal scam-detection pipeline: a 1,000-sample synthetic corpus (en/hi/hinglish/mixed, 10 scam + 7 legit categories), precision/recall/F1, per-language and per-category breakdown, adversarial robustness, leaderboard analytics, caching/stress harnesses, cost model, scaling path, and known weaknesses.

2026-06-03→

Semantic Intelligence Platform — Retrieval, Enrichment, GEO & Scam Clustering APIs

The A Square Solutions semantic intelligence layer built on Vertex AI embeddings + BigQuery VECTOR_SEARCH: intelligent chunking, hybrid lexical+vector retrieval, snippets & confidence, semantic enrichment (topic/scam/trust/GEO), scam-pattern clustering, GEO/AI-search readiness scoring, and retrieval observability. Production, serverless, scale-to-zero, canonical 768-dim.

2026-06-03→

GCP AI Infrastructure — Vertex Embeddings, BigQuery Vector Store, TrustScore API & Cloud Run

Production, serverless GCP infrastructure for the A Square Solutions ecosystem: Vertex AI embeddings for Tier-A posts/service pages/ScamCheck/TrustSeal, a vector-ready BigQuery store with VECTOR_SEARCH, a TrustScore/ScamCheck API on Cloud Run, semantic internal-link intelligence, Cloud Scheduler automation, and a realistic spend model in INR. Serverless-first, scales to zero, no idle VMs.

2026-05-31→

All Docs