Large-scale evaluation of the ScamCheck multimodal scam-detection pipeline: a 1,000-sample synthetic corpus (en/hi/hinglish/mixed, 10 scam + 7 legit categories), precision/recall/F1, per-language and per-category breakdown, adversarial robustness, leaderboard analytics, caching/stress harnesses, cost model, scaling path, and known weaknesses.
Production evaluation of multimodal ScamCheck. Nothing is mocked: the offline numbers below are produced by the real deterministic detection + entity + URL-intel + calibration layer (the cheap layer that gates the live pipeline), run over a 1,000-sample corpus. Metrics that require live inference (OCR fidelity, hallucination rate, latency, semantic-retrieval relevance) are produced by the live harnesses against a deployed instance and are marked as such — they are not estimated here.
scripts/gen-synthetic-corpus.mjs → datasets/synthetic/{scam,legit}.jsonl —
500 scam + 500 legit samples, seeded/reproducible.
scripts/render-screenshots.mjs emits SVG SMS/WhatsApp/
bank-UI mockups (rasterize to PNG for live Gemini vision).node scripts/eval-corpus.mjssamples: 1000 (500 scam / 500 legit)
TP=467 FP=0 TN=500 FN=33
precision = 1.000 recall = 0.934 F1 = 0.966 accuracy = 0.967
Scam recall by language: en 1.000 · mixed 1.000 · hinglish 0.914 · hi 0.796 Scam recall by category: fake_kyc/bank_alert/courier/job/crypto/whatsapp/ecommerce = 1.000 · fake_payment 0.82 · investment 0.78 · upi_refund 0.74 Entity-extraction accuracy (fixtures): 1.000
Zero false positives across 500 legitimate messages — including legit OTP ("do not share"), debit alerts, payment-success screens, and KYC-at-branch notices — which is the hard case. This validates the calibration + the precision fixes (fake_payment targets unsolicited credit/QR, not debit alerts; otp_request targets sharing solicitation; kyc_phish is a topic warn, not a standalone danger).
Obfuscated URLs (hxxp://, sbi-kyc[.]xyz), unicode homoglyphs (Cyrillic ѕ),
digit substitution (0 for o), spaced O T P, and emoji spam applied to all
500 scam samples:
adversarial scam recall = 0.852 (426/500)
A ~8 pt drop from clean text — URL-intel still catches punycode/homoglyph/[.]
obfuscation; the residual loss is on spaced/segmented keywords, which the live
Gemini-vision layer recovers.
Live equivalents are served by GET /api/scam-intel/dashboard over the
scam_image_analysis BigQuery telemetry.
SCAMCHECK_URL=… node scripts/stress-test.mjs — 100
concurrent screenshots, a 50× duplicate-upload cache benchmark (reports cache
hit rate + embedding reuse), oversized/malicious-payload abuse checks, and a
multilingual classification burst. Reports p50/p90/p99 latency, throughput, 429s.SCAMCHECK_URL=… node scripts/eval-corpus.mjs streams
samples to the live API (adds OCR/Gemini/VECTOR_SEARCH behaviour)./api/scam-intel/screenshot response
carries timings { ocrMs, embedMs, vectorMs, deepMs, totalMs } + estCostUsd,
and logs event=multimodal.analyzed to Cloud Logging.| Stage | Cost | Notes |
|---|---|---|
| OCR (Cloud Vision) | ~₹0.12 | or Gemini-OCR fallback ~₹0.05–0.15 |
| Query embedding | ~₹0.002 | ₹0 on cache hit |
| VECTOR_SEARCH | ~₹0 | free tier covers light traffic |
| Deep Gemini vision (gated ~30–40%) | ~₹0.15–0.40 | only fires on ambiguous risk |
| Blended | ~₹0.15–0.40 / scan | duplicate uploads ≈ ₹0 (hash cache) |
10k scans/month ≈ ₹1,500–4,000; the OTP-floor + calibration reduce deep-vision firing on clear-cut cases, lowering the blended cost.
--min-instances 0 until steady traffic, then 1 to remove cold starts · add a
BigQuery vector index past a few thousand corpus rows · Pub/Sub→BigQuery
telemetry sink at high volume · widen the deep-vision gate band as the cheap
layer's precision is proven · batch-embed at ingestion only.
A production pipeline to turn real screenshots into a queryable campaign corpus (run against live GCP; the deterministic layers are verified offline):
POST /api/scam-intel/corpus-ingest (admin) — OCR + Gemini vision
scam_corpus vector table.
Driver: scripts/build-scam-corpus.mjs ./datasets/real-images (bounded
concurrency, dedup by hash, live cost tracking).lib/scam-intel/fingerprint.ts): stable campaign id +
human label ("Fake SBI KYC Suspension", "Courier Customs Fee Scam", "Refund QR
Collection Scam", …) from category + spoofed brand + domain + wording skeleton.lib/scam-intel/campaign-clustering.ts): union-find
over shared domain / UPI / phone / wording-skeleton / fingerprint;
link-shorteners excluded as non-distinctive.scamCorpusNearest): every upload returns its nearest known
campaigns + cluster confidence (campaign history).GET /api/scam-intel/leaderboard): top spoofed brands, scam
domains, UPI IDs, phones, campaigns, and 7-day fastest-growing campaigns.Clustering quality (offline, node scripts/eval-clustering.mjs, 500 scam):
clusters = 44 (26 multi-member) weighted purity = 1.000
distinct fingerprints = 22
top campaigns: Work-From-Home Job (50), Crypto Airdrop/Seed-Phrase (50),
Fake Ecommerce Refund (50), Fake Customer-Support (48), Fake KYC Suspension (46) …
A bug this surfaced and fixed: link-shorteners (bit.ly/tinyurl/t.me) were merging unrelated campaigns — excluding them from the join key lifted weighted purity to 1.000 on category-correlated domains.
Serverless, scale-to-zero, no GPU; Vertex AI + BigQuery + Cloud Run only. All evaluation uses the live pipeline's real logic; nothing is mocked.