Vertex AI Gemini — Setup & Provider Architecture

How AI Execution Lab uses Vertex AI Gemini 2.5 (Flash + Pro) exclusively: dependency-free service-account auth, model-tier routing with automatic fallback, graceful rate-limit handling, token usage + cost tracking, Vertex quota monitoring, batched multilingual embeddings, and a deterministic mock fallback. Includes the exact env vars, IAM roles, and a go-live checklist.

May 30, 2026· by Anis Ansari, Founder, A Square Solutions· 3 min read

#vertex-ai #gemini #firebase #firestore #cost #rate-limiting #scamcheck #production

ShareX LinkedIn

Generate post copy →

The platform talks to Vertex AI exclusively — no direct Generative Language (API-key) calls. The provider abstraction is unchanged for callers: getProvider().generate(), generateJSON(), and the embeddings API keep their old signatures, so nothing downstream broke.

Why Vertex

Enterprise auth (service account / Workload Identity), VPC-SC, data residency, and one billing surface with Firestore.
Access to Gemini 2.5 Flash (fast, cheap) and Gemini 2.5 Pro (deep analysis).

Model tiers & routing

Tier	Model (default)	Used for
`flash`	`gemini-2.5-flash`	SEO, GEO, social, reel, FAQ, classification — everything fast
`pro`	`gemini-2.5-pro`	The long-form article (deep analysis)

Callers select a tier via generate(prompt, { tier: 'pro' }) or { deep: true }. Default is Flash. The article generator uses DEEP_TIER (default pro); set DEEP_TIER=flash to cut article cost ~4× for low-cost mode.

Automatic fallback: on persistent 429/5xx, a call retries with exponential backoff + jitter, then falls back across tiers once (flash↔pro) before failing. This is graceful rate-limit handling, not silent data loss — every retry/fallback is logged (ai.retry, ai.tier_fallback).

Authentication (dependency-free)

Two modes, checked in order (lib/ai/vertex-auth.ts):

VERTEX_ACCESS_TOKEN — a pre-minted OAuth token (e.g. Workload Identity on Cloud Run). Simplest.
GOOGLE_SERVICE_ACCOUNT_JSON (raw or base64) — we sign a JWT with node:crypto (RS256) and exchange it for an access token at Google's token endpoint, cached ~55 min. No google-auth-library needed.

Required IAM

Grant the service account roles/aiplatform.user on the project. (Firestore uses its own FIREBASE_* credentials.)

Token usage, cost & quota

Every live call records usageMetadata (lib/ai/usage.ts):

Per-call cost estimated from a pricing table (USD / 1M tokens), emitted as a structured ai.usage log line.
Daily counters (_ai_usage_daily) via cheap increments — no table scans.
Quota monitor (_ai_quota) tracks tokens/min vs VERTEX_TOKENS_PER_MIN; nearLimit flips at 85% so the system can back off before hard 429s.

Cost + quota appear live on /ops/analytics.

Pricing constants are estimates — confirm against the current Vertex pricing page and override via VERTEX_FLASH_IN_PER_M, VERTEX_PRO_OUT_PER_M, etc.

Embeddings

Vertex :predict with text-multilingual-embedding-002 (handles English + Hindi), 768-d, COSINE. embedBatch() packs up to 25 texts per request (real batching → fewer calls, lower cost). Falls back to a deterministic hash embedding (Devanagari-aware) when Vertex is unconfigured, so dedup/clustering/search stay testable offline.

Environment

Code

# Vertex AI (required for live)
VERTEX_PROJECT_ID=your-gcp-project          # falls back to FIREBASE_PROJECT_ID
VERTEX_LOCATION=us-central1                 # or 'global'
GOOGLE_SERVICE_ACCOUNT_JSON={...}           # OR VERTEX_ACCESS_TOKEN=...
# optional overrides
VERTEX_FLASH_MODEL=gemini-2.5-flash
VERTEX_PRO_MODEL=gemini-2.5-pro
VERTEX_EMBED_MODEL=text-multilingual-embedding-002
VERTEX_TOKENS_PER_MIN=100000
DEEP_TIER=pro                               # set 'flash' for cheap mode

Go-live checklist

Create a service account, grant roles/aiplatform.user, download the JSON key.
Set GOOGLE_SERVICE_ACCOUNT_JSON (base64 is convenient) + VERTEX_PROJECT_ID + VERTEX_LOCATION in Vercel.
Enable the Vertex AI API on the project.
curl /api/health → expect "ai":"live", provider: vertex:….
Generate one bundle → check /ops/analytics shows non-zero tokens + cost.

Verification status

Mock mode: full pipeline verified (build, typecheck, autopilot, queue) with the deterministic provider.
Vertex mode: auth/generation/embedding/usage code paths are typechecked and wired; live confirmation requires your service account (step 4–5 above).

AI Execution Lab Weekly

Production AI engineering notes, systems, and failure post-mortems — once a week.

Related in Docs

AI Cost Governance and Resource Discipline — A Square Solutions

Operational cost governance doctrine for TrustSeal and ScamCheck. Documents where costs originate, concrete free-tier economics, the 7 cost invariants that prevent runaway resource consumption, scaling thresholds with upgrade triggers, abuse containment strategy, and silent cost escalation vectors. All figures derived from real architecture — Gemini 1.5-flash free tier, Firebase Spark plan, Razorpay transaction fees.

2026-05-25→

Firestore Quota Enforcement for AI Features

Production pattern for per-user quota tracking, monthly reset logic, atomic increment, pre-AI-call enforcement, and abuse prevention using Firestore. Implemented in TrustSeal (10 free checks/month, premium tier) and ScamCheck (unlimited free after sign-up). Covers the data model, the enforcement code, the reset mechanism, and the cost protection logic that prevents free-tier Gemini quota from being exhausted by a single user.

2026-05-24→

AI Content Distribution Engine — Architecture

Modular, dependency-free engine that turns a single scam input into a full bilingual content bundle: article, SEO metadata, GEO summary, social copy for five platforms, Shorts/Reels script, FAQ + Article JSON-LD schema, auto internal links, and a per-channel publishing queue. Provider-agnostic AI over REST, Firebase-compatible store adapter, caching, rate limits, and audit logging.

2026-05-30→

All Docs