How AI Execution Lab uses Vertex AI Gemini 2.5 (Flash + Pro) exclusively: dependency-free service-account auth, model-tier routing with automatic fallback, graceful rate-limit handling, token usage + cost tracking, Vertex quota monitoring, batched multilingual embeddings, and a deterministic mock fallback. Includes the exact env vars, IAM roles, and a go-live checklist.
The platform talks to Vertex AI exclusively — no direct Generative Language (API-key) calls. The provider abstraction is unchanged for callers: getProvider().generate(), generateJSON(), and the embeddings API keep their old signatures, so nothing downstream broke.
| Tier | Model (default) | Used for |
|---|---|---|
flash | gemini-2.5-flash | SEO, GEO, social, reel, FAQ, classification — everything fast |
pro | gemini-2.5-pro | The long-form article (deep analysis) |
Callers select a tier via generate(prompt, { tier: 'pro' }) or { deep: true }. Default is Flash. The article generator uses DEEP_TIER (default pro); set DEEP_TIER=flash to cut article cost ~4× for low-cost mode.
Automatic fallback: on persistent 429/5xx, a call retries with exponential backoff + jitter, then falls back across tiers once (flash↔pro) before failing. This is graceful rate-limit handling, not silent data loss — every retry/fallback is logged (ai.retry, ai.tier_fallback).
Two modes, checked in order (lib/ai/vertex-auth.ts):
VERTEX_ACCESS_TOKEN — a pre-minted OAuth token (e.g. Workload Identity on Cloud Run). Simplest.GOOGLE_SERVICE_ACCOUNT_JSON (raw or base64) — we sign a JWT with node:crypto (RS256) and exchange it for an access token at Google's token endpoint, cached ~55 min. No google-auth-library needed.Grant the service account roles/aiplatform.user on the project. (Firestore uses its own FIREBASE_* credentials.)
Every live call records usageMetadata (lib/ai/usage.ts):
ai.usage log line._ai_usage_daily) via cheap increments — no table scans._ai_quota) tracks tokens/min vs VERTEX_TOKENS_PER_MIN; nearLimit flips at 85% so the system can back off before hard 429s.Cost + quota appear live on /ops/analytics.
Pricing constants are estimates — confirm against the current Vertex pricing page and override via
VERTEX_FLASH_IN_PER_M,VERTEX_PRO_OUT_PER_M, etc.
Vertex :predict with text-multilingual-embedding-002 (handles English + Hindi), 768-d, COSINE. embedBatch() packs up to 25 texts per request (real batching → fewer calls, lower cost). Falls back to a deterministic hash embedding (Devanagari-aware) when Vertex is unconfigured, so dedup/clustering/search stay testable offline.
# Vertex AI (required for live)
VERTEX_PROJECT_ID=your-gcp-project # falls back to FIREBASE_PROJECT_ID
VERTEX_LOCATION=us-central1 # or 'global'
GOOGLE_SERVICE_ACCOUNT_JSON={...} # OR VERTEX_ACCESS_TOKEN=...
# optional overrides
VERTEX_FLASH_MODEL=gemini-2.5-flash
VERTEX_PRO_MODEL=gemini-2.5-pro
VERTEX_EMBED_MODEL=text-multilingual-embedding-002
VERTEX_TOKENS_PER_MIN=100000
DEEP_TIER=pro # set 'flash' for cheap mode
roles/aiplatform.user, download the JSON key.GOOGLE_SERVICE_ACCOUNT_JSON (base64 is convenient) + VERTEX_PROJECT_ID + VERTEX_LOCATION in Vercel.curl /api/health → expect "ai":"live", provider: vertex:…./ops/analytics shows non-zero tokens + cost.