LABconfirmed

Gemini Structured Output Reliability — Prompt Iteration Experiment

Testing whether embedding an exact JSON schema + explicit format constraint in the prompt reduces malformed output frequency in Gemini 1.5-flash. Three prompt iterations tested during ScamCheck and TrustSeal build. Schema-in-prompt approach reduced parse failures from ~6% to <1% of calls.

February 15, 2026· by Anis Ansari, Founder, A Square Solutions· 6 min read

Hypothesis

If we embed the exact expected JSON schema and explicit format-suppression instructions in the Gemini prompt, then parse failure frequency (SyntaxError on JSON.parse) will decrease from ~6% observed baseline to under 1%, because language models produce decorated output by default in the absence of explicit format constraints.

#gemini #structured-output #firebase-functions #scamcheck #trustseal #experiment #json #prompt-engineering

ShareX LinkedIn

Generate post copy →

Gemini 1.5-flash does not reliably produce raw JSON without explicit format constraints. The model's default behavior — wrapping output in markdown code fences, prepending explanation text — is correct for chat contexts and incorrect for API consumers calling JSON.parse(). This experiment tested three prompt structures against observed parse failure rates during ScamCheck and TrustSeal development (February 2026).

●Intermediate⏱2h (iteration + testing) to implement⬡Production data from ScamCheck and TrustSeal Cloud Functions — real user inputs, real Gemini responses logged

1. Hypothesis

If we embed the exact expected JSON schema and explicit format-suppression instructions ("no code fences, no text before or after the JSON object") into the Gemini prompt,

Then parse failure frequency — measured as SyntaxError events on JSON.parse() in the Cloud Function logs — will decrease from the ~6% observed baseline to under 1%,

Because language models produce decorated output (markdown fences, explanatory text) by default when the output format is not explicitly constrained. The model is following a learned behavior from training data where code and structured data are presented in fenced blocks. Explicit contradiction of this behavior in the prompt shifts the model's output distribution toward raw, parseable JSON.

2. Baseline: Iteration 0

Prompt structure (minimal):

Code

Analyze whether the following message is a scam. Return a JSON response.

Input: {userInput}

Observed behavior:

Clean JSON returned: ~94% of calls
Markdown code fence wrapping: ~4% of calls
Explanation prefix/suffix: ~2% of calls
Parse failure rate: ~6%

Sample failure output:

Code

```json
{"verdict":"LIKELY_SCAM","probability":0.85,"signals":["urgency","financial_request"],"explanation":"This message contains several high-confidence scam indicators including financial urgency and impersonation signals."}
```

The model reasoned correctly but decorated the output. JSON.parse() throws `SyntaxError: Unexpected token `` at position 0.

3. Iteration 1: Schema Description in Prose

Prompt structure:

Code

Analyze whether the following message is a scam. Return a JSON response with the following fields:
- verdict: "LIKELY_SCAM", "UNLIKELY_SCAM", or "UNCERTAIN"
- probability: a number between 0 and 1
- signals: an array of detected signal strings
- explanation: a plain language explanation

Input: {userInput}

Observed behavior:

Clean JSON returned: ~96% of calls
Markdown code fence wrapping: ~3% of calls
Explanation prefix/suffix: ~1% of calls
Parse failure rate: ~4%

Marginal improvement — the prose description of the schema did not eliminate decorated output. The model still treated the request as a "write code/data for me" task and applied its learned formatting defaults.

4. Iteration 2: Exact Schema Embedded (Without Format Suppression)

Prompt structure:

Code

You are a scam detection system. Analyze the following input and return ONLY a JSON object.

Required output format:
{
  "verdict": "LIKELY_SCAM" | "UNLIKELY_SCAM" | "UNCERTAIN",
  "probability": <number between 0 and 1>,
  "signals": [<array of detected signal strings>],
  "explanation": "<plain language explanation for non-technical users>"
}

Input: {userInput}

Observed behavior:

Clean JSON returned: ~98% of calls
Markdown code fence wrapping: ~1.5% of calls
Explanation prefix/suffix: ~0.5% of calls
Parse failure rate: ~2%

Substantial improvement from embedding the literal schema shape. The model mapped its output directly to the provided template structure. Residual failures came from inputs that triggered long reasoning chains about ambiguous signals.

5. Iteration 3: Schema + Explicit Format Suppression (Production Prompt)

Prompt structure:

Code

You are a scam detection system. Analyze the following input and return ONLY a JSON object with no other text, no markdown formatting, no code fences, and no explanation text outside the JSON.

Required output format (return ONLY this structure):
{
  "verdict": "LIKELY_SCAM" | "UNLIKELY_SCAM" | "UNCERTAIN",
  "probability": <number between 0 and 1>,
  "signals": [<array of signal type strings from the taxonomy below>],
  "explanation": "<1-2 sentence plain language explanation for non-technical users>"
}

Signal taxonomy (use only these values in the signals array):
URGENCY_PRESSURE | FINANCIAL_REQUEST | IMPERSONATION | PHISHING_LINK | 
UNSOLICITED_CONTACT | PRIZE_LOTTERY | ROMANCE_SCAM | TECHNICAL_SUPPORT_SCAM |
INVESTMENT_FRAUD | IDENTITY_THEFT_ATTEMPT | LOGICAL_INCONSISTENCY | OTHER

Edge case instructions:
- If the user is describing a past scam experience (not submitting active suspicious content), set verdict to UNCERTAIN and note in explanation that you cannot verify past events
- If input is in a language other than English, analyze in that language and return explanation in English
- If input is too short to analyze confidently (<10 words), set verdict to UNCERTAIN

Return ONLY the JSON object. Do not wrap it in code fences. Do not add any text before or after the JSON.

Input: {userInput}

Observed behavior (production data, February–May 2026):

Clean JSON returned: ~99.2% of calls
Markdown code fence wrapping: ~0.6% of calls (reduced, primarily on very long inputs)
Explanation prefix/suffix: ~0.2% of calls
Parse failure rate: ~0.8%

The explicit "Do not wrap it in code fences. Do not add any text before or after the JSON" instruction measurably shifted the residual failure rate. Combined with the pre-parse cleaning layer (described below), end-to-end parse success reached 100% across logged production inputs.

6. Pre-Parse Cleaning (Defense-in-Depth)

Prompt optimization reduces failure frequency but does not eliminate it. The production implementation adds a cleaning layer before JSON.parse():

JavaScript

function cleanGeminiOutput(raw) {
  let text = raw.trim()
  // Strip markdown code fences
  text = text.replace(/^```(?:json)?\s*/i, '').replace(/\s*```$/, '').trim()
  // Extract first JSON object
  const firstBrace = text.indexOf('{')
  const lastBrace  = text.lastIndexOf('}')
  if (firstBrace !== -1 && lastBrace > firstBrace) {
    text = text.slice(firstBrace, lastBrace + 1)
  }
  return text
}

With Iteration 3 prompt + pre-parse cleaning: 0 parse failures across all logged production inputs (ScamCheck and TrustSeal, February–May 2026).

7. Results Summary

Iteration	Approach	Parse Failure Rate
0 — Baseline	Minimal prompt ("Return JSON")	~6%
1 — Prose schema	Field list in prose	~4%
2 — Embedded schema	Exact JSON shape in prompt	~2%
3 — Schema + suppression	Schema + "no code fences" instruction	~0.8%
3 + pre-parse cleaning	Production implementation	~0%

8. Conclusions

Schema in prompt outperforms prose description. Giving the model a literal JSON template to mirror is more effective than describing the schema in English. The model maps its output directly to the provided structure.

Explicit format suppression works. "Do not wrap it in code fences" is not redundant — it measurably reduces the residual failure rate beyond what schema embedding alone achieves.

Pre-parse cleaning is still required. Even with the most constrained prompt, model updates, edge case inputs, or unusual runtime conditions can produce decorated output. The pre-parse cleaning layer provides the final line of defense with zero user-facing impact.

Two-layer defense is the production pattern: tight prompt constraints (reduce failure frequency) + pre-parse cleaning (handle residual failures). Neither layer alone is sufficient for production reliability.

AI Execution Lab Weekly

Production AI engineering notes, systems, and failure post-mortems — once a week.

Related in Labs

QuickFix Component — Semantic HTML for AI Fact Extraction

Testing whether structured semantic HTML (dl/dt/dd elements with explicit field labels) increases AI crawler fact extraction accuracy compared to prose failure descriptions. The QuickFix component was designed as an operational hypothesis — this lab documents the reasoning, the implementation, and the observable indicators.

2026-05-18→

GEO Experiment — Entity Density and Answerability Scoring

Testing whether structured entity density and explicit answer formatting on asquaresolutions.com key pages increases AI citation frequency. Baseline established, implementation in progress.

2026-05-18→

WordPress Ecosystem Rollout — Visual Operational Archive

Complete visual evidence archive for the A Square Solutions Phase 2 WordPress ecosystem rollout (2026-05-20). Schema deployment, homepage sections, footer widget, About page, internal links, and LiteSpeed cache purge — every step captured with production screenshots and operational commentary.

2026-05-20→

All Labs