The Blueprint for Scaling Generative AI at Intuit: Frameworks, Failures, and Future-Proof APIs

As enterprises race to integrate generative AI into their operations, few have navigated the journey as methodically as Intuit. Under the guidance of Merrin Kurian, a principal engineer leading the company's AI transformation, Intuit has developed a robust infrastructure stack that balances innovation with reliability. The key insights from their approach—ranging from a simple yet powerful scaling framework to pragmatic failure mode analysis—offer a masterclass for any organization building its own GenAI foundation.

The 'Fixed, Flexible, Free' Framework

At the heart of Intuit’s scaling strategy lies a tripartite framework that Kurian calls “Fixed, Flexible, Free.” This model was designed to unify over 8,000 developers across the company while respecting the diverse needs of different product teams.

The Blueprint for Scaling Generative AI at Intuit: Frameworks, Failures, and Future-Proof APIs — Source: www.infoq.com

Fixed: A core set of non-negotiable components—security protocols, data governance rules, and base model access—that every team must adopt. This ensures a consistent safety baseline and prevents fragmentation.
Flexible: Standardized templates and pipelines for common AI workflows (like retrieval-augmented generation or fine-tuning) that teams can customize without reinventing the wheel. This reduces duplication of effort while allowing domain-specific adaptations.
Free: The freedom to experiment with novel architectures, custom agents, and third-party tools outside the standard stack, as long as they plug into the organization’s monitoring and evaluation backbone. This fosters innovation without losing visibility.

This layered approach helped Intuit scale its GenOS (Generative AI Operating System) from a handful of pilot projects to a company-wide platform supporting thousands of simultaneous experiments.

Scaling GenOS Across Thousands of Developers

Intuit’s GenOS is not a single product but a set of interconnected services—model serving, prompt management, observability, and evaluation—exposed via APIs. The platform now supports over 3,500 production experiments conducted by teams ranging from tax preparation to small business accounting. Achieving this scale required three critical investments:

Self-service tooling: Developers can spin up new agent prototypes with minimal intervention from the central AI platform team. This reduced onboarding from weeks to hours.
Unified telemetry: Every experiment logs to a central dashboard that tracks latency, cost, user feedback, and safety metrics. Teams can compare their agents against organizational benchmarks.
Graduation gates: Not every experiment goes to production. Intuit uses a staged review process where experiments must pass automated tests for performance, fairness, and robustness before being deployed to real users.

This systematic approach turned a chaotic deluge of ideas into a manageable pipeline of vetted AI improvements.

Enabling 3,500+ Experiments: Lessons Learned

One surprising insight from Kurian’s presentation is that volume alone isn’t success. Many experiments fail—and that’s by design. Intuit deliberately encourages rapid, cheap failures in earlier stages to surface the most viable innovations later. The key is making those failures informative, not costly.

Taming Agent Failure Modes

No GenAI infrastructure discussion is complete without confronting the elephant in the room: agents fail. Kurian categorized critical failure modes into three buckets:

Hallucination and drift: Agents invent facts or contradict internal knowledge bases, especially when prompts drift over time.
Tool misuse: Agents call the wrong API, misuse parameters, or fall into loops that degrade user experience.
Evaluation blindness: Traditional metrics (like BLEU or ROUGE) fail to capture subtle quality issues, leading to false confidence.

To combat these, Intuit adopted a strategy called “LLM-as-a-Judge”—using a separate, more powerful language model to evaluate the outputs of production agents. This evaluation is not just pass/fail; it provides structured feedback on criteria like helpfulness, factuality, and safety. The judge model runs both offline (on historical batches) and online (as a lightweight guardrail for critical workflows). This hybrid approach catches regressions that rule-based detectors miss.

Implementing LLM-as-a-Judge Effectively

Kurian emphasized that the judge must itself be calibrated. Intuit uses a combination of human-annotated golden datasets and automated consistency checks to prevent the judge from introducing its own biases. The evaluation pipeline also includes adversarial testing where known failure cases are injected to verify the judge remains robust.

Building Tool-Ready APIs for the Future

A recurring theme in Kurian’s talk was the necessity of designing APIs that agents can consume autonomously. Traditional APIs designed for human developers often assume that the caller will follow documentation perfectly. Agents, however, are less forgiving—they can misinterpret descriptions, ignore required parameters, or fail on minor version changes.

Intuit’s solution was to create “tool-ready” API specifications that include:

Structured input/output schemas (using JSON Schema) that agents can parse deterministically.
Self-describing endpoints that expose their own capabilities via a machine-readable manifest (e.g., OpenAPI augmented with usage hints).
Idempotency and retry logic built into the contract, so agents can safely repeat requests without causing side effects.

This forward-thinking design has allowed Intuit to onboard new AI capabilities—like natural language querying of financial data—without rewriting existing agent frameworks.

Conclusion: A Pragmatic Path to GenAI at Scale

Merrin Kurian’s blueprint for Intuit’s GenAI infrastructure stack is a rare blend of ambition and pragmatism. The Fixed-Flexible-Free framework provides a mental model for balancing standardization and creativity; the scaling story demonstrates how to move from pilot to pervasive without chaos; the failure-mode analysis reminds us that realistic testing is non-negotiable; and the tool-ready APIs ensure the infrastructure evolves alongside the agents. For any organization building its own GenAI stack, the lesson is clear: invest in the system that supports the system, and let the experiments flow.