Programming

The Blueprint for Scaling Generative AI at Intuit: Frameworks, Failures, and Future-Proof APIs

2026-05-20 06:07:40

As enterprises race to integrate generative AI into their operations, few have navigated the journey as methodically as Intuit. Under the guidance of Merrin Kurian, a principal engineer leading the company's AI transformation, Intuit has developed a robust infrastructure stack that balances innovation with reliability. The key insights from their approach—ranging from a simple yet powerful scaling framework to pragmatic failure mode analysis—offer a masterclass for any organization building its own GenAI foundation.

The 'Fixed, Flexible, Free' Framework

At the heart of Intuit’s scaling strategy lies a tripartite framework that Kurian calls “Fixed, Flexible, Free.” This model was designed to unify over 8,000 developers across the company while respecting the diverse needs of different product teams.

The Blueprint for Scaling Generative AI at Intuit: Frameworks, Failures, and Future-Proof APIs
Source: www.infoq.com

This layered approach helped Intuit scale its GenOS (Generative AI Operating System) from a handful of pilot projects to a company-wide platform supporting thousands of simultaneous experiments.

Scaling GenOS Across Thousands of Developers

Intuit’s GenOS is not a single product but a set of interconnected services—model serving, prompt management, observability, and evaluation—exposed via APIs. The platform now supports over 3,500 production experiments conducted by teams ranging from tax preparation to small business accounting. Achieving this scale required three critical investments:

  1. Self-service tooling: Developers can spin up new agent prototypes with minimal intervention from the central AI platform team. This reduced onboarding from weeks to hours.
  2. Unified telemetry: Every experiment logs to a central dashboard that tracks latency, cost, user feedback, and safety metrics. Teams can compare their agents against organizational benchmarks.
  3. Graduation gates: Not every experiment goes to production. Intuit uses a staged review process where experiments must pass automated tests for performance, fairness, and robustness before being deployed to real users.

This systematic approach turned a chaotic deluge of ideas into a manageable pipeline of vetted AI improvements.

Enabling 3,500+ Experiments: Lessons Learned

One surprising insight from Kurian’s presentation is that volume alone isn’t success. Many experiments fail—and that’s by design. Intuit deliberately encourages rapid, cheap failures in earlier stages to surface the most viable innovations later. The key is making those failures informative, not costly.

Taming Agent Failure Modes

No GenAI infrastructure discussion is complete without confronting the elephant in the room: agents fail. Kurian categorized critical failure modes into three buckets:

To combat these, Intuit adopted a strategy called “LLM-as-a-Judge”—using a separate, more powerful language model to evaluate the outputs of production agents. This evaluation is not just pass/fail; it provides structured feedback on criteria like helpfulness, factuality, and safety. The judge model runs both offline (on historical batches) and online (as a lightweight guardrail for critical workflows). This hybrid approach catches regressions that rule-based detectors miss.

The Blueprint for Scaling Generative AI at Intuit: Frameworks, Failures, and Future-Proof APIs
Source: www.infoq.com

Implementing LLM-as-a-Judge Effectively

Kurian emphasized that the judge must itself be calibrated. Intuit uses a combination of human-annotated golden datasets and automated consistency checks to prevent the judge from introducing its own biases. The evaluation pipeline also includes adversarial testing where known failure cases are injected to verify the judge remains robust.

Building Tool-Ready APIs for the Future

A recurring theme in Kurian’s talk was the necessity of designing APIs that agents can consume autonomously. Traditional APIs designed for human developers often assume that the caller will follow documentation perfectly. Agents, however, are less forgiving—they can misinterpret descriptions, ignore required parameters, or fail on minor version changes.

Intuit’s solution was to create “tool-ready” API specifications that include:

This forward-thinking design has allowed Intuit to onboard new AI capabilities—like natural language querying of financial data—without rewriting existing agent frameworks.

Conclusion: A Pragmatic Path to GenAI at Scale

Merrin Kurian’s blueprint for Intuit’s GenAI infrastructure stack is a rare blend of ambition and pragmatism. The Fixed-Flexible-Free framework provides a mental model for balancing standardization and creativity; the scaling story demonstrates how to move from pilot to pervasive without chaos; the failure-mode analysis reminds us that realistic testing is non-negotiable; and the tool-ready APIs ensure the infrastructure evolves alongside the agents. For any organization building its own GenAI stack, the lesson is clear: invest in the system that supports the system, and let the experiments flow.

Explore

NVIDIA CEO Tells Graduates: AI Revolution Marks the Start of Your Career AI Language Models Face 'Extrinsic Hallucination' Crisis: Experts Call for Fact-Checking Overhaul Everything You Need to Know About Hermes Agent and Qwen 3.6: Self-Improving AI on NVIDIA Hardware Chrome's 'Compile Hints' Feature Shaves 630ms Off JavaScript Startup in New Tests Machine-Speed Security: Merging Automation and AI to Counter Modern Threats