Building self-evolving AI systems @ Sentient

At Sentient, I've worked across AI products, evaluation systems, and enterprise tooling shaped by one broader bet: AGI will likely be built in the open, and reasoning frameworks will play a central role in getting there.

My role has been to help turn that thesis into usable systems - from consumer AI products like Sentient Chat, to recursive research frameworks like ROMA, to benchmark platforms like Arena, to enterprise evaluation and improvement loops through Eldros. Across all of it, the goal has been the same: make AI systems more capable, more measurable, and more useful in real-world tasks.

TL;DR

Managed a team of 5 engineers at Sentient
Helped turn research output into products, benchmarks, and enterprise tooling
Built and scaled Sentient Chat to 11M users
Built AI search and research systems that outperformed Perplexity and ChatGPT on FRAMES
Co-led engineering on Arena, a benchmark platform with 4,700+ participants in Cohort 0
Helped ship ROMA, featured at NeurIPS 2025, and EvoSkills, focused on continuous agent skill evolution
Led enterprise product direction for Eldros, which improved business-requirement adherence by 30% for a leading bank in India

Context

Sentient Foundation exists to pursue AGI through open source ($85M seed from Founders' Fund, Pantera Capital, Franklin Templeton).

That sounds abstract until it becomes an engineering constraint. The company's core bet is that progress will come not only from larger models, but from reasoning frameworks that are open, reproducible, and extensible enough for outside researchers and builders to contribute to.

That meant we were not just building products. We were building infrastructure around a research thesis.

The tension was constant:

research ambition vs. production reliability
openness vs. competitive advantage
speed vs. rigor

My team sat in the middle of those trade-offs.

My role

I managed a team of 5 engineers working across product, prototyping, and tooling.

In practice, my role was to turn frontier research into systems people could actually use. That meant working across engineering, design, research, and enterprise needs - sometimes shipping consumer-facing products, sometimes building evaluation infrastructure, and sometimes packaging research into usable workflows for internal and external teams.

A large part of the job was deciding where the real leverage was:

what deserved to become a product
what needed better evaluation
what should remain research
and when a successful product was still the wrong thing to keep building

Phase 1: Can LLMs reason? Start with games.

One of our earliest questions was whether LLM-based agents could show emergent reasoning in dynamic, multi-agent environments.

We built a harness that allowed teams to configure agents ranging from simple prompt wrappers to more complex ReAct- and AutoGen-style setups, then put them into live scenarios. One of the earliest experiments was a two-day hackathon where agents played Mafia against each other.

Mid-2024, neither open-source nor closed-source models showed the kind of emergent behavior we were looking for. But the harness itself turned out to be valuable. It became the foundation for much of what followed: experimentation, evaluation, and eventually more structured reasoning systems.

Even when the models are not ready, the infrastructure you build to test them can still become core product and research scaffolding.

Phase 2: Sentient Chat

A consumer product for open-source AI

One of the biggest products I helped build was Sentient Chat - a web experience designed to make open-source models accessible under one roof.

The goal was not just to create another chat UI. It was to prove that open models, when paired with the right product and agent infrastructure, could compete on usefulness.

I owned the product from both engineering and design perspectives. We started with basic inference, then expanded into search, deep research, and agent-style workflows.

Sentient Chat served 11M users.

That scale changed the nature of the work. Once you are serving millions of users, the challenge stops being feature velocity and becomes reliability, latency, model behavior, product clarity, and operational discipline.

Search, research, and systems quality

As the product evolved, we shipped:

search, where the system retrieved, ranked, and synthesized current information
deep research, where agents decomposed problems and synthesized longer reports
multi-model comparisons, which gave both users and researchers insight into model strengths and failure modes

Our AI search and research systems outperformed Perplexity and ChatGPT on FRAMES, using open-source models.

That result mattered because it was not just a model win. It was a systems win. Retrieval, prompting, orchestration, and reasoning structure made the difference.

Why we shut it down

By conventional product logic, Sentient Chat was a success:

millions of users
strong product engagement
benchmark-beating performance

But Sentient was not trying to become a consumer AI company.

Every engineer maintaining Chat was an engineer not building reasoning infrastructure. Every operational cycle spent serving 11M users was attention diverted from the company's core thesis.

So in December 2025, we made the call to sunset it.

It was a hard decision, but the right one. Sentient Chat validated distribution and systems quality. It did not validate the long-term mission. Once we stepped back from it, the team moved much faster on the work that mattered more to the company's thesis.

Phase 3: From product to framework

Once we pulled back from the consumer product layer, the work became much more focused.

ROMA

One of the major artifacts that came out of this phase was ROMA, a recursive framework for long-running deep research tasks. It allowed agents to decompose complex problems, follow multiple threads, synthesize findings, and iterate over time.

ROMA was later featured at NeurIPS 2025.

EvoSkills

We also worked on EvoSkills, a framework that allowed agents to evolve their own SKILLS files based on performance feedback.

The important idea was simple: instead of retraining the model, the system could observe where it failed, write new skills, update existing skills, and improve how it approached future tasks. That made self-improvement more practical, faster to test, and easier to iterate on.

Together, ROMA and EvoSkills helped consolidate the broader thesis: self-evolving agents powered by reasoning.

Phase 4.1: Arena

As the company's focus sharpened, I co-led engineering on Arena - a platform where researchers and builders compete on benchmarks and evaluations built around real-world agent failure modes.

The important difference was that these were not toy benchmarks. They were built around problems AI systems still struggle with in practice.

For Cohort 0, Arena brought in 4,700+ participants globally, focused on back-office problem sets on top of OfficeQA.

This work taught me that building the platform is only half the challenge. The benchmark design itself is a research problem. Good evaluations need to be hard, fair, useful, and grounded in tasks people actually care about.

Phase 4.2: Eldros

Eldros came out of a simple realization: we had built several pieces of adjacent research and tooling that, when combined, could become a strong enterprise product.

We had:

research into how agents fail and how context can manipulate behavior
systems like EvoSkills that supported continuous improvement
evaluation infrastructure that could simulate and benchmark performance
experience packaging complex research ideas into usable products

Eldros combines that into a product workflow:

the customer gives us their AI agent
they define business requirements and evaluation rubrics
the system runs simulations and benchmarks
it identifies gaps and suggests improvements
and, where access is available, it can implement those changes and keep the loop running

We are currently focused on AI voice bots in banking, insurance, and pharma.

In an early deployment with a leading bank in India, Eldros drove:

30% improvement in business-requirement adherence
better audio and emotion understanding
no loss of conversational context
adherence to PII standards

That was one of the clearest validations of the thesis: self-improving systems only matter when they are grounded in real operational constraints.

Internal tools and leverage

A quieter but important part of my work at Sentient has been building systems that reduce dependence on product engineering and help other teams move faster.

Design system for agentic coding

I built a workflow where designers can iterate on the design system with AI, make changes in Figma, and - once approved - automatically generate component code, test cases, and design-system updates.

The system also updates SKILLS files so agents know how and where to use each component.

The goal is to open-source this once it is battle-tested enough internally.

GTM self-sufficiency

I also built tooling and processes that let the go-to-market team create their own web pages, videos, and social assets without filing engineering tickets.

That made them faster and made my team less of a bottleneck. Two members of the GTM team also started automating parts of their own workflows, which is one of the more satisfying forms of leverage: not just building tools, but helping people work differently.

Outcomes

Managed a team of 5 engineers
Scaled Sentient Chat to 11M users
Built AI search and research systems that outperformed Perplexity and ChatGPT on FRAMES
Helped ship ROMA, featured at NeurIPS 2025
Helped build EvoSkills, focused on continuous agent skill evolution
Co-led engineering on Arena, with 4,700+ participants in Cohort 0
Led enterprise product direction for Eldros
Improved enterprise voice bot business-requirement adherence by 30% for a leading bank in India
Built internal systems that reduced design-to-code friction and increased GTM self-sufficiency

What I'd do differently

I would have sunset Sentient Chat sooner.

The 11M users metric was impressive, but it validated distribution more than mission. In hindsight, we should have defined the exit criteria much earlier so the decision to wind it down felt like execution, not debate.

I underestimated how hard benchmark design would be.

Building the infrastructure for evaluation is one problem. Designing evaluations that are actually difficult, fair, and useful is another. If I were doing Arena again, I would staff dedicated evaluation design earlier.

I would have pushed for deeper instrumentation sooner.

We had strong usage signals, but not enough structured data around which features drove retention, where users churned, or which failure modes mattered most. Better telemetry would have made several product decisions easier.

What I learned

The real challenge in self-evolving AI systems is not theatrics. It is disciplined iteration.

What matters most is:

what you measure
how you define failure
what feedback loop you trust
how quickly teams can turn research into usable systems

That is where the actual engineering and product challenge lives.

ShopOS - zero-to-one GenAI commerce, building on the same enterprise AI instincts
Scapic / Flipkart - where the product-at-scale thinking started