Building self-evolving AI systems @ Sentient
At Sentient, I've worked across AI products, evaluation systems, and enterprise tooling shaped by one broader bet: AGI will likely be built in the open, and reasoning frameworks will play a central role in getting there.
My role has been to help turn that thesis into usable systems - from consumer AI products like Sentient Chat, to recursive research frameworks like ROMA, to benchmark platforms like Arena, to enterprise evaluation and improvement loops through Eldros. Across all of it, the goal has been the same: make AI systems more capable, more measurable, and more useful in real-world tasks.
TL;DR
- Managed a team of 5 engineers at Sentient
- Helped turn research output into products, benchmarks, and enterprise tooling
- Built and scaled Sentient Chat to 11M users
- Built AI search and research systems that outperformed Perplexity and ChatGPT on FRAMES
- Co-led engineering on Arena, a benchmark platform with 4,700+ participants in Cohort 0
- Helped ship ROMA, featured at NeurIPS 2025, and EvoSkills, focused on continuous agent skill evolution
- Led enterprise product direction for Eldros, which improved business-requirement adherence by 30% for a leading bank in India
Context
Sentient Foundation exists to pursue AGI through open source ($85M seed from Founders' Fund, Pantera Capital, Franklin Templeton).
That sounds abstract until it becomes an engineering constraint. The company's core bet is that progress will come not only from larger models, but from reasoning frameworks that are open, reproducible, and extensible enough for outside researchers and builders to contribute to.
That meant we were not just building products. We were building infrastructure around a research thesis.
The tension was constant:
- research ambition vs. production reliability
- openness vs. competitive advantage
- speed vs. rigor
My team sat in the middle of those trade-offs.
My role
I managed a team of 5 engineers working across product, prototyping, and tooling.
In practice, my role was to turn frontier research into systems people could actually use. That meant working across engineering, design, research, and enterprise needs - sometimes shipping consumer-facing products, sometimes building evaluation infrastructure, and sometimes packaging research into usable workflows for internal and external teams.
A large part of the job was deciding where the real leverage was:
- what deserved to become a product
- what needed better evaluation
- what should remain research
- and when a successful product was still the wrong thing to keep building
Phase 1: Can LLMs reason? Start with games.
One of our earliest questions was whether LLM-based agents could show emergent reasoning in dynamic, multi-agent environments.
We built a harness that allowed teams to configure agents ranging from simple prompt wrappers to more complex ReAct- and AutoGen-style setups, then put them into live scenarios. One of the earliest experiments was a two-day hackathon where agents played Mafia against each other.
Mid-2024, neither open-source nor closed-source models showed the kind of emergent behavior we were looking for. But the harness itself turned out to be valuable. It became the foundation for much of what followed: experimentation, evaluation, and eventually more structured reasoning systems.
Even when the models are not ready, the infrastructure you build to test them can still become core product and research scaffolding.
Phase 2: Sentient Chat
A consumer product for open-source AI
One of the biggest products I helped build was Sentient Chat - a web experience designed to make open-source models accessible under one roof.
The goal was not just to create another chat UI. It was to prove that open models, when paired with the right product and agent infrastructure, could compete on usefulness.
I owned the product from both engineering and design perspectives. We started with basic inference, then expanded into search, deep research, and agent-style workflows.
Sentient Chat served 11M users.
That scale changed the nature of the work. Once you are serving millions of users, the challenge stops being feature velocity and becomes reliability, latency, model behavior, product clarity, and operational discipline.
Search, research, and systems quality
As the product evolved, we shipped:
- search, where the system retrieved, ranked, and synthesized current information
- deep research, where agents decomposed problems and synthesized longer reports
- multi-model comparisons, which gave both users and researchers insight into model strengths and failure modes
Our AI search and research systems outperformed Perplexity and ChatGPT on FRAMES, using open-source models.
That result mattered because it was not just a model win. It was a systems win. Retrieval, prompting, orchestration, and reasoning structure made the difference.
Why we shut it down
By conventional product logic, Sentient Chat was a success:
- millions of users
- strong product engagement
- benchmark-beating performance
But Sentient was not trying to become a consumer AI company.
Every engineer maintaining Chat was an engineer not building reasoning infrastructure. Every operational cycle spent serving 11M users was attention diverted from the company's core thesis.
So in December 2025, we made the call to sunset it.
It was a hard decision, but the right one. Sentient Chat validated distribution and systems quality. It did not validate the long-term mission. Once we stepped back from it, the team moved much faster on the work that mattered more to the company's thesis.
Phase 3: From product to framework
Once we pulled back from the consumer product layer, the work became much more focused.
ROMA
One of the major artifacts that came out of this phase was ROMA, a recursive framework for long-running deep research tasks. It allowed agents to decompose complex problems, follow multiple threads, synthesize findings, and iterate over time.
ROMA was later featured at NeurIPS 2025.
EvoSkills
We also worked on EvoSkills, a framework that allowed agents to evolve their own SKILLS files based on performance feedback.
The important idea was simple: instead of retraining the model, the system could observe where it failed, write new skills, update existing skills, and improve how it approached future tasks. That made self-improvement more practical, faster to test, and easier to iterate on.
Together, ROMA and EvoSkills helped consolidate the broader thesis: self-evolving agents powered by reasoning.
Phase 4.1: Arena
As the company's focus sharpened, I co-led engineering on Arena - a platform where researchers and builders compete on benchmarks and evaluations built around real-world agent failure modes.
The important difference was that these were not toy benchmarks. They were built around problems AI systems still struggle with in practice.
For Cohort 0, Arena brought in 4,700+ participants globally, focused on back-office problem sets on top of OfficeQA.
This work taught me that building the platform is only half the challenge. The benchmark design itself is a research problem. Good evaluations need to be hard, fair, useful, and grounded in tasks people actually care about.
Phase 4.2: Eldros
Eldros came out of a simple realization: we had built several pieces of adjacent research and tooling that, when combined, could become a strong enterprise product.
We had:
- research into how agents fail and how context can manipulate behavior
- systems like EvoSkills that supported continuous improvement
- evaluation infrastructure that could simulate and benchmark performance
- experience packaging complex research ideas into usable products
Eldros combines that into a product workflow:
- the customer gives us their AI agent
- they define business requirements and evaluation rubrics
- the system runs simulations and benchmarks
- it identifies gaps and suggests improvements
- and, where access is available, it can implement those changes and keep the loop running
We are currently focused on AI voice bots in banking, insurance, and pharma.
In an early deployment with a leading bank in India, Eldros drove:
- 30% improvement in business-requirement adherence
- better audio and emotion understanding
- no loss of conversational context
- adherence to PII standards
That was one of the clearest validations of the thesis: self-improving systems only matter when they are grounded in real operational constraints.
Internal tools and leverage
A quieter but important part of my work at Sentient has been building systems that reduce dependence on product engineering and help other teams move faster.
Design system for agentic coding
I built a workflow where designers can iterate on the design system with AI, make changes in Figma, and - once approved - automatically generate component code, test cases, and design-system updates.
The system also updates SKILLS files so agents know how and where to use each component.
The goal is to open-source this once it is battle-tested enough internally.
GTM self-sufficiency
I also built tooling and processes that let the go-to-market team create their own web pages, videos, and social assets without filing engineering tickets.
That made them faster and made my team less of a bottleneck. Two members of the GTM team also started automating parts of their own workflows, which is one of the more satisfying forms of leverage: not just building tools, but helping people work differently.
Outcomes
- Managed a team of 5 engineers
- Scaled Sentient Chat to 11M users
- Built AI search and research systems that outperformed Perplexity and ChatGPT on FRAMES
- Helped ship ROMA, featured at NeurIPS 2025
- Helped build EvoSkills, focused on continuous agent skill evolution
- Co-led engineering on Arena, with 4,700+ participants in Cohort 0
- Led enterprise product direction for Eldros
- Improved enterprise voice bot business-requirement adherence by 30% for a leading bank in India
- Built internal systems that reduced design-to-code friction and increased GTM self-sufficiency
What I'd do differently
I would have sunset Sentient Chat sooner.
The 11M users metric was impressive, but it validated distribution more than mission. In hindsight, we should have defined the exit criteria much earlier so the decision to wind it down felt like execution, not debate.
I underestimated how hard benchmark design would be.
Building the infrastructure for evaluation is one problem. Designing evaluations that are actually difficult, fair, and useful is another. If I were doing Arena again, I would staff dedicated evaluation design earlier.
I would have pushed for deeper instrumentation sooner.
We had strong usage signals, but not enough structured data around which features drove retention, where users churned, or which failure modes mattered most. Better telemetry would have made several product decisions easier.
What I learned
The real challenge in self-evolving AI systems is not theatrics. It is disciplined iteration.
What matters most is:
- what you measure
- how you define failure
- what feedback loop you trust
- how quickly teams can turn research into usable systems
That is where the actual engineering and product challenge lives.
Related work
- ShopOS - zero-to-one GenAI commerce, building on the same enterprise AI instincts
- Scapic / Flipkart - where the product-at-scale thinking started