370 / May 15, 2026

Why Your AI is Still a Demo: Lessons from Braintrust’s Field CTO

46 Minutes

Play episode • 46 Minutes

Explore Playlist

Listen on

370 / May 15, 2026

Why Your AI is Still a Demo: Lessons from Braintrust’s Field CTO

46 Minutes

Listen on

About the Episode

85% of AI teams will hit a serious production failure this year. The only thing separating them from the 15% who don’t? Evals.

After nearly two decades of building AI systems at Microsoft, Facebook, and Dropbox, Ameya Bhatawdekar is now Field CTO at Braintrust, the AI observability platform used by Airtable, Notion, Stripe, Dropbox, Vercel, Cloudflare, Lovable, and Replit.

We discuss a shift that most teams underestimate. The winners in AI are not just shipping faster. They are building systems that behave predictably, improve continuously, and earn user trust over time. As traditional monitoring breaks down in a probabilistic world, observability now requires learning how an AI system reasons, not just how it performs. This leads to a new paradigm where agents are no longer just executing tasks, but also analyzing and debugging other agents.

The episode also traces the evolution of machine learning itself. From feature engineering to deep learning to transformers , each leap increased capability and reduced control. Evaluation is now where control sits.

Ameya is clear on one point. Moving fast with weak evaluations feels like velocity, but it compounds into technical debt, unpredictable failures, and ultimately a loss of user trust. The teams that win are the ones that invest early in rigor, especially in understanding context, which is quickly becoming the hardest and most critical layer in AI systems.

If you are a founder or engineer moving beyond the demo phase and trying to build durable, high-quality AI systems, this episode will change how you think about shipping.

Watch all other episodes on The Neon Podcast – Neon

Or view it on our YouTube Channel at The Neon Show – YouTube

Transcript

Open All

Siddhartha Ahluwalia 0:54
Hi, this is Siddhartha Ahluwalia, your host at NEON Show and Managing Partner at NEON Fund, a fund that invests in pre-seed and seed stages in the best of enterprise AI companies across US-India Corridor like Atomicwork, SpotDraft, CloudSEK. Today I have with me Ameya. Ameya, welcome to the NEON Show.

Ameya Bhatawdekar 1:14
Thank you.

Siddhartha Ahluwalia 1:14
You are at a very unique position, Field CTO at Braintrust. So Braintrust for our audience is one of the leading companies in eval space for AI.

I’ll ask you to explain, first what is a Field CTO and second is what is eval?

Ameya Bhatawdekar 1:29
Sounds good. So my role at Braintrust is to help our customers who are building Gen AI systems to build their AI in a very predictable, repeatable way such that they can ship high quality AI into production. And so my role is to work with our customers to technical strategy, the way we deploy and use the evals platform and observability systems, how we build out these feedback loops to help the company build the muscle to ship production quality AI.

And so I lead the post-sales field engineering team as well. And so we work with our customers to not only onboard to Braintrust, but use it in a way that can help them build these AI systems effectively.

Siddhartha Ahluwalia 2:20
Can you describe now eval and observability? What do they mean in this context?

Ameya Bhatawdekar 2:25
Absolutely. And so, as we have seen this revolution of Gen AI over the last few years, what has happened is we moved from the world of training models to do very specific things where we would take large training data sets, build these algorithms and then train these models to do a certain task in a predictable fashion, with a high degree of accuracy. After the Gen AI revolution, the way now we build intelligent applications is we take models off the shelf.

These are general purpose models. They can reason on a variety of tasks. And then we condition the models to work a specific way by doing prompt engineering, by doing context engineering.

You’re really making sure that the models work for a particular use case in a predictable fashion. And so when we are not training these models and we’re conditioning them, we want to make sure that our instructions work well across the board for all the anticipated and not yet anticipated ways in which people are going to interact with these systems. And so the way you do that is by building eval data sets.

These eval data sets, they describe the system in terms of the expected inputs and the expected outputs. And so if your AI is now able to work across those eval data sets, you have a high degree of confidence that it will work well in production. That’s evals. You also want to make sure that the system works as expected in production. And so we want to know exactly how people are interacting with your systems, how they’re working with your agents, in which cases does your system work well, and in which cases it doesn’t. And then you want to look at those cases where it didn’t quite work as expected and use that insight to continually improve your intelligence system.

Can you fine tune your prompts? Can you do better context engineering? Can you change something upstream in your system to address those shortcomings and not regress your existing capabilities? And so that’s where observability plays a big part. So when you’re building Gen AI systems, you really want that feedback loop of observability that helps you build better evals, that helps you ship better AI.

Siddhartha Ahluwalia 5:00
Can you give an example using one of your customers, like what agents they are shipping? And how does Braintrust play a role in them shipping production-grade agents?

Ameya Bhatawdekar 5:11
Yeah. So we have a lot of customers, a lot of companies that are building some pretty phenomenal AI systems. They all are leveraging large language models.

They are building complex agentic systems that are reasoning, that are leveraging a lot of tools and interacting with LLMs to fulfill the user intent. And so we have systems where people can work with their agents to perform specific tasks. The agents are able to understand the user intent and perform actions to fulfill those user intents.

Some of these are transactional. Some of these are systems that provide answers to queries. Some of these are systems that do work on the user’s behalf.

An example would be, there are systems that are being used by special industries where they are trying to harvest meaningful information from their corpus of work documents to generate new content that is going to help them accelerate certain processes. For example, I’ve seen systems where construction companies are able to now put together effective proposals using complex engineering drawings, architectural plans, specifications to submit proposals for new RFPs. There’s a very specific format in which these companies are building out that content. It needs to be accurate. It needs to be standardized, the certain formats that it needs to follow. And the AI systems are able to help these companies fulfill that task with a high degree of automation.

There are other agents that are able to go in and manage your subscriptions for you. They’re able to make choices on your behalf in terms of which options they should exercise on your behalf. For example, there are agents that can help you set up your itinerary, come up with recommendations for certain flights or certain hotels, and help you make that discovery cycle far shorter so that knowing your preferences, knowing your constraints, they can pick the best combination of these flights, these hotels for you. So there’s lots of interesting agents that are being built out there.

Siddhartha Ahluwalia 7:54
So let’s take example of some of the earlier adopters like Notion, Zapier, Stripe, Airtable, and Instacart. What does eval look like in practice for them?

Ameya Bhatawdekar 8:05
Yeah. So all of these companies are building agents, intelligent systems that are performing specialized tasks for whatever use cases they have. And so all of them build evals that reflect how their systems are expected to behave in production. They’re all following similar patterns. They’re building these eval datasets. These eval datasets are exercised every time any component in their systems are modified.

Someone changes a prompt, someone changes some logic in some of the upstream components. The eval datasets ensure that any regressions are caught early in the process. These eval datasets, they are not static.

As the teams look at their logs, at how their systems are working in the real world, in the production use cases, they’re able to leverage those insights to continually augment their eval datasets. And so these eval datasets are now highly reflective of how the system is going to perform in the real world and continue to help them improve their AI. And so the same pattern is fairly common across all of these companies.

You can see how they have agents and systems that are specific to the offerings, the products they have in the market. And you can see the kind of evals that they would have built to make sure that those systems work as expected.

Siddhartha Ahluwalia 9:44
Got it. And why is observability absolutely necessary for AI when classic monitoring was enough for normal software?

Ameya Bhatawdekar 9:52
Yeah. I think monitoring gives you a great understanding of how your systems are working in terms of their performance from a latency perspective, from a cost perspective. And those are important dimensions to measure. I think in the case of these probabilistic systems, you also want to have a deep understanding of how your systems are behaving in response to these user intents that are typically being specified in natural language. There’s a lot of variance in how people can express the same thing. And so you want to make sure that your AI system is robust enough to handle all of the various ways in people express their intent and they are able to fulfill the intent accurately.

And so it’s important for AI systems to have the observability that’s beyond just these performance metrics. These AI systems need to log the entire trace of how the AI reasoned on the initial input. What were the tool calls it made? How did it interact with the LLMs? How did it sort of ultimately generate their response? And did that response actually meet the user intent? Was it accurate? Was it correct? Were there any hallucinations in the generated content or not?

And so you really need to capture all of that information to be able to effectively evaluate the quality of that response and as a result, your entire AI system. So that’s why observability becomes a pretty key aspect of building these kinds of intelligent systems.

Siddhartha Ahluwalia 11:33
And your title is very interesting. Field CTO. What does it mean?

Ameya Bhatawdekar 11:38
Yeah. So all of my career, I’ve been building AI systems. I’ve built internet scale, enterprise scale AI systems at Microsoft, at Facebook, at Dropbox. And over time, I’ve observed how these AI systems are built, how they’re deployed, how they’re operationalized. And I started leveraging Braintrust about a year ago while I was at Dropbox. And it was quite the step function change for us where we were able to move from a fairly messy way of building enterprise AI systems powered by Gen AI to a more systematic way of how we were building evals, how we were doing observability. And we were able to go from this chaotic state to a much more systematic way of doing things. And I see a lot of companies, a lot of folks who are building great AI products. They’re kind of in the state that I was about a year ago. And so I was working closely with the Braintrust team.

And I felt like there would be a great opportunity for me to work with a lot of world-class AI teams to work with them on building production quality AI systems in a way that I could share some of the learnings that I had accrued over my career and help them go from that messy primordial state of building these AI systems to doing something that was going to set them on the path of shipping a production quality AI systems fairly quickly. And so that was sort of the drop for me. So my role at Braintrust is to work with our AI partners, our AI customers to help them not only onboard the Braintrust platform and use it effectively, but to systematize the way in which they build and ship AI. Way that can guarantee high quality AI in a predictable, repeatable fashion.

Siddhartha Ahluwalia 13:52
Got it. And why did you choose to go from building AI systems to be on the field site where you’re not building and selling now?

Ameya Bhatawdekar 14:03
Yeah. As I said, I’ve been building AI systems for a long time. I started building my first machine learning models that were shipped in the operating system in the browsers in 2008. So it’s been 18 years of building and shipping AI systems. And that has been a wonderful journey. Now I’m at the point where I think I can amplify the impact that I have and not just limited to one company, but work with many, many, many world-class teams that are doing very exciting things in the space of AI. And so the opportunity to work with a lot of different teams across various domains, building a variety of different AI solutions was very appealing to me.

I was also always on the product side. And so this was sort of my opportunity to build my muscles and grow a few more folds in my brain as I learned about the sales world. And so this was a really interesting opportunity for me.

Siddhartha Ahluwalia 15:05
Now you work on pre-sales or post-sales?

Ameya Bhatawdekar 15:08
I work on post-sales. Well, I lead the post-sales field engineering team. So my team works with our customers to deploy Braintrust and to build AI solutions. And so we have solutions architects, as well as AI engineers who help our customers through the entire journey of onboarding to an advanced platform, operationalizing observability, and then really integrating Braintrust into their AIS DLC so that they are doing evals in a fairly systematic way, that they have operationalized how they do error analysis, that they have built all the integrations that they need to automate a lot of this workflow. So my team helps our customers go through those phases. But my role is talking to interesting teams.

I really enjoy talking to any team that is either considering Braintrust or considering onboarding to an advanced platform to teams that have built AI systems and are looking at how they can now have another step function improvement in their processes. So I work with everyone.

Siddhartha Ahluwalia 16:34
When an AI product fails or an AI agent fails, what are the few things that could have gone wrong?

Ameya Bhatawdekar 16:42
Yeah. So these are all probabilistic systems. So when an agent is reasoning on the task, the agent is trying to understand the user intent. The agent is trying to understand how it should sequence the set of tasks that it needs to perform. It needs to invoke tools. The tools are making calls to third-party systems, returning certain results. And then these agents are trying to piece together all of that information, and then again, continue their reasoning to generate the final outcome. And so as you can see, this is a probabilistic system with a long chain of steps that need to be orchestrated and executed correctly to yield the final outcome. And so a lot of things could go wrong.

Maybe the tools don’t work out as well as expected. Maybe the orchestration got certain things wrong. Maybe the sequence of tasks wasn’t quite right. Maybe there weren’t specific enough instructions for the AI to reason in a way that the system was expected to. So there’s a lot of things that could go wrong. There are also a lot of operational things that could go wrong. A call to an external system may fail. And so you really have a lot of moving parts when you’re building out complex agentic systems, and you could have a failure across any of those. And so you really want to have that full observability, that ability to trace exactly how an agent took that user intent, orchestrated the plan, executed it, and then generated the final answer. So you want to know exactly where it missed a step or fails to yield the final answer.

Siddhartha Ahluwalia 18:27
So are you hinting towards that, what you are building on the eval and observability side is capturing like, there’s very interesting articles by foundation people that got viral on context graphs and context engineering. Is that what you’re trying to capture in eval and observability, the context of an agent and why agent did what it did?

Ameya Bhatawdekar 18:50
Yeah, I mean, capturing the entire context is really important. When you want to observe, like when you want to understand exactly what is it that the agent did over the course of its execution, you want to capture all the context that the agent had access to. That way, anyone who’s debugging and troubleshooting can follow the details to understand exactly how the context was built.

Was something missing in the context? Was it interpreted by the underlying system correctly? And so we are building a system where you can capture all of that context within the trace of the conversation.

So you can imagine a conversation where an agent gets an instruction from the user, then performs multiple steps, yields an answer, and then that conversation can continue. And so you can have these long running conversations where there’s a lot of context that’s being generated and built. And all of that is really important to understand how the agent behaved and for aiding with troubleshooting and debugging.

Siddhartha Ahluwalia 19:59
So really, rather than going through a log, today, a developer or not even a non developer can really talk to an agent why you did what you did through a chat-based mechanism.

Ameya Bhatawdekar 20:10
Well, yes. I mean, if you have all the context, you can have other agents introspect, like look at that context and try to understand how the agent worked out what it did. And so we are getting into this very interesting phase where you have agents that can analyze and understand and give you insights into the performance of other agents.

We have an agent built into the tool itself in Braintrust. It’s called Loop and Loop has access to your entire project context. It can access all your production logs. It can look at all your eval datasets. It can look at all the prompts that you have.

It can look at all the ways in you are evaluating those agentic interactions where you can have custom scorers that can evaluate the quality of your agentic interactions. Because it has all that context, it is able to reason on how your agentic system is behaving. It can identify areas for improvement. It can help you generate synthetic data that can be used to augment your eval datasets. And so, yeah, you can now have agents analyze other agents and give you recommendations.

Siddhartha Ahluwalia 21:24
This has never happened before, right, in the history of engineering? Like on how do you debug?

Ameya Bhatawdekar 21:34
Yeah, I mean, debugging is pretty interesting in these cases because you really want to understand how every component of a particular trace of a conversation performed. All of that data is logged into your observability system. So in Braintrust, you can see every step that was performed by the agent as a spam.

And so you can have a full context of the spam. But you can also have scorers that can evaluate the execution of a particular spam. So you can have scorers, which are essentially functions that evaluate the quality of a particular action. You can define those as deterministic functions, you know, implemented in code. Or you can use LLM as judges. But then you can evaluate, like, how did each spam perform?

And that can give you a fairly good way to zero in on problematic areas of your agents. So within a particular spam, within a particular trace, you can quickly figure out where did the agent go wrong because your scorers can now point you to that particular place fairly effectively. But then you can also look at understanding all your production traces holistically. You can do clustering analysis to see what are the trends that you’re observing in your agents as they work across multiple different interactions. You know, where do they get things right? Where do they get things wrong? How do they get things wrong? Why do they get things wrong? So you can now do that kind of holistic analysis on all your data to not only do, you know, debugging and troubleshooting on a particular instance, but get sort of overall understanding of the behavior of your agent across a wide variety of use cases and categories.

Siddhartha Ahluwalia 23:43
So would you classify Braintrust also as a debugging tool then? Not yet.

Ameya Bhatawdekar 23:49
It is. It is a very effective debugging tool. You know, it tells you if your agent made a mistake and it gives you all the context and gives you all the clues to understand exactly where that mistake happened, where did things go wrong. And so, yeah, it is a very, very useful debugging tool for folks who are working on building these AI systems or continuing to improve these AI systems.

Siddhartha Ahluwalia 24:16
And is vibe coding going through evals? Like people who are using vibe coding, are they adopting tools like Braintrust?

Ameya Bhatawdekar 24:23
I think so. Like, I mean, it almost becomes, it’s really important that you have control over the application that you have built by vibe coding. vibe coding makes it really easy for you to take an idea and build an application.

But how do you know that your application is going to work great? Like, especially if your application is encoding agentic capabilities, right? And so when you’re building these intelligent agentic applications using wipe coding, evals almost become existential.

You know, that’s the only way you have a high degree of confidence that what you’ve built is going to work well. We are starting to see a lot of platforms that provide no code or low code ways of building agents. They are looking at embedding Braintrust in their platform to make it available to these no code, low code developers who are building agentic systems to test and ensure that whatever they have built will work as expected.

And so, you know, even if you’re a no code, low code developer, it’s really important that you have the rigor of building out evals, making sure what you’ve built works well for those evals and continue updating those evals as you collect more data from the real world usage of those systems.

Siddhartha Ahluwalia 25:54
You would have seen a team shift fast with weak evals and team move slow with strong evals. Which one will in the long term anyway?

Ameya Bhatawdekar 26:04
Yeah, I mean, you can ship very quickly without doing any evals or by doing a very cursory job at sort of doing evals. But it’s going to result in some technical debt. You’re going to see more frequent regressions, things that used to work stop working.

Any upstream changes can have a dramatic impact on the overall outcome. And so while you can start shipping things fast, shipping things predictably with high quality becomes really hard. Right. And so if you want to build something that is going to scale, that is going to consistently deliver a high quality experience, then, you know, you have to invest in high quality evals like that. That’s just non-negotiable. And you can see that, you know, users are very sensitive to their systems and their quality of their output. You know, it’s very easy for someone to lose faith and credibility into an AI system when they see certain mistakes or they’re having these rough experiences due to quality issues in their interactions with their systems. And so users are not very forgiving. And if you make mistakes too often, you’re going to see a churn and usage.

Siddhartha Ahluwalia 27:45
So enterprises, you know, the failure is very expensive. What makes an enterprise trust one AI system and reject another, even if they use the same model underneath?

Ameya Bhatawdekar 27:58
So these intelligent systems, these AI systems are not just a model, right? There’s a lot of layering that happens on top of these models. These systems have to really deliver specific capabilities or specific experiences that help people do certain specific tasks, right?

And so you’re looking at fairly complex systems that have a model at the heart of it, but there is a lot of engineering that happens on top of these systems. You want to make sure that you’re able to collect the right context, that you’re able to translate the user queries into rephrased queries that have a higher probability of being successfully executed. You want to make sure that the instructions that you’re providing to your AI systems are accurate, that are comprehensive, that they ensure that these systems don’t hallucinate or generate unexpected results.

You want to have the right guardrails in place. And so it’s not just the model. The model is a part of the system, but there’s a lot of engineering that happens on top of these models and you have to get it all right in order to have a high quality system that is at heart a predictable model.

Siddhartha Ahluwalia 29:25
Do you think AI startups that are building AI, for example, you know, agentic companies that are providing customer support agents, actually know how good or bad their product is or they’re guessing?

Ameya Bhatawdekar 29:37
Well, if you’re developing based on vibes, then you’re guessing, right? I think it’s going to be very hard to ship something in the real world and expect it to work well if you haven’t stress tested it. If you haven’t, if you’ve just tested it, you know, with vibes, chances are your system may not work well across the variety of ways and people are going to try and track with your system.

And so I do think that systems that are built with a high degree of rigor have a good chance of becoming useful systems that your customers will engage with. Otherwise, they end up looking like demos.

Siddhartha Ahluwalia 30:24
Can you tell us some of like, what absolutely layman terms, what are the different ML systems that you worked on in the last 20 years of your career before joining Braintrust?

Ameya Bhatawdekar 30:39
And what was the impact those systems created? Absolutely. So I’ll sort of talk about the three phases of ML that I’ve experienced in my career.

I think the first phase was classic ML. The classic algorithm probabilistic algorithm. They are largely, they are mostly algorithms that are popular in the 2000s. These were logistic regression, decision trees, clustering algorithms, and these are fairly effective. Like the other time, they did a pretty good job in terms of providing intelligent systems that could do things like risk management and risk detection. They could do spam detection. They could do things like churn analysis, anomaly detection. So they were fairly effective. The way you build these algorithms was you took a lot of data and you worked with the data. You did what you call feature engineering on that data to transform the data into a shape and size that the algorithms could effectively work with to produce high-quality predictive systems. And so at that time, a lot of emphasis was on feature engineering. People used to come up with a lot of different ways in which you would modify the shape and size of the data.

Siddhartha Ahluwalia 32:21
I believe search became really popular starting 20 years ago with Amazon personalizing search. Every company wanted to have a search.

Ameya Bhatawdekar 32:30
That search, like with these algorithms, you could implement search algorithms. You could implement personalization systems, recommendation systems.

Siddhartha Ahluwalia 32:37
Yeah, recommendation systems were very popular and all were based on these old ML models.

Ameya Bhatawdekar 32:42
Old ML models, collaborative filtering. And so it was a very interesting time to learn a lot of how ML worked from first principles because you could really understand under the covers what was happening, how you were taking these data, running them through these algorithms, the kind of systematic way in which they were processing the data to generate the final outputs. Competitively, the interpretability of the systems was much higher.

I think in the 2010s, we saw the wave of neural networks. With the abundance of computing and data, we are now able to train much more sophisticated, much more complex algorithms that could do things that the previous generation of algorithms struggled with. So they could do a really good job of understanding and working with visual data.

They could do a much better job of understanding and working with sequential data. So you had different training techniques that came into place where you are now doing representation learning, you are doing reinforcement learning. And so what that meant was the algorithms were now doing a lot of that featured engineering that previously machine learning engineers would do.

And so the algorithms were now processing data at a much higher volume. They were way more complex and they could now do way more complex tasks. So they could do visual reasoning, they could do machine translation, they could do image detection, facial detection, all kinds of amazing things are now possible.

Siddhartha Ahluwalia 34:20
I’ve worked on facial recognition using neural networks in 2011 and it was a project for the government of India for facial recognition across all airports in India.

Ameya Bhatawdekar 34:32
That’s phenomenal.

Siddhartha Ahluwalia 34:33
Yeah.

Ameya Bhatawdekar 34:34
And it was a significant breakthrough. The ability to do facial recognition would have been technically not quite feasible like five years before.

Siddhartha Ahluwalia 34:46
Yeah, I think the timing was right. And back then gesture recognition became a big thing. Latina recognition became a big thing. If you’ve seen older movies like Angel and Demons, all those CERN labs had Latina recognition.

Ameya Bhatawdekar 35:01
Yeah, you saw these post-detection algorithms become mainstream. You had products like Kinect.

Siddhartha Ahluwalia 35:09
Flutter became popular, if you remember.

Ameya Bhatawdekar 35:11
Yeah, yeah, yeah, yeah. That takes me back quite a few years, yeah. But the whole thing over there was like you now had these really powerful algorithms, but they were a lot less interpretable.

They were very complex neural networks and they took a lot of data to train. And by that time, a lot of companies had acquired, had amassed a lot of data. And these algorithms were more compute intensive, but GPUs were now becoming the way you trained and inferenced these models.

Siddhartha Ahluwalia 35:44
Yeah, all the labs globally, specifically like Stanford and like many, many even in India, they started sharing their open data sets.

Ameya Bhatawdekar 35:53
Yes, you started seeing the emergence of these fairly large data sets, right? And so now the emphasis was on how do you, you know, do some degree of algorithm exploration. But, you know, a lot of these algorithms were used as is with a few tweaks.

Siddhartha Ahluwalia 36:11
It was only reserved for specialists, right?

Ameya Bhatawdekar 36:15
Nobody… The researchers, the machine researchers were now shaping and, you know, building these algorithms while the majority of the practitioners were using these algorithms. But they were focusing a lot on building out these data sets and training these models.

Siddhartha Ahluwalia 36:31
I think in the last three to four years, it’s been really democratization of AI.

Ameya Bhatawdekar 36:37
Absolutely. And so what has happened is like in 2017, the famous transformer papers came out and transformers that had a seminal impact on the…

Siddhartha Ahluwalia 36:49
Can you explain to our audience who don’t belong to this world, what are transformers?

Ameya Bhatawdekar 36:53
Well, transformers was an architecture that allowed these models to understand very long sequences of words of data, right? Previously, you had these algorithms that could understand sequences of data, you know, take a long sentence and translate it.

But as the sentence grew longer, you know, it could probably deal with like 50 words or 100 words well. As you start getting into like, you know, a paragraph or a page of like thousands of words, these algorithms really struggled to understand the entire context and pay attention to all of the information that was presented to them, right? They would pay more attention to the most recent words and then forget about the more words that occurred in the past.

And so there were very, you know, great limitations to the amount of data that these algorithms could process. Transformers allowed these models to really work well with large context that was being provided to them. So these models could now process, you know, a paragraph, a page, a whole chapter, or a whole book to produce the outputs in a meaningful, coherent way.

And so transformers was this big seminal moment where the world kind of changed.

Siddhartha Ahluwalia 38:21
I’ll give you an example from one of our portfolio company, Budy.
So they have built AI for senior care living homes. And this is the ambient AI, which exists in all the systems, including CRM. And it doesn’t have a UI. But what could happen is, for example, they build an AI teammate that can tell a salesperson that, hey, the AI will create your tasks on your email or your calendar and tell you that this is a senior that you need to reach out to because they have been considering home, or this is a child that you need to reach out to because they have been exploring home for their grandparents. And this is the right moment because, you know, usually people start getting anxious after one year or one and a half year and they’re not able to find anything. And because AI can pass through their entire records, so they are able to create these tasks.

But what happens is, let’s say it adapts to the context also. For example, there’s a book called Never Split the Difference. It’s a very popular book in sales.

So if a senior care living home, the CRO, uses terminology from Never Split the Difference. Culturally, the team uses this. The AI is able to absorb the entire and, you know, the language of the AI and uses quotes from Never Split the Difference. People are amazed. Like, how did it adapt to our culture?

Ameya Bhatawdekar 39:51
Yeah, it’s pretty amazing because ultimately all these algorithms, they are essentially trying to predict the next word. Yeah. Given all the previous words that it has been given, it’s trying to predict the next word.

And initially, this used to be a highly statistical probabilistic operation. You know, these models would look at, you know, they would be trained on a large corpus of information, you know, like a Wikipedia-sized dataset. And the idea was that these algorithms are learning certain statistical patterns to predict what the next word is likely to be.

But as these algorithms have been trained on more and more data, with these algorithms becoming more and more complex in terms of the number of parameters that make up these algorithms has grown exponentially, we have seen some very emergent properties in these algorithms, right? You know, at their heart, they are a probabilistic next word prediction system. But they now seem to encode, you know, an understanding of grammar.

So the probability of the next word, not just based on sort of the statistical occurrence in the training corpus, but they are now also able to apparently reason on whether this next word will make that entire sentence grammatically correct or not, right? They are starting to understand facts, for example, right? They are not only able to say what the next word is going to be, but they’re able to now say, like, what is the probability, like, is it going to be factually accurate?

Because the probability of a sentence of that word being factually accurate makes it a higher probability than another word, which also occurs very frequently in the corpus given that particular sentence, but is factually inaccurate. So, for example, if the sentence is, the boiling point of water is 100 degrees centigrade, right? So the model will likely, you know, predict 100 degrees centigrade instead of 102 degrees centigrade, right?

And so we are starting to, it looks like the models are starting to learn facts. The model is starting to learn to reason, right? Like, even if it’s not a fact, it looks like the models are starting to understand whether a given proof is correct compared to an incorrect proof.

And so if you have a statement that looks like a proof, you know, the word that completes that sentence, if it is going to be factually correct compared to another proof, another sentence or variant of the sentence where the word was probably, you know, highly probable for completing that sentence, but the proof was wrong, the model will pick the right proof, right? And so it seems like the models have built this reasoning capability. And, you know, there’s an argument to be made that, you know, how can models learn these kind of reasoning capabilities?

But there’s also a counter argument that the internet is big. Any question, practically any question that you may have asked has probably been answered on the internet. And so, like, as these models have grown in size, as they’ve been trained on a lot more data, you’re starting to see these very interesting emergent properties of intelligence in these models.

Siddhartha Ahluwalia 43:41
And in the last, you know, we are coming towards the conclusion of the podcast. Where do you think these models and capabilities are heading in 26 and 27?

Ameya Bhatawdekar 43:54
You know, I’ve stopped trying to predict where these are going to be a year away from now, because they seem to be rapidly developing impressive capabilities day over day. You know, we’ve already seen models are now getting extremely good at reasoning. They’re getting really good at orchestrations. They have gotten really good at solving analytical problems, like coding, for example. These models have become really good at generating code. These models continue to get good at all of these capabilities at an accelerated pace.

So it’s hard for me to say exactly what these models will be able to do a year from now. But if you look at sort of the trajectory of these models, there was a lot of work being done, maybe about a year ago, to get these models to orchestrate complex tasks correctly. People built these very fancy frameworks and systems that were fairly complex and complicated to improve the orchestration capability of the model.

And, you know, there were some very impressive engineering feats that happened as a result of that. But now the models are able to do that natively. And so it’s pretty amazing that we have just, you know, built models that are now able to do these kind of planning and execution tasks that they would not have, like we would not have imagined about a year ago that they would be capable of doing.

You know, like if you were working on the early days of GPT-3.5, people spent a lot of time trying to make the models work with a very small context window of 4096 tokens with very weak instruction following. And a lot of what people built then is now completely unnecessary. The models just do that inherently.

And so, you know, who knows what the models are going to be doing a year from now, but I’m sure it will be impressive.

Siddhartha Ahluwalia 46:13
Thank you so much Ameya. I loved the conversation.

Ameya Bhatawdekar 46:15
Thank you so much.

Siddhartha Ahluwalia 46:16
I love, you know, how candidness you brought in the conversation and how simple you made it.

Ameya Bhatawdekar 46:22
Thank you so much for having me.