I’m assuming basic familiarity with large language models (LLMs) as I write this one. If you’d like a refresher, check out ‘A product builder’s introduction to large language models’.
We’ve come quite far with using LLMs in their raw capabilities to build some very cool products. Better content personalisation, search, question answering, etc have proven great customer value. What we haven’t seen much of in B2C production yet is LLM agents (and this post will start with the basics and cover what agents are, and how to implement them!) - and for good reason!
The Agent landscape is about to change though, and I predict that in the next year or so we will all be very comfortable using LLM agents in our production systems. Before we jump in, let’s dive a bit deeper into how we’ve been using LLMs in production so far.
We’ve seen so far 3 distinct ‘waves’ of LLM usage evolution, each increasing in complexity and usefulness.
Wave 1: Basic text generation
The first wave, starting early 2023, focused on basic text generation by leveraging the capabilities of LLMs to:
Understand text intent (what is the topic being discussed in this text, what is the sentiment, …)
Summarise text (summarise this very long article into a TL;DR – pretty handy for the article you are reading now ;) )
Generate new text (describe ‘Paris’ to a young family wanting to go for the first time)
Answer questions (are there long queues at the Eiffel tower?)
Through a combination of these, have chat dialogues with users (plan a trip to Paris!).
From a B2C industry perspective, this led to some pretty useful stuff, including enhancements to search, personalised content generation, question answer capabilities, and level 1 frontline customer support for chatbots. These capabilities remain useful today, and some of the largest realised impact of LLMs throughout the industry builds on these capabilities.
Why we need more
One limitation these capabilities present is the LLM’s lack of real-time world knowledge - for example, LLMs by themselves cannot answer “what’s the weather in Paris today?”. Because LLMs are trained on data snapshots, real-timeness is not a possibility.
The other limitation is, the LLM has world knowledge - but not business specific knowledge that’s proprietary to you and your business - for example, what’s your refund policy in case of a cancellation?
Many different approaches have been tried to work around the problem - including augmenting LLMs with pre-retrived, real time data through putting it in prompts (retrieve the weather for Paris from a weather API and make it available in the prompt), or retrieval augmentation systems to provide more bounded context (give the LLM access to your specific customer service procedures via data retrieved from your customer service manuals and added to prompts).
These approaches work fine for a number of use cases, but as the size of data increases, they begin to hit limitations. Imagine if you want to respond to that weather query for any location across the globe, you need to retrieve the weather all over the globe at every point in time and pass it on to the prompt - pretty unmanageable. Plus, you can’t really ‘do’ much with this approach in terms of taking action. For our customer service use case, we can retrieve what the correct procedure is (let’s say it’s ‘initiate a refund’), but we can’t actually act on it.
The second wave of using LLMs, relying on function and tool calling, is a response to these limitations.
Wave 2: Function calling
We’ve defined in the last wave a few ‘functions’ or tools the LLM needs to be able to use dynamically in order to get over its limitations. For our weather use case, it needs to be able to dynamically access a weather API. For our customer service use case, the LLM needs to be able to dynamically access product flows to initiate cancellations and refunds.
With function calling, we give the LLM access to these capabilities. We rely on the LLM’s discretion to ‘call’ these functions when necessary, by analysing what our users want, and what the right function to call is that might generate this response.
Ok, that sounds like a beautiful system - why haven’t we been doing this extensively for the last couple of years? Because of this: We rely on the LLM’s discretion to ‘call’ these functions when necessary. At the dawn of the ChatGPT era, LLMs were simply not good enough at using their discretion and following structure / prompts effectively. My early attempts (as well as many other’s) at function calling, way back in 2023, did not yield good enough outputs. That landscape has since drastically changed with better performing LLMs.
So how does this work in practice? In wave 1, our LLM prompts had two components - a ‘system prompt’ that defined the system behaviour for the LLM (“you are a helpful customer service agent”), and a ‘user prompt’, which was the user’s dialog / context (“can I cancel my order?”). Function calling introduces a third input - a list of functions available to the LLM. Here’s an example:
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a customer service assistant..."},
{"role": "user", "content": "I want to cancel order #12345"}
],
functions=[
{
"name": "check_order_status",
"description": "Validates order exists and gets current status",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
},
# ... other function definitions ...
]
)
"role": "system", "content": is our system prompt.
"role": "user", "content": is our user prompt
functions = is our list of functions available to the LLM, notably in our example, the ‘check order status’ function. Conceptually, just as simple as normal prompt writing.
The beauty of this setup? The LLM knows exactly what tools it has (from the function definitions) and how to use them (from the prompt), but can still maintain a natural conversation flow with users. Your backend then implements these functions to actually perform the requested actions.
This approach gives you the best of both worlds: structured, predictable function calls for your backend, and natural, helpful interactions for your users.
The key to good function calling prompts is being specific about the tools available and crystal clear about how they should be used. It's like giving someone a toolbox and an instruction manual - they need to know both what tools they have and how to use them properly.
Limitations of function calling
The main limitation of function calling is that the LLM can only work with the functions you've defined. If a user asks "what's the best time to visit Paris" but you haven't defined a function for seasonal tourism data, the LLM is stuck with generic knowledge from its training data.
This is why function calling, while powerful, can feel rigid. Every possible action needs to be anticipated and defined upfront. Want your assistant to help with hotel bookings too? That's a whole new set of functions to define. Want it to check local events? More functions. The system doesn't figure out new capabilities on its own - it can only use the tools you've explicitly given it.
The beauty of LLMs so far has been their non-deterministic, natural, ask-me-anything kind of outputs. Function calling, while making LLMs more useful, takes away from this non-deterministic beauty and brings back the rigidity of deterministic coding.
LLM Agents are an effective response to this limitation.
Wave 3: The era of LLM Agents
The idea behind agents is simple. In the last few examples in Wave 2, we’ve defined a list of functions and given our LLM access to these tools. The LLM could, in our previous example, determine which tool to use.
To make the limitations of function calling clearer, let’s say we give our LLM access to two tools - a weather API (‘what’s the weather like in Paris on a given day), and a booking API (a list of hotel bookings made by our user). In our function calling setup, we can’t answer the question “should I carry an umbrella today?”, because our weather function needs a definite day, and our booking API has no weather information.
An agent, however, is smarter. It will first check if the user is already in Paris today, by calling the booking API. If it finds that the user is in Paris today, it will check the weather for Paris today. If the weather indicates rain, it will respond with ‘yes, it’s wise to carry an umbrella today!’.
So, Agent LLMs go beyond simple function calling by being able to:
Plan steps to achieve a goal
Choose their own tools and approach
Learn from and adapt to results
Maintain context and memory
Make decisions about what to do next
Setting up an Agent
The setup is quite uncomplicated, at least conceptually. We have the same 3 components - the system prompt, the user prompt, and the list of tools.
Agent prompts are different though - they focus on goals and reasoning rather than specific steps. Let’s look at this through an e-commerce example, here is what an agent prompt would likely look like:
You are a shopping assistant helping customers find and purchase products.
Your goal is to help customers find what they want and complete purchases.
For each customer interaction:
1. Understand their needs and preferences
2. Think through the best approach to help them
3. Use available tools as needed
4. Adapt your approach based on results
5. Keep track of important details they share
Explain your thinking process and what you're doing to help them.
This prompt gives the LLM a lot more agency to reason and figure out what combination of tools can help.
Agents use the same fundamental structure as function calls. The prompts make them markedly different, by enabling the agent to perform:
Dynamic Planning
Function calling: Follows preset workflows
Agents: Create their own plans based on goals
Tool Usage
Function calling: Uses tools in predefined ways
Agents: Creatively combines tools to solve problems
Memory and Context
Function calling: Handles one interaction at a time
Agents: Maintains context across conversations
Problem Solving
Function calling: Fixed decision trees
Agents: Adaptive problem solving
Here's how an agent handles a complex request:
User: "I need gift ideas for my mom who loves gardening"
Agent Thought Process:
1. Gather more information about preferences
2. Search multiple categories
3. Consider seasonality
4. Check reviews and ratings
5. Verify availability
[Asks clarifying questions about budget and specific interests]
[Searches across multiple categories]
[Checks inventory for top options]
[Creates curated list with reasoning]
[Adapts suggestions based on user feedback]
A practical example
Lets build that e-commerce agent a little bit fuller.
Here is how I would define the tools:
{
"tools": [
{
"name": "search_products",
"description": "Search product catalog, returns matching items",
"parameters": {...}
},
{
"name": "check_inventory",
"description": "Check current stock levels for a product",
"parameters": {...}
}
]
}
Here’s the prompt:
You are a shopping assistant that can help customers find products, check availability, and place orders.
You can:
- Search the product catalog
- Check inventory levels
- Place orders
- Calculate shipping times
You cannot:
- Modify existing orders
- Apply custom discounts
- Access customer payment details
Some error handling
{
"error": "NOT_AUTHORIZED",
"message": "Agent cannot access customer payment information",
"allowed_actions": ["redirect_to_payment_page"]
}
And that’s it! Here’s how the agent would behave in practice:
The user prompt:
User: "I need running shoes under $100 that can be delivered by next week"
The agent's thinking process might look like:
Thought: I need to find suitable shoes and check delivery options. Let me break this down:
1. Search for running shoes with price filter
2. Check inventory for quick delivery
3. Verify shipping times
4. Present options to user
Action: Let me search first
[calls search_products({
"query": "running shoes",
"filters": {
"price_max": 100,
"category": "athletic/running"
}
})]
Thought: Found several options. Need to check stock and delivery for each
[calls check_inventory({
"product_id": "shoe123",
"location": "nearest_warehouse"
})]
Thought: This option has stock and can be delivered in time. Let me show it to the user and ask about size preferences...
As you can see in the example above, the power of agents comes from their ability to:
Break down complex problems
Choose appropriate tools
Adapt their approach based on results
Maintain context throughout the interaction
Make decisions about what information they need
This flexibility makes agents particularly good for tasks where the path to the solution isn't straightforward or needs to adapt based on what's discovered along the way.
Why don’t we have agents everywhere yet?
Despite being so radically helpful, we haven't seen agents take over yet, despite their apparent potential.
The main challenges fall into three buckets:
Technical Reliability
Agents are unpredictable - they might take unexpected actions or get stuck in loops. We’ve essentially got only the prompts guiding the agents, and we rely on the reasoning capabilities of the underlying models to ‘figure out’ the problem.
Error handling is complex because agents can attempt things in ways you didn't anticipate - given the amount of freedom and agency, you have far less control on the outcome.
Cost management is tricky, since agents might make multiple API calls to figure things out - if they’re stuck in a loop or if the prompt is suboptimal, the cost will really spiral out of control.
Each tool integration needs robust validation, rate limiting, and error states - adding further complexity and need for trial-and-error, domain expertise, etc.
Product Complexity
Building good agent UX is hard - users get frustrated when agents take too long to think, and can abandon workflows or even churn away from the product. A few workarounds for the UX do exist, including providing the user greater visibility into ‘what’ the LLM is doing.
Setting the right expectations is crucial - users either expect too much or too little with agents. A well made agent is often underutilised because the user is not primed to explore all of its capabilities. An agent that is good at one task can often appear to overpromise because the UX / copy is not well done.
Recovery flows are complex - what happens when an agent fails halfway through a task? Users can get frustrated when these flows are not well managed, as they may need to start all over again without guarantees of success.
You need extensive logging and monitoring to understand what your agents are actually doing, to make prompt adjustments and debugging even remotely possible.
Business Risk
Agents can make costly mistakes if not properly constrained - imagine if your agent determines that the best possible result for every customer’s cancelled order is a full refund.
The development overhead is significant compared to simple function calling - as I mentioned, you need some degree of experience handling LLMs before venturing into agents.
Many use cases work fine with simpler approaches - while we have discussed quite a few examples using agents here, there are more pragmatic ways to start - for improving user experience as well as for gaining efficiencies.
The ROI isn't clear yet for many applications, and the costs are high. Industry estimates currently forecast that it will take an agent upto $20 to accomplish a task at human quality, that a human can do for $6. The unit economics will get better, but are not there yet!
That said, I think we're at an interesting inflection point. As the tooling improves and we figure out the right patterns for building with agents, we'll likely see more adoption in specific use cases where the benefits outweigh the complexity.
For the next writeup, I’ll focus a bit on some examples of agents in the wild, as well as when and for what problems you should consider using agents - and also when not to use them. We will go into the details of the product building flows, the metrics and setups to use, etc, as I continue writing this series.
Meanwhile, if this was helpful:
If you would like to start with the basics of LLMs, check out this writeup:
If you would like to read about writing good prompts, check out this one:
This is such a good article. Thanks for sharing Pranav
Hi Pranav, first of all thanks for this great explanation of LLM agents with real world examples. Looking forward for more such reads.
Just to clarify, in your example the difference between wave 2 and wave 3 is (on a high level): in wave 2, function calling, we specifically provide the steps to break down the problem and appropriate tools to perform those steps. Example: if this is the problem, do this, then this and this.
In wave 3, agents are deciding all these steps themselves and using the tools provided to perform all these. Analogy: It’s like we gave the product to the customer and haven’t provided any user manual.
Am I right?