Experiments with LLMs
10 min read

Experiments with LLMs

Experiments with LLMs
Photo by Andrew Neel / Unsplash

LLMs are all the rage right now. How big of a hype cycle it is is hard for me to comment on. But one thing I am confident about is that this is an innovation that is a clear inflexion point. The world from now is going to be so different. If you have used LLM-based tools, you can probably feel it, too.

I have been a user of LLM-based tools for a while now. ChatGPT, Perplexity and Github Copilot are regular tools in my toolbox. Besides these, I try new AI tools that come my way every now and then. I have been reading (as much as I could) about the developments in the applied AI world. I have discussed ideas with peers and in my consulting. However, I have been lagging behind in building something myself with LLMs. I wanted to make sure that I ended the year by taking the first step in this super exciting space.

I am a big believer in experiential learning. As always, I start learning something with an objective to build something useful with it. So, I did the same with this as well. I wanted to build a tool that would categorize all my monthly expenses from my credit card and bank statements and tag them as business expenses for accounting.

Trying out different models and frameworks

Experiments with Langchain and OpenAI

Langchain is an LLM model agnostic framework for building applications using LLMs. It works with most of the LLM backends like OpenAI, Cohere, Anthropic, etc. I have been wanting to try Langchain for a while. Also, I prefer starting with a framework that does most of the job and allows me to switch between options quickly instead of getting into the specifics of the APIs of every LLM model provider. So that's what I did.

I decided to use Langchain and OpenAI to run some experiments to first learn how the framework works and what I can do with it. The first hurdle wasn't even technical. I was not able to add credits to my OpenAI account using my Indian credit card. My free credits had expired. So I couldn't do much. I was itching to move forward quickly so that's where I stopped with OpenAI and started looking for other options.

Experiments with Cohere Classification API

I started exploring Cohere, which provides free credits. I tried Cohere with Langchain. While fiddling with Cohere, I discovered that it also has an API for classification, which seemed relevant to what I was trying to do - classify financial transactions into categories. Langchain does not support classification, so this is where I moved out of Langchain to Cohere's Python SDK. While the "hello world" of classification seemed promising, it didn't really work. I tried to provide a large enough dataset of my previously categorized transactions, fiddled around with the format in which I was providing my transactions, etc. But the results were very off. I even tried to classify a slice of exactly the same data that I had provided to Cohere's Classification API as the learning dataset. Even that wasn't getting classified correctly. I couldn't make sense of it.

I left my experiments with Cohere's Classification API, understanding that classification works better on language constructs (sentences, phrases, etc.). What I was providing were names of transactions (for example, Kayak.com, Zomato, MTA*NYCT PAYGO, etc.). Some of these can be quite cryptic, depending on what the merchant is called in the payment provider's system. Many of the transactions are going to be just new names every month. Names don't mean anything unless backed by a large dataset. Hence, their classification is a tough problem.

But I don't think that I understand classification fully yet. What I couldn't make any sense of was why Cohere was not able to correctly classify exactly the same transactions that I had provided as the learning data set is beyond me. I have to dive deep into this to understand how classification algorithms really work.

Back to Langchain and OpenAI

After trying Cohere's Classification API, I was back to Langchain and OpenAI as that seemed like the only reasonable option for getting started. I figured out a way to add credits to my account somehow.

I started working with gpt3.5 based models. In fact, I started with the exact models in Langchain's docs (why not). My transactions dataset was in CSV format. I wanted to categorize each transaction and get the result back in CSV format with an additional "CATEGORY" column so that I could use the result in Google Spreadsheets.

Interestingly, Langchain has a concept called Output Parsers, which you can use to parse outputs into specific formats like lists, JSON, etc. It also works with data structures native to Python, like Pandas Enums, Pydantic model, etc. This was interesting, and I was wondering how Langchain does it. It turns out it just does really smart Prompt Engineering to instruct the model to give the results in the expected format as a text response and then uses its own parsers to parse it.

This was an interesting learning experience because I could do a lot with this as a prompt engineering pattern. For example, Langchain does not support a CSV output parser. I could have worked with Pydantic models as well, but it just seemed like a lot of work. I wanted to use CSV and Pandas because that's what I am most comfortable with. So, I started instructing the model to give its results in a format that would be easy for me to parse.

As soon as prompting for getting the right output format was solved, I moved to doing something more meaningful. Very quickly, I ran into the limiting context window problem of the model I was using because of my large dataset for context (400+ transactions) and input dataset to categorize (100-130 transactions). That got me to explore other OpenAI models with larger context windows.

Even with the models with the largest context windows available for public use, I was still struggling with the context window problem. So, the next obvious step was to break the input and categorize transactions into smaller chunks. Again, some chunks would invariably not output the entire result, mostly ending abruptly with incomplete lines of output. As much as I tried to debug it, Langchain provides limited output from backend LLM providers like OpenAI. So, there wasn't enough information to debug what was going on. Something like Langsmith could have helped with debugging the interactions with OpenAI, but that's available in closed beta only. I faced a roadblock again.

From here on, the only viable option I had was to work with OpenAI's APIs directly instead of using Langchain.

An important learning in this process was that getting desired results while working with low-level LLM model APIs usually involves multiple steps and interactions instead of expecting that one API call can do everything due to their limitations, like limited context windows. Then, you wrap those interactions in a higher-level function to build an API for your own application. A case in point is breaking the input into chunks to get results in chunks and stitching them together. That's just one example. Multi-step multi-interactions can come in several ways. Read on.

OpenAI Assistants FTW!

While trying to use OpenAI APIs directly, I remembered that OpenAI recently released Custom GPTs and OpenAI Assistants. Both of them do something similar - give ChatGPT custom instructions to build your own GPTs (or Assistants) and then interact with them using a user interface (in the case of Customer GPTs) or APIs (in the case of Assistants). They can also do things like retrieval from your proprietary datasets. This seemed relevant and something I had been wanting to play with. I decided to try out OpenAI Assistants. I still had to chunk my inputs, but I found that I wasn't getting the truncated incomplete outputs I was getting with Langchain. I am not quite sure why the outputs were truncated in the case of Langchain, which also uses OpenAI's public APIs. Maybe Assistants uses another API under the hood that works slightly differently and is more reliable than the public APIs? I can't say for sure.

Now that I was not struggling with truncated outputs, it was time to tune my results and make them more accurate. This was an interesting journey of learning Prompt Engineering. I have documented some of my personal learning experiences about prompt engineering in the remaining article.

Things I learned about Prompt Engineering

Name your input and intermediate results to make subsequent prompting easier

I was working with multiple datasets - the historically categorized transactions and the transactions to categorize. In fact, the end-to-end process involved transforming the inputs into another form and working with that.

Initially, I would give my inputs that would refer to different datasets as "initial dataset", "categorized dataset", "output in the previous message", etc. I often ended up confusing the assistant and got incorrect results. Here is one of my bad prompts:

Categorize all the transactions in the output of the previous message.

Search for transactions in the provided dataset. If you don't find a transaction in the dataset I provided, search for it by PLACE in your own dataset. Then use this information to categorize the transaction.

Bad Prompt

Instead, explicitly name your datasets and input them to make it easy to refer to them. For example, I now use prompts like this:

The attached file is a set of historical transactions to use as reference for categorization.

Each transaction is in the following format: {DATE}, Time: {TIME}, Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

Let's call this data "historical transactions".

Improved Prompt

Another prompt:

Here is the file with the transactions which are not categorized. The file is in CSV format. Load this file to work with it. Format every transaction in it into the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}

Lets call this "current transactions".

First Prompt

And then, I will use these references in my future prompts:

Categorize all the transactions in the "current transactions" by looking up similar transactions in "historical transactions".

Second Prompt

Treat the model as someone who has no context and can do one very specific thing well

Treat the model as if you have hired someone who has no idea about your job, work and goals. It's like hiring someone to do a one-off job on Fiverr. This person can work really well if you are very specific, but when given complex tasks, this person might get confused because they don't have the context of the problem and clarity about your expectations of them.

I wanted my assistant to categorize all my transactions according to transactions that I have already categorized in the previous months. If something is not present in my previously categorized transactions, I wanted it to use its own knowledge (the internet) to categorize the transactions. Here is the prompt that I used:

Categorize all the transactions in the input dataset.

Search for transactions by PLACE. If you don't find a transaction
in the historical dataset I provided, search for it in your own dataset.

Example Prompt

This turned out to be too complex, and the result was never accurate. It would either use my historical transactions or its own knowledge, but not both.

What's a better way? Break it down. Here is an improved version.

First prompt:

Categorize each transaction in "formatted current transactions" by looking up similar transactions by PLACE in "historical transactions". If you don't find a matching transaction in "historical transactions", categorize it as PENDING.

The output should be in the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

First Prompt

Second follow-up prompt:

Great. Now, categorize all the PENDING transactions with your own dataset and knowledge base. Preserve categories for transactions that have been already categorized and are not marked PENDING. The output should be in the same format as the previous message with all 50 transactions.

Second Prompt

The results were more accurate and more predictable.

Be as specific as possible for accuracy in results

Since I was trying to categorize transactions, I expected that the count of categorized transactions would be the same as the transactions in the input. Interestingly, I saw that either the model would decide to output only the first 5 transactions to show me a sample of the work done by it and get my approval, or it would show all the transactions, but erratically, some transactions will be missing.

To make sure that the model does its job correctly and you are not going in loops to get accurate results, make your expectations of the results explicit and precise. Here is an example:

Categorize all the transactions...

The output should be in the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

Also, the output should have all 50 transactions.

Example Prompt

Notice that I specify the expected output format and the exact number of transactions I expect in the output. This reduced the random disappearance of transactions in the output, or the model just showed the top 5 or top 10 transactions.

Cross-question the model's responses to reason and improve your prompts

You might start with a set of prompts that you think might work. But the result might not be what you expected. In that case, you can cross-question the model to understand why the output is not what you expected. This can help you improve your own understanding of how the model is processing your input and instructions. Once you have a better understanding, you can improve your prompts. This is called Chain of Thought reasoning. Consider this example:

Categorize each transaction in "formatted current transactions" by looking up similar transactions by PLACE in "historical transactions". If you don't find a matching transaction in the historical transactions, categorize it as PENDING.

The output should be in the following format: Place: {PLACE}, Amount: {AMOUNT}, CR/DR: {CR/DR}, Category: {CATEGORY}

The output should also have all the 50 transactions.

We will call this dataset "first pass".

First Prompt

In my case, the response had all the transactions categorized as PENDING. This was unexpected as I knew that my input had some transactions that were present in my "historical transactions" dataset. So, my follow-up prompt was to reason about it. Here was my next prompt:

You have marked everything as PENDING. Didn't you find any transaction in the historical dataset?

Second Prompt

The model responded by saying this:

Apologies for the confusion. It seems that I mistakenly used a placeholder mapping for historical transactions instead of the actual historical transactions dataset. Let me correct this and categorize the transactions again using the correct historical transactions dataset.

...

Response to Second Prompt

Now, this helped me understand that my First Prompt was ambiguous. Using this kind of reasoning, I was able to iterate further on my prompts to make them more accurate. I could also inject automated tests in intermediate steps if I was doing it through code and have a fallback prompt to have the model improve the results.

What's next?

This has been an exciting learning experience. I am a little better at understanding how to build LLM-based applications. There are still some areas that can be improved in this assistant to make the output more deterministic and accurate. I will keep working on that.

The next set of things I am excited about to learn:

  • Using Retrieval in OpenAI's Assistants to further fine-tune my assistant
  • Exploring RAG by implementing a vector database myself and going down the rabbit hole of learning all about vector embeddings, cosine similarity and retrieval
  • Running open-source models instead of using OpenAI's models and testing for accuracy using an automated (or semi-automated) process

Update: OpenAI's Prompt Engineering Guide

It turns out that OpenAI has written a useful guide on Prompt Engineering to get accurate results, which overlaps quite a bit with what I have learned. While it would have been helpful to read this before I did everything myself, the engineer in me is happy that I was able to figure out most of this on my own 😄