Happy New Year!

I’m so excited for 2024. Ankit and I are re-igniting Pocket Labs and are planning on releasing a product of our own very soon.

This year I will have the time and freedom to commit to publishing more, narrating my work and launching new ideas.

We’re going to have Fun on the Internet™.

Applying LLMs to private data

Over the past year I have been focused on how to apply LLMs to private data. At first I thought training on top of an existing LLM would make sense based on my previous machine learning experience. Quickly it became clear that wouldn’t deliver the results one would expect. Prompt engineering with clever and long context windows establishing a basis for an eventual question was the way to go. LlamaIndex is a framework that does just that.

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models. The framework provides the ability to connect to a variety of data sources, the ability to index that data and query/chat engines to interface with the data.

My goal was to create a natural language search prototype over a synthetic data set. I used ChatGPT to generate the synthetic data set as an SQLite database that ran in memory. I configured Llama index to use that database and set an OpenAI API key to utilize ChatGPT 3.5 as the executing LLM. That was it.

I was able to ask complex questions and get text answers that were relevant to the data set. Complex questions appeared to work as long as they followed the naming found within the SQL schema. Not every question worked but most did out of the box. It became clear if I cleaned up some of the naming found in the schema that the LLM would make less mistakes.

Performance wasn’t great. Simpler questions yielded faster response times. Broader questions that had to consider the entire data set too much longer to execute. At its fastest it was a second or so to respond and at its slowest it was 30 seconds or more. Some very broad queries failed to execute.

Context window size matters. When I increased the context window to 16K for the ChatGPT 3.5 LLM model the answers improved. GPT4 was much slower but the responses were better than GPT3.5.

So how does LlamaIndex actually achieve this? After turning on verbose debugging and examining the sparse code base it was clear: LlamaIndex was asking ChatGPT to do just about everything. Every database query was written by ChatGPT and all logic was handled by ChatGPT. LlamaIndex was just there to ask, execute the query ChatGPT told it to and then supply the response back.

The query engine process worked in the following way for every natural language search question:

  • LlamaIndex: Prompt ChatGPT with context window containing entire database, include the question asked and then request an SQL query as part of the response.
  • ChatGPT: Responds with the SQL query to execute.
  • LlamaIndex: Executes SQL query on database and provides results as response to ChatGPT and asks for synthesis.
  • ChatGPT: Repeats the steps above if there is another query it would like to execute. Otherwise synthesizes all of the data gathered in a text summary response.

It was illuminating to see what it was like to hand the keys over to an LLM. From writing every query and synthesizing the response to the weird and simple mistakes that were made. Ambiguous language and synonymous terms not being interpolated were major issues. I had to add 50+ specific prompt directives to get it to do what I wanted for most queries. In many ways it was like working with a junior developer.

New startups can’t acquire cloud resources

This is more of a public service announcement than a post but its worth stating: new cloud accounts at AWS, Google and beyond have severe service quota limitations on resources. Nearly all services have a service quota default of zero when the new account is created. All services that require accelerated GPU resources will certainly start with a quota of zero.

There are many reasons to create a new cloud account. The formation of a new entity, the formation of a new product, a merger, the need to separate data from code and access control are all valid reasons. The problem: a new cloud account is effectively a clean slate without a payment history. As a result cloud providers assume you can’t afford expensive resources or don’t need them yet.

Requests can be made to increase service quotas. Sometimes they are given, other times you have to justify and often you may be denied especially if there is no payment history.

I recently setup a new AWS account and was stunned to find that I could launch any EC2 instance greater than 5 virtual CPUs period. Any instance outside of the most basic web server would require a service quota request. EC2 wasn’t the only problem either: service quotas across the board were extremely restrictive. Requests for quota increases took weeks of back and forth and promises that this new entity could in fact afford the resources. Frustrated by the lack of progress I checked out Google Cloud Platform and found the exact same situation with service quotas which were also all set to zero and also required complaining.

At the venture studio we are launching several startups a year and many projects require accelerated instances or services. The lesson: create new cloud accounts for new entities early and make requests for services before you need them. Do this across cloud providers to ensure you have the resources when its time to build even if they are not on your preferred cloud platform.

The problem appears to be specifically for new accounts. I have a long running AWS account that I created in 2009 with a long and substantial payment history for several non-trivial services I still maintain. That account has no issues with service quotas with levels set to more than I would ever need. In a time where GPUs are being fought over having a mature AWS account is actually meaningfully important.

B2B AI will require self hosted LLMs

Business to business AI solutions that handle sensitive or large amounts of business customer data will have to host their own LLM infrastructure for both security and performance reasons.

Many B2B solutions handle not only the data of their direct business customer but potentially sensitive information of that business’ customers. If your AI solution could potentially throw any of that information into the plaintext of an LLM context window of the third party sub-processor like OpenAI you could be taking on tremendous risk. Depending on the agreement with the third party that data could be retained and trained on or used in other ways internally to improve their models and thus potentially leak information to users in future models. Additionally, your customer data is potentially totally exposed in the event of a data leak with the sub-processor.

Trusting a third party sub-processor with your customer data is not new but the data patterns with LLMs are. LLMs require potentially a LOT of text based input in order to operate. This input is called the Context Window and can vary in potential length from 1,000 to 32,000 tokens. A token is part of a word or a word. In order to get the output you seek from an LLM on behalf of a customer you may be curating a very long and detailed Context Window containing a convenient representation of human readable data that may bring together historical and sensitive personally identifiable information about a particular person. A data leak of these Context Windows could be very damaging.

An AI feature may require a long many turn conversation with an LLM that involves sending and receiving large amounts of information that takes significant time to generate. There is a lot of time transmitting and waiting on information. These calls will take even more time if they are being sent to a third party sub-processor because they will have to travel over the internet. Many are using OpenAI in this way right now. The benefit now is the time to standing up a solution: just wire up a few OpenAI calls and you have an AI product! The downside though is this time spent in input/output, the waiting on responses and from my experience: totally unpredictable performance and throughput.

The solution to these two significant problems is self hosting LLMs. Customer data can be kept secure within your own internal network and not have to travel to a third party processor. Additionally, the cost of transit can be eliminated by keeping all calls in-house. Another benefit is delivering throughput guarantees for your users by controlling the resources involved.

The challenge at the moment is standing up this solution. Running your own large server instance is also possible but complex to stand up and scale. Services like AWS Bedrock have the potential to really be the solution we need to deliver on security and performance.

AWS Bedrock

AWS has just launched a new service called Bedrock for running foundation and custom generative AI models. At first glance this looks like an extremely convenient way to deploy open source foundational models and customizations of them. Additionally, it would allow businesses to utilize LLMs within their account and private VPCs without having to ship customer data in a zillion plaintext calls to 3rd parties like OpenAI.

The available models to start:

Llama 2 support is coming soon. I’m really interested in where the pricing lands on this.

The pricing between the models is a mix of per-token pricing or time based (hourly) pricing. Unfortunately, it appears to be one or the other and some models are hourly only and just as expensive as everything else on SageMaker. Throughput can be purchased in provisioned amounts.

Amazon now has multiple efforts with SageMaker and Bedrock for deploying AI services. Bedrock looks like what I’ve been searching for as a developer but SageMaker may still be required for training and other tasks.


Large language models are slow and require scarce expensive resources. The inference process is bottlenecked by available GPU memory. The entire large language model itself needs to fit into video memory along with enough of a buffer to support live inference sessions of varying length. The result is an anemic throughput of 1-3 requests per second even with some fairly expensive hardware.

Scaling AI services is going to take a lot of innovation. Thats why projects like vLLM are so important early on in this new generative AI cycle.

vLLM’s PagedAttention algorithm allows hosted LLM models to use non-contiguous blocks of memory during inference sessions. This means that the same hardware can efficiently use all of its video memory and support many more concurrent users. The solution seems obvious but its critically important.

Latency and throughput are two huge challenges for generative AI. Generation is too slow for most users and the extremely low throughput will skyrocket infrastructure cost. Hardware will improve over time but more needle moving leaps like vLLM will need to arrive in order to deploy scalable AI services.

Open Source LLMs

I am currently working on an AI project and have been paying very close attention to the developments in open source LLMs since March of this year when Meta’s LLM “Llama” was leaked. I was able to get the small 7B model to run locally on my PC on an RTX 3080 10GB video card. While the 7B model leaves a lot to be desired I had something that felt like ChatGPT running locally. Incredible.

Meta has since formally released a second version of the Llama model along with a code generating model and a lot more ceremony and documentation. It’s a smart strategy. Meta is looking to influence the majority of the LLM market through a bold open source strategy. Everyone is using Llama as a foundation to build and train their models. Meta stands to benefit not only from the outsourced development but also from its position as a resourceful leader in the space.

Things have cooled off a bit since the summer but it is still extremely hard to find instances that can run the largest Llama 2 70B model. I was able to get access to an instance large enough on AWS SageMaker this week luckily to do some evaluations. Across the cloud services it can cost between $5k-$15k per month for an instance large enough to run the Llama 2 70B model. Microsoft Azure currently has the most competitive value in terms of pricing along with Google. Locking in reserved instances for 1 year can bring down the cost to $4k/month.

AWS and Google tend to have service quota limits set at zero for all GPU related services. If you just setup a new cloud account for a startup you will not have access to any GPU resources and will need to explicitly ask for service quota increases. This can be really frustrating and there is a chance your increase request will not be honored.

I’ll have a lot more to say about these open source LLMs in the near future as I attempt to use them in production.

Hello World

Take cartridge out. Slap the Nintendo several times. Blow into the cartridge. Place cartridge back into the Nintendo.

Power on.

Hello world.

I’d like to get back to narrating my work, shipping products and publishing more open source code.

I’m not on any social networks anymore so follow me and subscribe here.