Large language models are slow and require scarce expensive resources. The inference process is bottlenecked by available GPU memory. The entire large language model itself needs to fit into video memory along with enough of a buffer to support live inference sessions of varying length. The result is an anemic throughput of 1-3 requests per second even with some fairly expensive hardware.
Scaling AI services is going to take a lot of innovation. Thats why projects like vLLM are so important early on in this new generative AI cycle.
vLLM’s PagedAttention algorithm allows hosted LLM models to use non-contiguous blocks of memory during inference sessions. This means that the same hardware can efficiently use all of its video memory and support many more concurrent users. The solution seems obvious but its critically important.
Latency and throughput are two huge challenges for generative AI. Generation is too slow for most users and the extremely low throughput will skyrocket infrastructure cost. Hardware will improve over time but more needle moving leaps like vLLM will need to arrive in order to deploy scalable AI services.