vAttention

Dynamic Memory Management for Serving LLMs without PagedAttention

Efficient management of GPU memory is essential for high throughput LLM inference. Prior systems used to reserve KV-cache memory ahead-of-time that resulted in wasted capacity due to internal fragmentation. Inspired by demand paging, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation and improves serving throughout. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. As a consequence, one needs to rewrite the attention kernels to support paging, and implement a memory manager in the serving framework. This results in both performance and programming overheads, as well as portability challenges in adopting state-of-the-art attention kernels.

We solve these problems by storing the KV-cache in contiguous virtual memory and leverages OS support for on-demand allocation of physical memory.


You can try it out here:

Checkout the following paper to know more:

  1. ArXiv
    vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
    Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, and Ashish Panwar
    arXiv preprint arXiv:2405.04437, Preprints