RAG With Autoscaling: Better Performance With Lower Costs For pgvector
How Neon dynamically extends memory for HNSW index build
Neon’s autoscaling, now GA and available in all pricing plans, enables Postgres instances to dynamically scale up for the high memory and CPU demands of HNSW index builds, avoiding constant overprovisioning. With memory extension through disk swaps, Neon efficiently handles large index builds even with lower resource limits, ensuring optimal performance. Start using Neon autoscaling by creating a free account.
Querying large datasets is slow at scale because scanning entire databases could be more efficient. Take, for example, running a vector similarity search operation on a 1 million-row dataset on an 8-CPU Neon instance. The query takes approximately 3 seconds to complete.
Vector similarity search is essential for Retrieval-Augmented Generation (RAG) apps and allows large language models (LLM) to generate a better-quality output. However, a 3-second query execution time can negatively impact user experience.
This article explores the challenges of scaling vector similarity searches and how to optimize for them using pgvector and Neon’s autoscaling capabilities.
If we EXPLAIN ANALYZE the above query, we observe that the query performs a sequential scan (Seq Scan), meaning that the database calculates the distances between my query vector and all the vectors present in my database before returning the correct values. It’s like asking a person to read the entire dictionary whenever they’re looking for a specific word.
Speeding up queries with efficient HNSW index builds
Luckily for us, Postgres and pgvector come with indexes such as HNSW, which, in this example, speed up vector similarity search queries by a factor of 3,000.
To help speed up vector similarity search queries, the HNSW creates a multi-layer graph by calculating the distances among all vectors. However, the index build operation requires a large amount of memory and multiple CPUs for parallelization. In the example below, we used 8 (7+1) workers and 8GB of maintenance_work_mem
to fit and build the index.
Note that pgvector can build the index despite having insufficient maintenance_work_mem
. However, this will take significantly longer to complete.
Using the query below, we confirm that the index size is approximately 8GB.
Index build operations are essential, but they are only run to initialize and maintain the index in case of heavy updates. HNSW’s dependence on CPU and memory resources for infrequent index build operations makes scaling vector similarity search and RAG quite expensive. An ideal scenario is to scale up your Postgres instance to build the index, then scale it back down to serve your customers cost-efficiently.
This is why we implemented Index Build operations to Neon Autoscaling.
How Neon’s autoscaling optimizes HNSW index builds
Neon Autoscaling dynamically resizes your Postgres instance based on load, allowing you to constantly meet your application demands and saving you cost. In the previous example, we must allocate a minimum of 8GB of memory and 2 CPUs. With Neon, all I need to do is specify the minimum and maximum resources the application needs and leave it to the Autoscaling feature to handle the resizing. I can allocate as few as ¼ shared CPUs to a maximum 8 CPUs.
Using the index build query below, we inform Postgres we want to use 8 workers and allocate 8GB of maintenance_work_mem
. Consequently, the autoscaler-agent will detect the load on CPU usage and allocate additional resources.
However, on Neon, CPU and memory scale together and are measured by compute units (CU). For example, 1CU = 1vCPU and 4GB of memory. 2CU = 2vCPUs and 8GB of memory.
The challenge with scaling index build such as pgvector’s is that maintenance_work_mem
is allocated upfront. So how do we fit 8GB of memory in 1GB? To overcome this, Neon uses disk swaps, a technique to extend the apparent amount of physical memory size by using disk storage. This means that when we run SET maintenance_work_mem = '8GB'
with ¼ CU, we let Postgres “believe” it has access to 8GB of memory by allocating disk swaps, eventhough its current limit is 1GB. As the index build progresses, we save the articfacts on disk at every CU increment and swap them back to memory.
Conclusion
Neon’s autoscaling feature is beneficial for handling the intensive resource demands of index build operations. It allows for dynamic resizing of Postgres instances based on load, optimizing both performance and cost. The ability to scale up for resource-heavy tasks, like creating an HNSW index, and scale down during normal operations ensures that the system remains cost-efficient without compromising on performance.
Autoscaling is available for all Neon databases, including the Free plan. Create a Neon account today if you haven’t already. Join us on Discord, follow us on X, and let us know what you think.