Though it's worth noting that the license is AGPL. So if the idea is for this to take over for pgvecto.rs, it's an important data point for those building SaaS products.
It will make pgvector the only permissively licensed option, given it has the same license as Postgres.
Could you talk about how updates are handled? My understanding is that IVF can struggle if you're doing a lot of inserts/updates after index creation, as the data needs to be incrementally re-clustered (or the entire index needs to be rebuilt) in order to ensure the clusters continue to reflect the shape of your data?
We don’t perform any reclustering. As you said, users would need to rebuild the index if they want to recluster. However, based on our observations, the speed remains acceptable even with significant data growth. We did a simple experiment using nlist=1 on the GIST dataset, the top-10 retrieval results took less than twice the time compared to using nlist=4096. This is because only the quantized vectors (with a 32x compression) need to be inserted into the posting list, and only quantized vector distances need more computations. And the quantized vector computation only accounts for a small amount of time. Most of the time is spent on re-ranking using full-precision vectors. Let's say the breakdown is approximately 20% for quantized vector computations and 80% for full-precision vector computations. So even if the time for quantized vector computations triples, the overall increase in query time would be only about 40%.
If the data distribution shifts, the optimal solution would be to rebuild the index. We believe that HNSW also experiences challenges with data distribution to some extent. However, without rebuilding, our observations suggest that users are more likely to experience slightly longer query times rather than a significant loss in recall.
I'm the project lead for VectorChord. I have tested ScaNN on AlloyDB Omni but have struggled to achieve reasonable recall on the GIST 1M dataset, with results peaking at only around 0.8. The limited documentation makes it challenging to understand the underlying causes of this performance.
Additionally, I couldn’t find any performance benchmarks for ScaNN integrated with PostgreSQL, particularly in comparison to pgvector or its standalone. The publicly available metrics focus exclusively on query-only indexing outside of the database.
On our side, we’ve implemented the fastscan kernel for bit vector scanning, which is considered as one of ScaNN’s key advantages.
The “external index build” idea seems pretty interesting. How does it work with updates to the underlying data (e.g., new embeddings being added)? For that matter, I guess, how do incremental updates to pgvector’s HNSW indexes work?
The IVF indexing can be considered into two phases, computing the centroids (KMeans), and assigning each point to the centroids as the inverted lists. The most time-consuming part is at the KMeans stage, and can be greatly accelerated with GPU. 1M 960dim vec can be clustered in less than 10s.
We did the KMeans phase externally, and the assignment phase inside postgres. The KMeans part depends only on the data distribution, not on any specific data. So we can do sampling on the data, and inserting/deleting the data won't affect the KMeans result significantly.
For the update, it's just a matter of assigning the new vector to a specific cluster and appending it to the corresponding list. It's very light compared to inserting in hnsw
The cost to store a static set of 400k 768-dimension vectors is also $1 a month on Datastax's AstraDB. However, for that $1, AstraDB replicates the data 3x instead of storing it on a single machine.
It's hard to compare the cost with the Serverless pricing model, as write and read have extra costs. On the pricing page, datastax costs $4000 to write 100M 768-dim vectors. And 10M query will cost $300, which is only 4 QPS. As comparison, VectorChord can achieve 100 QPS on $250 instance.
I am still waiting for a good pattern for using multivector embeddings like ColBert and ColPali in postgres. I get that its fun to optimize single vector stuff, but multivector is that happy middleground between single vector and reranker that seems to be only validated in specialized exotic search dbs like Vespa
1. Uses half-vecs, so you cut down everything by half with no recall loss
2. Uses token pooling with hierarchial clustering at 3, so, you further cut down things by 2/3rd with <1% loss
3. Everything is on Postgres and pgvector, so you can do all the Postgres stuff and decrease corpus size by document metadata filtering
4. We have a 5000+ pages corpus in production with <3 seconds latency.
5. We benchmark against the Vidore leaderboard, and very near SOTA
I really like the idea of ColPali and products building on it but I am still unsure about the applications for which it makes most sense. We mostly deal with reports that are 80-90% text, 10-20% figures and tables. Does a vision first approach makes sense in this context? My sense is that text-based embeddings are better in mostly text contexts. Layout, for example, is pretty much irrelevant but plays into vision-based approaches. What is your sense about this?
So - the synthetic QAs datasets in the Vidore datasets are exactly like that 90% text, 10% charts/tables. OCR + BM25 is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is at ~98%.
It is a small upgrade, but one nonetheless. The complexity, and the cost of multi-vectors *might* not make this worth it, really depends on how accuracy-critical the task is.
For example, one of our customers who does this over FDA monographs, which is like 95%+ text, and 5% tables - they misses were extremely painful - even though there weren't that many in text-based pipelines. So, the migrations made sense to them.
There's no easy way to index ColBert multi-vectors in a scalable way that I know of. Vespa seems to rely heavily on binary quantization, which can cost a lot in recall loss. And for most cases, using ColBert as a reranker is good enough, as the pgvector example you posted.
Seems like like doing a proper relational 1:N chunk:multiple-vectors foreign key, binarization and a clever join or multistage CTE would get us pretty close to useful.
I am ok with it being less efficient as the dev ux will be amazing. Vespa ops (even in their cloud) are a complete nightmare compared to postgres
Would you be willing to speculate on how VectorChord's ingestion and query performance might compare to Elasticsearch/OpenSearch for dense vector and sparse vector search use cases, particularly when dealing with larger full text data sets (>5M records)?
In the LAION-5M benchmark, we’ve compared our performance against ElasticSearch and OpenSearch. However, comparing ingestion performance is more challenging due to differences in architecture. Both ElasticSearch and OpenSearch, like most vector databases, use the concept of shards. Each shard represents a separate vector index, and queries aggregate results across these shards. Larger shards lead to faster queries but come with higher resource requirements and slower update speeds.
It’s also worth noting that ElasticSearch has implemented RaBitQ support for HNSW. So it's difficult to compare without running actual benchmarks. However, ElasticSearch typically requires at least double, if not triple, the memory size of the vector dataset to maintain system stability. In contrast, PostgreSQL can achieve a stable system with far fewer resources—for example, 32GB of memory is sufficient to manage 100 million vectors efficiently.
From my perspective, it would be faster in query comparing to ElasticSearch due to the extensive optimizations. And much much faster with the updates (insert and delete) due to using IVF instead of HNSW.
I'm the project lead for both projects. We're still in the process of supporting all the function from pgvecto.rs in VectorChord (int8, more than 2000 dim vec, etc.). We'll provide the migration docs for pgvecto.rs users to VectorChord. User will have better experience with VectorChord due to better integration with postgres storage system.
We will stop supporting pgvecto.rs early next year when everything on VectorChord is ready.
In five pages of text, we never get to learn what a Vector is (in this context), why we’d want to store one in pgsql, or why it costs so much to store them compared to anything else you’d store there.
For an example of how you can communicate with domain experts, while still giving everyone else some form of clue as to what this hell you’re talking about, check out the link to the product that this thing claims to be a successor to:
That's because this product isn't for you then. My team has been evaluating vector databases for years and everything on the VectorChord page resonated with me. We run one of the world's largest vector databases and we'll likely benchmark vectorchord to see if it lives up to its promises here.
Hi, we’re here to help if you need assistance (via GitHub issue, Discord, or email). Could you let us know the scale of your vectors—are they 1B or 10B?
Wow. It never occurred to me that this might be anything but the landing page of a product.
The title here, the presentation on the page itself. Everything screams "landing page". I had to go back on a desktop browser to see the word "blog" in the url bar, and mentally shift those graphics and little islands of text around until I can view it from that lens. If it's really just a sub-product of the main product that they're talking about, then yeah, it makes more sense in that context.
But my answer to your question would still be "Yes". Absolutely. If you're a product, the job of your blog is to convince people coming off the street that they need your thing, even if they didn't realize it yet.
Step one of that process is to not bounce them back to the street without any idea what they're looking at.
Though it's worth noting that the license is AGPL. So if the idea is for this to take over for pgvecto.rs, it's an important data point for those building SaaS products.
It will make pgvector the only permissively licensed option, given it has the same license as Postgres.