VectorChord: Store 400k Vectors for $1 in PostgreSQL

dmezzetti · on Dec 5, 2024

This looks like an interesting project.

Though it's worth noting that the license is AGPL. So if the idea is for this to take over for pgvecto.rs, it's an important data point for those building SaaS products.

It will make pgvector the only permissively licensed option, given it has the same license as Postgres.

gaocegege · on Dec 11, 2024

Hi, now we adopted a dual license model (AGPLv3 and ELv2).

whakim · on Dec 5, 2024

Could you talk about how updates are handled? My understanding is that IVF can struggle if you're doing a lot of inserts/updates after index creation, as the data needs to be incrementally re-clustered (or the entire index needs to be rebuilt) in order to ensure the clusters continue to reflect the shape of your data?

VoVAllen · on Dec 5, 2024

We don’t perform any reclustering. As you said, users would need to rebuild the index if they want to recluster. However, based on our observations, the speed remains acceptable even with significant data growth. We did a simple experiment using nlist=1 on the GIST dataset, the top-10 retrieval results took less than twice the time compared to using nlist=4096. This is because only the quantized vectors (with a 32x compression) need to be inserted into the posting list, and only quantized vector distances need more computations. And the quantized vector computation only accounts for a small amount of time. Most of the time is spent on re-ranking using full-precision vectors. Let's say the breakdown is approximately 20% for quantized vector computations and 80% for full-precision vector computations. So even if the time for quantized vector computations triples, the overall increase in query time would be only about 40%.

If the data distribution shifts, the optimal solution would be to rebuild the index. We believe that HNSW also experiences challenges with data distribution to some extent. However, without rebuilding, our observations suggest that users are more likely to experience slightly longer query times rather than a significant loss in recall.

marcyb5st · on Dec 5, 2024

Awesome work! But aren't the comparisons missing ScaNN [1, 2]? I think it's the overall SOTA [3] at the moment regarding vector indexing.

[1] https://github.com/google-research/google-research/tree/mast...

[2] Also available on something like AlloyDB on GCP: https://cloud.google.com/alloydb/docs/ai/store-index-query-v...

[3] https://ann-benchmarks.com/glove-100-angular_10_angular.html

Disclaimer: Working for Google, but nowhere close to Databases.

VoVAllen · on Dec 5, 2024

I'm the project lead for VectorChord. I have tested ScaNN on AlloyDB Omni but have struggled to achieve reasonable recall on the GIST 1M dataset, with results peaking at only around 0.8. The limited documentation makes it challenging to understand the underlying causes of this performance.

Additionally, I couldn’t find any performance benchmarks for ScaNN integrated with PostgreSQL, particularly in comparison to pgvector or its standalone. The publicly available metrics focus exclusively on query-only indexing outside of the database.

On our side, we’ve implemented the fastscan kernel for bit vector scanning, which is considered as one of ScaNN’s key advantages.

marcyb5st · on Dec 5, 2024

Thanks for the explanation!

Really appreciate it and it makes perfect sense.

7qW24A · on Dec 5, 2024

The “external index build” idea seems pretty interesting. How does it work with updates to the underlying data (e.g., new embeddings being added)? For that matter, I guess, how do incremental updates to pgvector’s HNSW indexes work?

VoVAllen · on Dec 5, 2024

The IVF indexing can be considered into two phases, computing the centroids (KMeans), and assigning each point to the centroids as the inverted lists. The most time-consuming part is at the KMeans stage, and can be greatly accelerated with GPU. 1M 960dim vec can be clustered in less than 10s. We did the KMeans phase externally, and the assignment phase inside postgres. The KMeans part depends only on the data distribution, not on any specific data. So we can do sampling on the data, and inserting/deleting the data won't affect the KMeans result significantly. For the update, it's just a matter of assigning the new vector to a specific cluster and appending it to the corresponding list. It's very light compared to inserting in hnsw

estebarb · on Dec 5, 2024

It would have been nice a comparison with pgvectorscale, which uses binary quantization and StreamingDiskANN.

gaocegege · on Dec 6, 2024

We will look into this. We've tried it before, but it took too long to build the index.

_mmarshall · on Dec 5, 2024

The cost to store a static set of 400k 768-dimension vectors is also $1 a month on Datastax's AstraDB. However, for that $1, AstraDB replicates the data 3x instead of storing it on a single machine.

Here is a link to the cost calculator. Note that the calculator includes cost of ingestion, but the article only mentions storage costs, not ingestion costs: https://www.datastax.com/pricing/vector-search?cloudProvider...

Disclaimer: I work on vectorsearch/AstraDB at DataStax.

VoVAllen · on Dec 5, 2024

It's hard to compare the cost with the Serverless pricing model, as write and read have extra costs. On the pricing page, datastax costs $4000 to write 100M 768-dim vectors. And 10M query will cost $300, which is only 4 QPS. As comparison, VectorChord can achieve 100 QPS on $250 instance.

xieydd · on Dec 5, 2024

The pricing estimator is fake, the bill is real : ).

tarasglek · on Dec 5, 2024

I am still waiting for a good pattern for using multivector embeddings like ColBert and ColPali in postgres. I get that its fun to optimize single vector stuff, but multivector is that happy middleground between single vector and reranker that seems to be only validated in specialized exotic search dbs like Vespa

jonathan-adly · on Dec 5, 2024

I would like to throw our project in the ring that solves this problem: https://github.com/tjmlabs/ColiVara

1. Uses half-vecs, so you cut down everything by half with no recall loss 2. Uses token pooling with hierarchial clustering at 3, so, you further cut down things by 2/3rd with <1% loss 3. Everything is on Postgres and pgvector, so you can do all the Postgres stuff and decrease corpus size by document metadata filtering 4. We have a 5000+ pages corpus in production with <3 seconds latency. 5. We benchmark against the Vidore leaderboard, and very near SOTA

You can read about half-vecs here: https://jkatz05.com/post/postgres/pgvector-scalar-binary-qua...

Hierarchical token pooling: https://www.answer.ai/posts/colbert-pooling.html

And how we implemented them here: https://blog.colivara.com/

__jl__ · on Dec 5, 2024

I really like the idea of ColPali and products building on it but I am still unsure about the applications for which it makes most sense. We mostly deal with reports that are 80-90% text, 10-20% figures and tables. Does a vision first approach makes sense in this context? My sense is that text-based embeddings are better in mostly text contexts. Layout, for example, is pretty much irrelevant but plays into vision-based approaches. What is your sense about this?

jonathan-adly · on Dec 5, 2024

So - the synthetic QAs datasets in the Vidore datasets are exactly like that 90% text, 10% charts/tables. OCR + BM25 is at ~90% NCDG@5 which is pretty decent. ColPali/Ours is at ~98%.

It is a small upgrade, but one nonetheless. The complexity, and the cost of multi-vectors *might* not make this worth it, really depends on how accuracy-critical the task is.

For example, one of our customers who does this over FDA monographs, which is like 95%+ text, and 5% tables - they misses were extremely painful - even though there weren't that many in text-based pipelines. So, the migrations made sense to them.

tarasglek · on Dec 6, 2024

Wow, I didn't expect an answer with a product and source. This looks really good, gonna give you guys a try.

VoVAllen · on Dec 5, 2024

There's no easy way to index ColBert multi-vectors in a scalable way that I know of. Vespa seems to rely heavily on binary quantization, which can cost a lot in recall loss. And for most cases, using ColBert as a reranker is good enough, as the pgvector example you posted.

tarasglek · on Dec 5, 2024

Seems like like doing a proper relational 1:N chunk:multiple-vectors foreign key, binarization and a clever join or multistage CTE would get us pretty close to useful.

I am ok with it being less efficient as the dev ux will be amazing. Vespa ops (even in their cloud) are a complete nightmare compared to postgres

tarasglek · on Dec 5, 2024

Looks like that there is recent work to make a pgvector example for this https://github.com/pgvector/pgvector-python/blob/master/exam...

gaocegege · on Dec 5, 2024

Hey everyone! We’ve developed a new PostgreSQL extension that supports 400k vectors for just $1. Check it out!

rkuzsma · on Dec 5, 2024

Would you be willing to speculate on how VectorChord's ingestion and query performance might compare to Elasticsearch/OpenSearch for dense vector and sparse vector search use cases, particularly when dealing with larger full text data sets (>5M records)?

VoVAllen · on Dec 5, 2024

In the LAION-5M benchmark, we’ve compared our performance against ElasticSearch and OpenSearch. However, comparing ingestion performance is more challenging due to differences in architecture. Both ElasticSearch and OpenSearch, like most vector databases, use the concept of shards. Each shard represents a separate vector index, and queries aggregate results across these shards. Larger shards lead to faster queries but come with higher resource requirements and slower update speeds.

It’s also worth noting that ElasticSearch has implemented RaBitQ support for HNSW. So it's difficult to compare without running actual benchmarks. However, ElasticSearch typically requires at least double, if not triple, the memory size of the vector dataset to maintain system stability. In contrast, PostgreSQL can achieve a stable system with far fewer resources—for example, 32GB of memory is sufficient to manage 100 million vectors efficiently.

From my perspective, it would be faster in query comparing to ElasticSearch due to the extensive optimizations. And much much faster with the updates (insert and delete) due to using IVF instead of HNSW.

curl-up · on Dec 5, 2024

Does this mean you won't support pgvecto.rs anymore?

VoVAllen · on Dec 5, 2024

I'm the project lead for both projects. We're still in the process of supporting all the function from pgvecto.rs in VectorChord (int8, more than 2000 dim vec, etc.). We'll provide the migration docs for pgvecto.rs users to VectorChord. User will have better experience with VectorChord due to better integration with postgres storage system. We will stop supporting pgvecto.rs early next year when everything on VectorChord is ready.

nextworddev · on Dec 5, 2024

What dimension vectors are we talking here

jesperwe · on Dec 5, 2024

First paragraph of tfa mentions "768-dimensional vectors"

jasonkester · on Dec 5, 2024

In five pages of text, we never get to learn what a Vector is (in this context), why we’d want to store one in pgsql, or why it costs so much to store them compared to anything else you’d store there.

For an example of how you can communicate with domain experts, while still giving everyone else some form of clue as to what this hell you’re talking about, check out the link to the product that this thing claims to be a successor to:

https://pgvecto.rs/

That starts off by telling us what it is and what it does.

lja · on Dec 5, 2024

That's because this product isn't for you then. My team has been evaluating vector databases for years and everything on the VectorChord page resonated with me. We run one of the world's largest vector databases and we'll likely benchmark vectorchord to see if it lives up to its promises here.

gaocegege · on Dec 5, 2024

Hi, we’re here to help if you need assistance (via GitHub issue, Discord, or email). Could you let us know the scale of your vectors—are they 1B or 10B?

lja · on Dec 5, 2024

10B, feel free to email me luke@ the domain in my profile.

thelittleone · on Dec 5, 2024

Do you have a blog per chance? Or any recommended reading on pre processing / data chunking strategies to improve results?

VoVAllen · on Dec 5, 2024

I recently came across a project that looks promising: [WordLlama](https://github.com/dleemiller/WordLlama?tab=readme-ov-file#s...). It appears to be well-suited for semantic chunking, though I haven’t had a chance to try it out yet.

redskyluan · on Dec 5, 2024

Maybe check this https://zilliz.com/pricing

Yan easily store 1B data into Zilliz Serverless and the cost is incredible cheap

xieydd · on Dec 5, 2024

You cannot compare Serverless with dedicated instance, this is unfair. A large number of writes and queries will eat up your wallet.

redskyluan · on Dec 6, 2024

That's true. Now we can offer the new dedicated Cost-optimized CU，pricing is

$0.318/hour ($228.96/month)

Which means you can store 20M 768 dimension data in 228$ per month

lja · on Dec 5, 2024

Zilliz gave us a near 6-figure a month quote for our database.

Raed667 · on Dec 5, 2024

And every hosting provider doesn't start by teaching you what is HTML

curl-up · on Dec 5, 2024

So they should start every one of their posts, on that same site, with a summary of what is available on the homepage?

jasonkester · on Dec 5, 2024

Wow. It never occurred to me that this might be anything but the landing page of a product.

The title here, the presentation on the page itself. Everything screams "landing page". I had to go back on a desktop browser to see the word "blog" in the url bar, and mentally shift those graphics and little islands of text around until I can view it from that lens. If it's really just a sub-product of the main product that they're talking about, then yeah, it makes more sense in that context.

But my answer to your question would still be "Yes". Absolutely. If you're a product, the job of your blog is to convince people coming off the street that they need your thing, even if they didn't realize it yet.

Step one of that process is to not bounce them back to the street without any idea what they're looking at.