I don't think I've ever seen this mentioned anywhere, but if
you need a unique ID for an entity with not a lot of records
planned (≤10,000,000), why not use a random int64 with
a simple for loop on the application side to catch
the occasional collisions? Are there any downsides besides
making the application side a tiny bit more complex?
That’s the UUID approach, but worse. According to the birthday problem[1], you’re 50% likely to get a collision in 65 bit numbers after about 5 billion insertions. That’s not an awful lot. Replace that with a 128-bit UUID and you’d have to insert 22,000,000,000,000,000,000 rows to get a 50% chance. That’s probably less likely than a cosmic ray flipping a random bit in RAM and corrupting the index that way.
The post qualified as <= 10,000,000 total records. For that number of records, there's about a chance of about 0.00001 that you get a collision, assuming good randomness.
I'm just answering the poster's question directly; but in the general case, I agree with you. The cognitive overhead of dealing with the various "what ifs" usually aren't worth the couple bytes or cycles that you could save.
Getting a collision with this approach doesn’t matter — the whole point is to loop if you do get a collision. The only issue is getting a long string of sequential collisions, which is highly unlikely.
8 extra bytes per row and per foreign key reference relative to an int64 can add up quickly especially if the row is small. I agree it’s not typically the right trade off but it’s not as absolute as you claim.
for what it's worth, YouTube still uses 11 character base64 strings for their video ids, which are assumed to be 64-bit ints. They also allow unlisted videos, which people usually take to mean "semi-private".
It's an interesting tradeoff. The UX of the smaller YouTube video id links is probably of some benefit to them. Plus they have private videos for when you really don't want your video to be viewed, with unlisted being the middle ground of easy sharing but also keeping it exclusive.
Sure, and it makes sense there: write a service that returns unique 64 bit ints and encapsulate the complexity inside that one location. That’s easier than making every `insert` in your app code have to do a `while not unique` loop.
> if you need a unique ID for an entity with not a lot of records planned (≤10,000,000), why not use a random int64 with a simple for loop on the application side to catch the occasional collisions?
What’s the use case for this where UUIDv4 or sequential ID isn’t better? Because it sounds like a solution in search of a problem.
> Are there any downsides besides making the application side a tiny bit more complex?
I wouldn't say it's a particularly great upside, but I've found tooling doesn't tend to work as well with UUID keys as it does integer keys. E.g., Hasura won't map UUID keys to the GraphQL ID type [1], which makes working with them unnecessarily difficult. Arguably, the issue should be fixed with the tool (or a different tool chosen), but there's only so many battles to pick and sometimes that decision is out of your control.
UUIDv4 has performance implications. But I agree, if you are already coupled to the DB (due to the check loop), generally you might as well use sequential.
Database sequences comply with the ACID properties of the transactional processing. Generating your own IDs and adding "a simple for loop" means that you lose that capability for no good reason.
If you have 10 million rows, you're looking at 16MB for the storage of a UUID, vs 8MB for storage of a 64 bit int.
I have a web service where I need to generate access tokens of five characters. I just generate a random one and check to see if it’s already in use. Been working fine for years.
I should have been more clear. Here are the details:
If you read the linked doc, you'll see an EXCEPT clause. That can be used for a retry loop inserting into a table with a UNIQUE constraint. No read locks are necessary, because the UNIQUE constraint will catch violations safely (regardless of concurrent activity), and the retry loop can simply retry until that succeeds. For instance:
CREATE TABLE u(i INT8 UNIQUE);
-- insert random unique value in the range 0..n
-- into table u, retrying if it's already present
--
-- NOTE: this will not terminate if 0..n are all
-- present
CREATE OR REPLACE FUNCTION insert_uniq(n INT8)
RETURNS VOID
LANGUAGE plpgsql AS $$
DECLARE
x INT8;
BEGIN
<<retry_loop>>
LOOP
BEGIN
x := (random() * n)::int8;
INSERT INTO u VALUES(x);
RAISE NOTICE 'inserted unique value %', x;
EXIT retry_loop;
EXCEPTION
WHEN unique_violation THEN
RAISE NOTICE 'collision with value %; retrying', x;
END;
END LOOP;
END;
$$;
This will obviously loop forever if 0..n are all occupied, but if you choose n as (2::numeric^63 - 1)::int8, that won't happen.
Reducing space is less about pure storage amount but rather about the fact that having a not-too-large datatype (e.g. native integer) for keys generally improves all kinds of performance as the indexes are more compact and better fit in caches, comparison is trivial so joins are faster, etc.
The goal is to have a non-sequential ID (for example,
to hide the information about the actual size of the data
set) in a situation where one needs to support a storage
that doesn't have a native support for UUIDs (that isn't just “convert to TEXT”) and reuse as much of the schema
as possible without [ab]using ORMs. For example,
an application that has to support SQLite as well as
PostgreSQL and maybe some other storages.