Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't think I've ever seen this mentioned anywhere, but if you need a unique ID for an entity with not a lot of records planned (≤10,000,000), why not use a random int64 with a simple for loop on the application side to catch the occasional collisions? Are there any downsides besides making the application side a tiny bit more complex?


That’s the UUID approach, but worse. According to the birthday problem[1], you’re 50% likely to get a collision in 65 bit numbers after about 5 billion insertions. That’s not an awful lot. Replace that with a 128-bit UUID and you’d have to insert 22,000,000,000,000,000,000 rows to get a 50% chance. That’s probably less likely than a cosmic ray flipping a random bit in RAM and corrupting the index that way.

[1] https://en.wikipedia.org/wiki/Birthday_problem#Probability_t...


The post qualified as <= 10,000,000 total records. For that number of records, there's about a chance of about 0.00001 that you get a collision, assuming good randomness.


Sure, but stuff always grows, and the experiment gets run a bunch of times. Why not go with the built-in solution and then not have to worry about it?


I'm just answering the poster's question directly; but in the general case, I agree with you. The cognitive overhead of dealing with the various "what ifs" usually aren't worth the couple bytes or cycles that you could save.


Getting a collision with this approach doesn’t matter — the whole point is to loop if you do get a collision. The only issue is getting a long string of sequential collisions, which is highly unlikely.


But now you’ve tried code complexity for a few bytes if storage. That’s just not worth it.


8 extra bytes per row and per foreign key reference relative to an int64 can add up quickly especially if the row is small. I agree it’s not typically the right trade off but it’s not as absolute as you claim.


This is why we need 2TB drives now, when we used to get by with 2GB.


Back in my day we measured things in Ks, not Ms or Gs or Ts.

Get off my lawn...

But seriously, UUIDs work, they don't need application code to avoid collisions. If you want something a bit more compact/shardable, use ULIDs.


for what it's worth, YouTube still uses 11 character base64 strings for their video ids, which are assumed to be 64-bit ints. They also allow unlisted videos, which people usually take to mean "semi-private".

It's an interesting tradeoff. The UX of the smaller YouTube video id links is probably of some benefit to them. Plus they have private videos for when you really don't want your video to be viewed, with unlisted being the middle ground of easy sharing but also keeping it exclusive.


Sure, and it makes sense there: write a service that returns unique 64 bit ints and encapsulate the complexity inside that one location. That’s easier than making every `insert` in your app code have to do a `while not unique` loop.


If you have this situation, it’s not even worth the time or effort to not just use UUIDs. You’re adding complication for no gain whatsoever.


> if you need a unique ID for an entity with not a lot of records planned (≤10,000,000), why not use a random int64 with a simple for loop on the application side to catch the occasional collisions?

What’s the use case for this where UUIDv4 or sequential ID isn’t better? Because it sounds like a solution in search of a problem.

> Are there any downsides besides making the application side a tiny bit more complex?

Are there any upsides to warrant the complexity?


I wouldn't say it's a particularly great upside, but I've found tooling doesn't tend to work as well with UUID keys as it does integer keys. E.g., Hasura won't map UUID keys to the GraphQL ID type [1], which makes working with them unnecessarily difficult. Arguably, the issue should be fixed with the tool (or a different tool chosen), but there's only so many battles to pick and sometimes that decision is out of your control.

[1] -- https://github.com/hasura/graphql-engine/issues/3578


UUIDv4 has performance implications. But I agree, if you are already coupled to the DB (due to the check loop), generally you might as well use sequential.

Seems too easy to screw up.


Database sequences comply with the ACID properties of the transactional processing. Generating your own IDs and adding "a simple for loop" means that you lose that capability for no good reason.

If you have 10 million rows, you're looking at 16MB for the storage of a UUID, vs 8MB for storage of a 64 bit int.

Both of those are entirely cacheable.


I have a web service where I need to generate access tokens of five characters. I just generate a random one and check to see if it’s already in use. Been working fine for years.


You can also do the loop on the server side using PL/pgSQL:

https://www.postgresql.org/docs/current/plpgsql-control-stru...


That's not that trivial. You can't just loop to get a unique ID. Maybe if you lock the whole table for reads first, which is quite drastic.


I should have been more clear. Here are the details:

If you read the linked doc, you'll see an EXCEPT clause. That can be used for a retry loop inserting into a table with a UNIQUE constraint. No read locks are necessary, because the UNIQUE constraint will catch violations safely (regardless of concurrent activity), and the retry loop can simply retry until that succeeds. For instance:

    CREATE TABLE u(i INT8 UNIQUE);

    -- insert random unique value in the range 0..n                                                                
    -- into table u, retrying if it's already present                                                              
    --                                                                                                             
    -- NOTE: this will not terminate if 0..n are all                                                               
    -- present                                                                                                     
    CREATE OR REPLACE FUNCTION insert_uniq(n INT8)
      RETURNS VOID
      LANGUAGE plpgsql AS $$
    DECLARE
      x INT8;
    BEGIN
      <<retry_loop>>
      LOOP
        BEGIN
          x := (random() * n)::int8;
          INSERT INTO u VALUES(x);
          RAISE NOTICE 'inserted unique value %', x;
          EXIT retry_loop;
        EXCEPTION
          WHEN unique_violation THEN
            RAISE NOTICE 'collision with value %; retrying', x;
        END;
      END LOOP;
    END;
    $$;
This will obviously loop forever if 0..n are all occupied, but if you choose n as (2::numeric^63 - 1)::int8, that won't happen.


Thank you.


The point of guids/uuids is no collision... Ever... Including for large datasets. Even when you merge multiple together in 10 years.

For your use-case you could use incremental IDs.


Is the goal here to save space?


Reducing space is less about pure storage amount but rather about the fact that having a not-too-large datatype (e.g. native integer) for keys generally improves all kinds of performance as the indexes are more compact and better fit in caches, comparison is trivial so joins are faster, etc.


The goal is to have a non-sequential ID (for example, to hide the information about the actual size of the data set) in a situation where one needs to support a storage that doesn't have a native support for UUIDs (that isn't just “convert to TEXT”) and reuse as much of the schema as possible without [ab]using ORMs. For example, an application that has to support SQLite as well as PostgreSQL and maybe some other storages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: