I'm using duckdb for a side project and it keeps getting more powerful. Ironical...

LunaSea · on March 10, 2024

My issue with DuckDB is:

- unstable execution (random crashes)

- out-of-memory errors where I would've hoped for DuckDB to gracefully take the slow route to completion if no more memory is available (tried all the different conf settings)

jtigani · on March 11, 2024

The article mentioned that DuckDB keeps improving very quickly. The next couple of months of DuckDB are all about stabilization, with no new features getting added. Once it is robust enough it will be declared "1.0". My guess is that will be in late April.

You mentioned OOMs, this has been a focus for a while and ha gotten steadily better over the past few releases. 0.9 added spill to disk to prevent most OOMs. And 0.10, released a couple of weeks ago, fixes a bunch more memory usage problems. The storage format, which another commenter brought up, is now fully backwards compatible.

I'd suggest giving it another try, especially once 1.0 comes out.

LunaSea · on March 11, 2024

It might be getting better, but the examples are currently so egregious that it's tough to keep giving DuckDB a chance.

Example of a query that should never, ever, out-of-memory, but absolutely will in the latest DuckDB:

  COPY
    (
      SELECT
        rs.my_int,
        rs.my_bigint
      FROM
        READ_PARQUET('s3://some/folder/my-large-files-*.parquet')
        AS rs
    )
  TO
    '/my/home/folder/my-large-file.parquet'
    (
      FORMAT PARQUET,
      ROW_GROUP_SIZE 100000,
      COMPRESSION 'ZSTD'
    )
  ;

This query should simply read the two column series selected based on the parquet metadata and then stream the data to the disk.

And yet it will try to load data in memory before crashing.

cmdlineluser · on March 11, 2024

Does it fail on nightly?

There were some recent fixes: https://github.com/duckdb/duckdb/issues/10737

cmollis · on March 11, 2024

I've been testing duckdb's ability to scan multi-tb parquet datasets in S3. I have to say that i've been pretty impressed with it. I've done some pretty hairy SQL (window functions, multi-table joins, etc).. stuff that takes less time in Athena, but not by that much. Coupled with its ability to pull and join that data with information in RDB's like mysql make it a really compelling tool. Strangely, the least performant operations were the mysql look ups (had to set SET GLOBAL mysql_experimental_filter_pushdown=true;). Anyway.. definitely worth another look.. i'm using v 9.2

swasheck · on March 11, 2024

- each version breaks previous format and renders it unusable

1egg0myegg0 · on March 11, 2024

We heard your feedback! Backward compatibility was just implemented! Version 0.9 is actually fully readable by 0.10. With version 1.0 coming in a few months, this will be readable for several years' worth of version updates.

eyegor · on March 11, 2024

> readable for several years

Why not just make shims to migrate dbs for future compatibility? So you could read db 1.0 in v2.0 but only insofar as to migrate it to v2. The implication that you don't want to promise backwards read compatibility feels antithetical to a db driver.

For example, if I have an ancient mssql db that was started in 2001, I'm confident that I can grab the latest mssql driver and still use it. I don't have to track down mssql 2007 to migrate incrementally. Not sure about postgres or mysql but I assume it's the same there. Sqlite is definitely backwards read compatible.

plugin-baby · on March 11, 2024

Postgres:

> Major versions usually change the internal format of system tables and data files. These changes are often complex, so we do not maintain backward compatibility of all stored data.

https://www.postgresql.org/support/versioning/

lmz · on March 11, 2024

You're confusing a network protocol client (the MSSQL "driver") with an on-disk format. You can't upgrade the MSSQL server from 2001 to current in-place: https://learn.microsoft.com/en-us/sql/database-engine/instal...

Tarq0n · on March 11, 2024

That seems entirely fair for pre-1.0 software.

yencabulator · on March 11, 2024

Every time I've tried to use DuckDB I've made it segfault, so I'm simply using Datafusion instead, Rust saves the day there.

Three separate occasions with different uses all leading to crashes in the first hour of using DuckDB is enough that I frankly see no point in trying it again; I don't expect it to ever magically become reliable.

vietvu · on March 11, 2024

I haven't used duckdb since I got OOM on my dataset too. I think I will try again on 1.0.