No one who has done serious work across the database platforms believes they restrict benchmarks because they're behind or uncompetitive. They both make very capable, competitive database systems.
The problem, as cited in the linked page, is that "It takes much more work to refute bad benchmarks than to produce them". We see on HN with regularity where people post egregiously flawed benchmarks, usually to demonstrate some preconceived notion or other. And while that can usually be ascribed to ignorance or sloppiness, it's trivial for vendors to contrive benchmarks that are purpose-suited to make their own product look good and the competition look bad, however completely artificial and unrealistic the scenario is.
DeWitt clauses are an abomination. They should not exist. But I get why they exist and are threatened, even if they are basically never actually enforced (seriously though -- are they ever actually enforced?).
I think even flawed benchmarks are good benchmarks.
Because real world code isn't optimized to the max. Because real world code makes bad or even wrong assumptions. Because there are usually much more requirements to real world code than "be as fast as you can".
Additionally, those flawed benchmarks make you think about a lot of interesting details. Details you usually don't think about when producing real world code.
Also it's a little bit alike as with statistics. "Don't trust any statistics that you haven't fabricated yourself". Still statistics are useful in general.
There's flawed benchmarks and there are completely wrong benchmarks.
I've seen so many benchmarks where the timing or loop overhead dominated to the point that they weren't measuring anything but the cost of taking timing values
I've seen benchmarks where they run a single time without warming up the disk cache, so whichever benchmark they happen to run first shows as 2 orders of magnitude slower.
I've seen benchmarks where the standard deviation was much larger than the difference between the times reported, but they didn't bother to check for this.
I've seen benchmarks where they don't even use the same algorithm, and the difference in the algorithm chosen dominates any other concerns.
It's entirely possible that the net utility of such benchmarks being published is negative; enough people come away with the initial wrong conclusion without critically investigating it.
It's almost certainly true that the net value to the company whose product looks bad in the benchmarks is negative, which is why these clauses exist.
If the benchmarks were the kind of bad that you allude to in your second statement it would probably be pretty reasonable for the reasons you state, and overall you would probably end up with a distribution of benchmarks that approximated overall performance. What about intentionally crafted, nigh pathological examples? With the kind of flaws that don't crop up in real world code? I don't think those have value.
Actually, I think I'm trying to distinguish between flawed benchmarks like you are talking about and actively malignant benchmarks.
Nah, flawed data is pretty much unusable. Like, if I were to benchmark some really short java program by restarting the JVM each time, do I get to say that it is slow? No, because I measured some utterly stupid metric.
I was on a team that migrated an app from SQL Server to Postgres a some years ago, and the performance dropped by around 30%.
Of course, due to the benchmark rules, I'm not allowed to post about it... ;)
That being said, in our application it came down to three main issues:
- Postgres doesn't allow optimizer hints, and the optimizer generally produced worse plans than Microsoft's (e.g. CTE were an optimization fence in Postgres).
- Postgres at the time did not support parallel queries.
- Postgres connection handling is terrible. Pgbouncer helps but introduces its own problems.
- Database bloat / vacuum issues
Generally I found SQL Server would perform very well with no tuning and minimal maintenance. And generally, if it doesn't do the right thing, a query hint will fix it. Postgres required a lot more configuration and testing to obtain acceptable performance.
We still switched to Postgres. It has lots of other benefits - and none of the licensing.
> Generally I found SQL Server would perform very well
Every database engine makes different trade-offs, or does some things better than others.
I've found that SQL Server handles complex queries over large data sets fantastically well, easily outperforming Oracle or open-source DBMS systems in like-for-like scenarios. It often papers over missing indexes shockingly well, because it automatically uses temporary hash indexes that it can build in parallel. Up to about ten million rows this can be fast enough that developers don't even notice that they forgot an index!
However, it has very high minimum latency that cannot be fixed through any mechanism. Not via RDMA, shared memory, named pipes, local loopback, or any other means. It always takes SQL Server about 125-150 microseconds to respond to a query, not matter how trivial, such as "SELECT 1". This makes it terrible as a KV store, and makes some types of ORM very sluggish when chasing references in code.
Similarly, there have been several articles published saying that while Postgres is well rounded and feature-full, it has high write-amplification that makes it an absolute no-go past a certain scale. Many orgs have written long blogs about how they started with PG and were forced to migrate to MySQL to get to the required write throughput.
I feel that the DBMS space is still immature, and there are very fundamental things missing. E.g.: There are no enterprise-ready open source pure in-memory relational database engines that have capabilities comparable to SAP HANA. I'm also yet to see a DBMS with decent programmability, so that developers don't need to write App->ORM->DB layers over and over.
The first two are already fixed, the third is fixed in v14.
The fourth will likely never be fixed as it is a byproduct of the extremely fine grained transactional capabilities in Postgres (which brings the advantage of fewer deadlocks). However, if you are disinclined to make that tradeoff, I believe that there is a backend in development that is designed to reduce bloat, using the new pluggable storage engine API.
I'm aware some of these things have changed, but equally SQL server has likely developed since then too.
That said for #1 are you saying PG now allows query hints or just the specific example is fixed? There are a lot of cases still where the optimiser does the wrong thing and with PG you're powerless to fix it.
With #3, do you have any more information about that? I can't find any on a brief Google.
I was specifically referring to the optimization fence on CTEs. PG still doesn't have query hints, and it seems like the core team has an ideological opposition to building query hint capabilities. Their stance has been that query hints often lead to suboptimal behavior because queries often don't change over time, but data sets do, so relying on statistics allows your query plans to change with different data. A query hint, on the other hand, may lead to an optimal query execution today, but a terrible one tomorrow.
Over the years Postgres has steadily chipped away at almost all of the reasons why I would ever want query hints. Only two frustrations remain:
* the optimizer is not aware of TOAST storage. This means that for tables with lots of large objects (geospatial in my case), the optimizer only sees the costs of the table with a pointer to the large object, as opposed to the cost of pulling table with the large object.
* The optimizer only has singular statically configured costs for functions and aggregates, but costs can actually dramatically vary based on the size of the data inputs, the size of the aggregate "window", etc.
Overall, these are rarely a big enough deal for me to not choose Postgres. I see the default planner behavior as better than the default planners for both Oracle and SQL Server, but it is still possible that the lack of query hints could be a deal breaker. If you get a chance, give it another try...I think you might be surprised at how far Postgres has come since the 9.X days.
As far as number 3, I guess I should clarify that they've dramatically reduced the connection overhead, but I shouldn't have said "fixed". It may or may not be reduced enough for your requirements. Here is a blog post with some links to various commits: https://www.depesz.com/2020/08/25/waiting-for-postgresql-14-...
I do actually use Postgres still, despite its limitations.
I dislike their ideological opposition to query hints, partially because I've been bitten enough by the optimiser deciding in production to change the query plan to something that kills performance. If anything, I'd rather be able to pin the entire query plan! Obviously a better optimiser is ideal, but query plan changes are a big risk to app stability that are hard to mitigate. AWS Aurora for PGSQL actually has a query plan management system which helps with this.
Anyway, the difference between PG and the others doesn't make up for the licensing costs. Even in the specific cases where MSSQL does far, far better, the licensing costs still eliminate most of the benefit.
The query hint thing is a serious issue. I get the concern that people won't report issues and so on, but on heavily loaded DBs having no decent operational response to a bad plan change is really scary. In an ideal world something like oracle's plan stability would be pretty helpful.
This is extremely easy to believe.