“Our Apache Hadoop-based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS).” “Did you just tell me to go fuck myself, Bob?” “I believe I did.”
This isn't especially strange or complicated. I guess you're just trying to be witty? But it comes across as pretty ignorant.
Yeah if Uber was born today and didn't want to use any on-prem resources they'd use S3, but HDFS is the best alternative to object storage out there, and there's a huge ecosystem of tools around it. If you're running your own datacenter, there's not any serious alternative.
I don't understand what you're trying to say here. Is the methodology they're using outdated? Or perhaps their approach seems naive? Or are you saying something else?
The whole section on how GCs work is far from accurate. It only describes how one GC, CMS, works or worked because it had been removed. The permanent generation does not exist anymore since Java 8, etc
And i'm far from being an expert in GCs, but the basics are just wrong.
Maybe they have to use some kind of legacy JVM library that doesn't work on Java 8? It looks like they're a GoLang shop, so I'm not sure why we'd assume their Java stuff is representative of anything other than what they have to do because some useful JVM library for mapping or something needs to be supported.
The problem is that Java 7 is 10 years old, it means that you are missing 10 years of optimization. For GCs that's a lot.
GCs before ~2010 have been optimized for throughput more than for latency. Since then, you can choose.
Since 10 years, G1 let you set your maximum pause time and you get a huge warning if it miss that target, ZGC or Shenandoah are from the beginning latency first, throughput second.
That's for Hadoop 3 - Uber is on CDH 5 last I read, which has min support for Java 7 and it sounds like that's what they're using if the CMS GC was being used (unless it just happened to work better for them in Java 8+).
Yikes. Fortunately Java is a supported language at Uber, though some teams are more serious about it than others. Scala is also accepted because of Spark.
I guess I misunderstood things then. It's difficult to reconcile why they'd be using a legacy version of Java that required these specific GC tunings if they weren't forced to use it because of libraries that couldn't be replicated or replaced in GoLang. Sounds like it's a much bigger mess there than what I had assumed (I guess I was giving them too much of the benefit of the doubt).
Well, at least they take the cake when it comes to running CMS with the largest heap size?
It is also quite ironic that they paid for Zing C4, and then came up with "P90 average RPC queue time" as the metric, ignoring Gil Tene's wisdom that you should never average percentiles.
Another problematic statement is this:
> which in turn decreases request latency and increases throughput
Gil Tene wants you to measure p99.99 or pmax, so that you will pay for Zing C4 to get lower tail latency, while sacrificing some throughput. If you end up getting both lower tail latency and more throughput, then something is already very wrong with the existing setup (i.e. the full GC).
Large companies operate almost entirely on legacy inefficient crap. It's often the case that the thing that makes them the most money is running on giant old barely-supported piles of shit.
People constantly underestimate the fact that >80% of the cost of software is maintenance. Upgrades and patches and redesigns/refactors and following modern conventions costs 4x more than the cost it took to build it. But nobody puts that in their budget. And then later people find this huge pile of legacy shit, and think, wow, this company really screwed up. But in fact this is totally normal.
By the way, this is the case because of how we develop software and hardware, not because they are some inscrutable element that can't possibly be built once and maintained for a lifetime (like, say, a building). We just don't build it to last.
I guess the same was said about those non-mainframe kids 30 years ago. Tech gets cheaper and better. You can rent CPU time instead of buying a server. It is just that simple.
And overpay in orders of magnitude on a 5 year depreciation schedule. And the amount of fun you are going to have getting charged for the S3 API calls and pulling that data out of S3 for processing would make anyone reasonable CFO head spin 360 degrees every minute.
Depends on multiple factors:
- S3 or compatible would be trivial choice for storage
- if on-prem is a must there are multiple options, generally something with erasure codes (it is a game changer for storage)
So far I have been using enterprise storage (that has some potential problems when mounted as nfs volumes), works for petabytes, already decouples storage from compute.
More recently I was experimenting with MinIO. No conclusion so far.
The problems are with Hadoop:
- unfortunate design choices (namenode??)
- extremely unfortunate implementation (I probably spent more time in the Hadoop codebase than any other, found many bugs, some I could fix, most I couldn’t)
I think I have migrated away from Hadoop 10 PB worth of data infra in the last 5 years, mostly to AWS, some to Azure. Average cost saving is between 10-30% yoy.
Some comments point out the network cost. The reality is most companies collect a giant amount of data (ingress) and publish dashboards (egress). It makes cloud pretty viable.
S3 is beating the shit out of HDFS in reliability and cost, even though most Hadoop shops spread the fud that it is slow. Same way these companies used to spread the fud that snappy is best for data compression.
As of 2021 even the latest adopters (banks and insurance companies) use cloud. Maybe extremely few dogmatic companies remain in the onprem crowd. Even those will eventually give up.
It's quite odd that your calling out Uber for using HDFS because it's so "2014", legacy and inefficient and yet your solution involves NFS and an enterprise storage vendor? Do you not see any irony in that? I think that many would argue your solution is far more legacy. It's also far more inefficient from a cost perspective. I have yet to see a storage vendor who could match the price of commodity hardware.
Did S3 really become that much faster over the years? I was optimizing a cache that worked on top of HDFS and cloud FSes at some point few years ago, and I remember making slides to present the improvements. For HDFS I had this complex slide that tried to show cache is actually achieving some fractional improvement at some point. For the unnamed cloud FS, I just had to make a 2-bar chart that looked like the cloud FS is giving you a middle finger sideways... A really tall bar for the time it took to read data off the cloud FS. A really small bar with cache.
I used to work for Amazon. The code quality at places like Google and Amazon tend to be good.
S3 has a really good architecture and a great implementation.
HDFS has a meh architecture with a bad implementation.
There were obvious signs. I remember when Twitter decided to investigate why HDFS was slow and they figured out some details about how Hadoop guys decided to implement their own dictionary for configuration that had a much worse time complexity than the default dictionary in Java. There might be a video about this somewhere.
And there are more things like that. I used to have 5-10 years old HDFS Jira tickets open. I just gave up.
> I am surprised how much legacy inefficient crap is lingering around in companies like Uber.
Maybe there are some JVM specific libraries they need for things like mapping that don't exist on GoLang? From reading their tech blog, it sounds like they're mostly a GoLang shop and they were apparently Python before that. So it seems like they're probably forced into using Java because some of the libraries they use aren't worth rewriting in-house to avoid having to use the JVM.
Don’t get me wrong I have nothing against Java/JVM. I pretty much appreciate that tech. The criticism was specifically against Hadoop. I spent nearly 10 years on it. Luckily it is stack that most companies migrating away from.
Recent GC algorithms slow down the application, either by not scheduling coroutines that allocate too much memory or by forcing them to do the marking or the evacuation (helping the GC) when running (or both).
With that, STW pause time does not depend on the heap size or on the root set size (the stack size). In practice, it means pause < 1ms, at that point the OS becomes the bottleneck, not the GC.
So latency is good but throughput can be reduced by 30%.