Did JVM reinvent VM? Is it a VM or a container? No one (not enough people?) aske...

aprdm · on July 31, 2022

That’s true if your shop didn’t create a bunch of custom controllers or apis inside the k8s world… or forked it.

koffiezet · on Aug 1, 2022

Any decent k8s engineer worth his salt should find it pretty easy to navigate this, since how you deal with that is pretty standardised.

I've written custom operators and submitted upstream patches to existing-ones, it's not that hard once you know what you're dealing with conceptually.

x86x87 · on Aug 1, 2022

So your point is that once you learn how to use the tool it's easier to use the tool?

Also, what is a k8s engineer? What's the difference between this engineer and a "devops" engineer? This and software engineer?

koffiezet · on Aug 1, 2022

> So your point is that once you learn how to use the tool it's easier to use the tool?

Kubernetes is not just "a tool", it's a platform. So if you know the k8s fundamentals, being dropped in an unfamiliar environment should not be a problem. It's the same as knowing your way around a Linux system, you might not be familiar with that specific distribution you're thrown into, but you should be able to explore the system and figure stuff out.

> Also, what is a k8s engineer? What's the difference between this engineer and a "devops" engineer? This and software engineer?

Any engineer who claims to know a bit more than the basics of k8s? It being a software, "devops", SRE, "platform", ... engineer doesn't really matter, I've been labeled all of the above.

I'll never claim k8s is perfect, I have plenty of issues with it, but very few of them are an architectural problem, and I've yet to encounter a platform that attempts to address all the issues k8s does. But you seem to be very critical of it, so what would you propose as a viable alternative?

x86x87 · on Aug 1, 2022

Are you running the k8s control plane yourself? Are you keeping it up-to-date?

dilyevsky · on Aug 1, 2022

We do (using Cluster API) and yes we keep it now within 2 releases from latest. Most of the work required to keep it up to date is not on k8s side tho…

Edit: I can answer questions on how we do it or maybe work on blog post if you’re seriously asking

x86x87 · on Aug 1, 2022

I am asking because I've seen the work that goes into keeping a control plane (for anything other than toy applications) running. It's non-trivial and requires a lot of know-how about many not well documented k8s features. You may have figured a better way of doing it or maybe you have reduced complexity by having a more opioninated setup.

How many nodes (ballpark) are in your cluster, how big is the team supporting it and how many developers use the cluster you've stood up? (For how many services) Are you running your workloads in the cloud or on dedicated bare metal machines?

dilyevsky · on Aug 1, 2022

I’d love to hear more details/chat about what aspects you found most difficult specifically with running control plane (not on gke/eks/aks). I’m looking at how I could solve this more generally for businesses ;)

> How many nodes (ballpark) are in your cluster, how big is the team supporting it and how many developers use the cluster you've stood up? (For how many services)

Up to upper hundreds of nodes/tens of thousands of cores (beefy nodes) per cluster in biggest ones (several geo/cloud-distributed clusters). Tbh what’s giving me more grief is number and diversity (different clouds/different setups) of clusters not individual cluster sizes. Original team that built/proved the design was only 3-4 devs but k8s was only part of what we did so it’s more like 1-2 if full time. Number of engineers that deploy there is probably close to 50 now with at least twice as many services (not all actively developed or in-house).

> Are you running your workloads in the cloud or on dedicated bare metal machines?

GCP and AWS VMs. We managed to stay away from anything managed except block and object storage so baremetal is in store for us just wasn’t the time yet. One of my main goals developing our current design was to keep this door open when the time comes and the setup will be largely the same as in cloud (with a few additions like 3rd party CSI for ISCI/NVMeOf storage)

x86x87 · on Aug 1, 2022

Wow. Where to begin?

Problems stemmed across multiple dimensions: limits on the numbers of nodes in the cluster (had to split into 2 clusters - think thousands of nodes), crds, weird network setup issues, daemon sets for storage/logs, persistent volumes, rough edges of the tooling operating the cluster (people tell you that with k8s you write less scripts! But in reality depending on your cluster size the scripting moves to the admin side.

This does not include actually upgrading k8s which was a nightmare and having to keep the underlying nodes patched.

dilyevsky · on Aug 1, 2022

Did you run into ipv4 subnet limitations wrt to node count or something else? 5k is max supported node count for a while now but running it with a tunnel-based cni (as is typical in cloud) would probably be too wasteful so it’s likely only an option for when you can configure l3 on your own switches. Or maybe using ENIs in ec2. I found that ipv4 space requires very careful planning especially if you plan on peering clusters together.

> weird network setup issues, daemon sets for storage/logs, persistent volumes, rough edges of the tooling operating the cluster (people tell you that with k8s you write less scripts! But in reality depending on your cluster size the scripting moves to the admin side.

Network with Cilium has been mostly “setup once and forget”. But yeah all the addons will require constant upkeep in a self-managed scenario. As far as configuration management goes I agree it’s a big hole which we had to fill with our own DSL configuration framework (think Helm3 but with Starlark). Now there are products like Pulumi which can offer similar API