Etcd 2.2

eis · on Sept 11, 2015

Does anyone at this point have a good comparison between etcd and Consul?

The v3 API of ectd seems more powerful because it supports multi-key transactions but isn't there yet.

Other than that I always had a slight feeling that Consul might be a bit more stable/reliable but I have no hard facts.

philips · on Sept 11, 2015

On the stability/reliability of etcd the team is very confident in how it behaves. Here is how we build that confidence:

- We have a functional testing cluster that is constantly simulating various types of failures. You can find some information on that here: https://coreos.com/blog/new-functional-testing-in-etcd/

- The team is works to ensure the raft library that we use (and that we now share with cockroachdb) has unit tests that handle all of the edge cases described in the raft paper. See this paper as an example: https://github.com/coreos/etcd/blob/master/raft/raft_paper_t...

- We have been running etcd in production for 1.5 years for a bootstrapping service called discovery.etcd.io and have focused on how to handle higher and higher read/write loads there. For example see the improvements we made in etcd 2.1 with async snapshots: https://twitter.com/BrandonPhilips/status/639843455141740544

- We have also focused on clear guides on how to operate etcd. All of this testing and performance work is of no use if people can't operate the system correctly: https://coreos.com/etcd/docs/latest/

Of course if you have particular issues that you have encountered we would love to help: https://github.com/coreos/etcd#contact

SEJeff · on Sept 11, 2015

Fantastic reply, thanks for this Brandon. I know that Kyle Kingsbury aka "aphyr, destroyer of distributed systems" broke etcd and at the time (before you wrote a new raft library), consul faired quite a bit better than consul.

Are you using his jepesen testing tool as part of your testing for failure scenarios? If not, why?

philips · on Sept 11, 2015

IIRC, etcd and consul failed in the same way in aphyr's post. Essentially neither had quorum read.

We don't have jepesen setup as a testing tool because it is really hard to make it a reliable false positive free system. Plus, the languages it is written in make it hard for us to hack on it.

Instead what we have done is worked hard to build a functional testing suite to find non-algorithm issues (weird exhaustion issues, behavior under real disks, etc). And then have deep, fast, deterministic tests of the core raft algorithms.

SEJeff · on Sept 11, 2015

Oh the stuff you've (you as in CoreOS) done with raft golang libraries (to test and fix it) is nothing short of incredible. The future with k8s with CoreOS as the base looks really great. Please do keep it up :)

eis · on Sept 11, 2015

Yea I think I got the idea of Consul being more stable because of the jepsen results and some other project running into problems with the raft library that etcd used back then.

If they rewrote that library and CockroachDB (whos engineers I respect) uses it, then I guess my fears are unwarranted.

namecast · on Sept 11, 2015

Etcd (at least up to 2.1, which I'm running now in prod) doesn't bother with niceties like service discovery and failure detection; it's strictly focused on serving as a distributed k/v store. I think that's the biggest difference I can think of.

I'd disagree with the stable/reliable bit; or, at least, I'd note that CoreOS ships with etcd, and fleet (the CoreOS job manager, at least until Kubernetes obviates the need for it...) depends on etcd to function correctly.

The upshot of that is, if you're running a stable release of CoreOS, you have a stable etcd, by definition - or you have a broken cluster that you can't submit jobs to, which would be bad.

From the outside looking in, to me Consul seems to provide more features; I don't doubt that it's awesome, but etcd ships with CoreOS, and when I spin up a new cluster via Cloudformation it's up and waiting for me as soon as I log in. This is a huge selling point, as I am lazy.

jcadam · on Sept 11, 2015

I use Etcd as a simple 'service registry' for a microservices based system running on FreeBSD (mostly written in Go). I like the simplicity of it.

I've used Zookeeper in the past for a dosgi monstrosity and it was a never-ending source of pain. I haven't looked at Consul, but since I already have a good grasp on how Etcd works, I probably won't bother with it.

philips · on Sept 11, 2015

etcd focuses just on the key/value store with consensus to enable other systems. And there are a number of systems that are now built on top of etcd now. Here are a handful that I talked about in a recent talk at ContainerCon[1].

- locksmith: a scheduler for host reboots in a cluster. Designed to ensure a cluster can do OS upgrades unattended.

- skydns: a DNS server built on top of etcd.

- confd: a configuration file templating system designed to watch changes and rewrite configuration on disk.

- vulcand: a HTTP load balancer with rate limiting, and dynamic balancing algos.

- kubernetes: a system to manage clusters of containers which backs its service discovery, scheduling, and election with etcd.

One interesting thing is that kubernetes handles load balancing, DNS and configuration packaged together which many people enjoy using. While other people like to just have DNS or configuration so there are tools that focus on just that to tie together existing systems.

[1] https://github.com/philips/hacks/tree/master/etcd-demos

0xdeadbeefbabe · on Sept 11, 2015

Do you know how these projects address the problem of stale data, or how do you not read stale data in the case of a network partition?

_ondq · on Sept 11, 2015

  GET /v2/keys/mykey?quorum=true

0xdeadbeefbabe · on Sept 11, 2015

Okay sounds good if that is the solution, but if you have 5 nodes and end up with partitions A and B where A has 2 nodes and B has 3 nodes. Couldn't B have enough for a quorum, but still possibly have stale data?

acconsta · on Sept 12, 2015

Since A doesn't have quorum, writes on the A side of the partition are impossible, and the data in B cannot go stale (but operations can continue through the new master in B). That's the essence of the CAP theorem's consistency-availability trade off.

This assumes all reads and writes go through Raft.

philips · on Sept 11, 2015

That isn't possible and would break the consistency guarantee of using something like raft. A quorum read or a write are serialized through the raft state machine and thus has to be acknowledged by every member of the cluster.

0xdeadbeefbabe · on Sept 11, 2015

Getting stale data does not break consistency.

The link to the network partition I'm talking about is buried in this presentation http://thesecretlivesofdata.com/raft/

divideby0 · on Sept 11, 2015

In addition to some of the other replies, Consul also has a writeup on their website comparing itself with Zookeeper, etcd, etc:

https://consul.io/intro/vs/zookeeper.html

JulianMorrison · on Sept 11, 2015

I really like the idea of multi-key transactions because it will allow installation of configurations as a whole, rather than a series of piecemeal changes, or some sort of "write a new one piecemeal and then update the current version" manual hack.

dsies · on Sept 11, 2015

Congratulations on the new release! I actively use etcd in multiple projects, in production and am overall very happy with it. However, I've ran into some issues with the go-etcd package, namely with watches and bad behavior when a node goes away (things like a channel being spammed with updates even though no updates are taking place).

Is this something that the new go-etcd package addresses?

unihorn · on Sept 11, 2015

We publish https://github.com/coreos/etcd/tree/master/client as our new client, and it is cleaner and better than go-etcd. You could try this out.

Back to the problem, I didn't notice this behavior in go-etcd before. We are more than welcome to solve it if it is a problem in new client.

philips · on Sept 14, 2015

If you could reach out (brandon.philips@coreos.com) I would love to hear about your production use.

schmichael · on Sept 11, 2015

So excited for the v3 API. The v2 HTTP API with its traditional filesystem like structure makes getting started easy, but quickly becomes limiting. Recursive operations on directories allows for some multi-key transaction like behavior, but in an extremely limited sense.

Having native multi-key transactions in v3 will make a lot of use cases easier.

andrewflnr · on Sept 11, 2015

Can someone explain exactly what problem etcd solves and specifically how I would use it? It seems interesting, but I can't figure out how "distributed key value store" translates directly into "service discovery", whatever that is. What exactly are the keys and values that make this work?

divideby0 · on Sept 11, 2015

So one of the challenges in a containerized environment is that services start up on random ports. If I'm running a Postgres container with the default Docker networking mode, for example, the internal port of 5432 may be bound to the host port of 12345. This allows me to spin up multiple instances of Postgres on the same machine for greater service density.

However, in a distributed environment, services can spin up on different machines. The instance of Postgres my application needs could be on server1:12345 or server2:23456. But in a distributed system, you need a cross-cutting service that's available to all servers so that if my app is running on server1, it can find the right Postgres instance running on server2.

I'm not an expert on etcd, but my understanding is that the most common use case is to run etcd on each host machine. When services start up, their supervisor registers the service's hostname, port, etc with etcd's key-value store. This registration is then propagated to other etcd nodes in a consistent manner, using a consensus algorithm called Raft:

http://thesecretlivesofdata.com/raft/

Consensus actually turns out to be one of the harder problems in a distributed system design. If I have a network partition that prevents etcd instances from seeing each other, you don't want one instance reporting incorrect or stale data. Otherwise, my application could be writing to the wrong service, causing data loss.

Etcd does consensus extremely well, and in a way that scales to support hundreds of nodes. It's one of the two distributed systems I'm aware of that have (mostly) passed Jepsen testing:

https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul

There are also alternatives like Consul and Zookeeper, but in the case of Zookeeper you have to do a lot of the heavy lifting yourself to support service discovery. There are also some well-documented caveats:

https://tech.knewton.com/blog/2014/12/eureka-shouldnt-use-zo...

Consul also has a pretty fair writeup (IMHO) on the tradeoffs of each solution on their website:

https://consul.io/intro/vs/zookeeper.html