A way to do atomic writes

ww520 · on May 29, 2019

Appending log based file system has a lot of appeals. A lot of hard problems, like atomic write, become trivial. A log based FS shares similar properties as the transaction log in RDBMS, making consistency and recovery easy. Write performance is fantastic. With enough cache memory, read performance should be good, too. The only down-size is the performance hit when doing garbage collection. More research in this area would be helpful.

burmecia · on May 30, 2019

Totally agree. Log based file system makes things much easier, such as atomic write, data consistency and failure recovery. Along with COW semantics, transaction isolation also becomes easier. By its append-only writing nature, file versioning is trivial as well. Garbage collection is performance downside indeed, but it can be mitigated by COW switching and delayed bulk collection IMHO. Overall, the cost is worthwhile and that is what I am currently doing in ZboxFS: https://github.com/zboxfs/zbox.

ww520 · on May 30, 2019

That looks promising. Worth taking a look.

futureastronaut · on May 29, 2019

https://en.wikipedia.org/wiki/Journaling_file_system

https://en.wikipedia.org/wiki/Page_cache

ww520 · on May 30, 2019

Journaling FS only puts metadata and FS commands in log, not structuring the FS itself as an appending log.

Scriptor · on May 29, 2019

Haven't log-based file systems been a thing for a while now? https://en.wikipedia.org/wiki/Log-structured_file_system

notacoward · on May 29, 2019

They have, but they had fallen way out of favor for a couple of reasons - write amplification, cost/complexity of segment cleaning, poor performance for some workloads, etc. They've experienced a bit of a resurgence thanks to flash and SMR, though. Personally, I've always thought log-structured filesystems were elegant and wish they'd been more actively developed during the long "winter" during which their COW/write-anywhere cousins stole all the limelight.

ww520 · on May 30, 2019

Was hoping there are more research attempt at the Log Structured FS. For example,

- Borrow some ideas from generational garbage collection. Young generation in SSD (or mirrored in RAM) with copying GC to get rid of old versions of fast changing data blocks.

- Utilize some deduplication techniques with content based signature.

kjeetgill · on May 30, 2019

I think elements of that generationality are the foundations of the Log Structured Merge Trees used by KV Stores like LevelDB and RocksDB. Atleast I think that's the same concept, I'm not well read on filesystems.

ww520 · on May 30, 2019

Yes, LSMT is a good example of pushing the idea of a hybrid append log and in memory data structure further.

However, LSMT is for relatively smaller data set, i.e. ordered key-value. It has worse write amplification than a simple append log. The level-0 memtable flushed to the write-ahead-log counts as one write. Writing to the level-1 sorted files counts as 2nd write. Merging the sorted files counts as 3rd write. There're 2~3 writes per change.

Also it doesn't offer help to address the frequent update block problem. All versions of a data change are written to disk. A merge is needed to get rid of the old versions.

But it has a number of good implementation ideas that can be borrowed.

kabdib · on May 29, 2019

Note that the SQL Server team at Microsoft had the Windows folks add write-through and cache control to NTFS so that the database could have stability guarantees.

It's mind boggling that this is still an issue, given the number of decades we've had applications (databases, and other things) that need to efficiently guarantee bits are on stable storage.

zzzcpan · on May 29, 2019

It shouldn't be mind boggling, given that this guarantee is impossible to achieve. A little bit better or worse is not that big of a deal, still very far from 100%.

kabdib · on May 29, 2019

Er, no. I'm using a storage system right now that makes this guarantee; when the store responds to a block write, it's going to be stable. The implementation in question uses battery-backed RAM for performance, but that's just an implementation detail.

This is definitely not impossible. It's not even that difficult (and yes, I've written a number of transactional storage systems over the past 30 years).

paulsutter · on May 29, 2019

How have you tested the failure modes?

I worked with a battery backup disk controller once, that had redundant failover to a duplicate unit. It worked great for many months with flawless failover. Except when one of the internal batteries died, the whole system deadlocked and the 24/7 nonstop cluster was down for days.

The more layers, the more unpredictable the failure modes.

unilynx · on May 29, 2019

But did it break its storage guarantees? Sounds to me like freezing the storage if the redundancy can no longer be guaranteed due to a bad battery may be the only way to keep the promise.

ori_b · on May 29, 2019

Failover doesn't mean that the system fails when you transfer load over.

zzzcpan · on May 29, 2019

> when the store responds to a block write, it's going to be stable.

That's the thing, you are communicating with your storage controller and these are just promises from your controller, not guarantees. Once you try to read the data back sometimes in the future there is no guarantee that you will succeed. A lot can happen between your controller reporting successful write and you retrieving the data: software bugs and false promises, hardware problems, operational problems, disasters, etc. There is a limit of how high of a probability of data retention a typical single server in a typical server room can achieve.

theamk · on May 29, 2019

There are always implied assumptions, like assuming that there are no bugs in the kernel drivers themselves, and assuming that memory is not damaged.

Saying "we cannot achieve anything because there might be firmware bugs" is technically true, but completely counter-productive. Adding "A little bit better or worse is not that big of a deal" is just bad engineering -- can you imagine doctor saying, "you might get hit by a car at any time, so I decided it is not worth it to heal you"?

zzzcpan · on May 29, 2019

We can achieve a lot on top of unreliable components, buggy kernels, buggy firmware. Make unbelievably reliable systems. But can't, if we assume we can rely on unreliable components. This is what bad engineering is.

kabdib · on May 29, 2019

In this case, that stability is a promise from a SAN, a commercial product that has a very good reliability track record. We're pretty confident that the data we write is going to be stable, unless there's a physical disaster like a fire . . . which is why we write to multiples of these, which are physically distributed, have staggered software update schedules, etc. etc.

You can still lose everything, you can't control all failure modes. But you can plan for and protect against common disasters.

The original discussion was about the lack of OS support for proper flushing and making data stable. I think the lack of decent support for this is a decades-long travesty; claims that the lack of this functionality doesn't matter because "there might be firmware bugs in the storage system, so why bother?" are specious and unhelpful.

caf · on May 30, 2019

The support for flushing and making data stable has been around for a long time.

The problem is that using that support kills performance, so apps often don't.

Groxx · on May 29, 2019

It's definitely not an implementation detail. If the RAM fails somehow, it has lied about that data being durably stored.

But yes, it's absolutely possible for all of this to be vertically integrated, and agreed that it's ridiculous that it hasn't happened. Most systems shouldn't care, but the ones that do care really really care.

zamadatix · on May 29, 2019

If the disk fails somehow it has lied as well. There is no such thing as guaranteed anything, you can use parity and striping to better your chances but that's not storage media dependent.

programd · on May 29, 2019

Disks, both spinning and SSD, lie about data durability more often then you would think. This is particularly a problem with consumer disks and write-back caches. Postgres has a good state of play writeup on reliably writing data to storage in their "Reliability and the Write-Ahead Log" documentation [1].

My rule of thumb is that if I want serious reliability I make sure my data winds up on three different pieces of hardware and includes CRC codes. So, basically ZFS or some cloud/cluster equivalent like S3.

On the other hand, for day to day work on your dev machine, the drives are so crazy reliable that you just don't worry about it. Which, of course, can bite you if you translate your day to day experience to production at scale. Everything breaks at scale.

[1] https://www.postgresql.org/docs/devel/wal-reliability.html

ecnahc515 · on May 30, 2019

We need a "falsehoods programmers believe about disk reliability".

erik_seaberg · on May 30, 2019

> three different pieces of hardware

This. Data is real when it has been acked by a quorum with independent power. Three is because two doesn't give you enough slack to maintain it.

logicallee · on May 29, 2019

>The implementation in question uses battery-backed RAM for performance, but that's just an implementation detail.

Can you give more details on this? (What is the solution, and what does it cost? What kind of computer you use it on? How large is the memory?) Very curious.

kabdib · on May 30, 2019

It's a common technique in SAN controllers, where writes go to a battery-backed cache, and are then written to the actual storage devices (using variations of RAID) at the controller's leisure. The limited space in the immortal DRAM provides enough buffer to give the host near immediate responses (sub-sub-millisecond over fiber channel), while the data is safe if the controller crashes or loses power (the data is written as part of the recovery process). This organization also lets the controller gather up related writes and be more efficient about them, rather than writing each block in isolation.

The battery-backed DRAM really is an implementation detail. Things will execute correctly if you don't have it, but it's usually a huge performance win.

In this case we're using Nimble SANs (Nimble is now owned by HP and suffering customer service rot, oh well). The immortal DRAM is fairly small (8GB per controller head?). The storage arrays are petabyte scale, all flash, with many, many 16 and 32 Gbit fiber channel connections. Cost is a few million per SAN instance, of which you need several for real durability (and a replication scheme for remote storage, which I'm not going to discuss here).

throw0101a · on May 30, 2019

> The battery-backed DRAM really is an implementation detail. Things will execute correctly if you don't have it, but it's usually a huge performance win.

For anyone interested, especially the -N variant:

* https://en.wikipedia.org/wiki/NVDIMM

Back when ZFS was still new-ish, and SSDs were still expensive-ish (~2008), people were experimenting with using ZILs (ZFS Intent Log) on these types of devices:

* https://techreport.com/review/16255/acard-ans-9010-serial-at...

SSDs have come down in price since than, so people don't bother with RAM disks as much now.

logicallee · on May 30, 2019

Thank you so much for this very detailed reply. very interesting.

jchrisa · on May 29, 2019

One way modern distributed databases handle this is by assuming they can't trust the disks. When you are writing to multiple datacenters it makes recovery from disk errors just another kind of chaos resilience. The Calvin protocol takes this to its logical conclusion by moving the write-ahead-log to a Raft-like distributed consensus algorithm, where commits to the log span the cluster: https://fauna.com/blog/consistency-without-clocks-faunadb-tr...

I will add that the different flavors of fsync, and what they do (and don't do), is always a source of entertainment in database engineering chat rooms.

cafxx · on May 29, 2019

I agree with some of the comments on the article: there are good reasons to keep durability and atomicity separated, and to allow apps to specify what they need.

Also, a while back there was another paper that had a much better API as it allowed atomicity across files: https://github.com/ut-osa/txfs (better, but obviously not perfect: https://twitter.com/CAFxX/status/1017204557003173888)

tarruda · on May 29, 2019

When in full control of the file format, it is hard to find a cross platform user-space solution that beats SQLite.

chris_wot · on May 30, 2019

In that system, users can write as much data as they want to a file, but nothing will be visible until they do an explicit commit operation. Once that commit is done, all of the changes become active.

Isn't this Multiversion concurrency control[1]?

1. https://en.wikipedia.org/wiki/Multiversion_concurrency_contr...

caf · on May 30, 2019

Hellwig said that originally he was using an open() flag, but got reminded again that unused flags are not checked by open() so using a flag for data integrity is not really a good idea.

This problem will be solved if, as looks likely, the openat2() syscall is merged for the path resolution restriction patchset. As a new syscall, it can do what open() should have always done and return EINVAL on flags it doesn't understand.

ape4 · on May 29, 2019

Dumb question: is close() the same as fsync() for getting data written?

nwallin · on May 29, 2019

No. (and it's only a dumb question if you don't ask)

close() flushes the application buffer into the vfs buffer, but it does not guarantee the vfs buffer is flushed to the disk buffer, nor does it guarantee the disk buffer flushes to disk. A successful fsync() call guarantees these things.\*

\* except if the disk drive controller lies about flushing its buffer to disk, but there's nothing you can do about that besides buying a better disk.

zAy0LfpBZLC8mAC · on May 30, 2019

> close() flushes the application buffer into the vfs buffer

No, close() only closes the file descriptor. fclose() potentially also flushes a userspace buffer.

13415 · on May 29, 2019

When I got fed up with the way the OS handled writes 15 years ago in one of my (ordinary) applications, I wrote to a temp file in the same directory, closed the file, did the atomic filename swap, and then verified the integrity of the new file under new name by opening it again and comparing it byte for byte to the content to be stored. Only after that did the operation succeed. On failure, the file names were swapped again and an error was signaled to the user. The temp file remained on disk to avoid further data loss.

I suppose that's not fast enough for a DB.

jjnoakes · on May 29, 2019

Are you guaranteed that the data you read back didn't come from kernel caches of the file system blocks? How do you know the data actually came back from physical storage?

alain94040 · on May 29, 2019

Yes, but why can't we add an API, all the way down to the lowest levels of the hardware, that says "once I say this write is done, it will be readable again even after power-cycle".

Achieving that guarantee is not impossible, but it needs to be explicit. It can't be inferred from other API calls.

charleslmunger · on May 29, 2019

The behavior of fsync is a battle of OS developers and OEMs versus application and database developers. Gaming the implementation of fsync to do less than fully fsync makes benchmarks look amazing, improves user perceived latency, and reduces flash wear. On the other hand, it corrupts data - but that's rare, "the hardware is failing anyway", etc.

That's how you end up with stuff like OS X's "no really, fsync" param [1], or Motorola shipping nobarrier on their phones. [2]

[1] https://github.com/google/leveldb/issues/203#issuecomment-55...

[2] http://taras.glek.net/post/Followup-on-Pixel-vs-Moto:fsync/

layoutIfNeeded · on May 29, 2019

My CPU can't isolate processes due to Intel cutting corners with speculative execution.

My memory is vulnerable to row hammer due to vendors cutting corners while pushing for increased DRAM density.

And my - supposedly - non-volatile storage is broken due to vendors gaming benchmarks with fsync.

Is there any component in a modern consumer computer that isn't fundamentally broken in one way or another?

syn0byte · on May 29, 2019

What you call "broken" others call "performance choices". Except SSD controllers, they are big fat phonies.

Look at how much slower CPUs are without those speculative execution tricks. I can buy gigabytes of RAM with a 20 I found in an old jacket. You can explore entire worlds consisting of gigabytes of high res textures and mesh data in near real-time, while downloading 4 new albums off the internet.

Broken?

zAy0LfpBZLC8mAC · on May 30, 2019

> Broken?

Yes, it is breaking the promise of what it is supposed to do. If fsync() was defined as "will ensure your data is on disk, unless that's kinda slow, then who knows", then the behaviour would not be broken, just potentially useless for many applications. But if you promise to ensure something is stored on disk, and then don't, that's the definition of being broken.

layoutIfNeeded · on May 30, 2019

This. It's like those counterfeit memory cards you could get on eBay, that had a 128 Mb chip but reported themselves as 8 Gigs or so. "Works perfectly if you stay below 128 Mb!"

adrianratnapala · on May 30, 2019

> Look at how much slower CPUs are without those speculative execution tricks.

But do I care? All my computers, except the really ancient phone I use, are snappy.

Of course there are cases were CPU speed matters, but I don't know if they are any less obscure than the cases where timing attacks are a risk.

the8472 · on May 29, 2019

That off switch in your PSU is a conceptually sound implementation. Not the one at the front, that's broken too.

gmueckl · on May 29, 2019

No. At most these components are as reliable as random noise allows them to be. That part is just physics.

icedchai · on May 29, 2019

You're welcome to go back to 486 class machines with 32 megs of RAM and IDE disks. I'll take the performance optimizations.

tempguy9999 · on May 29, 2019

on linux fsync (I think).

On windows FILE_FLAG_WRITE_THROUGH ("Write operations will not go through any intermediate cache, they will go directly to disk").

It's all there.

I agree with other poster, just cos you read back the file and it compared byte-for-byte, unless large it's likely to have come from the OS's RAM file cache.

wyldfire · on May 29, 2019

> This feature would work for mmap() regions as well, not just traditional write() calls.

I always wondered about writable disk-backed pages -- back when I last looked it up munmap() manpage made no suggestions about whether it would flush the data or whether you need an explicit msync() to make it happen. Since pages could get written back without an explicit msync(), it's always tricky to speculate how the behavior should be based on observation.

namibj · on May 29, 2019

Read the code. Only way to know for sure. Then use something maybe based on clang-query to check if the path that leads to flushing base on msync() and/or munmap() changed. Like, if any code between your syscall and the kernel-internal write-page-to-disk-now-i'm-waiting call changed. You should be able to convince clang-query to spit out a normalized AST of the code, and still be able to easily map it back to the source file location so you can check the source-file diff for the specifics once you demmed the AST diff ass non-trivial (some things should be easier to ignore in an AST diff).

In case you end up creating such a tool, please share it with us.