A Linux kernel without struct buffer_head

dureuill · on May 5, 2023

Another great write-up on lwn.net.

Really makes me consider subscribing.

More on the topic I'm wondering how the remaining users of buffer_head could possibly migrate? Especially since we're talking filesystems I'm guessing we don't want any change in behaviour as it could result in a loss of data

hvs · on May 5, 2023

LWN.net is one site that has such consistently good writing that I even subscribed to it even though I wasn't really in the linux development community because it is just that good. I don't have time to read it now which is the only reason I don't subscribe currently. I highly recommend it to anyone.

dannyobrien · on May 6, 2023

I no longer work in a very kernel-adjacent area but I keep my LWN subscription because of the quality of the writing, and its general sense of the wider FLOSS ecosystem. It really is a great resource. https://lwn.net/subscribe/

dureuill · on May 6, 2023

I decided to take the plunge:-)

As a systems programmer I believe it will have useful content for me, and I want to support the initiative

afr0ck · on May 5, 2023

It will certainly take a couple of years before it's completely removed. This is just a first step.

shreyas-iyer · on May 6, 2023

I played a ton with the Linux file system while I was in college, and it's filled with code that is obviously a result tech debt. For that reason, it's one of the more complex parts of the kernel.

This is awesome, but I don't see the `buffer_head` getting replaced anytime soon. It's so baked into existing filesystems, I can't see it going away for at least half a decade.

Also, love lwn.net—great find!

cratermoon · on May 6, 2023

I aspire to one day work for an organization that balances backward compatibility with necessary changes as well as the linux community can. It's not just the attention to maintaining it, it's the sophistication in how to manage the variant needs.

kazinator · on May 6, 2023

Microsoft is hiring. In the past, Microsoft had gone as far as emulating undocumented/undefined API behavior for misbehaving/buggy applications, to keep them working under new OS releases.

Nobody in GNU/Linux tests old binaries; only whatever they recompiled for their current distro, and screw the rest.

The kernel is pretty good about maintaining syscall compatibility.

NotYourLawyer · on May 6, 2023

Raymond Chen has about a million blog posts describing some of these efforts.

https://devblogs.microsoft.com/oldnewthing/

smegsicle · on May 6, 2023

is redmond-chan even a real person ?

cratermoon · on May 6, 2023

The difference between Microsoft and the linux kernel community as demonstrated in this case is what I termed 'balance'.

> Microsoft had gone as far as emulating undocumented/undefined API behavior for misbehaving/buggy applications

Yeah, I believe Chen writes about code for SimCity. This is where Microsoft loses balance. They keep backwards compatibility to a fault. Literally a fault. At some point the kernel will no longer have `struct buffer_head`, at least not in the mainline, though it may be a while. That backwards compatibility will go away. Microsoft has been and continues to be bad about end-of-life for their software. They keep things around that should be gone, and remove or stop supporting things that should stay.

rasz · on May 7, 2023

also 'Woe unto PROGMAN.INI' https://devblogs.microsoft.com/oldnewthing/20090903-00/?p=16...

> Sad but true: Once you document a file format, it becomes a de facto API.

yjftsjthsd-h · on May 6, 2023

> On the other hand, it is possible to build a buffer-head-free kernel that supports Btrfs and XFS.

So if I build a system that only uses XFS/BTRFS, do I get anything by explicitly disabling it, or are those filesystems already getting the performance benefit just by not using it themselves? Is BUFFER_HEAD just a goal to show kernel devs what's left, or does it have practical value to end users?

rwmj · on May 6, 2023

All that the proposed new CONFIG_BUFFER_HEAD option does (https://lwn.net/ml/linux-kernel/20230424054926.26927-18-hch@...) is to enable/disable the struct code, and for each filesystem that still requires it, add:

   config EXFAT_FS
    tristate "exFAT filesystem support"
  + select BUFFER_HEAD
    select NLS
    select LEGACY_DIRECT_IO
    help

which means that if CONFIG_BUFFER_HEAD is disabled then all those filesystems get disabled too (or if you pick a filesystem to enable that still needs struct buffer_head then the CONFIG_BUFFER_HEAD option is enabled).

So basically it's only to show the kernel devs what is left. You already got the performance benefit for XFS etc as it just isn't used by those filesystems.

dezgeg · on May 6, 2023

It's not quite just disabling compilation of code, but also disables some runtime checks (eg. "else if (iomap_use_buffer_heads(srcmap))" branches will be false at compile time and optimized out). But the difference is probably unmeasurably small.

eduction · on May 6, 2023

Can someone Explain Like I’m Five why someone would design a filesystem to be deeply entangled with a specific struct in memory in a specific kernel design?

I’m referring to where the article says ext4 “still makes heavy use of buffer heads,” implies this has been the case since ext2 and says “the buffer cache was deeply wired into both the block subsystem and the filesystem implementations.”

Naively I would think the job of the filesystem is to handle reads and writes to disk and that a well designed one would leave caching to other parts of the OS.

thristian · on May 6, 2023

The filesystem doesn't just read from one part of the disk and write to another part of the disk, it reads from the disk and stores the result in memory, or it takes data from memory and writes it to the disk.

The other part of the kernel that deals with data in memory is the cache.

Rather than have two separate representations of "data in memory" and have to translate between them all the time, it makes sense to have one representation of "data in memory" that all the affected systems can use to talk to each other.

Even though the kernel now has multiple representations of "data in memory" due to advances in other parts of the kernel, it's still desirable to have just one representation shared between the cache and the filesystem, which is why people are talking about removing the old ones, and porting old code to use the new ones.

jabl · on May 6, 2023

I think it's a question of how you design the OS kernel. Consider for example an application that has opened a file, and now wants to read data from the file. So it calls the read() syscall (https://pubs.opengroup.org/onlinepubs/9699919799/functions/r... ) either directly or via fread(), iostream or whatever. How should the kernel handle that?

- Since a file is something filesystems do, it seems sensible to quite directly plumb the read() syscall into the filesystem, and the filesystem can then handle reading blocks from the actual disk, or caching previously read blocks. If an OS wants to support multiple filesystem types, it might even provide some common data structures and code for doing this caching of disk blocks. In Linux this being the "struct buffer_head".

- The other option being that managing memory is something kernels do, and it makes sense to centralize this management in a common code that can then make sensible decisions what to do with this memory. Such as giving it to applications that ask for memory, using it for caching file contents (page cache), and the crucial part, balancing these different usecases and making good choices what to evict. If you make this design choice, the read() syscall will not immediately be handed down to the filesystem, but rather it first asks the memory management subsystem whether the data happens to be in the page cache, and only if it isn't, it calls down to the filesystem to read that data from the physical disk, copy it to the page cache (potentially evicting something else to make room), and then copy it back to the application asking for it.

Now it seems that historically quite a few OS'es, Linux included, started off with the first approach. However later on it turned out that the second way of designing the OS is better, and thus Linux was slowly over time modified to mostly use the page cache. Though the struct buffer_head is still in use here and there.

rwmj · on May 6, 2023

The sibling comment is one reason, but I'd like to add another. ext2 was introduced into Linux in Jan 1993, which is less than 18 months after Linux was first released. ext3 was made by copying the code from ext2 and renaming the identifiers s/ext2/ext3/g, and ext4 the same way from ext3.

So these filesystems are old. Even if it was a good idea to structure caching and filesystem block access separately (which as the sibling comment says, it isn't), the Linux developers were just getting started back then and it was much more important to get something working than to think about good abstractions.

asveikau · on May 6, 2023

Think of the internal kernel interfaces like a library. The older filesystems were coded against the library as it existed at the time of their writing. Some of the APIs got superseded by newer interfaces which are more recommended for newer code. But a lot of code written against the old interface is still out there. For good measure, let's say the new interface isn't a trivial port, sometimes it requires rethinking larger parts of the caller's code.

just_a_quack · on May 6, 2023

I think it's because ext2 and ext4 both use block allocation in favor of paging thru bio and uses buffer tricks to write data in a way that reduces fragmentation

bcrl · on May 6, 2023

It has nothing to do with fragmentation, but with wasted space. When ext2 was originally developed in the 1990s, the typical block size employed was 1KB as storage was orders of magnitude smaller, so wasting half a block per file on a floppy disk or small HDD was a significant concern. Today most filesystems use 4KB blocks to match the page size of the most common CPUs as wasting a couple of kilobytes per file is noise.

The journaling layer (jbd) in ext3/ext4 was built on top of buffer_heads as buffer_heads were the way writes got tracked. Rewriting ext4 and jbd2 to use a new data structure to track writes to the journal and disk will be a lot of work. ext4 has a number of issues that make it a less desirable filesystem these days, so it's not clear the work will be worth it any time soon when filesystems like bcachefs, btrfs and xfs do so much better.