File and Storage Systems Microconference Notes
Welcome to Linux Plumbers Conference 2015
The structure will be short introductions to an issue or topic followed by a discussion with the audience.
A limit of 3 slides per presentation is enforced to ensure focus and allocate enough time for discussions.
Please use this etherpad to take notes. Microconf leaders will be giving a TWO MINUTE summary of their microconference during the Friday afternoon closing session.
Please remember there is no video this year, so your notes are the only record of your microconference.
Miniconf leaders: Please remember to take note of the approximate number of attendees in your session(s).
SCHEDULE
Dynamic iSCSI at Scale
======================
Nick Black talked about Google infrastructure to create hundreds of thousands of iSCSI targets all at once, and some adjustments they made to the stack to remove error timeouts and the like.
Mike Christie wondered if it was desirable to move the userspace iscsi daemon into the kernel to improve the performance.
Using the DLM as a Distributed In Memory Database
=================================================
Bart Van Assche talked about using a distributed locking mechanism itself as a key-value store; the author though this was a clever scheme.
SMR in the Kernel
=================
Adrian Palmer described a large number of changes all over the entire storage stack that he'd need to turn ext4 into a host-managed SMR filesystem.
zDM: strong encouragement to submit driver code upstream, even if not perfect. remark that this is likely a much lower cost way to achieve acceptable performance. we can always optimize the pain points later.
ext4-smrffs plan: amazement at the sheer volume of work required to retool basically everything in the stack.
btrfs: if btrfs is more or less what you need, why not just use that?
ext4: this drew strong reactions from the crowd. the proposal is to reinvent ext4 as a btree of groups, pointing to btrees of free blocks, inodes, extents, etc. basically a wholesale redesign of ext4 from the ground up. some question about how to preserve locality when scattering different types of metadata blocks into different zones. minor side benefit of making prefetch a lot easier, but at a very high cost.
it was also pointed out that for years now we've had this wonderful abstraction of storage devices as a big bathtub of blocks where one need not know anything at all about the physical details of the storage device, and how this smrfs thing takes us very far away from an abstraction that does a reasonably good job of representing all the types of storage (or possibly that most storage vendors have worked hard to ensure their products emulate this model). doing 5-7 years of work to increase the specialization for a product strategy that might not exist in 5-7 years seemed a dubious proposition.
martin petersen reiterated that it would be much more helpful for storage vendors to listen to the suggestions that the community have been making for years regarding how they'd like to talk to SMR storage (key/value stores; these have been talked about at length at previous LSFs)
*tea break*
can we still call it ext4? sheer scope of work required makes it hard to continue calling it ext4 with a straight face. forking doable but more difficult than anyone wants to try. adding support in e2fsprogs going to be exceptionally difficult because of separate codebase, which requires writing all this twice (in addition to the question of how does userland even find out about zbc?). remark that it's very hard to make fsck work right, and that if you're going to redesign most of the fs, you might as well do the whole thing and not have to deal with ext* legacy cruft.
raid: the first three are probably easy enough for easy raid0/1. raid5/6 (or anything with parity) will be much more difficult to figure out how to do this. some chatter about simply prohibiting smr drives from participating in raids. It's not clear what to do about sticking together disks with different SMR zone geometries.
in general, crowd thinks its better for the vendors to collaborate on a shared dm shim, then as workload pain points become more obvious we can fix the problems.
SMR BOF
=======
Jim Malina & Albert Chen talked about the preliminary results they've seen with their SMR shim for device mapper. The shim builds a media cache in RAM and takes over the drive's CMR and a few of its SMR zones to absorb unaligned or random writes, which relies on a user-tunable compactor to flush the cache zones periodically. they report acceptable performance with this shim already, and were very strongly encouraged to submit their code upstream asap and to avoid falling into the trap of waiting on all the QA to deliver a product that the community can't accept. they were also encouraged to work with the other storage vendors to come up with a SMR shim that would work well with all their products, and to do the rest of their development in an open manner.
Reflink API Discussion
======================
Darrick started this by mentioning that he was working on a reflink implementation for xfs using the existing btrfs ioctls for reflink and dedupe, and that they worked quite well. he mentioned that the only change required was a new btree to track reference counts and that it was mostly working except for copy on write support.
copy_file_range was brought up again; Anna Schumaker said that she was still sitting on zab's reflink syscall and that she would send them again. She asked what the process was, and was told that she ought to send the patches and that Al Viro would then argue about it, and eventually we'd commit a new API.
Darrick asked about the use of chattr +C to de-cow files it was agreed that flipping bits was a horrible unbounded-time interface and that i should preferably just use xfs_fsr to defrag the reflinked file (which also straightens it out).
Mingming talked briefly about coding support for a btree into ext4 and that she would eventually use it as a basis for a reference count tree. she said that initially she considered only allowing reflinks between two files a la ocfs2 but has more recently thought to make it generic so that arbitrary ranges can be shared, as this is a more powerful approach.
Lightning Talks
===============
Ted mentioned ext4 crypto work -- some people are interested in strengthening checksumming by using a MAC to guard against malice. this he said was difficult to do in a dynamic system, but remarked that it would be much easier to do this for the static case, such as a package manager writing out files, generating integrity data, and not modifying the file ever again.
Mike Kravetz spoke that he was working on adding fallocate support to hugetlbfs so that some users of it could punch out huge pages and return them to the global pool.
Boaz Harrosh observed that on a 2-node NUMA system he noticed very high latency when creating a large number of files in tmpfs; it was pointed out that the directory mutex was probably bouncing between the two nodes and that was likely the cause. Jeff Moyer said that perf recently has the ability to report on cacheline bounces; it was suggested to Boaz that he create multiple directories (a common trick to reduce directory contention).
Josh Triplett asked if anyone else was working on or was interested in per-fs dynamic uid/gid remapping, since he was interested in doing so for containers.
Darrick asked about the maximum number of extents a file could realistically have, noting that he OOMed at 500mil for XFS and 100mil for btrfs. Chris Mason said that seemed reasonable given each filesystem's extent overhead.