Persistent Memory Microconference Notes
Welcome to Linux Plumbers Conference 2015
The structure will be short introductions to an issue or topic followed by a discussion with the audience.
A limit of 3 slides per presentation is enforced to ensure focus and allocate enough time for discussions.
Please use this etherpad to take notes. Microconf leaders will be giving a TWO MINUTE summary of their microconference during the Friday afternoon closing session.
Please remember there is no video this year, so your notes are the only record of your microconference.
Miniconf leaders: Please remember to take note of the approximate number of attendees in your session(s).
SCHEDULE
Andy Rudoff/Intel
http://pmem.io/ - lots of libraries, etc.
- up to 1000x faster than NAND (slower than DRAM)
- up to 1000x endurance of NAND
- half the cost of DRAM
6TB per 2-socket system
SSDs will arrive first
DIMMs for next gen platform (NVDIMMs available today)
Byte-addressable persistence - fast enough to load directly
Working with Dan Williams/Intel to handle the Linux kernel changes to support PMEM
THE FUTURE
- Some more basics to be done:
- RAS
- - "we have a story" for dealing with persistent uncorrectables
- - there strategy to avoid crash loops
- ? Do you bus fault if you touch a bad section of memory?
- - could depend on the class of product ... EP platforms vs EX platforms
- - "the more expensive platform (EX) allows you to do memory-controller-based mirroring"
- replication
- - being done in userspace, not the kernel
- - there's nothing for the kernel to touch
- - not going to use small pages so there's no efficient way for the MMU to catch it
- - applications can call msync() to persist?
- - "no one is using msync()!"
- RDMA
- - "how do you know that it's persistent after you store it?"
- - NICs are DMA'ing into last-level cache
- - "have to change everything" to get persistent DMA
- - writing it up as a paper in the SNIA working group as a long-term solution
- Microsoft C compiler has a feature called "based pointers"
- - can declare a pointer as always being an offset from another pointer
- - Andy trying to propose this to the gcc folks
- Many emerging memory types with different characteristics (not necessarily from Intel)
- - can differ in perf characteristics, cost/byte, capacity, RAS characteristics
- - NUMA locality still applies
- - may/may not be non-volatile
- Application transparent
- - OS handles memory tiers transparently to the app
- - Server space overcomes their fear of paging
- - "How would we have defined our paging system 50 years ago, when we last had persistent memory?"
- - May not need to be transparent to the userspace
- Not application transparent
- - Expose it all for administration & APIs
- - "HPC guys will break their applications for a few extra percentage points of perf" - so will probably be early adopters
- - More help for transactions and replications
Dan Williams/Intel
PMEM & BLK
- Data path for PMEM - CPU writes directly to the memory
- Data path for BLK - CPU writes to some aperture registers, then writes to the memory
- - Q: Why? A: uncorrectable errors
- - Choice of path depends on RAS model
Lots of components in the Linux software stack...
- libnvdimm/libndctl: userspace
- DAX, ACPI.NFIT, libnvdimm kernel bus driver, PMEM, BTT, BLK
PMEM: ramdisk driver
BLK: block I/O layer
BTT layer: "makes the storage look more like a disk"
recommendation: use the library not the sysfs files to muck around with it
What to do about struct page?
- 6TB
- RDMA, etc need struct page to work
- kernel v4.3: proposing to use memory hotplugging
Q: What if drivers store PFN numbers?
A: "PFNs will always be relative to a block device" - PFNs are accessed through the PMEM driver. "Offset 17" in the PMEM driver will always give you the same address throughout one boot
Q: Keith: "I can't afford to hotplug my 320TB of memory" ... "I really am planning on using mmap() for these"
A: You have the choice to never hotplug with PMEM ... "the entire block layer can be done without struct page" ... "we can start removing the struct page from all of the subsystems" ... "I initially started out doing pageless block IO and not everybody was cheering my name"
What happens if someone is doing RDMA to memory and someone hot-unplugs the device?
"remove always goes through"
Proposal: make remove sleep ... but what if it sleeps indefinitely?
PMEM and NUMA
Q: do we need to be able to query numa locality at a granularity finer than that of a device or file?
A: Yes. Keith mentions that he has a system with a single pmem device that spans numa nodes, and that the numa node of a particular address can be found out via a table
Also, Jeff has preliminary code to add .direct_access() support to device mapper targets (though it's only useful for dm-linear and maybe thinp)
Q: How do you specify allocation policy for a file?
A: Keith - we used xattrs, not because they're necessarily the best method, but because it was easy
Jeff - FIEMAP may also be useful for querying locality
Bottom line: we need to be able to both set an allocation policy and query the numa affinity of data at page/file system block size granularity