Networking Microconference Notes
Welcome to Linux Plumbers Conference 2015
The structure will be short introductions to an issue or topic followed by a discussion with the audience.
A limit of 3 slides per presentation is enforced to ensure focus and allocate enough time for discussions.
Please use this etherpad to take notes. Microconf leaders will be giving a TWO MINUTE summary of their microconference during the Friday afternoon closing session.
Please remember there is no video this year, so your notes are the only record of your microconference.
Miniconf leaders: Please remember to take note of the approximate number of attendees in your session(s).
Rules
- Tight schedule!
- 3 Slides Max
- 5 minutes presentation, 10-15 minutes discussion
- If you are still talking after 5 minutes, you will be stopped
- Not kidding
SLIDES
SCHEDULE
- VRF's in Linux (Shrijeet, David)
- Awesome VRF + MPLS demo
- Q: Is your performance limited by throughput?
- 2 route lookups (additional lookup in driver), tx overhead should be minimal
- Q: Can I ping on a specific interface inside a particular VRF?
- It would require to bind to both the VRF interface and the specific interface in that VRF
- Creation of a debug interface inside the VRF
- Encrypted Overlay (Jesse)
- Why is overlay encryption not enabled by default? performance hit, complexity
- Hadware offload becoming more common in next gen NICs
- Obvious starting point: IPSec / DTLS on payload
- TLS sounds interesting, Netflix is doing it with FreeBSD
- Q: What about header and payload not being bound in DTLS
- Doesn't seem impossible to do
- Facebook actually does this
- Some concern around APis required for TLS such as key exchanges
- Key exchange probably a bad idea but the key subsystem could be leveraged
- Netflix does TLS in userspace
- Key exchange could occur in user space and data exchange in the kernel (davem)
- Circles back to generic checksum offloading
- eBPF in tc classifiers/actions (Daniel)
- tc layer received extensive bpf support including tail calls, map fd handover and documentation
- agent/fuse backend makes ebpf map fds persistent after tc task quits
- possibility to share maps across tc ingress/egress classifiers
- Network Virtualization with eBPF (Alexei)
- Distributed bridge with bpf, bpf redirects to dummy devices, vxlan, tap devices etc attached to linux bridge
- packet carries metadata
- maps configure the bridge => distributed bridge
- Issues:
- Dummy devices are useless, wasting memory
- Redirect requires clone just to drop the back again after bpf program exits
- Option: Change arg from skb to skb pointer so we can drop in the middle
- Option: TCA_ACT_REDIRECT return code to instruct qdisc to redirect
- Option: Add as metadata to skb (stacked metadata dst)
- need persistent maps
- Two FUSE implementations to resolve this but user space fd handover via scm_rights is an issue
- Q: Why do you need a dummy device in front of VXLAN?
- To program VXLAN device instead of falling back to default L2 behaviour
- Plans
- Move verifier to userspace, allows to apply fuzzing with clang
- constant blinding
- redirect to socket for containers which don't require a netdev/ip address
- ILA (Tom)
- Split v6 address into locator and id => Network overlay without encap
- Promise: near zero perf overhead
- Performance:
- ILA 14% slower than IPv6 baseline, most of that due to lack of early demux (-10%)
- Ideas:
- Fix early demux for LWT
- Avoid dst reference until skb is queued
- Avoid skb->dst altogether?
- IPVLAN avoids route lookup (lwt)
- Q: Scale of this? Lookup on /128?
- Not an issue in datacenter, location independence is more important
- Issue with scale of namspaces in ILA world
- IPv6 Routing Cache Removal (Martin)
- Problem: A FIB6 tree could potentially grow huge, most have default MTU, a small number of entries have a smaller MTUs
- => GC kicks in often and burns CPU
- RTM_GETROUTE takes forever
- Userspace may parse /proc/net/ipv6_route
- Solutions
- Decouple rt6_info from inet_peer cache
- Only create a clone after pmtu_update for the smaller MTU entries
- Next Steps
- fib6 tree uses an rwlock, potential to RCUify
- Sharing of IPv4 and IPv6 route tree is difficult due to subtrees in v6
- IPv6 and 6LoWPAN only networking for sensor devices (Marcel)
- IoT!!!!
- IPv6 only network over bluetooth with minimal stack size
- Hardware accelerated virtio using switchdev (Varun)
- Acceleration of vhost queues through switchdev based on Linux Bridge
- Fast path from vhost to switchdev/driver managed queues with slow path through Linux
- How to report DMA addresses to vhost?
- Since these are not VFs, there is no protection, needs to use VF/SRIOV model
- Resolution: DMA won't copy into an unsafe address
- Who maps the virt queues?
- Needs to be done in a framework with functionality provided by the driver
- What about stack functionality on top of Linux Bridge bypassed by the fast path?
- Unresolved, needs to be addressed
- GENEVE control planes (John)
- Geneve net_device in, yay!
- Does it make sense to extend the VXLAN multicast code to Geneve?
- Flannel, Weave, Docker libnetwork, ...
- LWT?
- Should work as the VXLAN LWT code is IP tunnel generic