Linux 3.13

wtbob · on Jan 20, 2014

> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.g...

I really, really, really wish that the Linux CSPRNG would quit having its flaws papered over. A fellow submitted a patch to implement the Fortuna CSPRNG years ago, and it wasn't accepted because of a misguided belied in entropy estimation.

I'm not saying that Fortuna is the One True CSPRNG—it's not—but any clean design would be preferable to the current Rube Goldberg mechanism. I'm pretty sure that /dev/random as it currently stands is secure enough, but 'pretty sure' isn't very reassuring.

forgottenpass · on Jan 20, 2014

Have you brought any of this up on lkml?

wtbob · on Jan 21, 2014

Jean-Luc Cooke tried to submit a patch back in September 2004 (yes, a decaded ago) and was shot down. I don't know if anyone has tried since.

zokier · on Jan 21, 2014

The current "Rude-Goldberg mechanism" has one significant advantage over anything that could replace it: it has been reviewed by experts.

stouset · on Jan 29, 2014

The implementation may have. But the cryptography certainly hasn't (otherwise, it wouldn't be changing so often).

dded · on Jan 20, 2014

One significant user-visible feature of 3.13 is nftables: https://lwn.net/Articles/564095/

esbranson · on Jan 20, 2014

I've been wanting to implement a dynamic ARP filter (DHCP snooping + ARP filtering) for ages now, but arptables/ebtables just didn't cut it. Hopefully this will be easier/viable now. Because I still don't think ArpON sniffs DHCP leases for its mapping (it intercepts and replays ARP requests or something) and it doesn't filter rogue DHCP servers. I'm just amazed ARP spoofing/ARP cache poisoning is still a viable attack vector on home networks in 2014.

adekok · on Jan 20, 2014

Please see the "master" branch of FreeRADIUS. https://github.com/FreeRADIUS/freeradius-server/

It can accept both DHCP and ARP protocols, and will decode them into attribute-value pairs. Those can then be referenced in a policy language, and stored to / read from a database.

I'm the author. :) It's no longer just a RADIUS server. I've been looking for a DHCP / ARP checker for a while, and couldn't find anything useful. Rather than writing something from scratch, I decided it was easier to just add ~2K LoC to FreeRADIUS. I could then leverage the policy language and database integration, so I didn't have to re-write all of that, either.

esbranson · on Jan 20, 2014

Excellent. :) I will look into this.

I am targeting embedded devices on OpenWRT, which means it needs to be as simple and small as possible, so I hope the code is tight.

But on the other hand, I would prefer to not reinvent the wheel.

adekok · on Jan 20, 2014

The code is tiny. It already runs on OpenWRT, so there's no issue there.

smutticus · on Jan 20, 2014

You need to implement DHCP Snooping at Layer 2 for it to really work, and these days I think all major switch vendors support it. Unless you're building a Linux L2 switch I don't see why you would want to implement DHCP Snooping.

esbranson · on Jan 20, 2014

You need DHCP snooping for proper ARP reply filtering. It is the cleanest way of determining MAC address<->IP address mappings.

sp332 · on Jan 20, 2014

Note libnftables was just renamed to libnftnl http://www.marshut.com/imyzmp/libnftables-renamed-to-libnftn... A later, higher-level library will probably get the name libnftables.

makmanalp · on Jan 20, 2014

> It adds a simple virtual machine to the kernel that is able to execute bytecode to inspect a network packet and make decisions on how that packet should be handled.

I wonder how long it's going to take until someone figures out a way to craft a specific sequence of packets that remotely do something nasty at the kernel level :P

diegocg · on Jan 20, 2014

It's not different than the rest of the network stack, most of it consists in "inspecting packets" in one way or another, and security bugs can be (and some times are) introduced.

But nftables is actually a big win from a security perspective, because it simplifies the current code (lots of duplicated code goes away) and moves other parts to userspace.

Old netfilter system: 70.000 LoC in kernel + 50.000 in userspace

nftables: 7.000 LoC in kernel + 50.000 in userspace

(source: http://www.slideshare.net/ennael/2013-kernel-recipesnftables)

Also note that it's not really a "virtual machine" comparable with java, this is how the developers actually describe it

    In a nutshell, nftables provides a pseudo-state machine with 4 general
    purpose registers of 128 bits and 1 specific purpose register to store
    verdicts. This pseudo-machine comes with an extensible instruction set,
    a.k.a. "expressions" in the nftables jargon. The expressions included
    in this patch provide the basic functionality, they are:
    
    * bitwise: to perform bitwise operations.
    * byteorder: to change from host/network endianess.
    * cmp: to compare data with the content of the registers.
    * counter: to enable counters on rules.
    * ct: to store conntrack keys into register.
    * exthdr: to match IPv6 extension headers.
    * immediate: to load data into registers.
    * limit: to limit matching based on packet rate.
    * log: to log packets.
    * meta: to match metainformation that usually comes with the skbuff.
    * nat: to perform Network Address Translation.
    * payload: to fetch data from the packet payload and store it into
      registers.
    * reject (IPv4 only): to explicitly close connection, eg. TCP RST.
    
    Using this instruction-set, the userspace utility 'nft' can transform
    the rules expressed in human-readable text representation (using a
    new syntax, inspired by tcpdump) to nftables bytecode.

krakensden · on Jan 20, 2014

I don't even think this is the first vm in the networking part of the kernel.

justincormack · on Jan 20, 2014

No, BPF is very old.

bjackman · on Jan 20, 2014

Great summaries, thanks! I love when kernel news is made as accessible as this (thanks also to LWN).

Looks like a pretty major release. I only wish I had a rackfull of cutting-edge SSDs to try out the new block layer on!

pshc · on Jan 20, 2014

Anyone heard any news on Google's user mode thread[1] kernel syscalls? I was really excited for that when it was announced but haven't heard a peep about it since.

[1]http://youtube.com/watch?v=KXuZi9aeGTw

gcr · on Jan 21, 2014

Is this similar to the old "Scheduler Activations" idea? http://homes.cs.washington.edu/~tom/pubs/sched_act.pdf

I'd love to see that idea wind up in a mainstream kernel.

twic · on Jan 21, 2014

It was in FreeBSD!

http://www.freebsd.cz/kse/index.html

For some reason, it didn't work out, and FreeBSD switched back to conventional threads, around 7.0, i think.

nopaste7 · on Jan 20, 2014

They never showed up on LKML.

arielweisberg · on Jan 20, 2014

Great to see the NUMA balancing in. My question has always been what workloads require NUMA balancing in the first place? If I present the kernel with the same number of threads as cores and keep all data local to a thread would the existing approach of allocating on the NUMA node the thread is running on have been enough?

diegocg · on Jan 20, 2014

This article explains the NUMA improvements in this release https://lwn.net/Articles/568870/

fsaintjacques · on Jan 20, 2014

When your process use more than half of the memory, e.g. DBs.

arielweisberg · on Jan 20, 2014

The kernel will balance loaded threads across cores so with a policy of allocating off the local NUMA node you actually end up with balanced allocations in practice if you run shared nothing thread per core.

In memory databases are my day job so I am pretty interested in cases where things go south because memory isn't balanced. To date it appears like no special actions were necessary, stuff just ends up balanced across nodes.

That's why it would be great of someone could characterize when balancing is necessary outside of obvious cases like allocating an entire buffer pool from one thread.

fsaintjacques · on Jan 20, 2014

The problem is not the balanced memory allocation of the said process, but the kernel evicting IO caches:

https://groups.google.com/forum/#!topic/fa.linux.kernel/4IKY...

We experienced this bug in production on hardware where the cost of accessing the other node is exactly over 20 (the limit defined in the kernel).

node distances: node 0 1 0: 10 21 1: 21 10

pmenon · on Jan 20, 2014

You should take a look at "OLTP on Hardware Islands" in VLDB 2012: http://vldb.org/pvldb/vol5/p1447_danicaporobic_vldb2012.pdf

MichaelGG · on Jan 20, 2014

It applies to more than DBs. Running VoIP software, we found that just by setting CPU affinity, we got a major increase in performance. The software in question, FreeSWITCH, is inanely-threaded in a misguided believe that "more threads=more performance" (well that, and it's also just easier to program). When there's thousands of threads going, keeping them and their data local to one NUMA node or less really makes a huge difference.

sandGorgon · on Jan 20, 2014

3.13 is the first release with full opensource Intel Broadwell drivers - switched off by default though.

So by the time Broadwell actually lands in Q3-Q4, the kernel should have stabilized nicely. Perfect for a cheap Steambox.

higherpurpose · on Jan 20, 2014

I doubt Broadwell will have any performance improvements over Haswell at the same price points. You're probably better off buying a cheaper Haswell then if you want a cheap Steambox. Intel has pretty much given up on improving overall performance of its chips. IVB was only 10 percent improvement over SNB, and Haswell only 5 percent over IVB, and the difference in price between new-gen and last-gen is probably more like 30 percent.

sandGorgon · on Jan 20, 2014

Intel’s Ben Widawsky, who works on Intel’s Linux graphics driver efforts, says that “Broadwell graphics bring some of the biggest changes we’ve seen on the execution and memory management side of the GPU… [the changes] dwarf any other silicon iteration during my tenure, and certainly can compete with the likes of the gen3->gen4 changes.”

This combined with the fully opensource Linux driver for Broadwell would mean that there is a very high chance it would perform significantly better.

wmf · on Jan 20, 2014

Broadwell will perform better, but Intel may decide to charge more for it. That's what they did with Ivy Bridge and Haswell.

delroth · on Jan 21, 2014

Single core performance in CPU intensive applications really improved between Nehalem/SNB/IVB/Haswell actually. [1] is a benchmark of the Dolphin Emulator (mostly single threaded performance) on a variety of recent CPUs that shows this.

[1] https://docs.google.com/spreadsheet/ccc?key=0AunYlOAfGABxdFQ...

darksaints · on Jan 20, 2014

I noticed a couple of commits regarding btrfs. Can anybody summarize them to someone that doesn't know anything about kernels and very little about file systems?

dsr_ · on Jan 20, 2014

The big improvements to btrfs came in 3.12. The 3.13 changes are rather minor:

a mount option to specify the maximum delay before committing writes to storage (default 30 seconds, no maximum, warning for 300+ seconds -- make sure you have battery coverage for whatever this is set to plus a few seconds, and try not to crash...)

a mount option for emergency use that will force the rebuild of the UUID tree

userspace tools that read FIEMAP_EXTENT_SHARED can now use that on btrfs; no functionality change, really, just making the info available in the same way that ocfs2 does it.

contingencies · on Jan 20, 2014

For the block layer update, before anyone gets excited like I did, the paper actually suggests it's not useful to most people at all with current era hardware.

In this paper, we have established that the current design of the Linux block layer does not scale beyond one million IOPS per device. This is sucient for today's SSD, but not for tomorrow's. We proposed a new design for the Linux block layer. This design is based on two levels of queues in order to reduce contention and promote thread locality. Our experiments have shown the superiority of our design and its scalability on multi-socket systems. Our multiqueue design leverages the new capabilities of NVM-Express or high-end PCI-E devices, while still providing the common interface and convenience features of the block layer.

mbjorling · on Jan 20, 2014

This statement should be seen as there's no way to scale the old block layer to new devices. To current SSDs, its already useful, in that it decreases latency and CPU usage for current generation of drives.

It's currently only enabled using the virtioblk driver. But there's work underway to make the scsi layer and all the others drivers use it (already patches out for the mtip and nvme driver).

contingencies · on Jan 20, 2014

Thanks for the clarification. IIRC you are a co-author of the paper, so perhaps you can answer a follow-up question.

What kind of latency or CPU usage change should a typical modern SSD on an amd64 class multicore processor observe when using the new block layer?

Also correct me if I'm wrong but since Linux aggressively caches already and SSDs are already way faster than older drives for normal (ie. ~random access) loads, plus RAM is cheap and plentiful these days, I am guessing that very few applications will honestly be IO-bound enough to see that benefit.

mbjorling · on Jan 20, 2014

One thread issuing IOs: A reduction of 2x in the IO path latency isn't unusual. The overhead of the code path drops from 5us to around 2us. When there's multiple IO threads, the gain is much higher (to 38x in the 8 socket setup). Thus, the more complex workload, the better performance.

I don't have any up-to-date numbers on CPU usage. When we did the experiments on the mtip drive, it was around 20% less CPU usage when performing roughly the same IOs.

For a typical workstation workload, the SSDs access times are still too high to feel the reduced latency. A typical modern SSD is around 50-100us for an IO access. The win there will be the lesser CPU usage that free up resources for other things to do.

Applications are still bound by the round-trip time of getting IOs. Just because we get more memory, we still have to persist data at intervals to prevent data loss, and everything that can help in decreasing the overhead is a win.

cbsmith · on Jan 21, 2014

There are certainly plenty of contexts where SSD's in general provide limited if any performance wins because disk I/O is largely not involved. However, in cases where SSD's are being used for performance reasons, particularly for random reads, I would expect this would make a fair bit of difference.

nly · on Jan 20, 2014

Indeed. There were even some benchmarks on Phoronix a while ago that hinted that these block layer changes had resulted in some performance regressions. Anyone know if these regressions were tracked down and fixed?

shimon_e · on Jan 20, 2014

Great finally some drivers I've wanted have been merged. http://kernelnewbies.org/Linux_3.13-DriversArch

netcraft · on Jan 20, 2014

Do we know what the next LTS kernel release will be and when it is expected?

sp332 · on Jan 20, 2014

Greg Kroah-Hartman hasn't mentioned it on his blog. He's doing one per year, but he didn't announce last years' until it had been out for a month already. http://www.kroah.com/log/blog/2013/08/04/longterm-kernel-3-d...

Edit: more official link https://www.kernel.org/category/releases.html

higherpurpose · on Jan 20, 2014

That will probably be 3.15.