I really, really, really wish that the Linux CSPRNG would quit having its flaws papered over. A fellow submitted a patch to implement the Fortuna CSPRNG years ago, and it wasn't accepted because of a misguided belied in entropy estimation.
I'm not saying that Fortuna is the One True CSPRNG—it's not—but any clean design would be preferable to the current Rube Goldberg mechanism. I'm pretty sure that /dev/random as it currently stands is secure enough, but 'pretty sure' isn't very reassuring.
I've been wanting to implement a dynamic ARP filter (DHCP snooping + ARP filtering) for ages now, but arptables/ebtables just didn't cut it. Hopefully this will be easier/viable now. Because I still don't think ArpON sniffs DHCP leases for its mapping (it intercepts and replays ARP requests or something) and it doesn't filter rogue DHCP servers. I'm just amazed ARP spoofing/ARP cache poisoning is still a viable attack vector on home networks in 2014.
It can accept both DHCP and ARP protocols, and will decode them into attribute-value pairs. Those can then be referenced in a policy language, and stored to / read from a database.
I'm the author. :) It's no longer just a RADIUS server. I've been looking for a DHCP / ARP checker for a while, and couldn't find anything useful. Rather than writing something from scratch, I decided it was easier to just add ~2K LoC to FreeRADIUS. I could then leverage the policy language and database integration, so I didn't have to re-write all of that, either.
You need to implement DHCP Snooping at Layer 2 for it to really work, and these days I think all major switch vendors support it. Unless you're building a Linux L2 switch I don't see why you would want to implement DHCP Snooping.
> It adds a simple virtual machine to the kernel that is able to execute bytecode to inspect a network packet and make decisions on how that packet should be handled.
I wonder how long it's going to take until someone figures out a way to craft a specific sequence of packets that remotely do something nasty at the kernel level :P
It's not different than the rest of the network stack, most of it consists in "inspecting packets" in one way or another, and security bugs can be (and some times are) introduced.
But nftables is actually a big win from a security perspective, because it simplifies the current code (lots of duplicated code goes away) and moves other parts to userspace.
Old netfilter system: 70.000 LoC in kernel + 50.000 in userspace
nftables: 7.000 LoC in kernel + 50.000 in userspace
Also note that it's not really a "virtual machine" comparable with java, this is how the developers actually describe it
In a nutshell, nftables provides a pseudo-state machine with 4 general
purpose registers of 128 bits and 1 specific purpose register to store
verdicts. This pseudo-machine comes with an extensible instruction set,
a.k.a. "expressions" in the nftables jargon. The expressions included
in this patch provide the basic functionality, they are:
* bitwise: to perform bitwise operations.
* byteorder: to change from host/network endianess.
* cmp: to compare data with the content of the registers.
* counter: to enable counters on rules.
* ct: to store conntrack keys into register.
* exthdr: to match IPv6 extension headers.
* immediate: to load data into registers.
* limit: to limit matching based on packet rate.
* log: to log packets.
* meta: to match metainformation that usually comes with the skbuff.
* nat: to perform Network Address Translation.
* payload: to fetch data from the packet payload and store it into
registers.
* reject (IPv4 only): to explicitly close connection, eg. TCP RST.
Using this instruction-set, the userspace utility 'nft' can transform
the rules expressed in human-readable text representation (using a
new syntax, inspired by tcpdump) to nftables bytecode.
Anyone heard any news on Google's user mode thread[1] kernel syscalls? I was really excited for that when it was announced but haven't heard a peep about it since.
Great to see the NUMA balancing in. My question has always been what workloads require NUMA balancing in the first place? If I present the kernel with the same number of threads as cores and keep all data local to a thread would the existing approach of allocating on the NUMA node the thread is running on have been enough?
The kernel will balance loaded threads across cores so with a policy of allocating off the local NUMA node you actually end up with balanced allocations in practice if you run shared nothing thread per core.
In memory databases are my day job so I am pretty interested in cases where things go south because memory isn't balanced. To date it appears like no special actions were necessary, stuff just ends up balanced across nodes.
That's why it would be great of someone could characterize when balancing is necessary outside of obvious cases like allocating an entire buffer pool from one thread.
It applies to more than DBs. Running VoIP software, we found that just by setting CPU affinity, we got a major increase in performance. The software in question, FreeSWITCH, is inanely-threaded in a misguided believe that "more threads=more performance" (well that, and it's also just easier to program). When there's thousands of threads going, keeping them and their data local to one NUMA node or less really makes a huge difference.
I doubt Broadwell will have any performance improvements over Haswell at the same price points. You're probably better off buying a cheaper Haswell then if you want a cheap Steambox. Intel has pretty much given up on improving overall performance of its chips. IVB was only 10 percent improvement over SNB, and Haswell only 5 percent over IVB, and the difference in price between new-gen and last-gen is probably more like 30 percent.
Intel’s Ben Widawsky, who works on Intel’s Linux graphics driver efforts, says that “Broadwell graphics bring some of the biggest changes we’ve seen on the execution and memory management side of the GPU… [the changes] dwarf any other silicon iteration during my tenure, and certainly can compete with the likes of the gen3->gen4 changes.”
This combined with the fully opensource Linux driver for Broadwell would mean that there is a very high chance it would perform significantly better.
Single core performance in CPU intensive applications really improved between Nehalem/SNB/IVB/Haswell actually. [1] is a benchmark of the Dolphin Emulator (mostly single threaded performance) on a variety of recent CPUs that shows this.
I noticed a couple of commits regarding btrfs. Can anybody summarize them to someone that doesn't know anything about kernels and very little about file systems?
The big improvements to btrfs came in 3.12. The 3.13 changes are rather minor:
a mount option to specify the maximum delay before committing writes to storage (default 30 seconds, no maximum, warning for 300+ seconds -- make sure you have battery coverage for whatever this is set to plus a few seconds, and try not to crash...)
a mount option for emergency use that will force the rebuild of the UUID tree
userspace tools that read FIEMAP_EXTENT_SHARED can now use that on btrfs; no functionality change, really, just making the info available in the same way that ocfs2 does it.
For the block layer update, before anyone gets excited like I did, the paper actually suggests it's not useful to most people at all with current era hardware.
In this paper, we have established that the current design of the Linux block layer does not scale beyond one million IOPS per device. This is sucient for today's SSD, but not for tomorrow's. We proposed a new design for the Linux block layer. This design is based on two levels of queues in order to reduce contention and promote thread locality. Our experiments have shown the superiority of our design and its scalability on multi-socket systems. Our multiqueue design
leverages the new capabilities of NVM-Express or high-end
PCI-E devices, while still providing the common interface
and convenience features of the block layer.
This statement should be seen as there's no way to scale the old block layer to new devices. To current SSDs, its already useful, in that it decreases latency and CPU usage for current generation of drives.
It's currently only enabled using the virtioblk driver. But there's work underway to make the scsi layer and all the others drivers use it (already patches out for the mtip and nvme driver).
Thanks for the clarification. IIRC you are a co-author of the paper, so perhaps you can answer a follow-up question.
What kind of latency or CPU usage change should a typical modern SSD on an amd64 class multicore processor observe when using the new block layer?
Also correct me if I'm wrong but since Linux aggressively caches already and SSDs are already way faster than older drives for normal (ie. ~random access) loads, plus RAM is cheap and plentiful these days, I am guessing that very few applications will honestly be IO-bound enough to see that benefit.
One thread issuing IOs: A reduction of 2x in the IO path latency isn't unusual. The overhead of the code path drops from 5us to around 2us. When there's multiple IO threads, the gain is much higher (to 38x in the 8 socket setup). Thus, the more complex workload, the better performance.
I don't have any up-to-date numbers on CPU usage. When we did the experiments on the mtip drive, it was around 20% less CPU usage when performing roughly the same IOs.
For a typical workstation workload, the SSDs access times are still too high to feel the reduced latency. A typical modern SSD is around 50-100us for an IO access. The win there will be the lesser CPU usage that free up resources for other things to do.
Applications are still bound by the round-trip time of getting IOs. Just because we get more memory, we still have to persist data at intervals to prevent data loss, and everything that can help in decreasing the overhead is a win.
There are certainly plenty of contexts where SSD's in general provide limited if any performance wins because disk I/O is largely not involved. However, in cases where SSD's are being used for performance reasons, particularly for random reads, I would expect this would make a fair bit of difference.
Indeed. There were even some benchmarks on Phoronix a while ago that hinted that these block layer changes had resulted in some performance regressions. Anyone know if these regressions were tracked down and fixed?
I really, really, really wish that the Linux CSPRNG would quit having its flaws papered over. A fellow submitted a patch to implement the Fortuna CSPRNG years ago, and it wasn't accepted because of a misguided belied in entropy estimation.
I'm not saying that Fortuna is the One True CSPRNG—it's not—but any clean design would be preferable to the current Rube Goldberg mechanism. I'm pretty sure that /dev/random as it currently stands is secure enough, but 'pretty sure' isn't very reassuring.