Planet Linux Plumbers Conf

April 06, 2009

Darrick Wong

August 27, 2008

Stephen Hemminger

Exploring transactional filesystems

In order to implement router style semantics, Vyatta allows setting many different configuration variables and then applying them all at once with a commit command. Currently, this is implemented by a combination of shell magic and unionfs. The problem is that keeping unionfs up to date and fixing the resulting crashes is major pain.

There must be better alternatives, current options include:
  • Replace unionfs with aufs which has less users yelling at it and more developers.
  • Use a filesystem like btrfs which has snapshots. This changes the model and makes api's like "what changed?" hard to implement.
  • Move to a pure userspace model using git. The problem here is that git as currently written is meant for users not transactions.
  • Use combination of copy, bind mount, and rsync.
  • Use a database for configuration. This is easier for general queries but is the most work. Conversion from existing format would be a pain.
Looks like a fun/hard problem. Don't expect any resolution soon.

by Linux Network Plumber (noreply@blogger.com) at August 27, 2008 10:20 PM

May 19, 2015

Paul E. McKenney

Dagstuhl Seminar: Compositional Verification Methods for Next-Generation Concurrency

Some time ago, I figured out that there are more than a billion instances of the Linux kernel in use, and this in turn led to the realization that a million-year RCU bug is happening about three times a day across the installed base. This realization has caused me to focus more heavily on RCU validation, which has uncovered a number of interesting bugs. I have also dabbled a bit in formal verification, which has not yet found a bug. However, formal verification might be getting there, and might some day be a useful addition to RCU's regression testing. I was therefore quite happy to be invited to this Dagstuhl Seminar. In what follows, I summarize a few of the presentation. See here for the rest of the presentations.

Viktor Vafeiadis presented his analysis of the C11 memory model, including some “interesting” consequences of data races, where a data race is defined as a situation involving multiple concurrent accesses to a non-atomic variable, at least one of which is a write. One such consequence involves a theoretically desirable “strengthening” property. For example, this property would mean that multiplexing two threads onto a single underlying thread would not introduce new behaviors. However, with C11, the undefined-behavior consequences of data races can actually cause new behaviors to appear with fewer threads, for example, see Slide 7. This suggests the option of doing away with the undefined behavior, which is exactly the option that LLVM has taken. However, this approach requires some care, as can be seen on Slide 19. Nevertheless, this approach seems promising. One important takeaway from this talk is that if you are worried about weak ordering, you need to pay careful attention to reining in the compiler's optimizations. If you are unconvinced, take a look at this! Jean Pichon-Pharabod, Kyndylan Nienhuis, and Mike Dodds presented on other aspects of the C11 memory model.

Martin T. Vechev apparently felt that the C11 memory model was too tame, and therefore focused on event-driven applications, specifically javascript running on Android. This presentation included some entertaining concurrency bugs and their effects on the browser's display. Martin also discussed formalizing javascript's memory model.

Hongjin Liang showed that ticket locks can provide starvation freedom given a minimally fair scheduler. This provides a proof point for Björn B. Brandenburg's dissertation, which analyzed the larger question of real-time response from lock-based code. It should also provide a helpful corrective to people who still believe that non-blocking synchronization is required.

Joseph Tassarotti presented a formal proof of the quiescent-state based reclamation (QSBR) variant of userspace RCU. In contrast to previous proofs, this proof did not rely on sequential consistency, but instead leveraged a release-acquire memory model. It is of course good to see researchers focusing their tools on RCU! That said, when a researcher asked me privately whether I felt that the proof incorporated realistic assumptions, I of course could not resist saying that since they didn't find any bugs, the assumptions clearly must have been unrealistic.

My first presentation covered what would be needed for me to be able to use formal verification as part of Linux-kernel RCU's regression testing. As shown on slide 34, these are:


  1. Either automatic translation or no translation required. After all, if I attempt to manually translate Linux-kernel RCU to some special-purpose language every release, human error will make its presence known.
  2. Correctly handle environment, including the memory model, which in turn includes compiler optimizations.
  3. Reasonable CPU and memory overhead. If these overheads are excessive, RCU is better served by simple stress testing.
  4. Map to source code lines containing the bug. After all, I already know that there are bugs—I need to know where they are.
  5. Modest input outside of source code under test. The sad fact is that a full specification of RCU would be at least as large as the implementation, and also at least as buggy.
  6. Find relevant bugs. To see why this is important, imagine that some tool finds 100 different million-year bugs and I fix them all. Because roughly one of six fixes introduces a bug, and because that bug is likely to reproduce in far less than a million years, this process has likely greatly reduced the robustness of the Linux kernel.


I was not surprised to get some “frank and honest” feedback, but I was quite surprised (but not at all displeased) to learn that some of the feedback was of the form “we want to see more C code.” After some discussion, I provided just that.

May 19, 2015 07:30 PM

May 03, 2015

Darrick Wong

Stupid Recipe for Creating LVM2 SSD Cache

Silly recipe for creating a SSD cache with LVM without wasting space on the SSD device:


# pvcreate /dev/slowdisk
# pvcreate /dev/fastdisk
# vgcreate plum_disk_cache /dev/slowdisk /dev/fastdisk
# lvcreate -n origin plum_disk_cache -l $size_of_slowdisk /dev/slowdisk
# lvcreate -n cache plum_disk_cache -l $size_of_fastdisk /dev/fastdisk
# lvconvert --type cache --cachemode writeback --cachepool plum_disk_cache/cache plum_disk_cache/origin --chunksize 512k -v --poolmetadataspare n
# lvreduce plum_disk_cache/cache -l -6 (or however many it claims to need)
# lvconvert --type cache --cachemode writeback --cachepool plum_disk_cache/cache plum_disk_cache/origin --chunksize 512k -v --poolmetadataspare n

May 03, 2015 06:50 AM

April 23, 2015

Paul E. McKenney

Verification Challenge 5: Uses of RCU

This is another self-directed verification challenge, this time to validate uses of RCU instead of validating the RCU implementations as in earlier posts. As you can see from Verification Challenge 4, the logic expression corresponding even to the simplest Linux-kernel RCU implementation is quite large, weighing in at tens of thousands of variables and hundreds of thousands of clauses. It is therefore worthwhile to look into the possibility of a trivial model of RCU that could be used for verification.

Because logic expressions do not care about cache locality, memory contention, energy efficiency, CPU hotplug, and a host of other complications that a Linux-kernel implementation must deal with, we can start with extreme simplicity. For example:

 1 static int rcu_read_nesting_global;
 2 
 3 static void rcu_read_lock(void)
 4 {
 5   (void)__sync_fetch_and_add(&rcu_read_nesting_global, 2);
 6 }
 7 
 8 static void rcu_read_unlock(void)
 9 {
10   (void)__sync_fetch_and_add(&rcu_read_nesting_global, -2);
11 }
12 
13 static inline void assert_no_rcu_read_lock(void)
14 {
15   BUG_ON(rcu_read_nesting_global >= 2);
16 }
17 
18 static void synchronize_rcu(void)
19 {
20   if (__sync_fetch_and_xor(&rcu_read_nesting_global, 1) < 2)
21     return;
22   SET_NOASSERT();
23   return;
24 }


The idea is to reject any execution in which synchronize_rcu() does not wait for all readers to be done. As before, SET_ASSERT() sets a variable that suppresses all future assertions.

Please note that this model of RCU has some shortcomings:


  1. There is no diagnosis of rcu_read_lock()/rcu_read_unlock() misnesting. (A later version of the model provides limited diagnosis, but under #ifdef CBMC_PROVE_RCU.)
  2. The heavyweight operations in rcu_read_lock() and rcu_read_unlock() result in artificial ordering constraints. Even in TSO systems such as x86 or s390, a store in a prior RCU read-side critical section might be reordered with loads in later critical sections, but this model will act as if such reordering was prohibited.
  3. Although synchronize_rcu() is permitted to complete once all pre-existing readers are done, in this model it will instead wait until a point in time at which there are absolutely no readers, whether pre-existing or new. Therefore, this model's idea of an RCU grace period is even heavier weight than in real life.


Nevertheless, this approach will allow us to find at least some RCU-usage bugs, and it fits in well with cbmc's default fully-ordered settings. For example, we can use it to verify a variant of the simple litmus test used previously:

 1 int r_x;
 2 int r_y;
 3 
 4 int x;
 5 int y;
 6 
 7 void *thread_reader(void *arg)
 8 {
 9   rcu_read_lock();
10   r_x = x;
11 #ifdef FORCE_FAILURE_READER
12   rcu_read_unlock();
13   rcu_read_lock();
14 #endif
15   r_y = y;
16   rcu_read_unlock();
17   return NULL;
18 }
19 
20 void *thread_update(void *arg)
21 {
22   x = 1;
23 #ifndef FORCE_FAILURE_GP
24   synchronize_rcu();
25 #endif
26   y = 1;
27   return NULL;
28 }
29 
30 int main(int argc, char *argv[])
31 {
32   pthread_t tr;
33 
34   if (pthread_create(&tr, NULL, thread_reader, NULL))
35     abort();
36   (void)thread_update(NULL);
37   if (pthread_join(tr, NULL))
38     abort();
39 
40   BUG_ON(r_y != 0 && r_x != 1);
41   return 0;
42 }


This model has only 3,032 variables and 8,844 clauses, more than an order of magnitude smaller than for the Tiny RCU verification. Verification takes about half a second, which is almost two orders of magnitude faster than the 30-second verification time for Tiny RCU. In addition, the model successfully flags several injected errors. We have therefore succeeded in producing a simpler and faster model approximating RCU, and that can handle multi-threaded litmus tests.

A natural next step would be to move to litmus tests involving linked lists. Unfortunately, there appear to be problems with cbmc's handling of pointers in multithreaded situations. On the other hand, cbmc's multithreaded support is quite new, so hopefully there will be fixes for these problems in the near future. After fixes appear, I will give the linked-list litmus tests another try.

In the meantime, the full source code for these models may be found here.

April 23, 2015 05:49 PM

January 15, 2014

Greg KH

kdbus details

Now that linux.conf.au is over, there has been a bunch of information running around about the status of kdbus and the integration of it with systemd. So, here’s a short summary of what’s going on at the moment.

Lennart Poettering gave a talk about kdbus at linux.conf.au. The talk can be viewed here, and the slides are here. Go read the slides and watch the talk, odds are, most of your questions will be answered there already.

For those who don’t want to take the time watching the talk, lwn.net wrote up a great summary of the talk, and that article is here. For those of you without a lwn.net subscription, what are you waiting for? You’ll have to wait two weeks before it comes out from behind the paid section of the website before reading it, sorry.

There will be a systemd hack-fest a few days before FOSDEM, where we should hopefully pound out the remaining rough edges on the codebase and get it ready to be merged. Lennart will also be giving his kdbus talk again at FOSDEM if anyone wants to see it in person.

The kdbus code can be found in two places, both on google code, and on github, depending on where you like to browse things. In a few weeks we’ll probably be creating some patches and submitting it for inclusion in the main kernel, but more testing with the latest systemd code needs to be done first.

If you want more information about the kdbus interface, and how it works, please see the kdbus.txt file for details.

Binder vs. kdbus

A lot of people have asked about replacing Android’s binder code with kdbus. I originally thought this could be done, but as time has gone by, I’ve come to the conclusion that this will not happen with the first version of kdbus, and possibly can never happen.

First off, go read that link describing binder that I pointed to above, especially all of the links to different resources from that page. That should give you more than you ever wanted to know about binder.

Short answer

Binder is bound to the CPU, D-Bus (and hence kdbus), is bound to RAM.

Long answer

Binder

Binder is an interface that Android uses to provide synchronous calling (CPU) from one task to a thread of another task. There is no queueing involved in these calls, other than the caller process is suspended until the answering process returns. RAM is not interesting besides the fact that it is used to share the data between the different callers. The fact that the caller process gives up its CPU slice to the answering process is key for how Android works with the binder library.

This is just like a syscall, and it behaves a lot like a mutex. The communicating processes are directly connected to each other. There is an upper limit of how many different processes can be using binder at once, and I think it’s around 16 for most systems.

D-Bus

D-Bus is asynchronous, it queues (RAM) messages, keeps the messages in order, and the receiver dequeues the messages. The CPU does not matter at all other than it is used to do the asynchronous work of passing the RAM around between the different processes.

This is a lot like network communication protocols. It is a very “disconnected” communication method between processes. The upper limit of message sizes and numbers is usually around 8Mb per connection and a normal message is around 200-800 bytes.

Binder

The model of Binder was created for a microkernel-like device (side note, go read this wonderful article about the history of Danger written by one of the engineers at that company for a glimpse into where the Android internals came from, binder included.) The model of binder is very limited, inflexible in its use-cases, but very powerful and extremely low-overhead and fast. Binder ensures that the same CPU timeslice will go from the calling process into the called process’s thread, and then come back into the caller when finished. There is almost no scheduling involved, and is much like a syscall into the kernel that does work for the calling process. This interface is very well suited for cheap devices with almost no RAM and very low CPU resources.

So, for systems like Android, binder makes total sense, especially given the history of it and where it was designed to be used.

D-Bus

D-Bus is a create-store-forward, compose reply and then create-store-forward messaging model which is more complex than binder, but because of that, it is extremely flexible, versatile, network transparent, much easier to manage, and very easy to let fully untrusted peers take part of the communication model (hint, never let this happen with binder, or bad things will happen…) D-Bus can scale up to huge amounts of data, and with the implementation of kdbus it is possible to pass gigabytes of buffers to every connection on the bus if you really wanted to. CPU-wise, it is not as efficient as binder, but is a much better general-purpose solution for general-purpose machines and workloads.

CPU vs. RAM

Yes, it’s an over simplification of a different set of complex IPC methods, but these 3 words should help you explain the differences between binder and D-Bus and why kdbus isn’t going to be able to easily replace binder anytime soon.

Never say never

Ok, before you start to object to the above statements, yes, we could add functionality to kdbus to have some blocking ioctl calls that implement something like: write question -> block for reply and read reply one answer for the request side, and then on the server side do: write answer -> block in read That would get kdbus a tiny bit closer to the binder model, by queueing stuff in RAM instead of relying on a thread pool.

That might work, but would require a lot of work on the binder library side in Android, and as a very limited number of people have write access to that code (they all can be counted on one hand), and it’s a non-trivial amount of work for a core function of Android that is working very well today, I don’t know if it will ever happen.

But anything is possible, it’s just software you know…

Thanks

Many thanks to Kay Sievers who came up with the CPU vs. RAM description of binder and D-Bus and whose email I pretty much just copied into this post. Also thanks to Kay and Lennart for taking the time and energy to put up with my silly statements about how kdbus could replace binder, and totally proving me wrong, sorry for having you spend so much time on this, but I now know you are right.

Also thanks to Daniel Mack and Kay for doing so much work on the kdbus kernel code, that I don’t think any of my original implementation is even present anymore, which is probably a good thing. Also thanks to Tejun Heo for help with the memfd implementation and cgroups help in kdbus.

January 15, 2014 08:57 PM

September 11, 2013

Greg KH

binary blobs to C structures

Sometimes you don’t have access to vim’s wonderful xxd tool, and you need to use it to generate some .c code based on a binary file. This happened to me recently when packaging up the EFI signing tools for Gentoo. Adding a build requirement of vim for a single autogenerated file was not an option for some users, so I created a perl version of the xxd -i command line tool.

This works because everyone has perl in their build systems, whether they like it or not. Instead of burying it in the efitools package, here’s a copy of it for others to use if they want/need it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/usr/bin/env perl
#
# xxdi.pl - perl implementation of 'xxd -i' mode
#
# Copyright 2013 Greg Kroah-Hartman <gregkh@linuxfoundation.org>
# Copyright 2013 Linux Foundation
#
# Released under the GPLv2.
#
# Implements the "basic" functionality of 'xxd -i' in perl to keep build
# systems from having to build/install/rely on vim-core, which not all
# distros want to do.  But everyone has perl, so use it instead.

use strict;
use warnings;
use File::Slurp qw(slurp);

my $indata = slurp(@ARGV ? $ARGV[0] : \*STDIN);
my $len_data = length($indata);
my $num_digits_per_line = 12;
my $var_name;
my $outdata;

# Use the variable name of the file we read from, converting '/' and '.
# to '_', or, if this is stdin, just use "stdin" as the name.
if (@ARGV) {
        $var_name = $ARGV[0];
        $var_name =~ s/\//_/g;
        $var_name =~ s/\./_/g;
} else {
        $var_name = "stdin";
}

$outdata .= "unsigned char $var_name\[] = {";

# trailing ',' is acceptable, so instead of duplicating the logic for
# just the last character, live with the extra ','.
for (my $key= 0; $key < $len_data; $key++) {
        if ($key % $num_digits_per_line == 0) {
                $outdata .= "\n\t";
        }
        $outdata .= sprintf("0x%.2x, ", ord(substr($indata, $key, 1)));
}

$outdata .= "\n};\nunsigned int $var_name\_len = $len_data;\n";

binmode STDOUT;
print {*STDOUT} $outdata;

Yes, I know I write perl code like a C programmer, that’s not an insult to me.

September 11, 2013 10:22 PM

September 03, 2009

Valerie Aurora

Carbon METRIC BUTTLOAD print

I just read Charlie Stross's rant on reducing his household's carbon footprint. Summary: He and his wife can live a life of monastic discomfort, wearing moldy scratchy 10-year-old bamboo fiber jumpsuits and shivering in their flat - or, they can cut out one transatlantic flight per year and achieve the equivalent carbon footprint reduction.

I did a similar analysis back around 2007 or so and had the same result: I've got a relatively trim carbon footprint compared to your average first-worlder, except for the air travel that turns it into a bloated planet-eating monster too extreme to fall under the delicate term "footprint." Like Charlie, I am too practical, too technophilic, and too hopeful to accept that the only hope of saving the planet is to regress to third world living standards (fucking eco-ascetics!). I decided that I would only make changes that made my life better, not worse - e.g., living in a walkable urban center (downtown Portland, now SF). But the air travel was a stumper. I liked traveling, and flying around the world for conferences is a vital component of saving the world through open source. Isn't it? Isn't it?

Two things happened that made me re-evaluate my air travel philosophy. One, I started a file systems consulting business and didn't have a lot of spare cash to spend on fripperies. Two, I hurt my back and sitting became massively uncomfortable (still recovering from that one). So I cut down on the flying around the world to Linux conferences involuntarily.

You know what I discovered? I LOVE not flying around the world for Linux conferences. I love taking only a few flights a year. I love flying mostly in the same time zone (yay, West coast). I love having the energy to travel for fun because I'm not all dragged out by the conference circuit. I love hanging out with my friends who live in the same city instead of missing out on all the parties because I'm in fucking Venezuela instead.

Save the planet. Burn your frequent flyer card.

September 03, 2009 07:04 AM

March 04, 2013

Twitter

March 01, 2013

Twitter

February 18, 2009

Stephen Hemminger

Parallelizing netfilter

The Linux networking receive performance has been mostly single threaded until the advent of MSI-X and multiqueue receive hardware. Now with many cards, it is possible to be processing packets on multiple CPU's and cores at once. All this is great, and improves performance for the simple case.

But most users don't just use simple networking. They use useful features like netfilter to do firewalling, NAT, connection tracking and all other forms of wierd and wonderful things. The netfilter code has been tuned over the years, but there are still several hot locks in the receive path. Most of these are reader-writer locks which are actually the worst kind, much worse than a simple spin lock. The problem with locks on modern CPU's is that even for the uncontested case, a lock operation means a full-stop cache miss.

With the help of Eric Duzmet, Rick Jones, Martin Josefsson and others, it looks like there is a solution to most of these. I am excited to see how it all pans out but it could mean a big performance increase for any kind of netfilter packet intensive processing. Stay tuned.

by Linux Network Plumber (noreply@blogger.com) at February 18, 2009 05:51 AM

September 25, 2010

Andy Grover

Plumbers Down Under

<p>Since the original <a href="http://www.linuxplumbersconf.org/">Linux Plumbers Conference</a> drew much inspiration from <a href="http://lca2011.linux.org.au/">LCA</a>'s continuing success, it's cool to see some of what Plumbers has done be seen as <a href="http://airlied.livejournal.com/73491.html">worthy of emulating at next year's LCA</a>!</p><p>LCA seems like a great opportunity to specifically try to make progress on cross-project issues. It's quite well-attended so it's likely the people you need in the room to make a decision will be <em>in the room</em>.</p>

by andy.grover at September 25, 2010 01:50 PM

September 10, 2010

Andy Grover

Increasing office presence for remote workers

<p>I work from home. My basement, actually. I recently read an article in the Times about <a href="http://www.nytimes.com/2010/09/05/science/05robots.html?_r=1&amp;pagewanted=1">increasing the office presence of remote employees with robots</a>. Pretty interesting. How much does one of those robo-Beltzners cost? $5k? This is a neat idea but it's still not released so who knows.<br /><br />I've been thinking about other options for establishing a stronger office presence for myself. Recently I bought a webcam. If I used this to broadcast me, sitting at my desk on Ustream or Livestream, that would certainly make it so my coworkers (and the rest of the world) could see what I was up to, every second of the workday. This is actually a lot <i>more</i> exposure than an office worker, even in a cubicle, would expect. If I'm in an office cube, I might have people stop by, but I'll know they're there, and they won't <i>always</i> be there.&nbsp; There is still generally solitude and privacy to concentrate on the code and be productive. I'm currently trying something that I think is closer to the balance of a real office:<br /><ul><li>Take snapshots from webcam every 15 minutes<br /></li><li>Only during normal working hours</li><li>Give 3 second audible warning before capturing</li><li>Upload to an intranet webserver</li></ul>I haven't found this to be too much of an imposition -- in fact, the quarter-hourly beeps are somewhat like a clock chime.<br /><br />In the beginning, it's hard to resist mugging for the camera, but that passes:<br /><img style="max-width: 800px;" src="http://oss.oracle.com/%7Eagrover/pics/blog/whassup.jpg" alt="whassup???" height="240" width="320" /><br />Think about how this is better than irc or IM, both of which <i>do</i> have activity/presence indicators, but which either aren't used, or poorly implemented and often wrong. How much more likely are you, as a colleague of mine, to IM, email, video chat, or call me if you can see I'm at my desk and working? No more "around?" messages needed. You could even see if I'm looking cheerful, or perhaps otherwise indisposed, heh heh:<br /><img style="max-width: 800px;" src="http://oss.oracle.com/%7Eagrover/pics/blog/cat1.jpg" alt="hello kitty" height="240" width="320" /><br />On a technical note, although there were many Debian packages that kind-of did what I wanted, it turned out to be surprisingly easy to roll my own in about <a href="http://github.com/agrover/pysnapper/blob/master/webcam.py">20 lines of Python</a>.<br /><img style="max-width: 800px;" src="http://oss.oracle.com/%7Eagrover/pics/blog/working.jpg" alt="working hard." height="240" width="320" /><br />Anyways, just something I've been playing around with, while I wait for my robo-avatar to be set up down at HQ...</p>

by andy.grover at September 10, 2010 05:20 PM

November 08, 2009

Valerie Aurora

Migrated to WordPress

My LiveJournal blog name - valhenson - was the last major holdover from my old name, Val Henson. I got a new Social Security card, passport, and driver's license with my new name several months ago, but migrating my blog? That's hard! Or something. I finally got around to moving to a brand-spanking-new blog at WordPress:

Valerie Aurora's blog

Update your RSS reader with the above if you still want to read my blog - I won't be republishing my posts to my new blog on this LiveJournal blog.

If you're aware of any other current instances of "Val Henson" or "Valerie Henson," let me know! I obviously can't change my name on historical documents, like research papers or interviews, but if it's vaguely real-time-ish, I'd like to update it.

One web page I'm going to keep as Val Henson for historical reasons is my Val Henson is a Man joke. Several of the pages on my web site were created after the fact as vehicles for amusing pictures or graphics I had lying around. In this case, my friend Dana Sibera created a pretty damn cool picture of me with a full beard and I had to do something with it.



It's doubly wild now that I have such short hair.

November 08, 2009 11:36 PM