Why We Should All Test the New Linux Kernel

I explain how and why we can improve the quality of the Linux operating system kernel by testing it on our own computers.

Note - This was originally posted at Advogato as Why We Should All Test the New Linux Kernel shortly before the release of Linux 2.4.0. However, once it was posted several people wrote in to inform me about errors I had made or suggestions for helpful additions, but once an article is posted at Advogato, it can't be edited, only replied to. I decided to post an updated version of the article here on the Linux Quality Database with the needed corrections folded in. I wasn't sure whether I should update it to reflect the current version of the kernel or not - to keep with the original spirit of the article I leave in the mention of 2.4.0-prerelease in the intro, but as I write this the current stable kernel is 2.4.4 and the test patch from Alan Cox is 2.4.4-ac11.

Introduction

Ladies and Gentlemen, we are approaching an important point in the history of Free Software - the imminent release of a major revision to the Linux kernel. Being the foundation to the systems the vast majority of us depend on for our own work, the correctness of the kernel is vital to the proper functioning of the programs those of us on Advogato develop. Please grab the latest 2.4.0-prerelease sources from the Kernel.org mirror site nearest you and give it a thorough test on your equipment, and with your programs.

The current product kernel is 2.2.18, and for a long time the development kernels were the 2.3 series. The 2.4.0-test1 kernel came out earlier in the year and not only provides many new features to the Linux system, it is also a major rearchitecting of the kernel.

I've been working with the 2.4.0-test and 2.4.0-prerelease kernels for most of this year and found that in general they work pretty well. I think I can safely say that they will be fine when used by someone who is a programmer or competent to administer their own Linux system. This is not to say that the kernel is yet trouble-free but is good enough to be worth using by anyone likely to be reading this article.

(Note - at this date I've been using 2.4.3 and 2.4.4 on a production development machine and netatalk/samba fileserver and I've had no trouble with them.)

The problem is (and the reason that I post this) is that once 2.4.0 is released, it is likely to be rushed into production use on a lot of end-user systems, many with configurations that have not been adequately tested. I'm hoping that more widespread testing will head off such problems.

This is in part because the users will download the sources and build the kernel themselves because it has features or fixes they need, or because many of the distributions will rush to include it so they can be perceived as "competitive", either with each other or with non-free operating systems.

But the people doing the most active work with the 2.4.0 kernel are the kernel developers themselves, or those few like me who are just working to test it. I don't think there's a tremendous number of people taking the trouble to test it, and even those who spend the most time at it (the kernel developers) often have limited resources for trying different configurations. (I have heard of distributions that prematurely shipped systems with prerelease kernels - something I consider irresponsible.)

A lot of people will have their very first experience with Linux by purchasing a $29 CD distribution "just to check it out". For many of them, the brand-new 2.4.0 kernel will be what they get, and it's very important that they have a positive experience with it. Every bug found by an Linux Quality reader is a bug that's not found by a couple of thousand novice Linux users who might not come back for more.

It's very important to have the kernel tested on a wide variety of configurations and under the load of a lot of different applications.

For comparison of what's done in the commercial world, I used to work at Apple, at one time as a QA engineer and at another time as an OS engineer doing system debugging. I tested MacTCP, Apple's older TCP/IP stack, and for that I had about three dozen machines in a lab and worked full-time there about a year doing nothing but testing, writing test plans and writing test tools - all that to QA what was then (1990) considered an unimportant component of the system by most of the company.

At the time I was an OS engineer in the mid-90's, I don't know how many QA staff Apple had, but I would guess it numbered 500 or greater, all working full-time, year round to test a system that had far fewer hardware configurations in question than the Linux kernel is expected to support - and note that Apple maintains extremely tight control over the hardware, where Linux is expected to run on just about anything from Internet Appliances to ancient 386 boxes to mainframes.

The kernel has special needs for testing that require it to be done by a wide variety of people for several reasons:

It is distributed as highly configurable source code, so it needs to be tried out with lots of different options to try to find combinations that stimulate bugs
It supports a number of different instruction set architectures, and different CPU grades for a given architecture, and even mainframe processors (the S/390) - no one owns all those different machines
It supports a very large number of hardware devices, which need to be actually installed to do anything interesting. There may be conflicts between different devices that can only be found out by widespread testing
Being an interface between user applications and the hardware, the kernel needs to be tested by running lots of different user-mode programs on it, so that a lot of combinations of system calls and other loads on the system get tested - that's why you should test your application on the new kernel
Failure of the kernel in a production system usually has a worse impact than failure of a user mode program

If the kernel is flaky, it's obvious your machine can crash and the filesystem can get corrupted and users lose data and the use of their machine either until they reboot or even until the problem is resolved. What could be worse is if a buggy kernel doesn't crash but causes incorrect functioning of an otherwise reliable program - this kind of bug is insidious and can be maddening to track down.

Working with Test Kernels

There's a few things you'll need to know to get working with your new kernel.

Usually you want to report bugs to the linux-kernel mailing list at linux-kernel@vger.kernel.org Note the new mailserver - vger.rutgers.edu apparently had a meltdown.

You probably don't want to actually subscribe to the linux-kernel list because of the volume of mail. I suggest reading the list off of an archive, of which there are many. I like this archive. You can find other archives at Google.

It is of course good form to read the linux-kernel mailing list FAQ.

Once you're connected to an archive server the files to look for will be in pub/linux/kernel/v2.4

The full list of kernel archive mirrors can be found at http://www.kernel.org/mirrors/

Here are links to the download directories for current kernels and utilities at some of the mirrors:

Download Links for Current Kernel Versions
Country	Stable	Testing	AC Test Patches	modutils	util-linux
USA	Stable	Testing	AC	modutils	util-linux
Canada	Stable	Testing	AC	modutils	util-linux
United Kingdom	Stable	Testing	AC	modutils	util-linux
Germany	Stable	Testing	AC	modutils	util-linux
France	Stable	Testing	AC	modutils	util-linux
Japan	Stable	Testing	AC	modutils	util-linux
China	Stable	Testing	AC	modutils	util-linux
South Korea	Stable	Testing	AC	modutils	util-linux
India	Stable	Testing	AC	modutils	util-linux
Saudia Arabia	Stable	Testing	AC	modutils	util-linux
Russia	Stable	Testing	AC	modutils	util-linux
South Africa	Stable	Testing	AC	modutils	util-linux
Brazil	Stable	Testing	AC	modutils	util-linux
Mexico	Stable	Testing	AC	modutils	util-linux

You'll only need to download the whole kernel source once, then you can download and apply the much smaller patches when they come out (you don't have to try to keep up with all the patches, contribute at a pace that's appropriate for you).

If you download the .bz2 files (which are smaller), if your version of tar has built-in support for bzip2 format, use tar xvfpy filename.bz2 to extract them, otherwise use bunzip2 to unpack them, then tar -xvfp to extract them. If you download the .tar.gz files, use tar -xvfzp to extract and uncompress them at the same time.

When you untar the sources, a directory called "linux" will be created. I won't go into how to configure and build the kernel, for that the Kernel Newbies website has the best information.

I suggest you not untar the source in /usr/src - if you do, be sure to rename the existing /usr/src/linux or you'll make a mess. /usr/src/linux might be a symlink; if so, and you really want the sources in /usr/src, then remove the symlink, untar the sources, rename the linux directory to linux-version, and make a new symlink. But you'll quickly find that testing lots of kernel versions uses lots of disk space, and it's likely that you don't have that much spare space in /usr; I make a directory called /home/admin and keep most of my kernels down in there. My /usr/src/linux directory is either for the kernel that came with my distro or a later kernel once I've decided to stay with one version for a while.

The one big gotcha I ever found was that to change from running a 2.2 kernel to a 2.4 kernel I needed a new set of modutils, the programs that manage the kernel modules. Without them you'll get a lot of undefined symbols in your modules and your modules won't load right (the new modutils seem to work OK with old kernels). You'll find the new modutils on your local mirror server in pub/linux/utils/kernel/modutils/v2.4

To ensure full compatibility, you should check the required version numbers for different packages in the file linux/Documentation/ You may find as I did that some things work OK at first with the older programs that came with your distribution but may break later on. Before reporting a bug be sure that you've got the relevant updates installed. Here are the the required software versions for programs to be used with kernel 2.4.4:

Minimum Software Versions for Use with Linux 2.4.2
Package	Minimum Version	How to Check Version
Gnu C	2.91.66	gcc --version
Gnu make	3.77	make --version
binutils	2.9.1.0.25	ld -v
util-linux	2.10o	fdformat --version
modutils	2.4.2	insmod -V
e2fsprogs	1.19	tune2fs
reiserfsprogs	3.x.0j	reiserfsck 2>&1\|grep reiserfsprogs
pcmcia-cs	2.4.0	pppd --version
PPP	2.4.0	pppd --version
isdn4k-utils	3.1pre	isdnctrl 2>&1\|grep version

Be sure to check the Changes file with each new kernel that you get to see if you need to update something. Of course, if you don't use some software at all, you don't need to have the update - for example, most desktop machines won't need pcmcia-cs, and you'll only need reiserfsprogs if you are using the ReiserFS journaled filesystem.

If you want to contribute the most, try to download and apply the patches that come out. If you have a specific problem, and someone posts a patch for it on the mailing list, you can grab the patch out of the email and apply it, or you can get the compilations that are distributed by Linus or Alan Cox, which contain all of the submitted patches that they've approved of - note that sometimes if a patch that fixes something gets submitted, it doesn't always get included in the new compilations, and you need to politely remind the kernel developers that the problem remains and maybe resubmit the patch.

If you've got a patch named patch-2.4-prerelease-ac5 then you apply it to the 2.4.0-prerelease kernel sources by cd'ing into the linux directory (kernel source top level) and executing:

patch -p1 < patch-2.4-prerelease-ac5

Note that patch takes its input from standard input rather than as a command line parameter - don't forget to redirect with <

Linus' patch compilations will be in pub/linux/kernel/testing. Alan Cox's will be in pub/linux/kernel/people/alan/2.4/ Generally Linus' stuff is more official and stable while Alan patches are often the first try at something or experimenting with a fix.

A few more helpful tidbits:

After you configure your kernel a file named .config will be created in the linux directory. This holds all the configuration options you just selected. It is helpful to make a directory somewhere and save copies of your .configs with names that reflect the kernel version and the most significant options you've set in that build. You can then use the saved files to recover earlier kernels for testing at a later date, and if the kernel developers need it you can send the .config file for a kernel that had some bug you're reporting.

It is best not to post a .config file directly to the mailing list; the files are large, many people subscribe to the list and the list has a lot of traffic. I suggest either mailing any needed .configs privately to those who request them, or else posting them on a web page as I did here when I had a problem with my laptop and including the URL with your bug report.

(It is also helpful to post successful configuration files including your .config, XF86Config and lilo.conf (if you use kernel parameters) on a web page to aid those who might have similar hardware to you - especially if you've got hardware that's hard to configure correctly. I have such a page for sharing the configuration of my Compaq Presario 1800T laptop.)

If you patch your kernel sources you can get them configured anew the fastest if you use an old .config file and give the command "make oldconfig". You'll be prompted for new items that weren't mentioned in the old file. It's probably best to run through the whole configuration manually when you first create a 2.4 kernel though, as there is a lot of new stuff.

If you have XWindows working on your machine, the most pleasant way to configure your kernel (after any needed make oldconfig is "make xconfig" (saying this probably marks me as not being a true hacker...). This is also the quickest if you want to just change a few options here and there (it's a GUI configuration tool) or for browsing the config options help. Other possibilities are "make config" (which steps through the many options sequentially - it is very laborious), "make menuconfig" (for curses-based editing in a terminal), or just manually editing the .config file (not generally recommended because of dependencies between the options).

Finally, if you're working on an Intel-architecture machine, and are trying out frequent new kernels, it is very convenient to install GNU Grub. It is a much more full-featured bootloader than LILO. Chief among its advantages is that it understands various filesystem formats natively, so unlike Lilo which needs to be reinstalled every time a new kernel binary is put in place, once Grub is installed you only need to edit it's menu.lst file if you want to add a totally new kernel name to boot off of in the boot menu - and if you forget, you can boot the kernel by name from the grub command line.

Because it boots the kernel by name rather than physical disk sector, replacing an old kernel with a new one with the same pathname doesn't require you to do anything at all to grub - because LILO uses a sector list, a new kernel with the same path may be in a different physical location on the disk so you have to reinstall it when you put a new kernel in place.

Grub will boot operating systems that it doesn't know the format of through a process called "chain-loading", much the same as booting DOS or Windows from LILO.

Note that Grub is not yet at 1.0; it works great for me but I suggest starting by making a grub floppy for testing before you install it in your boot sector.

User Mode Linux - run the kernel as a process

You can run the Linux Kernel as a user process under some other native Linux system using Jeff Dike's User Mode Linux.

This was featured on Slashdot a while back.

I haven't tried it yet but this is a great thing. Besides safe testing of new kernels (you install your distribution in a filesystem built out of a single regular file, with controlled access to hardware), you can also use it to test potentially dangerous new software (you don't ever run software you've got off the net as root do you ;-/ ), and it potentially allows one to instrument the kernel in all kinds of ways that would be difficult running off of real hardware. Also you can install new kernels and test them without rebooting your machine, and there is a script that automatically starts up a system, runs a command, and reboots it.

The other thing I hasten to add, especially if you're going to be testing on a real machine that contains important data, is do a backup first, save your work frequently, and make more backups regularly - but if you use user-mode Linux, you don't have to worry.

Some Further Thoughts

The following was written in response to the feedback readers mailed in after the original article was posted on Advogato

How Good is the 2.4.x Kernel Right Now? Should I Feel Safe to Test It?

One fellow wrote in to recommend that I should say that the new kernel works "very well" or at least "well". He felt that my statement that it worked "pretty well" would discourage a lot of people who might otherwise usefully test it.

It is my own experience that I have very little problem with the new kernel, and very likely you won't either. But I hesitate to say anything of substance about how well it's working - if it works at all for you, very likely it will work flawlessly and you'll have the added benefit of whizzy new features and performance enhancements.

But observing the traffic on the linux-kernel mailing list, some people have significant trouble. I feel that if you test it, the benefit will be likely you'll have a nice new toy to play with, but you must accept some risk, and that risk might be that your machine won't boot at the very least - or that it will scrag your filesystem or lose data you've created in a program. So it really should only be tested by people that are prepared to accept the possibility of having to fix their machine or recover their data.

ReiserFS was accepted into the main kernel distribution in version 2.4.1, but I would suggest not using it for storage of production files yet. This is in contrast to the recommendations of many other people - lots of people are using ReiserFS successfully.

However, I note from the changelogs of the last several kernel releases that some significant bugs have been fixed in ReiserFS, I understand some of which could cause filesystem corruption. I'd like to use it on my own machine but don't plan to do so until at least one kernel release has appeared without any ReiserFS fixes - hopefully that will indicate it is stable, rather than that work has been abandoned!

I would encourage everyone to test ReiserFS though; a simple way would be to mount a ReiserFS filesystem that's in a loopback-mounted regular file on ext2, or else get a spare hard drive.

Let me contrast this, however, with the condition of Windows 2000 when it was beta tested. I needed to write some Java meant to run on NT for a consulting job and my client thought it would be fun if I used the Windows 2000 Beta. I would suggest "living hell" is a better way to characterize my experience. I had no end of trouble, and it wreaked lots of havoc with my work - for example, I could not use ethernet and DNS via PPP at the same time (even though I ran Windows 2000 server) and had to disable ethernet and reboot before checking my email.

The Win2K problem shipped with lots of bugs and the opinion was widely held among the industry press and IT managers that one should not install it until a few service packs had been released - but Microsoft shipped it anyway. (To be fair, all those bugs were counted among the entire system and not just the kernel, and the figure of 64,000 known bugs I quoted in my original article turned out to not have been accurate).

I've been running the 2.4.0-test kernels on the machines I use for my daily work since test1 was released. I've had no problems that prevented me from doing useful work. The one serious bug I found was that my Adaptec APA1480 Cardbus SCSI host bus adapter wouldn't function, and that was resolved very early on by working with the mailing list - so now I can burn CD's with a SCSI CD burner off my laptop. The only problem I've got now is that my machine doesn't power itself off when I shut down.

So you be the judge.

Besides Building the Kernel, What Steps Do the Users of a Given Distribution Need to Take to Run the New Kernel?

Depending on your existing distribution, the only thing that is likely to be absolutely required is to install the new modutils package as mentioned above. The modutils are user programs that manage kernel modules, generally device drivers that may be loaded into or removed from the kernel at runtime. The module format has changed in 2.4, so that's why you need the new version.

Your system will probably still boot OK with an older modutils, but you'll get a lot of messages about unresolved symbols when you run depmod and trying to use insmod or modprobe will fail with an error message that's not likely to be helpful.

I can reasonably say your system will still boot because modules can't be loaded at initial boot, so you have to statically link in any drivers you need to start up, but lots of important things may not work for you; for example it is popular to load PPP as a module so you won't be able to dial in to the Internet.

All of your existing user-mode programs, applications and libraries should continue to work without the need to update their source or even recompile. Binary compatibility with user programs that ran on old kernel versions is a basic requirement for the system.

I have seen reports that some existing app would crash when run under the new kernel. This isn't an error on the user's part, usually, but a bug in the kernel, and should be reported to the mailing list.

There are some new kernel features that require user programs to take advantage of them. You don't need them to run the new kernel on your old system. I don't know what they all are, but they are mentioned in the kernel config help - if you select the help when examining a configuration option, sometimes the help will refer you to other documentation or to a website that will tell you about the new software you need.

An example of such a user mode program is the reiserfsprogs - the kernel only has support for mounting, unmounting, reading and writing an existing ReiserFS filesystem. To create a new filesystem on a partition, or to repair a damaged one, you need some user programs.

I know one feature that is probably too radical for most casual users to want to mess with on an existing distribution install. This is the DevFS filesystem. With DevFS, the /dev directory is initially empty and special files are created when a driver loads (either at boot time or when its module is loaded) and it disappears when it's module is unloaded.

This is a vast architectural improvement but you probably don't want to just slap it into an existing distro that expects its /dev files to stay put, and there are some issues about managing these files that need to be dealt with (like how to set the default permissions on one of these dynamic files). Anybody but the Linux From Scratch people will probably want to wait for a distro that supports that as an integrated whole.

Monkeywrenching the Virtual Machine

I'd like to say a few additional words about why it is so important that the quality of the kernel, not just for Linux but any operating system, must be so high. One could argue that it's just as critical that the system libraries be error free because an error in a library could affect any program that uses it, but really the kernel is a special case.

This is because of the non-local effects of having the virtual machine break down.

Reliably functioning computer programs, both kernels and user-mode programs, are virtual machines, of which the parts are the data structures and the algorithms which operate on them. We have stacks, queues, lists, subroutines, interrupts (both hardware interrupts in the kernel and software interrupts in use programs such as signals), threads, locks and so on.

Our programming languages, libraries and kernels give us a wide array of machine parts and then we assemble these into very elaborate machines that, if rendered as physical mechanisms, would put the finest sportscar to shame - as long as the programs are written correctly.

The problem is if you've got certain kinds of bugs in your program, such as heap corruption, buffer overflows, race conditions, failure to protect a critical region, then all hell brakes loose. It's as if the Army pulled a Howitzer up to your nice sportscar and put a shell through the engine - but then it kept running. Programs don't explode when they're damaged, they're happy to continue running along, executing each instruction in sequence, but they're likely not doing what you want.

Consider yourself lucky if you get a segment violation - at least then you find out right away something is wrong, rather than an hour later after you've saved your work to disk into a file that turns out to be corrupt.

I discussed this in a letter entitled Algorithms have unclear boundaries that I originally wrote to the patent office and also submitted to the Forum on Risks to the Public in Computers and Related Systems. (I recommend that anyone who uses computers read Risks - years of following the Risks Forum is what made me such a freak about software quality).

I once followed a discussion of programming assertions on the Usenet News. Assertions are tests included in debug builds of programs that test that a condition that must be true actually is true. If the condition is found to be false then the program is halted immediately so the programmer can check out what's wrong. Assertions speed software development by catching your mistakes quicker, doing some testing automatically for you every time you run the program.

One common practice is to test that an impossible condition is not true, for example, if a variable is allowed to hold one of three values then you assert that it does not contain a fourth. But one participant in the discussion argued vehemently that if he could prove, through the logical flow of the program code as written, that an impossible condition could never occur, it was a waste of time to include assertions that tested for impossibilities.

I feel that he was wrong though, and it's likely he spends a lot of extra time needlessly debugging his programs that he could save by using more assertions. His argument only holds while the virtual machine is intact. When the virtual machine breaks down, impossible conditions start coming fast and hard, and peppering your code with assertions will warn you right away this is happening. It's impossible to know ahead of time what impossible conditions to test for, in practice you test for them wherever its convenient.

Now how does this long theoretical discussion apply to the kernel?

Normal user mode programs on modern operating systems like Linux run in protected memory, in which the program has the perception it possesses the entire memory space of the whole machine and it is impossible for one program to use a memory access to affect another. The protected memory is managed by the kernel and enforced by the memory management unit, a component of modern microprocessors.

If the virtual machine of one user mode program breaks down, it may act erratically or be terminated by the system, but it is unlikely that it will harm any other programs.

Besides keeping the system more reliable for users and protecting user data, protected memory makes life easier for programmers because an error in your program will at worst terminate the application. You find out right away something is wrong, if you're using a debugger you get helpful information on what the problem is, and your program doesn't crash the machine so you don't have to wait to reboot to continue your work.

Don't take protected memory for granted - there are lots of systems that still don't have it. The classic Mac OS doesn't, and I've spent much time in my career waiting for a Mac to restart because of some silly pointer bug. The BSD/Mach-based Mac OS X that is currently in beta testing will be Apple's first publicly released, widely used protected memory OS (there was also A/UX, an early Mac Unix, but it wasn't meant for widespread consumption).

User mode programs on Linux can affect each other, but they do it through carefully managed channels of communication that are directed by the kernel. Most familiar are TCP/IP networking and files on the hard drive, but there's also Unix domain sockets, pipes and signals. Programs can expose the guts of their memory to direct access by other programs by using shared memory via such methods as the mmap system call, but they only do this when they want to and typically they do not expose critical data.

These are all well-defined communications pathways. It is possible for one program to crash another through one of these pathways (for example, by writing a corrupt file to disk that is used by another program) but it is much harder in general and even then the problem is localized.

The kernel is a special case, though. In itself, it is a particularly complex virtual machine - both within its own operation, and in the system call and special device file interface it presents to user programs - it presents the hardware to the user programs as an external virtual machine. It sits in the middle of everything, between each user program and the hardware, between different pieces of hardware that communicate with each other via hardware buses and DMA, and between user programs running together on the same machine and even on different machines that are communicating via a network protocol.

The kernel effectively has root privelige on your machine. If a program has lesser privelige, that is because the kernel is enforcing that policy - but in reality, the kernel can do anything it wants if it should get an inclination to.

It all runs in one big virtual machine. The kernel does not have protected memory within itself. The situation is complicated because parts of the kernel run within the virtual memory space of user programs, and the kernel manages the memory spaces itself, and also makes direct access to physical memory, so the memory architecture of the running kernel is a complicated thing. But there's really no protection against some part of the kernel screwing up another part.

And if the kernel's virtual machine breaks down, just a little bit, not so much as to bring your machine crashing down, you can create pathological communications pathways within the kernel.

An extreme case (I haven't seen this actually happen) would be a pointer bug in a device driver that caused the driver to overwrite some critical memory data structure that was used by a journaled filesystem like ReiserFS. Lots of people think journaled filesystems are completely reliable because they arrange to write filesystem metadata only atomically. First the metadata is streamed into the journal, and only after it is complete is it then copied to the filesystem itself, and it is done in such a way that if the process is interrupted at any time (as by a power failure) then the integrity of the filesystem will be preserved.

But what if a buggy driver scrawls some bogus data into the memory used by the journaled filesystem just before it's written to disk? Think about that the next time you install the driver for some oddball piece of hardware into the computer you're using to write your memoirs.

Something I have seen happen many times, when I was a "Debug Meister" at that Big Fruit Company in Cupertino, is for an error in the operating system (the Mac OS System in this case) to screw up data structures used by some other part of the system during some system call. When a user application later makes that system call, something else happens other than was documented by Inside Macintosh - the system behaves incorrectly, or returns bogus results.

The most straightforward and methodical way to test this is by writing test tools that try out all the different system calls, and vary their parameters over the acceptable ranges and ensure that the results returned are also within the documented range. You also try making system calls with illegal parameters to ensure that an appropriate error code is returned.

This is valuable, but the tools are tedious to write and often don't exercise the system all that well. I don't see a lot of these kind of tools available in the Free Software community but it would be valuable to write some (that's part of what I did as a QA engineer at Apple).

There are some test tools available for Linux - see Using Test Suites to Validate the Linux Kernel. There are some tools to test the kernel directly, and one can also use the test suites for user-level packages to exercise the kernel

What is also very valuable is to stimulate the kernel with many applications that are otherwise expected to work reliably, because they have worked reliably with previous kernels. There are far more programs meant for some real purpose than there are test tools and so using these you can get much broader coverage than a test tool would typically do. They're usually more interesting to spend your days with too.

You want to try out these applications on lots of different hardware configurations because of the problems of hardware-dependent code creating pathological communications pathways with the programs. And in fact at Apple it was very common that a tester would report that some commercial application would work reliably on one model of Macintosh with a new version of the System, but not another, and often this was because of some bug in a hardware driver that surfaced in the misbehavior of a video game or spreadsheet.

At this point I've probably scared you beyond wanting to test at all. But the situation is not as grim as it might sound. The kernel wouldn't work very well at all if it was not highly reliable to start with, and there are some things about the kernel and the way it is developed that make it much more robust than is likely to be the case with other operating system kernels.

One factor that adds to Linux' reliability is that it is cross-platform. It supports a number of different microprocessors as well as the S/390 mainframe processor. It is used on a very inhomogeneous population.

Another is that it is distributed as configurable source code. There are widely varying options for some ways the kernel will work, and even with one set of features for a given architecture you can choose to optimize for a particular processor.

These are good news because they help to bring out latent bugs. Some bugs only cause trouble rarely, or don't show up at all but rear their head after a major modification to the system. But since the kernel is distributed as source code, and built for many different systems, it is likely that the different conditions of one system - often the fact that memory is laid out differently, or that the code is built with different options - will stimulate the bug repeatibly on at least configuration so it can be found and fixed early.

Contrast this with, say, the Windows 2000 kernel, which only works on Intel-architecture microprocessors, all of which run code copied from a single build of the system by Microsoft's release engineers. This is a very homogeneous population and they do not have the benefit that varying so many parameters brings to Linux. Note also that when Be, Inc. ported the BeOS to the Intel architecture from PowerPC, although they found that there was vastly more market interest in Pentium BeOS than PowerPC BeOS, they still support the PowerPC version because it helps to ensure the quality of their code - I'm sure that Microsoft, at least Microsoft's engineers, will ultimately regret abandoning PowerPC and Alpha for this reason.)

(By the way, this is one benefit of doing cross-platform development of user applications too. You definitely want to get people who use different processors to work with your code and if possible make it work with other compilers than gcc and on different operating systems entirely - it makes your code very robust).

Also, many of the kernel developers have been using the development kernels on their own personal machines for a long time and often have subjected them to heavy stress testing loads. There's been a lot of time in development for kernel bugs to be found and fixed.

So it's not all that likely that you're going to have really brain-damaged behaviour.

I'm so concerned about it not because I think it will be common, but that if it happens it will be hard for the people it happens to to track it down - it would appear that there was a bug in a program that wasn't at fault, and that program's developers probably wouldn't have the same kernel bug so they wouldn't be able to figure it out.

It would be best if such problems were found in testing rather than in production machines, or on a machine owned by someone who wasn't an expert user.

Source: Michael D. Crawford

License: Creative Commons - Share Alike

software technology linux kernel tests