Kernel Traffic #224 For 30 Jul 2003 By Zack Brown If you like Kernel Traffic and want to send me a little money, click here: https://www.paypal.com/xclick/business=zbrown%40tumblerings.org&no_note=1&tax=0¤cy_code=USD Table Of Contents * Standard Format * Text Format * XML Source * Czech Translation * Mailing List Stats For This Week * Threads Covered 1. 8 Jul 2003 - 14 Jul 2003 (23 Better Support For Big-RAM Systems posts) 2. 10 Jul 2003 (1 Linux Test Project Update For July post) 3. 10 Jul 2003 - 12 Jul 2003 (33 Linux 2.5.75; Approaching 2.6; Andrew posts) Morton Likely 2.6 Maintainer 4. 11 Jul 2003 - 18 Jul 2003 (95 Expected Changes From 2.4 To 2.6 posts) 5. 11 Jul 2003 - 15 Jul 2003 (53 Merging Software Suspend Patches; posts) Aborting A Suspend-In-Progress 6. 11 Jul 2003 - 15 Jul 2003 (8 File-Time System Calls; Status Of posts) ReiserFS 7. 13 Jul 2003 - 18 Jul 2003 (16 Linux 2.6.0-test1 Released posts) 8. 14 Jul 2003 - 15 Jul 2003 (12 Status Of XBox Support posts) 9. 14 Jul 2003 - 15 Jul 2003 (5 Linux 2.6 Feature Documentation By Joe posts) Pranevich 10. 14 Jul 2003 (2 nfs-utils 1.0.4 Released posts) 11. 14 Jul 2003 - 18 Jul 2003 (15 RadeonFB Maintainership And posts) Development Battles 12. 15 Jul 2003 (2 Status Of Virtual Memory Documentation posts) 13. 18 Jul 2003 (3 BitKeeper Snapshots For 2.6.0-test posts) 14. 18 Jul 2003 (1 Adeos M3 Released post) Mailing List Stats For This Week We looked at 3207 posts in 17182K. There were 738 different contributors. 416 posted more than once. 205 posted last week too. The top posters of the week were: * 207 posts in 1424K by Alan Cox * 67 posts in 198K by Greg KH * 66 posts in 233K by Davide Libenzi * 60 posts in 179K by Jeff Garzik * 59 posts in 309K by William Lee Irwin III * Full Stats 1. Better Support For Big-RAM Systems 8 Jul 2003 - 14 Jul 2003 (23 posts) Archive Link: "[announce, patch] 4G/4G split on x86, 64 GB RAM (and more) support" Topics: Big Memory Support, Virtual Memory People: Ingo Molnar, Petr Vandrovec Ingo Molnar said: i'm pleased to announce the first public release of the "4GB/4GB VM split" patch, for the 2.5.74 Linux kernel: http://redhat.com/~mingo/4g-patches/4g-2.5.74-F8 The 4G/4G split feature is primarily intended for large-RAM x86 systems, which want to (or have to) get more kernel/user VM, at the expense of per-syscall TLB-flush overhead. on x86, the total amount of virtual memory - as we all know - is limited to 4GB. Of this total 4GB VM, userspace uses 3GB (0x00000000-0xbfffffff), the kernel uses 1GB (0xc0000000-0xffffffff). This is VM scheme is called the 3/1 split. This split works perfecly fine up until 1 GB of RAM - and it works adequately well even after that, due to 'highmem', which moves various larger caches (and objects) into the high memory area. But as the amount of RAM increases, the 3/1 split becomes a real bottleneck. Despite highmem being utilized by a number of large-size caches, one of the most crutial data structures, the mem_map[], is allocated out of the 1 GB kernel VM. With 32 GB of RAM the remaining 0.5 GB lowmem area is quite limited and only represents 1.5% of all RAM. Various common workloads exhaust the lowmem area and create artificial bottlenecks. With 64 GB RAM, the mem_map[] alone takes up nearly 1 GB of RAM, making the kernel unable to boot. Relocating the mem_map[] to highmem is very impractical, due to the deep integration of this central data structure into the whole kernel - the VM, lowlevel arch code, drivers, filesystems, etc. with the 4G/4G patch, the kernel can be compiled in 4G/4G mode, in which case there's a full, separate 4GB VM for the kernel, and there are separate full (and per-process) 4GB VMs for user-space. A typical /proc/PID/maps file of a process running on a 4G/4G kernel shows a full 4GB address-space: 00e80000-00faf000 r-xp 00000000 03:01 175909 /lib/tls/libc-2.3.2.so 00faf000-00fb2000 rw-p 0012f000 03:01 175909 /lib/tls/libc-2.3.2.so [...] feffe000-ff000000 rwxp fffff000 00:00 0 the stack ends at 0xff000000 (4GB minus 16MB). The kernel has a 4GB lowmem area, of which 3.1 GB is still usable even with 64 GB of RAM: MemTotal: 66052020 kB MemFree: 65958260 kB HighTotal: 62914556 kB HighFree: 62853140 kB LowTotal: 3137464 kB LowFree: 3105120 kB the amount of lowmem is still more than 3 times the amount of lowmem available to a 4GB system. It's more than 6 times the amount of lowmem a 32 GB system gets with the 3/1 split. Performance impact of the 4G/4G feature: There's a runtime cost with the 4G/4G patch: to implement separate address spaces for the kernel and userspace VM, the entry/exit code has to switch between the kernel pagetables and the user pagetables. This causes TLB flushes, which are quite expensive, not so much in terms of TLB misses (which are quite fast on Intel CPUs if they come from caches), but in terms of the direct TLB flushing cost (%cr3 manipulation) done on system-entry. RAM limits: in theory, the 4G/4G patch could provide a mem_map[] for 200 GB (!) of physical RAM on x86, while still having 1 GB of lowmem left. So it gives quite some legroom. While the right solution for lots of RAM is to use a proper 64-bit system, there's alot of existing x86 hardware, and x86 servers will still be sold in the next couple of years, so we ought to support them maximally. The patch is orthogonal to wli's pgcl patch - both patches try to achieve the same, with different methods. I can very well imagine workloads where we want to have the combination of the two patches. Implementational details: the patch implements/touches a number of new lowlevel x86 infrastructures: * it moves the GDT, IDT, TSS, LDT, vsyscall page and kernel stack up into a high virtual memory window (trampoline) at the top 16 MB of the 4GB address space. This 16 MB window is the only area that is shared between user-space and kernel-space pagetables. * it splits out atomic kmaps from highmem dependencies. * it makes LDT(s) atomic-kmap-ed. * (and lots of other smaller details, like increasing the size of the initial mappings and fixing the PAE code to map the full 4GB of kernel VM.) Whenever we do a syscall (or any other trap) from user-mode, the high-address trampoline code starts to run, with a high-address esp0. This code switches over to the kernel pagetable, then it switches the 'virtual kernel stack' to the regular (real) kernel stack. On syscall-exit it does it the other way around. there are a few generic kernel changes as well: * it implements 'indirect uaccess' primitives and implements all the get_user /put_user/copy_to_user/... functions without relying on direct access to user-space. This feature uncovered a number of bugs in the lowlevel x86 code already, there was still code that accessed user-space memory directly. * it splits up PAGE_OFFSET into PAGE_OFFSET_USER and PAGE_OFFSET (kernel) * fixes a couple of assumptions about PAGE_OFFSET being PMD_SIZE aligned. but the generic-kernel impact of the patch is quite low. the patch optimizes kernel<->kernel context switches and does not flush the TLB, also, IRQ entry only cases a TLB flush if a userspace pagetable is loaded. the typical cost of 4G/4G on typical x86 servers is +3 usecs of syscall latency (this is in addition to the ~1 usec null syscall latency). Depending on the workload this can cause a typical measurable wall-clock overhead from 0% to 30%, for typical application workloads (DB workload, networking workload, etc.). Isolated microbenchmarks can show a bigger slowdown as well - due to the syscall latency increase. i'd guess that the 4G/4G patch is not worth the overhead for systems with less than 16 GB of RAM (although exceptions might exist, for particularly lowmem-intensive/sensitive workloads). 32 GB RAM systems run into lowmem limitations quite frequently so the 4G/4G patch is quite recommended there, and for 64 GB and larger systems it's a must i think. Status, future plans: The patch is a work-in-progress snapshot - it still has a few TODOs and FIXMEs, but it compiles & works fine for me. Be careful with it nevertheless - it's an experimental patch which does very intrusive changes to the lowlevel x86 code. There are a couple of performance enhancements ontop of this patch that i'll integrate into this patch in the next couple of days, but i first wanted to release the base patch. In any case, enjoy the patch - and as usual, comments and suggestions are more than welcome. A number of people liked the patch, and there was some technical discussion. Petr Vandrovec also remarked, "FYI, VMware's vmmon/vmnet I maintain for 2.5.x kernels at http://platan.vc.cvut.cz/ftp/pub/vmware (currently .../ vmware-any-any-update37.tar.gz) were updated to work correctly with 4G/4G kernel configuration." 2. Linux Test Project Update For July 10 Jul 2003 (1 post) Archive Link: "[ANNOUNCE] Linux Test Project July Release Announcement" Topics: Bug Tracking, PCI, Power Management: ACPI, USB, Version Control People: Robert Williamson Robert Williamson announced: The Linux Test Project test suite has been released. The latest version of the testsuite contains 2000+ tests for the Linux OS. Our web site also contains other information such as: test results, a Linux test tools matrix, an area for keeping up with fixes for known blocking problems in the 2.5 kernel releases, technical papers and HowTos on Linux testing, and a code coverage analysis tool. Highlights: * Inclusion of the OpenHPI (Hardware Platform Interface) Test Suite. * New tests for PCI, USB, ACPI, and the NLS filesystem * Fixes and code cleanups for IA64 and PowerPC64 * More script-based tests updated to use the test harness APIs * A new logo! We encourage the community to post results, patches or new tests on our mailing list and use the CVS bug tracking facility to report problems that you might encounter with the test suite. 3. Linux 2.5.75; Approaching 2.6; Andrew Morton Likely 2.6 Maintainer 10 Jul 2003 - 12 Jul 2003 (33 posts) Archive Link: "Linux 2.5.75" People: Linus Torvalds, Russell King Linus Torvalds announced 2.5.75 (http://www.kernel.org/pub/linux/kernel/v2.5/ ChangeLog-2.5.75) and said: Ok. This is it. We (Andrew and me) are going to start a "pre-2.6" series, where getting patches in is going to be a lot harder. This is the last 2.5.x kernel, so take note. The probably most notable thing here is the anticipatory scheduler, which has been in -mm for a long time, and was the major piece that hadn't been merged. Some architecture updates: cris has been updated for 2.5, ia64 and arm26 updates etc. And various random (smallish) things. Russell King replied: Well, only two words from me. Oh Shit. The 2.5.70 ARM patch currently looks like this: 343 files changed, 45388 insertions(+), 7341 deletions(-) and I don't see that this will be reducing in size now that 2.6 is around the corner. I _know_ ARM stuff doesn't build and hasn't built in Linus' tree for a fair time now - there are some generic changes to support ARM modules needed in vmalloc.c which I just haven't had the time to sort out, and there's still the issue of whether /proc/kcore actually works or not, and now I see that the time stuff needs re-working for multiple ARM platforms yet again. (yes, all the other architectures got updated, except for ARM.) Maybe I should just forget even attempting to merge upstream, like most of the ARM community doesn't. Frustrated such an understatement. Linus Torvalds replied, "Hey, this is already much later than it should have been, so it's not as if this is a huge surprise." He went on: We can sort it out later. Obviously, clearly arm-specific patches (ie stuff in arch/arm and include/asm-arm) I wouldn't mind per se, but I'd rather hold back on even those just to make the patches and the changlogs not be mixed up with the "main bugfixes". We've never had a first stable release that has all architectures up-to-date, and I'm not planning on changing that for 2.6.x. This is _not_ the time to try to make my tree build on arm (or other architectures either), considering that my tree hasn't been the main ARM tree for a long time. Finally, to Russell's frustration, Linus said: To be blunt, which part of "we want to release 2.6.x this year" came as a surprise to you? That means that I'm not willing to hold stuff up any more. Stuff that hasn't followed the development tree doesn't magically just "get fixed". Also, the only real point of a stable release is for distribution makers. That pretty much cuts the list of "needs to be supported" down to x86, ia64, x86-64 and possibly sparc/alpha. So everything else is a bonus, but can equally well just play catch-up later. Embedded people tend to want to stay back anyway, which is obviously why they don't follow the development tree in the first place. Russell said, "I can't think of any stock kernel which has been usable, let alone been compilable for ARM. Which, IMO, is a pretty sorry statement to make." To which Linus replied: You see that as a sorry statement, but I don't think it's a failure. Why _should_ one tree have to try to make everybody happy? We want to try to make it easier to keep the couplings in place by striving for portable infrastructure etc, but we would only be hampered by a philosophy that says "everything has to work in tree X", since that just means that you can't afford to break things. I'd much rather keep the freedom to break stuff, and have many separate trees that break _different_ things, and let them all co-exist in a friendly rivalry. And my tree is just one tree in that forest. So it's not a bug - it's a FEATURE! 4. Expected Changes From 2.4 To 2.6 11 Jul 2003 - 18 Jul 2003 (95 posts) Archive Link: "2.5 'what to expect'" Topics: Access Control Lists, BSD: OpenBSD, Backward Compatibility, Big Memory Support, Big O Notation, Compression, Device Mapper, Disk Arrays: LVM, Disk Arrays: RAID, Disks: IDE, Disks: SCSI, Extended Attributes, FS: CIFS, FS: FAT, FS: InterMezzo, FS: JFS, FS: NFS, FS: NTFS, FS: ReiserFS, FS: UMSDOS, FS: VFAT, FS: XFS, FS: devfs, FS: ext2, FS: ext3, FS: sysfs, Forward Port, Framebuffer, Hot-Plugging, Hyperthreading, Ioctls, Kernel Build System, Microsoft, Modems, Networking, PCI, POSIX, Power Management: ACPI, Real-Time, SMP, Samba, Scheduler, Software Suspend, Sound: ALSA, Sound: OSS, USB, User-Mode Linux, Version Control, Virtual Memory, Web Servers People: Dave Jones, Oleg Drokin, Paul Dickson, Matthew Dharm, Greg KH, Pavel Machek, Larry McVoy, Davide Libenzi, Jens Axboe, Meelis Roos, James Simmons, Ingo Molnar, Rik van Riel, Rusty Russell, Vojtech Pavlik, Matt Domsch, David Mosberger, Adam Belay, Ulrich Drepper, Jeff Garzik, Bert Hubert, Steven Cole, Alan Cox, Peter Chubb, Albert Cahalan, James H. Cloos, Keith Owens, Robert Love , Andrew Morton, Zwane Mwaikambo Dave Jones explained: In preparation for the flood of testers as we approach 2.6pre, I thought I'd give this doc another airing to be sure that it isn't missing anything important.. (Plus I've been meaning to post an update for a while, and 42 sounded like a good number). The post-halloween document. v0.42 (aka, 2.5 - what to expect) Dave Jones (Updated as of 2.5.75) This document explains some of the new functionality to be found in the 2.5 Linux kernel, some pitfalls you may encounter, and also points out some new features which could really use testing. Note, that "contact foo@bar.com" below also implies that you should also cc: linux-kernel@vger.kernel.org. Latest version of this document can always be found at http:// www.codemonkey.org.uk/post-halloween-2.5.txt Thanks to many [far too many to list] people for valuable feedback. Note, that this document is somewhat x86-centric, but most features documented here affect all platforms anyway. Spanish translation at: http://www.terra.es/personal/diegocg/ post-halloween-2.5.es.txt Applying patches. * In 2.4 and previous kernels, the recommended way to apply patches was to use a command line such as ... gzip -cd patchXX.gz | patch -p0 In 2.5, Linus started adding an extra path element to the diffs, so using -p1 in the untarred 'to be patched' directory is necessary. Known gotchas. Certain known bugs are being reported over and over. Here are the workarounds. * Blank screen after decompressing kernel? Make sure your .config has CONFIG_INPUT=y, CONFIG_VT=y, CONFIG_VGA_CONSOLE=y and CONFIG_VT_CONSOLE=y A lot of people have discovered that taking their .config from 2.4 and running make oldconfig to pick up new options leads to problems, notably with CONFIG_VT not being set. * An additional bug biting some people is that NICs fail to receive packets (usually notable by a NIC not getting a DHCP lease for eg, despite being sent one by the server). Booting with "noapic" "acpi=off" or a combination of both fixes this for most people. Additional breakage reports should go to Jeff Garzik * (Possibly linked to above bug) VIA APIC routing is currently broken. boot with 'noapic'. * Can't load any modules? You need updated tools (See modules section below). Regressions. (Things not expected to work just yet) * The hptraid/promise RAID drivers are currently non functional, and will probably be converted to use device-mapper. * Some filesystems still need work (Intermezzo, UFS, HFS, HPFS..) * A number of drivers don't compile currently due to them needing various work to convert them to the new APIs * UMSDOS fs is currently missing, pending rewrite. * The format of /proc/stat changed, which could break some applications that still depend on the old layout. Currently the only known application to break is the java 'DOTS' app. (http://bugme.osdl.org/show_bug.cgi?id=277) * Some people seem to have trouble running rpm, most notably Red Hat 9 users. This is a known bug of rpm. Workaround: run "export LD_ASSUME_KERNEL= 2.2.5", before running rpm. Deprecated features. * khttpd is gone. * Older Direct Rendering Manager (DRM) support (For XFree86 4.0) has been removed. Upgrade to XFree86 4.1.0 or higher. * LVM1 has been removed. See Device-mapper below. * boot time root= parsing changed. ramdisks are now ram instead of rd and cm206 is cm206cd (instead of cm206). * The system call table is no longer exported. Any module that relied on this previously will no longer work. * Soundmodem hamradio support has been removed. Its functionality has been superceded by a userspace replacement. http://www.baycom.org/~tom/ham/ soundmodem * Direct booting from floppy is no longer supported. You should now use a boot loader program instead. * Callout tty devices (/dev/cua) have been deprecated since 2.1.90pre2. Support is now removed. Modules. * The in-kernel module loader got reimplemented. * You need replacement module utilities from http://www.kernel.org/pub/linux/ kernel/people/rusty/modules/ * A backwards compatible set of module utilities is also available from the same URL in RPM format. * Debian sid users can 'apt-get install module-init-tools' * Modules now free stuff marked with __init or __initdata. * For Red Hat users, there's another pitfall in "/etc/rc.sysinit". During startup, the script sets up the binary used to dynamically load modules stored at "/proc/sys/kernel/modprobe". The initscript looks for "/proc/ ksyms", but since it doesn't exist in 2.5 kernels, the binary used is "/ sbin/true" instead. This, eventually, will keep modules from working. Red Hat users will have to patch the "/etc/rc.sysinit" script to set "/proc/sys/kernel/modprobe" to "/sbin/modprobe", even when "/proc/ksyms" doesn't exist. Kernel build system. * The build system is much improved compared to 2.4. You should notice quicker builds, and less spontaneous rebuilds of files on subsequent builds from already built trees. * There are new graphical config tools. "make xconfig" now requires the qt libraries. "make gconfig" uses gtk libraries. * Make menuconfig/oldconfig has no user-visible changes other than speed, whilst numerous improvements have been made. * Several new debug targets exist: 'allyesconfig' 'allnoconfig' 'allmodconfig'. * Note: The new configuration system is not CML2 related. * Also note: Whilst some ideas were taken from it, Keith Owens' kbuild-2.5 project was not integrated. * "make" is now the preferred command, without a target; it does and modules. * "make -jN" is now the preferred parallel-make execution. Do not bother to provide "MAKE=xxx" * The build is now much less verbose. If you want to see exactly what's going on, try "make V=1" or set KBUILD_VERBOSE=1 in your environment. * 'make kernel/mm.o' will build the named file, provided a corresponding source exists. This also works for (non-composite) modules. (FIXME: broken for modules right now?) * 'make kernel/' will compile all files in a subdirectory and below. * There is no need to run 'make dep' at any stage. * 'make help' provides a list of typical targets, including debugging targets. IO subsystem. * You should notice considerable throughput improvements over 2.4 due to much reworking of the block and the memory management layers. * Report any regressions in this area to Jens Axboe and Andrew Morton . * Several different IO elevators are available to match different types of workload. You can select which one to use with elvtune. * Assorted changes throughout the block layer meant various block device drivers had a large scale cleanup whilst being updated to newer APIs. * The size and alignment of O_DIRECT file IO requests now matches that of the device, not the filesystem. Typically this means that you can perform O_DIRECT IO with 512-byte granularity rather than 4k. But if you rely upon this, your application will not work on 2.4 kernels and will not work on some devices. Enormous block size support. * Thanks to work done by Peter Chubb, block devices can now access up to 16TB on 32-bit architectures, and up to 8EB on 64-bit architectures. * To use the new BLKGETSZ64 ioctls, you'll need updated file-utils. (Currently only jfsutils 1.0.20 has this change, patches for other filesystems are still pending merging) POSIX ACLs & Extended attributes. * Userspace tools available at http://acl.bestbits.at VM Changes. * Version zero swap partitions are no longer supported (everything is using v1 now anyway - rerun mkswap if in doubt). Linux 2.0.x requires v0 swap space, Linux v2.1.117 and later support v1. mkswap(8) can format swap space in either format. * The actual 'reverse mappings' part of Rik van Riel's rmap vm was merged. VM behaviour under certain loads should improve. * VM misbehaviour should be reported to linux-mm@kvack.org * The VM explicitly checks for sparse swapfiles, and aborts if one is found. * /proc/sys/vm/swappiness defines the kernel's preference for pagecache over mapped memory. Setting it to 100 (percent) makes it treat both types of memory equally. Setting it to zero makes the kernel very much prefer to reclaim plain pagecache rather than mapped-into-pagetables memory. * The bdflush() syscall is now officially deprecated. The syscall does nothing, and prints a stern warning to users. The functionality is replaced by the pdflush daemons. * Due to various changes, swap files should be just as fast as swap partitions. Kernel preemption. * The much talked about preemption patches made it into 2.5. With this included you should notice much lower latencies especially in demanding multimedia applications. * Note, there are still cases where preemption must be temporarily disabled where we do not. These areas occur in places where per-CPU data is used. * If you get "xxx exited with preempt count=n" messages in syslog, don't panic, these are non fatal, but are somewhat unclean. (Something is taking a lock, and exiting without unlocking) * If you DO notice high latency with kernel preemption enabled in a specific code path, please report that to Andrew Morton and Robert Love . The report should be something like "the latency in my xyz application hits xxx ms when I do foo but is normally yyy" where foo is an action like "unlink a huge directory tree". Process scheduler improvements. * Another much talked about feature. Ingo Molnar reworked the process scheduler to use an O(1) algorithm. In operation, you should notice no changes with low loads, and increased scalability with large numbers of processes, especially on large SMP systems. * Scheduler is now Hyperthreading SMP aware and will disperse processes over physically different CPUs, instead of just over logical CPUs. * Robert Love wrote various utilities for changing behaviour of the scheduler (binding processes to CPUs etc). You can find these tools at http:// tech9.net/rml/schedutils * The behavior of sched_yield() changed a lot. A task that uses this system call should now expect to sleep for possibly a very long time. Tasks that do not really desire to give up the processor for a while should probably not make heavy use of this function. Unfortunately, some GUI programs (like Open Office) do make excessive use of this call and under load their performance is poor. It seems this new 2.5 behavior is optimal but some user-space applications may need fixing. * The above applies to use of yield() in the kernel, too. * 2.5 adds system calls for manipulating a task's processor affinity: sched_getaffinity() and sched_setaffinity() * Regressions to mingo@redhat.com and rml@tech9.net * Debian users who encounter effects such as skips in mp3 playback, jerky mouse movement may want to stop the X server from renicing itself to -10 You can alter this permanently with 'dpkg-reconfigure xserver-common'; if you elect not to have /etc/X11/Xwrapper.config managed by debconf, simply edit it directly. * Balancing of IRQs between multiple CPUs should be handled using the irqbalance (http://people.redhat.com/arjanv/irqbalance/) program. * David Mosberger maintains a webpage containing some current 'known gotchas' of the O(1) scheduler at http://www.hpl.hp.com/research/linux/kernel/o1.php PCI. * PCI domain support has been added. For most people, this just means that all PCI slot names are extended with "0000:" on the front, but for people with bigger servers it means they're able to access all their PCI devices. * More hotplug drivers have been added, including a fake PCI hotplug driver so people without specialised hardware can test hotplug features. Fast userspace mutexes (Futexes). * Rusty Russell added functionality that allows userspace to have fast mutexes that only use syscalls when there is contention. Used by NPTL. * Bert Hubert has written some documentation on this functionality at http:// ds9a.nl/futex-manpages epoll Davide Libenzi wrote an event based poll replacement that got included in 2.5. More info available at http://www.xmailserver.org/linux-patches/nio-improve.html http://lwn.net/Articles/13587/ Threading improvements. * Ingo Molnar put a lot of work into threading improvements during 2.5. Some of the features of this work are: + Generic pid allocator (arbitrary number of PIDs with no slowdown, unified pidhash). + Thread Local Storage syscalls + sys_clone() enhancements (CLONE_SETTLS, CLONE_PARENT_SETTID, CLONE_SETTID, CLONE_CLEARTID, CLONE_DETACHED) + POSIX thread signals stuff (atomic signals, shared signals, etc.) + Per-CPU GDT + Threaded coredumping support + sys_exit() speedups (O(1) exit) + Generic, improved futexes, vcache + New, threading related ptrace features + exit/fork task cache + /proc updates for threading + API changes for threading. * Users should notice a significant speedup in basic thread operations. This is true to a lesser extent even for old-threading userspace libraries such as LinuxThreads. * Regressions should go to Ingo Molnar and phil-list@redhat.com. Regressions could happen in the area of signal handling and related threading semantics, plus coredumping. * Native Posix Threading Library (NPTL). Ulrich Drepper worked closely with Ingo on the threading enhancements, and developed a 1:1 model threading library. You can find out more about NPTL at http://people.redhat.com/ drepper/nptl-design.pdf Enhanced coredumping. * 2.5 offers you the ability to configure the way core files are named through a /proc/sys/kernel/core_pattern file. You can use various format identifiers in this name to affect how the core dump is named. %p - insert pid into filename %u - insert current uid into filename %g - insert current gid into filename %s - insert signal that caused the coredump into the filename %t - insert UNIX time that the coredump occurred into filename %h - insert hostname where the coredump happened into filename %e - insert coredumping executable name into filename You should ensure that the string does not exceed 64 bytes. * Multithreaded processes can now dump core Input layer. * Possibly the most visible change to the end user. If misconfigured, you'll find that your keyboard/mouse/other input device will no longer work. 2.5 offers a much more flexible interface to devices such as keyboards. * The downside is more confusing options. In the "Input device support" menu, be sure to enable at least the following. --- Input I/O drivers < > Serial i/o support < > i8042 PC Keyboard controller [ ] Keyboards [ ] Mice (Also choose the relevant keyboard/mouse from the list) * If you find your keyboard/mouse still don't work, edit the file drivers/ input/serio/i8042.c, and replace the #undef DEBUG with a #define DEBUG When you boot, you should now see a lot more debugging information. Forward this information to Vojtech Pavlik * If you use a KVM switcher, and experience problems, booting with the boot time argument 'psmouse_noext' should fix your problems. * Users of multimedia keys without X will see changes in how the kernel handles those keys. People who customize keymaps or keycodes in 2.4 may need to make some changes in 2.5 PnP layer. * Support for plug and play devices such as early ISAPnP cards has improved a lot in the 2.5 kernel. The new code behaves more closely to the code handling PCI devices (probe, remove etc callbacks), and also merges PnP BIOS access code. * Report any regressions in plug & play functionality to Adam Belay ALSA. * The advanced linux sound architecture got merged into 2.5. This offers considerably improved functionality over the older OSS drivers, but requires new userspace tools. * Several distros have shipped ALSA for some time, so you may already have the necessary tools. If not, you can find them at http:// www.alsa-project.org/ * ALSA can emulate OSS interface using the snd_pcm_oss/snd_pcm_mixer modules, if your card produces nothing but silence, you may need to run alsamixer to unmute channels wich /dev/mixer doesn't see * Note that the OSS drivers are also still functional, and still present. Many features/fixes that went into 2.4 are still not applied to these drivers, and it's still unclear if they will remain when 2.6/3.0 ships. The long term goal is to get everyone moved over to (the superior) ALSA. AGP. * The agpgart driver got a long overdue cleanup which involved splitting it into an agpgart core, and per-chipset drivers. You may need to adjust your modules configuration to autoload the chipset drivers on loading the agpgart module. * Generic AGP 3.0 support is now included. Faster system calls. * Systems that support the SYSENTER extension (Basically Intel PPro and above, and AMD Athlons) now have a faster method of making the transition from userspace to kernelspace when a syscall is performed. * Without an updated glibc, this will not be noticable. * VMWare 4 users may get crashes due to this. Zwane Mwaikambo wrote a patch for a "nosysenter" option which is worth googling for if there isn't a vmware update available. * Regressions to torvalds@transmeta.com and libc-alpha@redhat.com procps. * The 2.5 /proc filesystems changed some statistics, which confuse older versions of procps. Rik van Riel and Robert Love have been maintaining a version of procps during the 2.5 cycle which tracks changes to /proc which you can find at http://tech9.net/rml/procps/ * Alternatively, the procps by Albert Cahalan now supports the altered formats since v3.0.5 -- http://procps.sf.net/ * The /proc/meminfo format changed slightly which also broke gtop in strange ways. Likely this also broke some of the KDE/GNOME panel applets. Framebuffer layer. * James Simmons has reworked the framebuffer/console layer considerably during 2.5. Support for some cards is still lagging a little, but it should be functionally no different than previous incarnations. * boot time arguments may have changed depending on your driver. an example of the change is.. append = "video=radeon:1024x768-24@100" needs to become.. append = "video=radeonfb:1024x768-24@100" * Current userspace tools (fbset for eg) are not yet updated, and won't function as expected. * The VESA framebuffer now enables MTRRs for the framebuffer memory range during initialisation (Note: PCI cards only). If you notice screen corruption, please report this, along with an lspci output, so your card can be blacklisted. * Any problems should go to IDE. * The IDE code rewrite was subject to much criticism in early 2.5.x, which put off a lot of people from testing. This work was then subsequently dropped, and reverted back to a 2.4.18 IDE status. Since then additional work has occurred, but not to the extent of the first cleanup attempts. * Known problems with the current IDE code. + Simplex IDE devices (eg Ali15x3) are missing DMA sometimes + Serverworks OSB4 may panic on bad blocks or other non fatal errors + PCMCIA IDE hangs on eject + Most PCMCIA devices have unload races and may oops on eject + Modular IDE does not yet work, modular IDE PCI modules sometimes oops on loading + ide_scsi is completely broken in 2.5.x. Known problem. If you need it either use 2.4 or fix it 8) * IDE disk geometry translators like OnTrack, EZ Partition, Disk Manager are no longer supported. The only way forward is to remove the translator from the drive, and start over. IDE TCQ. * Tagged command queueing for IDE devices has been included. * Not all combinations of controllers & devices may like this, so handle with care. READ AS: ** Don't use IDE TCQ on any data you value. It's likely bad combinations will be blacklisted as and when discovered. * If you didn't choose the "TCQ on by default" option, you can enable it by using the command echo "using_tcq:32" > /proc/ide/hdX/settings (replacing 32 with 0 disables TCQ again). * Report success/failure stories to Jens Axboe with inclusion of hdparm -i /dev/hdX, and lspci output. SCSI. * Various SCSI drivers still need work, and don't even compile. * Various drivers currently lack error handling. These drivers will cause warnings during compilation due to missing abort: & reset: functions. * Note, that some drivers have had these members removed, but still lack error handling. Those noticed so far are ncr53c8xxx, sym53c8xx and inia100 * large dev_t support allowing thousands of disks to be supported (was 128 or 256 in the 2.4 series) * major code cleanup, initially to support the block layer (bio) improvements have led to: + better throughput (?) [less double handling of data] + per HBA locks (there was a single io_request_lock in the 2.4 series) + more flexible interface to HBA drivers + better hotplug support, especially for USB mass storage and ieee1394 sbp2 devices [well it's work_in_progress] * improved error processing and scanning code (support for large, sparse lun spaces) * lots of scsi driver "innards" available via sysfs v4l2. * The video4linux API finally got its long awaited cleanup. * xawtv, bttv and most other existing v4l tools are also compatible with the new v4l2 layer. You should notice no loss in functionality. * See http://bytesex.org/v4l/ for more information. Quota reworking. The new quota system needs new tools. Supports 32 bit uids. http://www.sf.net/ projects/linuxquota/ CD Recording. * Jens Axboe added the ability to use DMA for writing CDs on ATAPI devices. Writing CDs should be much faster than it was in 2.4, and also less prone to buffer underruns and the like. * Updated cdrecord in rpm and tar.gz can be found at *.kernel.org/pub/linux/ kernel/people/axboe/tools/ (http://ftp.kernel.org/pub/linux/kernel/people/ axboe/tools/) * With the above tools, you also no longer need ide-scsi in order to use an IDE CD writer. * Ripping audio tracks off of CDs now also uses DMA and should be notably faster. You can also find an updated cdda2wav at the same location. * Send good/bad reports of audio extraction with cdda2wav and burning with the modified cdrecord to Jens Axboe * Currently only 'open by device name' works in cdrecord. cdrecord -dev=/dev/hdX -inq * More info at http://lwn.net/Articles/13538/ & http://lwn.net/Articles/13160 / USB: * Very little user visible changes, the only noticable 'major' change is that there is now only one UHCI driver. As noted elsewhere, usbdevfs got renamed to usbfs. Nanosecond stat: The stat64() syscall got changed to return jiffies granularity. This allows make(1) to make better decisions on whether or not it needs to recompile a file. Not all filesystems may support such precision. Filesystems: A number of additional filesystems have made their way into 2.5. Whilst these have had testing out of tree, the level of testing after merging is unparalleled. Be wary of trusting data to immature filesystems. A number of new features and improvements have also been made to the existing filesystems from 2.4. Reports of stress testing with the various tools available would be beneficial. Generic VFS changes. * Since Linux 2.5.1 it is possible to atomically move a subtree to another place. The call is... mount --move olddir newdir * Since 2.5.43, dmask=value sets the umask applied to directories only. The default is the umask of the current process. The fmask=value sets the umask applied to regular files only. Again, the default is the umask of the current process. devfs. * devfs got somewhat stripped down and a lot of duplicate functionality got removed. You now need to enable CONFIG_DEVPTS_FS=y and mount the devpts filesystem in the same manner you would if you were not using devfs. EXT2. * 2.5.49 included an extension to ext2 which will cause it to not attach buffer_head structures to file or directory pagecache at all, ever. This is for the big highmem machines. It is enabled via the `-o nobh' mount option. * The ext2 filesystem is now using finer-grained locking which yields reduced context switch rates and higher throughput on large SMP machines. EXT3. * The ext3 filesystem has gained indexed directory support, which offers considerable performance gains when used on filesystems with directories containing large numbers of files. * In order to use the htree feature, you need at least version 1.32 of e2fsprogs. * Existing filesystems can be converted using the command tune2fs -O dir_index /dev/hdXXX * The latest e2fsprogs can be found at http://prdownloads.sourceforge.net/ e2fsprogs * data=journal mode is currently broken. * The ext2 and ext3 filesystems have new file allocations policies (the "Orlov allocator") which will place subdirectories closer together on-disk. This tends to mean that operations which touch many files in a directory tree are much faster if that tree was created under a 2.5 kernel. Reiserfs. * Reiserfs now supports inode attributes such as immutable. NFS. * Basic support has been added for NFSv4 (server and client) * Additionally, kNFSD now supports transport over TCP. This experimental feature is also backported to 2.4.20 * Interoperability reports with other OS's would be useful. * v1.0.3 of nfs-utils supports the newer 2.5 kernels change of kdev_t type. You can grab it at http://nfs.sourceforge.net * Problems to nfs@lists.sourceforge.net NTFS. * A new, rewritten NTFS driver got merged during 2.5. It has the following main benefits over the old driver: + SMP and reentrant safe + support bigger than 4 kB cluster sizes + full support for sparse files on W2K/XP/W2K3 + mmap() support + More stable, and much faster than the previous NTFS driver. + Still read-only, but with safe file overwrite support without changes to the file size + More information is available at http://linux-ntfs.sf.net sysfs. In simple terms, the sysfs filesystem is a saner way for drivers to export their innards than /proc. This filesystem is always compiled in, and can be mounted just like another virtual filesystem. No userspace tools beyond cat and echo are needed. mount -t sysfs none /sys See Documentation/filesystems/sysfs.txt for more info. JFS. IBM's JFS got merged during 2.5. (And backported to 2.4.20, but it was still a new feature here first. You can read more about JFS at http://www-124.ibm.com/ developerworks/oss/jfs/index.html XFS. The SGI XFS filesystem has been merged, and has a number of userspace features. Users are encouraged to read http://oss.sgi.com/projects/xfs for more information. The various utilities for creating and manipulating XFS volumes can be found on SGI's ftp server: ftp://oss.sgi.com/projects/xfs/download/download/cmd_tars/ xfsprogs-2.3.9.src.tar.gz CIFS. Support utilities and documentation for the common internet file system (CIFS) can be found at http://us1.samba.org/samba/Linux_CIFS_client.html FAT. CVF (Compressed VFAT) support has been removed. This means you will no longer be able to access DriveSpace partitions. HugeTLBfs. Files in this filesystem are backed by large pages if the CPU supports them. See Documentation/vm/hugetlbpage.txt for more details. Internal filesystems. /proc/filesystems will contain several filesystems that are not mountable in userspace, but are used internally by the kernel to keep track of things. Amongst these filesystems are futexfs and eventpollfs Oprofile. A system wide performance profiler has been included in 2.5. With this option compiled in, you'll get an oprofilefs filesystem which you can mount, that the userspace utilities talk to. You can find out more at http:// oprofile.sourceforge.net/oprofile-2.5.html (http://oprofile.sourceforge.net/) util-linux. * You need a fixed readprofile utility for 2.5. Present in util-linux as of 2.11z Improved BIOS table support. * Linux now supports various new BIOS extensions. Simple boot flag support. The SBF specification is an x86 BIOS extension that allows improved system boot speeds. It does this by marking a CMOS field to say "I booted okay, skip extensive POST next reboot". Userspace tool is at http://www.codemonkey.org.uk/ cruft/sbf.c. More info on SBF is at http://www.microsoft.com/hwdev/resources/ specs/simp_bios.asp EDD Support. * Support for BIOS Enhanced Disk Drive Services (EDD) was added, which exports information on what the BIOS thinks is the boot drive and other useful info to /sys/firmware/edd * Matt Domsch is interested in hearing success/fails on this code with some simple tests decribed at http://domsch.com/linux/edd30/results.html Intel IPMI support. * IPMI is a standard for monitoring the hardware in a system. * Project home page: http://openipmi.sourceforge.net * Specification: http://www.intel.com/design/servers/ipmi/spec.htm x86 CPU detection. * The CPU detection code got a pretty hefty shake up. To be certain your CPU has all relevant workarounds applied, be sure to check that it was detected correctly. cat /proc/cpuinfo will tell what the kernel thinks it is. * Likewise, the x86 MTRR driver got a considerable makeover. Check that XFree86 sets up MTRRs in the same way it did in 2.4 (Failures will get logged in /var/log/XFree86.log) * Early PII Xeon processors and possibly other early PII processors require microcode updates either from the BIOS or the microcode driver to work around CPU bugs the O(1) scheduler exposes. You can find the relevant microcode tools at http://www.urbanmyth.org/microcode/ * Any regressions in both should go to mochel@osdl.org Cc: davej@suse.de Extra tainting. Running certain AMD processors in SMP boxes is out of spec, and will taint the kernel with the 'S' flag. Running 2 Athlon XPs for example may seem to work fine, but may also introduce difficult to pin down bugs. In time it's likely this tainting will be extended to cover other out of spec cases. Additionally, the new modules interface will taint the kernel if you try to 'force' a module to load with insmod -f. Power management. * 2.5 contains a more up to date snapshot of the ACPI driver. Should you experience any problems booting, try booting with the argument "acpi=off" to rule out any ACPI interaction. ACPI has a much more involved role in bringing the system up in 2.5 than it did in 2.4 * The old "acpismp=force" boot option is now obsolete, and will be ignored due to the old "mini ACPI" parser being removed. * software suspend is still in development, and in need of more work. It is unlikely to work as expected currently. CPU frequency scaling. Certain processors have the facility to scale their voltage/clockspeed. 2.5 introduces an interface to this feature, see Documentation/cpufreq for more information. This functionality also covers features like Intel's speedstep, and the Powernow! feature present in mobile AMD Athlons. In addition to x86 variants, this framework also supports various ARM CPUs. You can find a userspace daemon that monitors battery life and adjusts accordingly at: http:// www.staikos.net/~staikos/cpufreqd/ Background polling of MCE. The machine check handler has been extended so that it regularly polls for any problems on AMD Athlon, and Intel Pentium 4 systems. This may result in machine check exceptions occuring more frequently than they did in 2.4 on out of spec systems (Overclocking/inadequate cooling/underated PSU etc..). LVM2 - DeviceMapper. The LVM1 code got removed wholesale, and replaced with a much better designed 'device mapper'. * This is backwards compatible with the LVM1 disk format. * Device mapper does require new tools to manage volumes however. You can get these from ftp://ftp.sistina.com/pub/LVM2/tools/ Debugging options. During the stabilising period, it's likely that the debugging options in the kernel hacking menu will trigger quite a few problems. Please report any of these problems to linux-kernel@vger.kernel.org rather than just disabling the relevant CONFIG_ options. Merging of kksymoops means that the kernel will now spit out automatically decoded oopses (no more feeding them to ksymoops). For this reason, you should always enable the option in the kernel hacking menu labelled "Load all symbols for debugging/kksymoops". Testing with CONFIG_PREEMPT will also increase the amount of debug code that gets enabled in the kernel. Kernel preemption gives us the ability to do a whole slew of debugging checks like sleeping with locks held, scheduling while atomic, exiting with locks held, etc. Compiler issues. * The recommended compiler (for x86) is still 2.95.3. * When compiled with a modern gcc (Ie gcc 3.x), 2.5 will use additional optimisations that 2.4 didn't. This may shake out compiler bugs that 2.4 didn't expose. * Do not use gcc 3.0.x on x86 due to a stack pointer handling bug. * gcc 2.96 is not supported with CONFIG_FRAME_POINTER=y due to a stack pointer handling bug. * gcc 3.2.2-5 as shipped by Red Hat generates incorrect code in the kmalloc optimisation introduced in 2.5.71 See http://linus.bkbits.net:8080/ linux-2.5/cset@1.1410 Security concerns. Several security issues solved in 2.4 may not yet be forward ported to 2.5. For this reason 2.5.x kernels should not be tested on untrusted systems. Testing known 2.4 exploits and reporting results is useful. Networking. * ebtables The bridging firewall code got merged. To manage these you'll need the ebtables tool available from http://users.pandora.be/bart.de.schuymer/ ebtables/ More on bridge-nf can be found at http://bridge.sourceforge.net * Bridged packets can now be 'seen' by iptables. * IPSec Linux finally has IPSec support in mainline. Use the KAME tools port on ftp://ftp.inr.ac.ru/ip-routing/iputils-ss021109-try.tar.bz2. For more info see http://www.lib.uaa.alaska.edu/linux-kernel/archive/2002-Week-44/ 1127.html. Also Bert Hubert has a howto at http://lartc.org/howto/ lartc.ipsec.html. Additionally, ipsec-utils is at http://sourceforge.net/ projects/ipsec-tools. Herbert Xu also has patches against FreeSWAN 2.00 to allow its userspace to use the 2.5 IPSec functionality. They can be downloaded from http://gondor.apana.org.au/~herbert/freeswan/ * Some applications may trigger the kernel to spit out warnings about 'process xxx using obsolete setsockopt SO_BSDCOMPAT' . + Bind 9.2.2 checks for #ifdef SO_BSDCOMPAT in correctly, so a recompile is all that is needed. + bind9-host from debian testing triggers, though the 'host' package doesn't. + process `snmpd' is using obsolete setsockopt SO_BSDCOMPAT + process `snmptrapd' is using obsolete setsockopt SO_BSDCOMPAT + ntop uses obsolete (PF_INET,SOCK_PACKET) * Users of boxes with >1 NIC may find that for eg, eth0 and eth1 refer to the opposites of what they did in 2.4. This is a bug that will be fixed before 2.6.0. One option (or management workaround) for this is to use 'nameif' to name Ethernet interfaces. There is a HOWTO for doing this at * Support for various new RFCs. + RFC3173 (IP Payload Compression). * Linux reaches congestion collapse when subjected to heavy network load. NAPI fixes this amongst other things and therefore improving network performance. More info at http://www.cyberus.ca/~hadi/usenix-paper.tgz and ftp:// robur.slu.se/pub/Linux/net-development/NAPI/ Crypto * A generic crypto API has been merged, offering support for various algorithms (HMAC,MD4,MD5,SHA-1,DES,Triple DES EDE, Blowfish) * This functionality is used by IPSec and the crypto-loop. It's possible that it will later also be available for use in userspace through a crypto device, possibly compatible with the OpenBSD crypto userspace. * The in-kernel loopback device can now do crypto using the CryptoAPI. May need new userspace tools. Deprecated. * usbdevfs will be going away in 2.7. The same filesystem can be mounted as 'usbfs' in recent 2.4 kernels, and in 2.5.52 and above, which is what the filesystem will furthermore be known as. * elvtune is deprecated (as are the ioctl's it used). Instead, the io scheduler tunables are exported in sysfs (see below) in the /sys/block/ /iosched directory. Jens wrote a document explaining the tunables of the new scheduler at http://www.lib.uaa.alaska.edu/linux-kernel/archive/ 2002-Week-44/att-deadline-iosched.txt Ports. * 2.5 features support for several new architectures. + x86-64 (AMD Hammer) + ppc64 + UML (User mode Linux). See http://user-mode-linux.sf.net for more information. + uCLinux: m68k(w/o MMU), h8300 and v850. sh also added a uCLinux option. * The 64 bit s390x port got collapsed into a single port, appearing as a config option in the base s390 arch. * In the opposite direction, arm26 was split out from arm. Oleg Drokin pointed out that ReiserFS supported inode attributes in 2.4.17 as well, and so couldn't really be considered a new feature in 2.5; but he added: On the real new features list we have: * Relocated/nonstandard size journal support (actually was included in 2.4.22-pre3, too) * Support for writes larger than 4k in size (get speedup on large file writes, esp. in append mode, should be more SMP friendly, too) * Variable blocksize support (i.e. you can choose any blocksize in range of 1024 .. PAGE_CACHE_SIZE, must be power of 2) Paul Dickson also pointed out that the features Dave had marked "deprecated" were in fact already removed. As he put it, "Deprecated means "in the process of being phased out, but still usable"" . Elsewhere, Matthew Dharm also pointed out for the little section on USB, "We may want to mention here that usb-storage has changed behavior. A device which is disconnected and then re-connected is not re-associated with the old /dev/ node. Also some performance enhancements." Elsewhere, also regarding the USB section, Greg KH said: The USB host controller drivers got renamed in 2.5. They are now: uhci-hcd.o for UHCI USB host controllers ohci-hcd.o for OHCI USB host controllers ehci-hcd.o for EHCI (USB 2.0) host controllers Elsewhere, James H. Cloos Jr. pointed out that in the "IO Subsystem" section, Dave had said that I/O elevators could be selected via 'elvtune', while in the deprecated section, Dave had said that 'elvtune' was deprecated, and that the elevators were actually tunable through a SysFS interface. Meelis Roos also made this connection elsewhere, and Dave and others tried to puzzle out whether 'elftune' would or would not be the preferred method. Elsewhere regarding software suspend, Pavel Machek pointed out that it was not quite as badly off as Dave had described. Pavel said, "Actually it tends to work these days. No SMP, be carefull with PREEMPT, and no unusual hardware, and it should work." Elsewhere, Alan Cox also had a number of comments about Dave's first post. Among them, he pointed out that the URL Dave had given for one of the BitKeeper changesets, was apparently incorrect. Larry McVoy replied, "I know, sorry. The version numbers in BK are not stable, they can't be. You have to use the underlying internal version number. If someone who knows can show me the output of 'bk changes -r' for that changeset I will figure out a way to have a URL that doesn't change and send it to Dave for that doc as well as post it there." Steven Cole said that http://linus.bkbits.net:8080/linux-2.5/ cset@1.1215.127.10 looked correct. 5. Merging Software Suspend Patches; Aborting A Suspend-In-Progress 11 Jul 2003 - 15 Jul 2003 (53 posts) Archive Link: "Thoughts wanted on merging Software Suspend enhancements" Topics: Big Memory Support, Compression, Power Management: ACPI, Software Suspend People: Nigel Cunningham, Pavel Machek, Dmitry Torokhov, Jamie Lokier, Vojtech Pavlik Nigel Cunningham asked: As you may know, there has been a lot of work done on the 2.4 version of software suspend. This includes: * async i/o * back out on errors rather than panicing (where possible) * enhancements to refrigerator so it successfully freezes processes even under high load * save a full image rather than freeing just about all the memory first * highmem support * image compression support * swapfile support in progress * nice display * user can abort at any time during suspend (oh, I forgot, I wanted to...) by just pressing Escape * extensive debugging info that doesn't need to be compiled in and can be adjusted during the suspend cycle (very handy for diagnosing issues) I'm wanting to get your thoughts on how we should go about merging it. I don't think these qualify as bug fixes, but current users (and I'm not excluding myself!) would certainly like to see the patch merged sooner rather than later. Would it be a good idea to seek to get Marcello and Andrew to take it into 2.4 and 2.6, and then aim for 2.[7|9]? Pavel Machek objected to the idea of aborting a suspend operation by just pressing the ESC key. He said, "We don't want joe random user that is at the console to prevent suspend by just pressing Escape. Maybe magic key to do that would be acceptable..." Dmitry Torokhov replied that in most cases, the machine being suspended would be a laptop, and so there would only be one Joe User present. He said, "I myself would rather have an option to press ESC than remember what SysRq really maps to as by the time I would figure that out the laptop would already be suspended. IMHO, an option to use ESC, probably compile time option, is a good thing." But Pavel was against adding more compile options unless they were really necessary. He also said any keyboard solution would be ugly, including the magic sysrq, but that the magic sysrq would be useful enough to be acceptible. Jamie Lokier replied: Can't you just use the Suspend button? :) I'd be inclined to initiate suspend-to-disk when the laptop's lid is closed, or when I press the suspend button if ACPI would be so accomodating. After closing the lid, if I change my mind there's only two inputs I can do quickly: press the Suspend button, or open the lid. SysRq-Escape would take a couple of seconds longer due to needing to open the lid. Of those, I'd be worried about the fragile lid switch occasionally bouncing as I moved the laptop, causing it to fail to suspend in my bag. The button is well protected. Kent Borg said he wanted his laptop to keep running when the lid was closed, and Pavel said, "Of course, this *needs* to be configurable (== handled by userland). If you are using external keyboard/mouse you do not want your notebook open." But elsewhere he said that using the Suspend button would be acceptible in general. He said, "If it is the same button that would wake machine up when it finished suspend... I guess that makes sense." Vojtech Pavlik suggested just having any keypress stop a suspend, but Nigel said this would make it too easy to accidentally interrupt the suspend. Elsewhere, Nigel said he still preferred using the Esc key. He said, "Having listened to the arguments, I'll make pressing Escape to cancel the suspend a feature which defaults to being disabled and can be enabled via a proc entry in 2.4. I won't add code to poll for ACPI (or APM) events :>" . Pavel suggested, "At least no new proc entry, please. Make it depend on sysrq_enabled and disable it completely if sysrq support is not compiled in." And Vojtech also suggested "making it a mappable function in the keymap, like reboot is for example. Both for initiating and stopping the suspend." There was a bit more discussion, but Nigel stood firm, and the thread petered out. 6. File-Time System Calls; Status Of ReiserFS 11 Jul 2003 - 15 Jul 2003 (8 posts) Archive Link: "utimes/futimes/lutimes syscalls" Topics: FS: ReiserFS, FS: XFS People: Ulrich Drepper, Andrew Morton, Nikita Danilov, Tomas Szepe Ulrich Drepper said: With the introduction of the nanosecond fields in struct stat the utime() syscall is kind of obsolete. It's not possible anymore to restore the exact access/modification time of a file. Unix defines the utimes() function for this. It is currently implementated in glibc on top of the utime() syscall which used to be OK but isn't anymore today. In addition some systems provide the futimes() and lutimes() functions which appropriate semantics. The question: would a patch introducing these syscalls be accepted? If there are filesystems which store the sub-seconds on disk I think this is necessary since otherwise all kinds of programs (including archives) cannot be written correctly. If the sub-seconds only live in memory I still think it would be good to have the syscalls but it would not be that urgent. Andrew Morton replied, "XFS (at least) stores nanoseconds on disk. So yes, I think we should make this change." And Nikita Danilov pointed out that ReiserFS also stored nanoseconds on disk. Tomas Szepe asked, as a by-the-way, when ReiserFS would be ready at last, and Nikita replied: Real soon now. Latest benchmark results are available at the http://namesys.com/intbenchmarks/ mongo/03.07.11.light/short.html we still have problems with delete performance, but in three to ten days reiser4 will be released to the public testing. 7. Linux 2.6.0-test1 Released 13 Jul 2003 - 18 Jul 2003 (16 posts) Archive Link: "Linux v2.6.0-test1" People: Linus Torvalds Linus Torvalds announced Linux v2.6.0-test1 (http://www.kernel.org/pub/linux/ kernel/v2.6/ChangeLog-2.6.0-test1) , and explained: the naming should be familiar - it's the same deal as with 2.4.0. One difference is that while 2.4.0 took about 7 months from the pre1 to the final release, I hope (and believe) that we have fewer issues facing us in the current 2.6.0. But very obviously there are going to be a few test-releases before the real thing. The point of the test versions is to make more people realize that they need testing and get some straggling developers realizing that it's too late to worry about the next big feature. I'm hoping that Linux vendors will start offering the test kernels as installation alternatives, and do things like make upgrade internal machines, so that when the real 2.6.0 does happen, we're all set. 8. Status Of XBox Support 14 Jul 2003 - 15 Jul 2003 (12 posts) Archive Link: "[PATCH] XBox Gaming System subarchitecture." Topics: Microsoft People: Linus Torvalds, Anders Gustafsson Anders Gustafsson posted a patch adding XBox support to Linux, and Linus Torvalds replied: Quite frankly, for Xbox support I want it to become a lot more commonly used before I actually put it into the standard kernel. Why? Simply because for now it's still a fairly specialized thing, and as such I have to weigh the benefits of including it in the standard kernel against the negatives of just being a bit politically "hot potato". Don't get me wrong: I think doing an Xbox port is fine. It's just that putting it in the standard tree is not likely a good idea. I can well imagine a number of Linux distributors who do not feel like they need the aggravation ;) Anders replied: Okey. Now I know, thanks. I assumed that it was either this or that it had to follow the standard procedure of posting the mach-patch(/patchset) a hundred times to lkml before it got accepted. (And regarding the distros: They distributors could just rip that part out while doing all their patches ;). And I know that at least mandrake has a positive look on xbox-distro. And the mandrake devels were especially helpful in porting their installer to be compatible with the xbox.) Just to make clear: The patch does nothing that involves anything with the copy-protection. Not even the hdd-unlock. It is aimed to those who replace the bios in the xbox with the clean microsoft-free cromwell-bios, which has the sole purpose of booting linux. 9. Linux 2.6 Feature Documentation By Joe Pranevich 14 Jul 2003 - 15 Jul 2003 (5 posts) Archive Link: "Wonderful World of Linux 2.6 - Linux 2.6 features document (first revision)" Topics: Version Control People: Joe Pranevich Joe Pranevich said: I've recently put together the first draft of a features document describing the changes in Linux 2.6. (I did similar documents for both Linux 2.2 and Linux 2.4.) It's based almost entirely on BitKeeper changelogs (with clarifying information pulled from the lists and the web), so there is a chance that I misunderstood something or that I missed something else entirely. Please give it a look over and if you see anything that needs a look-over, please let me know. As it stands now, I feel pretty good about how it turned out so I'm finally comfortable mailing what I have around. (There are still a couple areas that need expanding on, I think...) As of right now, you can find the latest versions of the document available online. Text version: http://www.kniggit.net/wwol26.txt Tersely formatted HTML: http://www.kniggit.net/wwol26.html Several people praised Joe's work and recommended it, and offered corrections. 10. nfs-utils 1.0.4 Released 14 Jul 2003 (2 posts) Archive Link: "ANNOUNCE: nfs-utils 1.0.4" Topics: FS: NFS People: Neil F. Brown, Steven Cole Neil F. Brown announced: This release of nfs-utils contains: 1. Fix for a remotely exploitable buffer-overflow bug. 2. assorted minor bug fixes 3. Extensive changes to make use of new functionality in linux-2.6.0 nfsd nfs-utils 1.0.4 can be downloaded from http://sourceforge.net/project/ showfiles.php?group_id=14 or http://www.{countrycode}.kernel.org/pub/linux/ utils/nfs/ I consider this release to be a pre-release for 1.1.0 which I hope to release before linux-2.6.0-final. Bug reports are very welcome. 1. A buffer-overflow bug was found by Janusz Niewiadomski iSEC Security Research http://isec.pl/ It is trivially exploitable to effect a remote denial of service. More subtle exploits may be possible. I recommend that all users of nfs-utils either: 1. upgrade to 1.0.4; or 2. Get an update from their vendor (most vendors should have an update available). I also recommend that all NFS services be protected from the internet-at-large by a firewall where that is possible. 2. See the change log in the source for details on bug fixes. 3. In 2.4 and earlier kernels, the nfs server needed to know about any client that expected to be able to access files via NFS. This information would be given to the kernel by "mountd" when the client mounted the filesystem, or by "exportfs" at system startup. exportfs would take information about active clients from /var/lib/nfs/rmtab. This approach is quite fragile as it depends on rmtab being correct which is not always easy, particularly when trying to implement fail-over. Even when the system is working well, rmtab suffers from getting lots of old entries that never get removed. With 2.6 we have the option of having the kernel tell mountd when it gets a request from an unknown host, and mountd can give appropriate export information to the kernel. This removes the dependancy on rmtab and means that the kernel only needs to know about currently active clients. To enable this new functionality, you need to: mount -t nfsd nfsd /proc/fs/nfs before running exportfs or mountd. If you are using 2.6.0-testX and exporting files with NFS *please* test this out and let me know of any problems. And Steven Cole posted a patch to Documentation/Changes to reflect this work. 11. RadeonFB Maintainership And Development Battles 14 Jul 2003 - 18 Jul 2003 (15 posts) Archive Link: "Re: radeonfb patch for 2.4.22..." People: Ani Joshi, Marcelo Tosatti, Benjamin Herrenschmidt Ani Joshi asked Marcelo Tosatti in private email, "Is there any particular reason why you decided to merge Ben H.'s radeonfb update instead of the one I sent you?" And Marcelo replied, "I merged his version because he sent me your update (0.1.8) plus his code (which are useful fixes he has been working on). It seems things are broken now due to a missing header, but he also sent me that. Do you have any objections to his fixes ?" Ani replied that the patch he'd sent multiple times had included all of those fixes. He said that, as the driver maintainer, patches should be routed through him before going to Marcelo. It made no sense, he said, to allow everyone to patch the driver when they pleased, because that made it impossible to organize the work and make meaningful developments. Ani asked Marcelo to clarify his position on driver maintainership. Marcelo replied that he hadn't realized Ben's code had been included in Ani's patch. He also said, "I received complains that you were not accepting patches from Ben. He needs that code in." And he added, "If you had accepted Ben's changes in the first place I wouldnt need to apply his patch." He also said that Benjamin Herrenschmidt was interested in taking over maintainership, and asked what Ani's and Ben's feelings would be about that. Benjamin pointed out that Ani's patch did not contain all of his changes; and added, "I could take over if Ani wants to give up, though I would prefer a dedicated maintainer with more time to do the necessary rewrite of this driver in 2.6 and later, which I don't have time to do right now, however, I can maintain the existing code base if necessary." Ani also replied to Marcelo, pointing to evidence that he had in fact accepted patches from Benjamin. He added, "Also, its hard to "accept patches" from people if you do NOT recieve any patches from them! Ben's style is to get the maintainers of drivers to go around and search for his personal tree and do their own diffs from that tree, instead of him sending a patch to the maintainer." He added that Benjamin may have written more recent patches that were not included because they were so recent; and Ani also pointed out that anyone, including Marcelo, could also be accused of "not accepting patches"; and that it wasn't fair to judge solely on the basis of complaints. Benjamin took umbrage at Ani's characterization of him as never submitting patches. He said he always posted his patches to the mailing list, and CCed Ani on all patch posts. Marcelo said he'd revert Benjamin's patch and apply Ani's, and he asked Benjamin to send Ani a new patch that included all his additional fixes. He said he hoped this would make everyone happy. 12. Status Of Virtual Memory Documentation 15 Jul 2003 (2 posts) Archive Link: "VM docs and where they are going" Topics: Virtual Memory People: Mel Gorman Mel Gorman said: I made a small number of typo corrections and expanded the introduction chapter a small bit on the Linux VM docs on my site. The changes are small enough that if anyone has already printed it out, don't bother printing it again. They are still available from the usual places. Main document PDF: http://www.csn.ul.ie/~mel/projects/vm/guide/pdf/understand.pdf HTML: http://www.csn.ul.ie/~mel/projects/vm/guide/html/understand/ Text: http://www.csn.ul.ie/~mel/projects/vm/guide/text/understand.txt Code commentary PDF: http://www.csn.ul.ie/~mel/projects/vm/guide/pdf/code.pdf HTML: http://www.csn.ul.ie/~mel/projects/vm/guide/html/code/ Text: http://www.csn.ul.ie/~mel/projects/vm/guide/text/code.txt On the where it's going front from here, I'm happy to say I've now writing a book which will be published under the Bruce Peren's Open Book Series (http:// www.perens.com/Books/). Some stuff that I'm working on for it include; * Better integration of the code commentary so it's easier to follow * Much better introduction sections and updating of the software tools * Shiny CD that comes with softcopy versions of the docs, browsable version of the tree and hopefully online call graph generation * Chapter on anonymous shared memory including the virtual filesystem * Assorted expansions and additions * And best of all, a fairly detailed introduction to 2.6. The 2.6 sections are at the end of each chapter and give a fairly detailed account (right now, it's totalling about 30 pages) of what is new in 2.6 and how it is implemented If all goes well, it'll be available before the end of this year or in early 2004 :-) 13. BitKeeper Snapshots For 2.6.0-test 18 Jul 2003 (3 posts) Archive Link: "2.6.0 BK snapshots" Topics: Version Control People: Jeff Garzik, Martin Schlemmer Martin Schlemmer asked if the 2.6.0-test series would have BitKeeper snapshots the way 2.5 did, and Jeff Garzik said: I know, I know... ;-) Suffixes ("-testN") break my snapshot process. Should be fixed sometime today... Martin thanked him, and the thread ended. 14. Adeos M3 Released 18 Jul 2003 (1 post) Archive Link: "[ANNOUNCE] Adeos m3" Topics: Microkernels: Adeos, Real-Time: RTAI, SMP People: Philippe Gerum Philippe Gerum said: Adeos m3 for Linux is available at http://savannah.nongnu.org/download/adeos/ releases/adeos-m3.tar.gz. This third milestone provides support for the following platforms: * 2.4.{19,20,21} and 2.6.0-test1 on x86 hardware (UP and SMP). * 2.4.20-uc0 on ARM-nommu. Quite a lot of work has taken place since m2 was released a year ago, mostly aimed at improving stability and determinism in demanding real-time environments. People seeking "real world" use of Adeos should have a look at the RTAI project: http://www.aero.polimi.it/~rtai/. Sharon And Joy Kernel Traffic is grateful to be developed on a computer donated by Professor Greg Benson and Professor Allan Cruse in the Department of Computer Science at the University of San Francisco. This is the same department that invented FlashMob Computing. Kernel Traffic is hosted by the generous folks at kernel.org. All pages on this site are copyright their original authors, and distributed under the terms of the GNU General Public License version 2.0.