

<kc>

<title>Kernel Traffic</title>

<author contact="mailto:zbrown@tumblerings.org">Zack Brown</author>

<issue num="173" date="30 Jun 2002 23:00:00 -0800" />

<stats posts="1159" size="5612" contrib="375" multiples="182" lastweek="146">

<person posts="28" size="112" who="Linus Torvalds " />
<person posts="28" size="87" who="Dave Jones " />
<person posts="27" size="160" who="Ingo Molnar " />
<person posts="26" size="88" who="William Lee Irwin III " />
<person posts="25" size="129" who="Martin Dalecki " />
<person posts="20" size="51" who="&quot;David S. Miller&quot; " />
<person posts="19" size="76" who="Robert Love " />
<person posts="18" size="86" who="Rusty Russell " />
<person posts="16" size="51" who="Frank Davis " />
<person posts="15" size="52" who="Rudmer van Dijk " />
<person posts="15" size="48" who="Adrian Bunk " />
<person posts="15" size="48" who="David Schwartz " />
<person posts="14" size="57" who="Patrick Mochel " />
<person posts="13" size="102" who="Zwane Mwaikambo " />
<person posts="13" size="52" who="Andrew Morton " />
<person posts="13" size="51" who=" (Eric W. Biederman)" />
<person posts="12" size="39" who="Jens Axboe " />
<person posts="12" size="36" who="&quot;Stephen C. Tweedie&quot; " />
<person posts="11" size="245" who="Bartlomiej Zolnierkiewicz " />
<person posts="11" size="86" who="&quot;Adam J. Richter&quot; " />
<person posts="11" size="48" who="Andrea Arcangeli " />
<person posts="10" size="33" who="Austin Gonyou " />
<person posts="10" size="23" who="Alan Cox " />
<person posts="9" size="88" who="Kurt Garloff " />
<person posts="9" size="59" who="Rob Landley " />
<person posts="9" size="53" who="Dmitry Kasatkin " />
<person posts="9" size="34" who="george anzinger " />
<person posts="9" size="32" who="&quot;J.A. Magallon&quot; " />
<person posts="9" size="32" who="" />
<person posts="9" size="27" who="Peter Chubb " />
<person posts="8" size="56" who="Larry McVoy " />
<person posts="8" size="40" who="Andreas Dilger " />
<person posts="8" size="26" who="David Brownell " />
<person posts="8" size="25" who="Helge Hafting " />
<person posts="8" size="24" who="Keith Owens " />
<person posts="7" size="39" who="Robert Love " />
<person posts="7" size="32" who="Stephen Rothwell " />
<person posts="7" size="23" who="Bill Davidsen " />
<person posts="7" size="23" who="&quot;Robbert Kouprie&quot; " />
<person posts="7" size="19" who="&quot;Maciej W. Rozycki&quot; " />
<person posts="7" size="18" who="Roman Zippel " />
<person posts="6" size="52" who="sean darcy " />
<person posts="6" size="42" who="Kai Germaschewski " />
<person posts="6" size="33" who="Craig Kulesa " />
<person posts="6" size="29" who="&quot;Richard B. Johnson&quot; " />
<person posts="6" size="28" who="&quot;Griffiths, Richard A&quot; " />
<person posts="6" size="22" who="Daniel Phillips " />
<person posts="6" size="19" who="Dave Hansen " />
<person posts="6" size="19" who="James Bottomley " />
<person posts="6" size="19" who="&quot;Grover, Andrew&quot; " />
<person posts="6" size="18" who="Benjamin LaHaise " />
<person posts="6" size="18" who="Pavel Machek " />
<person posts="6" size="18" who="Kai Germaschewski " />
<person posts="6" size="14" who="Greg KH " />
<person posts="5" size="77" who="mgross " />
<person posts="5" size="24" who="Jesse Pollard " />
<person posts="5" size="22" who="Daniel Phillips " />
<person posts="5" size="22" who="Nick Bellinger " />
<person posts="5" size="21" who="Brad Hards " />
<person posts="5" size="19" who="Oliver Xymoron " />
<person posts="5" size="17" who="Ville Herva " />
<person posts="5" size="17" who="&quot;Martin J. Bligh&quot; " />
<person posts="5" size="16" who="Andries Brouwer " />
<person posts="5" size="16" who="Pavel Machek " />
<person posts="5" size="16" who="Rik van Riel " />
<person posts="5" size="16" who="Neil Brown " />
<person posts="5" size="14" who="" />
<person posts="5" size="13" who="" />
<person posts="5" size="12" who="DervishD " />
<person posts="4" size="70" who="Ben Greear " />
<person posts="4" size="69" who="Albert Cranford " />
<person posts="4" size="47" who="Matthew Hall " />
<person posts="4" size="34" who="Muli Ben-Yehuda " />
<person posts="4" size="28" who="Jeff Garzik " />
<person posts="4" size="22" who="Jurriaan on Alpha " />
<person posts="4" size="21" who="Tom Rini " />
<person posts="4" size="17" who="Richard Ems " />
<person posts="4" size="15" who="Cort Dougan " />
<person posts="4" size="15" who="Doug Ledford " />
<person posts="4" size="13" who="&quot;H. Peter Anvin&quot; " />
<person posts="4" size="12" who="Trond Myklebust " />
<person posts="4" size="11" who="Sam Ravnborg " />
<person posts="4" size="11" who="Russell King " />
<person posts="4" size="10" who="Fabio Massimo Di Nitto " />
<person posts="4" size="10" who="Melchior FRANZ " />
<person posts="4" size="8" who="Mikael Pettersson " />
<person posts="3" size="79" who="Andrey Panin " />
<person posts="3" size="46" who="Matthew Dobson " />
<person posts="3" size="33" who="Torrey Hoffman " />
<person posts="3" size="21" who="Denis Vlasenko " />
<person posts="3" size="15" who="Lightweight patch manager " />
<person posts="3" size="15" who="Francois Romieu " />
<person posts="3" size="14" who="Federico Sevilla III " />
<person posts="3" size="13" who="jw schultz " />
<person posts="3" size="11" who="Sandy Harris " />
<person posts="3" size="11" who="Dzmitry Chekmarou " />
<person posts="3" size="11" who="Werner Almesberger " />
<person posts="3" size="11" who="Marek Michalkiewicz " />
<person posts="3" size="11" who="Henning Makholm " />
<person posts="3" size="10" who="James Simmons " />
<person posts="3" size="10" who="" />
<person posts="3" size="9" who="Jakob Oestergaard " />
<person posts="3" size="9" who="Lars Magne Ingebrigtsen " />
<person posts="3" size="9" who="Padraig Brady " />
<person posts="3" size="9" who="&quot;Christopher E. Brown&quot; " />
<person posts="3" size="9" who="&quot;Oliver Pitzeier  Home&quot; &lt;oliver@linux-kernel.at&gt;" />
<person posts="3" size="8" who="Donald Becker " />
<person posts="3" size="8" who="Marcelo Tosatti " />
<person posts="3" size="7" who="Witek Krecicki " />
<person posts="3" size="7" who="&quot;Roy Sigurd Karlsbakk&quot; " />
<person posts="3" size="7" who="" />
<person posts="3" size="6" who="Pradeep Padala " />
<person posts="2" size="73" who="Armin Obersteiner " />
<person posts="2" size="44" who="&quot;Nicholas L. Nigay&quot; " />
<person posts="2" size="27" who="Felipe Alfaro Solana " />
<person posts="2" size="24" who="Ed Sweetman " />
<person posts="2" size="24" who="Martin Diehl " />
<person posts="2" size="22" who="David Howells " />
<person posts="2" size="15" who="" />
<person posts="2" size="14" who="Cory Watson " />
<person posts="2" size="13" who="OGAWA Hirofumi " />
<person posts="2" size="13" who="john stultz " />
<person posts="2" size="11" who="Matthew Wakeling " />
<person posts="2" size="11" who="Nick Papadonis " />
<person posts="2" size="11" who="" />
<person posts="2" size="10" who="Manik Raina " />
<person posts="2" size="10" who="&quot;Christopher A. Baumbauer&quot; " />
<person posts="2" size="10" who="" />
<person posts="2" size="10" who="Michael Hohnbaum " />
<person posts="2" size="9" who="Kevin Corry " />
<person posts="2" size="9" who="&quot;Timothy D. Witham&quot; " />
<person posts="2" size="9" who="Edmund GRIMLEY EVANS " />
<person posts="2" size="9" who="&quot;Gross, Mark&quot; " />
<person posts="2" size="8" who="&quot;jdow&quot; " />
<person posts="2" size="8" who=" (Raphael Manfredi)" />
<person posts="2" size="8" who="Stevie O " />
<person posts="2" size="8" who="Douglas Gilbert " />
<person posts="2" size="8" who="Benjamin Herrenschmidt " />
<person posts="2" size="8" who="Stephen Samuel " />
<person posts="2" size="8" who="Richard Gooch " />
<person posts="2" size="7" who="&quot;Shen, JT&quot; " />
<person posts="2" size="7" who="Chris Friesen " />
<person posts="2" size="7" who="Joseph Pingenot " />
<person posts="2" size="7" who="&quot;Bhavesh P. Davda&quot; " />
<person posts="2" size="7" who="Samuel Flory " />
<person posts="2" size="7" who="Matti Aarnio " />
<person posts="2" size="7" who="&quot;Scott Tillman&quot; " />
<person posts="2" size="7" who="&quot;Jonathan A. Davis&quot; " />
<person posts="2" size="7" who="&quot;Martin Schwenke&quot; " />
<person posts="2" size="6" who="Tigran Aivazian " />
<person posts="2" size="6" who="Andrey Nekrasov " />
<person posts="2" size="6" who="Matthew Wilcox " />
<person posts="2" size="6" who="Arnd Bergmann " />
<person posts="2" size="6" who="Jan-Benedict Glaw " />
<person posts="2" size="6" who="Manfred Spraul " />
<person posts="2" size="6" who="&quot;Mike Black&quot; " />
<person posts="2" size="6" who="Stelian Pop " />
<person posts="2" size="6" who="Richard Zidlicky " />
<person posts="2" size="6" who="Ben Clifford " />
<person posts="2" size="6" who="Emre Tezel " />
<person posts="2" size="6" who="Andre Hedrick " />
<person posts="2" size="6" who="Samuel Sieb " />
<person posts="2" size="6" who="Roy Sigurd Karlsbakk " />
<person posts="2" size="5" who="Nicolas Aspert " />
<person posts="2" size="5" who="Jesse Barnes " />
<person posts="2" size="5" who=" (=?iso-8859-1?q?M=E5ns_Rullg=E5rd?=)" />
<person posts="2" size="5" who="&quot;Oliver Pitzeier&quot; " />
<person posts="2" size="5" who="John Alvord " />
<person posts="2" size="5" who="Miles Lane " />
<person posts="2" size="5" who="" />
<person posts="2" size="5" who="Xavier Bestel " />
<person posts="2" size="5" who="Erik Steffl " />
<person posts="2" size="5" who="mbs " />
<person posts="2" size="5" who="Diego Calleja " />
<person posts="2" size="5" who=" (Kai Henningsen)" />
<person posts="2" size="5" who="Toni Viemero " />
<person posts="2" size="5" who="Ricky Beam " />
<person posts="2" size="5" who="Erik Andersen " />
<person posts="2" size="5" who="Chris Wright " />
<person posts="2" size="4" who="&quot;Bloch, Jack&quot; " />
<person posts="2" size="4" who="David Weeks " />
<person posts="2" size="3" who="William Thompson " />
<person posts="1" size="80" who="&quot;Jonathan Thorpe&quot; " />
<person posts="1" size="32" who="" />
<person posts="1" size="25" who="&quot;Paul McKenney&quot; " />
<person posts="1" size="21" who="Martin Schwidefsky " />
<person posts="1" size="18" who="Abraham David Smith " />
<person posts="1" size="16" who="Ivan Kokshaysky " />
<person posts="1" size="15" who="Shawn Starr " />
<person posts="1" size="14" who="blaise " />
<person posts="1" size="14" who="&quot;Kristofer T. Karas&quot; " />
<person posts="1" size="14" who="=?ISO-8859-2?B?IqN1a2FzeiBH83JhbGN6eWsi?= " />
<person posts="1" size="14" who="Burkhard Bunk " />
<person posts="1" size="13" who="=?ISO-8859-1?Q?Micha=B3?= =?ISO-8859-1?Q?Cie=B6lakiewicz?= " />
<person posts="1" size="13" who="Ian Stirling " />
<person posts="1" size="13" who="Dirk Schmidt " />
<person posts="1" size="13" who="&quot;David McIlwraith&quot; " />
<person posts="1" size="12" who="Bernd Schubert " />
<person posts="1" size="12" who="Gerald Champagne " />
<person posts="1" size="11" who="Andre Bonin " />
<person posts="1" size="10" who="Thomas Sailer " />
<person posts="1" size="8" who="&quot;John L. Males&quot; " />
<person posts="1" size="8" who="Jonathan Woithe " />
<person posts="1" size="7" who="Samo Gabrovec " />
<person posts="1" size="7" who="Andreas Jellinghaus " />
<person posts="1" size="7" who="Peter =?ISO-8859-1?Q?W=E4chtler?= " />
<person posts="1" size="7" who="Andrew Theurer " />
<person posts="1" size="6" who="&quot;Alexandre P. Nunes&quot; " />
<person posts="1" size="6" who="&quot;Jeff V. Merkey&quot; " />
<person posts="1" size="6" who="=?iso-8859-1?Q?Rasmus_B=F8g_Hansen?= " />
<person posts="1" size="6" who="&quot;KV FRANCE&quot; " />
<person posts="1" size="6" who="toxic " />
<person posts="1" size="5" who="Matthew Harrell " />
<person posts="1" size="5" who="Vasil Kolev " />
<person posts="1" size="5" who="David Weinehall " />
<person posts="1" size="5" who="Joel Jaeggli " />
<person posts="1" size="5" who="Riley Williams " />
<person posts="1" size="5" who="Zak Shaf " />
<person posts="1" size="5" who="" />
<person posts="1" size="5" who="Frank " />
<person posts="1" size="5" who="Ihno Krumreich " />
<person posts="1" size="4" who="Duc Vianney " />
<person posts="1" size="4" who="Richard Liu " />
<person posts="1" size="4" who="Samo Gabrovec " />
<person posts="1" size="4" who="Anton Altaparmakov " />
<person posts="1" size="4" who=" (Jeronimo Pellegrini)" />
<person posts="1" size="4" who="David Lang " />
<person posts="1" size="4" who="Stephen Satchell " />
<person posts="1" size="4" who="&quot;joseph edward&quot; " />
<person posts="1" size="4" who="&quot;Michael Kerrisk&quot; " />
<person posts="1" size="4" who="sullivan " />
<person posts="1" size="4" who="&quot;seawang&quot; " />
<person posts="1" size="4" who="&quot;Albert D. Cahalan&quot; " />
<person posts="1" size="4" who="&quot;PARTHY MACARTHY&quot; " />
<person posts="1" size="4" who="David Lang " />
<person posts="1" size="4" who="&quot;Mala Anand&quot; " />
<person posts="1" size="4" who="Matthew Dharm " />
<person posts="1" size="4" who="&quot;R. Steve McKown&quot; " />
<person posts="1" size="4" who="Security Coordinator " />
<person posts="1" size="4" who="Dimitris Zilaskos " />
<person posts="1" size="4" who="Marc Lefranc " />
<person posts="1" size="4" who="Knut J Bjuland " />
<person posts="1" size="4" who="Patrick Mansfield " />
<person posts="1" size="4" who="Steven Cole " />
<person posts="1" size="4" who="Teodor Iacob " />
<person posts="1" size="3" who="&quot;Hubbard, David&quot; " />
<person posts="1" size="3" who="Jes Sorensen " />
<person posts="1" size="3" who="Ruth Ivimey-Cook " />
<person posts="1" size="3" who="Michael S. Zick " />
<person posts="1" size="3" who="Juliusz Chroboczek " />
<person posts="1" size="3" who="Kasper Dupont " />
<person posts="1" size="3" who="&quot;Steve Best&quot; " />
<person posts="1" size="3" who="Mikael Abrahamsson " />
<person posts="1" size="3" who="Erik McKee " />
<person posts="1" size="3" who="&quot;Lu, Yan P&quot; " />
<person posts="1" size="3" who="Daniel Jacobowitz " />
<person posts="1" size="3" who="Axel Thimm " />
<person posts="1" size="3" who=" (Rogier Wolff)" />
<person posts="1" size="3" who="Michael Clark " />
<person posts="1" size="3" who="Heinz Diehl " />
<person posts="1" size="3" who="Dan Boals " />
<person posts="1" size="3" who="lezong " />
<person posts="1" size="3" who="Alexander Viro " />
<person posts="1" size="3" who="" />
<person posts="1" size="3" who="Bryan Andersen " />
<person posts="1" size="3" who="John covici " />
<person posts="1" size="3" who="Lincoln Dale " />
<person posts="1" size="3" who=" (Linus Torvalds)" />
<person posts="1" size="3" who="Nathan Straz " />
<person posts="1" size="3" who="Stephan Brauss " />
<person posts="1" size="3" who="Rudmer van Dijk " />
<person posts="1" size="3" who="Chris Mason " />
<person posts="1" size="3" who="Brian Gerst " />
<person posts="1" size="3" who="Jamie Bennett " />
<person posts="1" size="3" who="Nick LeRoy " />
<person posts="1" size="3" who="Thunder from the hill " />
<person posts="1" size="3" who="&quot;Craig I. Hagan&quot; " />
<person posts="1" size="3" who="Horst von Brand " />
<person posts="1" size="3" who="&quot;Thomas Duffy&quot; " />
<person posts="1" size="3" who="Stelian Pop " />
<person posts="1" size="3" who="Justin Wojdacki " />
<person posts="1" size="3" who="Robert Schwebel " />
<person posts="1" size="3" who="Daniel Kobras " />
<person posts="1" size="3" who="Wayne Whitney " />
<person posts="1" size="3" who="" />
<person posts="1" size="3" who="RW Hawkins " />
<person posts="1" size="3" who="Carl Wilhelm Soderstrom " />
<person posts="1" size="3" who="James Bottomley " />
<person posts="1" size="3" who="&quot;Kevin Krieser&quot; " />
<person posts="1" size="3" who="Erlend Aasland " />
<person posts="1" size="3" who="&quot;John Hawkes&quot; " />
<person posts="1" size="3" who="Jes Sorensen " />
<person posts="1" size="3" who="John Weber " />
<person posts="1" size="2" who="John Summerfield " />
<person posts="1" size="2" who="Eric Weigle " />
<person posts="1" size="2" who="" />
<person posts="1" size="2" who="Brian Gerst " />
<person posts="1" size="2" who="Billy O'Connor " />
<person posts="1" size="2" who="Skip Gaede " />
<person posts="1" size="2" who="Andi Kleen " />
<person posts="1" size="2" who="&quot;Taavo Raykoff&quot; " />
<person posts="1" size="2" who="Andreas Schwab " />
<person posts="1" size="2" who="Mike Touloumtzis " />
<person posts="1" size="2" who="Bernd Eckenfels " />
<person posts="1" size="2" who="David Mosberger " />
<person posts="1" size="2" who="Your friend " />
<person posts="1" size="2" who="" />
<person posts="1" size="2" who="Allen Campbell " />
<person posts="1" size="2" who="Anders Gustafsson " />
<person posts="1" size="2" who="Leif Sawyer " />
<person posts="1" size="2" who="Brad Heilbrun " />
<person posts="1" size="2" who="" />
<person posts="1" size="2" who="Allan Sandfeld Jensen " />
<person posts="1" size="2" who="Richard Russon " />
<person posts="1" size="2" who="kheBrooke " />
<person posts="1" size="2" who="Marcus Sundberg " />
<person posts="1" size="2" who="&quot;L-Soft list server at HEAnet (1.8d)&quot; " />
<person posts="1" size="2" who="Ingo Oeser " />
<person posts="1" size="2" who="Andrew D Kirch " />
<person posts="1" size="2" who="Anton Blanchard " />
<person posts="1" size="2" who="Olivier Galibert " />
<person posts="1" size="2" who="Vincent Hanquez " />
<person posts="1" size="2" who="Christopher Li " />
<person posts="1" size="2" who="&quot;Shipman, Jeffrey E&quot; " />
<person posts="1" size="2" who="Nicolas Turro " />
<person posts="1" size="2" who="Gonzalo Augusto Arana Tagle " />
<person posts="1" size="2" who="&quot;Egidijus Antanaitis&quot; " />
<person posts="1" size="2" who="&quot;T.Raykoff&quot; " />
<person posts="1" size="2" who="Ben Collins " />
<person posts="1" size="2" who="Jirka Kosina " />
<person posts="1" size="2" who="&quot;Petr Vandrovec&quot; " />
<person posts="1" size="2" who="Oleg Drokin " />
<person posts="1" size="2" who="Bill Huey " />
<person posts="1" size="2" who="James Morris " />
<person posts="1" size="2" who="Simon Winwood " />
<person posts="1" size="2" who="Phil Oester " />
<person posts="1" size="2" who="&quot;Matthew D. Pitts&quot; " />
<person posts="1" size="2" who="Xinwen - Fu " />
<person posts="1" size="2" who="Luis Pedro de Moura Ribeiro Pinto " />
<person posts="1" size="2" who="Paul Vojta " />
<person posts="1" size="2" who="Alastair Stevens " />
<person posts="1" size="2" who="Jeff Meininger " />
<person posts="1" size="2" who="Nivedita Singhvi " />
<person posts="1" size="2" who="Martin Knoblauch " />
<person posts="1" size="2" who="Robert Love " />
<person posts="1" size="2" who="&quot;Roy Sigurd Karlsbakk&quot; " />
<person posts="1" size="2" who="" />
<person posts="1" size="2" who="Qin Tao " />
<person posts="1" size="2" who="Wakko Warner " />
<person posts="1" size="2" who="&quot;Domcan Sami&quot; " />
<person posts="1" size="2" who="&quot;Gopal Shankar&quot; " />
<person posts="1" size="2" who="Shanti Katta " />
<person posts="1" size="2" who="&quot;Nils O. =?ISO-8859-1?Q?Sel=E5sdal&quot; ?= " />
<person posts="1" size="2" who="Urban Widmark " />
<person posts="1" size="2" who="Geert Uytterhoeven " />
<person posts="1" size="2" who="Felipe Alfaro Solana " />
<person posts="1" size="2" who="Alan Cox " />
<person posts="1" size="2" who="Petko Manolov " />
<person posts="1" size="2" who="88biz " />
<person posts="1" size="2" who="" />
<person posts="1" size="2" who="&quot;Calin A. Culianu&quot; " />
<person posts="1" size="2" who="" />
<person posts="1" size="2" who="Pete Zaitcev " />
<person posts="1" size="2" who="&quot;Guillaume Boissiere&quot; " />
<person posts="1" size="2" who="devik " />
<person posts="1" size="2" who="Hayden James " />
<person posts="1" size="2" who="Chris Rode " />
<person posts="1" size="2" who="Jeff Dike " />
<person posts="1" size="2" who="Ryan Anderson " />
<person posts="1" size="1" who="Martin Devera " />
<person posts="1" size="1" who="Brad Dameron " />
<person posts="1" size="1" who="Felipe Contreras " />
<person posts="1" size="1" who="&quot;louie miranda&quot; " />
<person posts="1" size="1" who="&quot;www.portalcinta.com&quot; " />

</stats>

<section
  title="Sparc64 Support For O(1) Scheduler; Developer Interaction"
  subject="[PATCH] 2.4-ac: sparc64 support for O(1) scheduler"
  archive="http://www.uwsg.indiana.edu/hypermail/linux/kernel/0206.1/1225.html"
  posts="29"
  startdate="13 Jun 2002 11:21:58 -0800"
  enddate="23 Jun 2002 16:16:34 -0800"
>
<topic>Big O Notation</topic>
<topic>Developer Interaction</topic>
<topic>SMP</topic>
<topic>Scheduler</topic>

<mention>Thomas Duffy</mention>
<mention>Alan Cox</mention>

<p>Robert Love posted a patch and said to Alan Cox:</p>

<quote who="Robert Love">

<p>Attached patch provides SPARC64 support for the O(1) scheduler in 2.4-ac.
This is based off a 2.5 backport for my O(1) scheduler patches by Thomas Duffy
(i.e. give him the credit).</p>

<p>I do not know if any other architectures in 2.4-ac support the new scheduler
yet, but I will work on sending you the diffs as I get them or do them...</p>

<p>Patch is against 2.4.19-pre10-ac2, please apply.</p>

</quote>

<p>David S. Miller objected, <quote who="David S. Miller">Ummm what is
with all of those switch_mm() hacks?  Is this an attempt to work around the
locking problems?  Please don't do that as it is going to kill performance and
having ifdef sparc64 sched.c changes is ugly to say the least.  Ingo posted
the correct fix to the locking problem with the patch he posted the other day,
that is what should go into the -ac patches.</quote> But Robert replied, <quote
who="Robert Love">I am explicitly refraining from sending Alan any code that
is not well-tested in 2.5 and my machines first.  As Ingo's new switch_mm()
bits are not even in 2.5 yet, I plan to wait a bit before sending them... (I
am currently putting together all the scheduler bits we have been working
on for a 2.4-ac patch...)</quote> He added, <quote who="Robert Love">If you
like, Alan can hold off on this and take it when the appropriate patches
are in.</quote> But David rejoined:</p>

<quote who="David S. Miller">

<p>Your sparc64 kernel/sched.c bits have zero testing in any kernel.
What point are you trying to make?  It disables a very important optimization
on SMP sparc64.  It's simply unacceptable.</p>

<p>Ingo's change which deletes the frozen locking bits has to be installed
with the patches which allow sparc64 to continue working without the deadlock
bug, they cannot be added seperately.</p>

</quote>

<p>And Robert came back with, <quote who="Robert Love">I don't care about
Sparc64, especially as a short term item. Long term yes you are right but
for the -ac work, it can fall back for a while.</quote></p>

<p>David didn't reply to this, but close by, Ingo Molnar said regarding
his own patches, <quote who="Ingo Molnar">Linus applied them already,
they will be in 2.5.22. They fix real bugs and i've seen no problems on my
testboxes. Those bits are a must for SMP x86 and Sparc64 as well, there is
absolutely no reason to selectively delay their backmerge. Besides the last
task_rq_lock() optimization which got undone in 2.5 already, all the recent
scheduler bits i posted are needed.</quote> Robert replied:</p>

<quote who="Robert Love">

<p>I know they are fine (I looked over them) and I saw Linus took them, but
2.5.22 is not yet out and I did not see any reason to rush to new bits to
Alan for 2.4 when we could wait a bit and make sure 2.5 proves them fine...</p>

<p>My approach thus far with 2.5 -> 2.4 O(1) backports has been one of
caution and it has worked fine thus far.  I figure, what is the rush?</p>

</quote>

</section>

<section
  title="Shrinking ext2 And ext3 Directories"
  subject="Shrinking ext3 directories"
  archive="http://www.uwsg.indiana.edu/hypermail/linux/kernel/0206.2/0463.html"
  posts="39"
  startdate="18 Jun 2002 08:08:28 -0800"
  enddate="23 Jun 2002 23:12:31 -0800"
>
<topic>FS: ext2</topic>
<topic>FS: ext3</topic>

<mention>Andreas Dilger</mention>
<mention>Daniel Phillips</mention>

<p>Someone pointed out the age-old problem that after deleting files and
directories from a given directory in an ext2 or ext3 filesystem, the blocks
allocated for them in that directory would not be freed. The posted knew
that the traditional way around this was to create a new directory, move
any desired items from the old directory to the new, and then delete the old
entirely. But this seemed messy, so the poster asked if there were any way to
'shrink' the directory without going through the rigamarole.</p>

<p>At one point, Andreas Dilger said that there was no way currently, and that
implementing such a feature would probably take a lot of work. Stephen C.
Tweedie also agreed that there was no current implementation, but he added,
<quote who="Stephen C. Tweedie">However, I know that Daniel Phillips has
been thinking about adding that for his HTree extensions which add fast
directory indexing to ext2/3.</quote> But Alexander Viro said with a shrug,
<quote who="Alexander Viro">for ext2 a limited form of "shrinking" is easy
to implement.  ext2_delete_entry() can easily notice that it's about to
create an empty entry spanning entire last block.  In that case it should
just walk back and check beginnings of previous blocks, as long as they
are empty (inode = 0, len = block size).  Then it's vmtruncate() time -
all IO on directories is protected by i_sem, so we are safe.  IOW, making
sure that empty blocks in the end of directory get freed is a matter of
10-20 lines.</quote> He offered to do it himself, and the original poster
said that would be great. But Stephen objected, with:</p>

<quote who="Stephen C. Tweedie">

<p>It's certainly easier at the tail, but with htree we may have genuinely
enormous directories and being able to hole-punch arbitrary coalesced blocks
could be a huge win.  Also, doing the coalescing block by block is likely
to be far easier for ext3 than truncating the directory arbitrarily back in
one go.</p>

<p>Chopping a large directory at once brings back the truncate() nightmare
of having to make an unbounded disk operation seem atomic, even if it has to
get split over multiple transactions.  Incremental coalescing should allow
us to know in advance how many disk blocks we might end up touching for the
operation, so we can guarantee to do it in one transaction.</p>

</quote>

<p>At one point Andrew Morton gave some links to patches and remarked,
<quote who="Andrew Morton">btw, I merged all the ext3 htree stuff into 2.5.23
yesterday. Haven't tested it much at all yet.</quote> At this point folks
delved into the particular implementation details.</p>

</section>

<section
  title="The Future Of Linux Multiprocessor Support"
  subject="latest linus-2.5 BK broken"
  archive="http://www.uwsg.indiana.edu/hypermail/linux/kernel/0206.2/0488.html"
  posts="100"
  startdate="18 Jun 2002 09:18:08 -0800"
  enddate="24 Jun 2002 05:06:12 -0800"
>
<topic>Clustering</topic>
<topic>FS: initramfs</topic>
<topic>FS: ramfs</topic>
<topic>Hyperthreading</topic>
<topic>Microkernels</topic>
<topic>Ottawa Linux Symposium</topic>
<topic>Real-Time</topic>
<topic>SMP</topic>
<topic>Scheduler</topic>
<topic>Version Control</topic>

<p>In the course of discussing something else, Linus Torvalds remarked:</p>

<quote who="Linus Torvalds">

<p>I'm absolutely 100% conviced that you don't want to have a "single kernel"
for a cluster, you want to run independent kernels with good communication
infrastructure between them (ie global filesystem, and try to make the
networking look uniform).  </p>

<p>Trying to have a single kernel for thousands of nodes is just crazy. Even
if the system were ccNuma and _could_ do it in theory.  </p>

<p>The NuMA work can probably take single-kernel to maybe 64+ nodes,
before people just start turning stark raving mad. There's no way you'll
have single-kernel for thousands of CPU's, and still stay sane and claim
any reasonable performance under generic loads.</p>

</quote>

<p>Eric W. Biederman replied:</p>

<quote who="Eric W. Biederman">

<p>Agreed.</p>

<p>The compute cluster problem is an interesting one.  The big items I see
on the todo list are:</p>

<p>

<ul>

<li>Scalable fast distributed file system (Lustre looks like a
possibility)</li>

<li>Sub application level checkpointing.</li>

</ul>

</p>

<p>Services like a schedulers, already exist.</p>

<p>Basically the job of a cluster scheduler gets much easier, and the scheduler
more powerful once it gets the ability to suspend jobs.  Checkpointing buys
three things.  The ability to preempt jobs, the ability to migrate processes,
and the ability to recover from failed nodes, (assuming the failed hardware
didn't corrupt your jobs checkpoint).</p>

<p>Once solutions to the cluster problems become well understood I wouldn't
be surprised if some of the supporting services started to live in the kernel
like nfsd.  Parts of the distributed filesystem certainly will.</p>

<p>I suspect process checkpointing and restoring will evolve something
something like pthread support.  With some code in user space, and some
generic helpers in the kernel as clean pieces of the job can be broken off.
The challenge is only how to save/restore interprocess communications. Things
like moving a tcp connection from one node to another are interesting
problems.</p>

<p>But also I suspect most of the hard problems that we need kernel help
with can have uses independent of checkpointing.  Already we have web server
farms that spread connections to a single ip across nodes.</p>

</quote>

<p>Larry McVoy came in with:</p>

<quote who="Larry McVoy">

<p><a
href="http://www.bitmover.com/cc-pitch">http://www.bitmover.com/cc-pitch</a></p>

<p>I've been trying to get Linus to listen to this for years and he keeps
on flogging the tired SMP horse instead.  DEC did it and Sun has been
passing around these slides for a few weeks, so maybe they'll do it too.
Then Linux can join the party after it has become a fine grained, locked
to hell and back, soft "realtime", numa enabled, bloated piece of crap like
all the other kernels and we'll get to go through the "let's reinvent Unix
for the 3rd time in 40 years" all over again.  What fun.  Not.</p>

<p>Sorry to be grumpy, go read the slides, I'll be at OLS, I'd be happy to
talk it over with anyone who wants to think about it.  Paul McKenney from
IBM came down the San Francisco to talk to me about it, put me through an
8 or 9 hour session which felt like a PhD exam, and after trying to poke
holes in it grudgingly let on that maybe it was a good idea.  He was kind
of enough to write up what he took away from it, here it is.</p>

</quote>

<p>Eric W. Biederman replied:</p>

<quote who="Eric W. Biederman">

<p>Hmm.  My impression is that Linux has been doing SMP but mostly because
it hasn't become a nightmare so far.  Linus just a moment ago noted that
there are scaleablity limits, to SMP.</p>

<p>As for the cc-SMP stuff.</p>

<p>

<ol>

<li>Except dual cpu systems no-one makes affordable SMPs.</li>

<li>It doesn't solve anything except your problem with locks.</li>

</ol>

</p>

<p>You have presented your idea, and maybe it will be useful.  But at
the moment it is not the place to start. What I need today is process
checkpointing.  The rest comes in easy incremental steps from there.</p>

<p>For me the natural place to start is with clusters, they are cheaper and
more accessible than SMPs.  And then work on the clustering software with
gradual refinements until it can be managed as one machine.  At that point
it should be easy to compare which does a better job for SMPs.</p>

</quote>

<p>At one point, Cort Dugan said:</p>

<quote who="Cort Dugan">

<p>"Beating the SMP horse to death" does make sense for 2 processor SMP
machines.  When 64 processor machines become commodity (Linux is a commodity
hardware OS) something will have to be done.  When research groups put Linux
on 1k processors - it's an experiment.  I don't think they have much right to
complain that Linux doesn't scale up to that level - it's not designed to.</p>

<p>That being said, large clusters are an interesting research area but it
is _not_ a failing of Linux that it doesn't scale to them.</p>

</quote>

<p>Linus replied, regarding the insistance on SMP for 2 processor systems,
saying:</p>

<quote who="Linus Torvalds">

<p>It makes fine sense for any tightly coupled system, where the tight
coupling is cost-efficient.</p>

<p>Today that means 2 CPU's, and maybe 4.</p>

<p>Things like SMT (Intel calls it "HT") increase that to 4/8. It's just
_cheaper_ to do that kind of built-in SMP support than it is to not use it.</p>

<p>The important part of what Cort says is "commodity". Not the "small number
of CPU's". Linux is focusing on SMP, because it is the ONLY INTERESTING
HARDWARE BASE in the commodity space.</p>

<p>ccNuma and clusters just aren't even on the _radar_ from a commodity
standpoint. While commodity 4- and 8-way SMP is just a few years away.</p>

<p>So because SMP hardware is cheap and efficient, all reasonable scalability
work is done on SMP. And the fringe is just that - fringe. The numa/cluster
fringe tends to try to use SMP approaches because they know they are a
minority, and they want to try to leverage off the commodity.</p>

<p>And it will continue to be this way for the forseeable future. People
should just accept the fact.</p>

<p>The only thing that may change the current state of affairs is that some
cluster/numa issues are slowly percolating down and they may become more
commoditized. For example, I think the AMD approach to SMP on the hammer
series is "local memories" with a fast CPU interconnect. That's a lot more
NUMA than we're used to in the PC space.</p>

<p>On the other hand, another interesting trend seems to be that since
commoditizing NUMA ends up being done with a lot of integration, the actual
_latency_ difference is so small that those potential future commodity NUMA
boxes can be considered largely UMA/SMP.</p>

<p>And I guarantee Linux will scale up fine to 16 CPU's, once that is
commodity. And the rest is just not all that important.</p>

</quote>

<p>Elsewhere, Larry said that any attempt to scale Linux up past a few CPUs
would lead to an unworkable mass of threading and locking that could never be
undone. Jeff Garzik chimed in, with:</p>

<quote who="Jeff Garzik">

<p>One point that is missed, I think, is that Linux secretly wants to be
a microkernel.</p>

<p>Oh, I don't mean the strict definition of microkernel, we are continuing
to push the dogma of "do it in userspace" or "do it in process context"
(IOW userspace in the kernel).</p>

<p>Look at the kernel now -- the current kernel is not simply an event-driven,
monolithic program [the tradition kernel design].  Linux also depends on a
number of kernel threads to perform various asynchronous tasks.  We have had
userspace agents managing bits of hardware for a while now, and that trend
is only going to be reinforced with Al's initramfs.</p>

<p>IMO, the trend of the kernel is towards a collection of asynchronous tasks,
which lends itself to high parallelism.  Hardware itself is trending towards
playing friendly with other hardware in the system (examples: TCQ-driven
bus release and interrupt coalescing), another element of parallelism.</p>

<p>I don't see the future of Linux as a twisted nightmare of spinlocks.</p>

</quote>

<p>Cort replied:</p>

<quote who="Cort Dougan">

<p>That's not a microkernel design philosophy, it's a good OS design
philosophy.  If it doesn't _have_ to be in the kernel, it generally shouldn't
be.</p>

<p>I agree with you that Linux is already a loosely connected yet highly
inter-dependent set of asynchronous tasks.  That makes for a very difficult
to analyze system.</p>

<p>I don't see Linux being in serious jeopardy in the short-term of becoming
solaris.  It only aims at running on 1-4 processors and does a pretty good
job of that.  Most sane people realize, as Larry points out, that the current
design will not scale to 64 processors and beyond.  That's obvious, it's not
an alarmist or deep statement.  The key is to realize that it's not _meant_
to scale that high right now.</p>

<p>I've done a little work with Larry's suggestion for scaling Linux and it's
very smart in that it solves the problem in a very simple and elegant way.
DEC did the same thing with Galaxy some time ago but they layered it with so
much of their cluster software and OpenVMS that it lost all the performance
that it had gained by being clever.  If you want a simple description of
the idea (the way I am working on it), it's a software version of NORMA.</p>

<p>Linux's sweet spot is 2-4 processors and probably shouldn't try to change.
It's a very hard problem going higher.  Many systems have failed in exactly
the same way trying to do that sort of thing.  Just cluster a bunch of those
2-4 processor Linux's (room full of boxes, large 64-way IBM server or some
hybrid) and you have a clean solution.</p>

</quote>

</section>

<section
  title="Thoughts On Bootless Kernel Upgrades"
  subject="kernel upgrade on the fly"
  archive="http://www.uwsg.indiana.edu/hypermail/linux/kernel/0206.2/0560.html"
  posts="11"
  startdate="18 Jun 2002 13:21:49 -0800"
  enddate="22 Jun 2002 00:40:14 -0800"
>
<topic>FS</topic>
<topic>Hot-Plugging</topic>
<topic>Microsoft</topic>
<topic>Software Suspend</topic>

<p>Adi Zaimi asked if anyone had thought about ways of upgrading the kernel of a
running system, without requiring a reboot. Rob Landley replied:</p>

<quote who="Rob Landley">

<p>Thought about, yes.  At length.  That's why it hasn't been done. :)</p>

<p>Closest you'll get at the moment is some variant of two kernel monte,
I.E. a reboot to a new kernel with all processes offed, but at least without
involving the bios.</p>

<p>The new swsup infrastructure from pavel machek theoretically lets you
freeze the state of your system to disk, so we're a heck of a lot farther
ahead then we were.  If you want to re-open this can of worms, the only way
to go is to start with some combination of these two projects:</p>

<p><a
href="http://falcon.sch.bme.hu/~seasons/linux/swsusp.html">http://falcon.sch.bme.hu/~seasons/linux/swsusp.html</a></p>

<p><a
href="http://sourceforge.net/projects/monte/">http://sourceforge.net/projects/monte/</a></p>

<p>That said, the fundamental problem is that when you change kernels,
run-time state structures change.  Parsing your run-time state from oldvers
to feed into newvers can't really be done automatically because your tool
wouldn't know what any of the changes MEAN, so you would probably have to
write a custom frozen process converter, which would be a pain and a half
to debug, to say the least.  (And by the time you've got that even half
debugged you need to do it for the NEXT kernel...)</p>

<p>Of course software suspend theoretically deals with at least some of the
device driver issues, so there's a certain amount of handwaving you can do
on that end.  And migrating hot network connections is something people
have in fact done before, although you'll have to ask around about who.
(Ask the security nuts, they consider it a bad thing. :) </p>

<p>Nothing is impossible for anyone impervious to reason, and you might suprise
us (it'd make a heck of a graduate project).  Hot migration isn't IMPOSSIBLE,
it's just a flipping pain in the ass.  But the issue's a bit threadbare in
these parts (somewhere between "are we there yet mommy?" and "can I buy a
pony?").  Try the swsup mailing list, they might be willing to humor you...</p>

<p>(And the people most likely to WANT this feature ("this system never goes
down" types) are also the least likely to want to deal with subtle bugs from
a bad conversion that don't manifest until a week after the new system comes
up when cron goes nuts at 3 am.  Of course whether hot migration it's more
dangerous to your data than the interaction between Andre's and Martin's
egoes in the ATAPI layer is an open question... :)  Ahem.  Right...)</p>

<p>The SANE answer always has been to just schedule some down time for the box.
The insane answer involves giving an awful lot of money to Sun or IBM or
some such for hot-pluggable backplanes.  (How do you swap out THE BACKPLANE?
That's an answer nobody seems to have...)</p>

<p>Clusters.  Migrating tasks in the cluster, potentially similar problem.
Look at mosix and the NUMA stuff as well, if you're actually serious about
this.  You have to reduce a process to its vital data, once all the resources
you can peel away from it have been peeled away, swapped out, freed, etc.
If you can suspend and save an individual running process to a disk image
(just a file in the filesystem), in such a way that it can be individually
re-loaded later (by the same kernel), you're halfway there.  No, it's not
as easy as it sounds. :)</p>

</quote>

<p>John Alvord replied, <quote who="John Alvord">IMO the biggest reason it
hasn't been done is the existence of loadable modules. Most driver-type
development work can be tested without rebooting.</quote> But Rob came
back with:</p>

<quote who="Rob Landley">

<p>That's part of it, sure.  (And I'm sure the software suspend work is
leveraging the ability to unload modules.)</p>

<p>There's a dependency tree: processes need resources like mounted filesystems
and open file handles to the network stack and such, and you can't unmount
filesystems and unload devices while they're in use.  Taking a running system
apart and keeping track of the pieces needed to put it back together again
is a bit of a challenge.</p>

<p>The software suspend work can't freeze processees individually to seperate
files (that I know of), but I've heard blue-sky talk about potentially
adding it.  (Dunno what the actual plans are, pavel machek probably would).
If processes could be frozen in a somewhat kernel independent way (so that
their run-time state was parsed in again in a known format and flung into
any functioning kernel), then upgrading to a new kernel would just be a
question of suspending all the processes you care about preserving, doing
a two kernel monte, and restoring the processes.  Migrating a process from
one machine to another in a network clsuter would be possible too.</p>

<p>I'm sure it's not as easy as it sounds, but looking at the software suspend
work would be a necessary first step.  They are, at least, serializing
processes to disk and bringing them back afterwards.  I'm fairly certain
it's happening the microsoft word saves *.doc files (block write the
run-time structures to disk and block read them back in verbatim later,
and hope all your compiler alignment offsets and such match if there's any
version skew).</p>

<p>Then again, the star office people reverse engineered that and made it
(mostly) work without even having access to the source code... :)</p>

<p>Hmmm, what would be involved in serializing a process to disk?
Obviously you start by sending it a suspend signal.  There's the process
stuff, of course.  (Priority, etc.)  That's not too bad.  You'd need to
record all the memory mappings (not just the contents of the physical and
swapped out memory mappings (which should be saved to the serializing file),
but also the memory protection states and memory mapped file ranges and
such, so you can map it all back in at the appropriate location later).
I'd bug whoever did the recent shared page table work (daniel philips?) for
information about what that really MEANS.</p>

<p>You'd need to record all the open file handles, of course. (For actual
files this includes position in file, corresponding locks, etc.  For the
zillions of things that just LOOK like files, pipes and sockets and character
and block devices, expect special case code).</p>

<p>Pipes bring up a fun point: you can't always serialize just one process.
Sometimes they clump together, and if you kill one more go down with it.
Thread groups are easy to spot, as well as parent/child relationships that
share memory maps and file handles and such, but even just a simple "cat blah
| less" means there are two processes connected by a pipe which pretty much
need to be serialized together.  (A common real-world case is that one of
those processes is going to be the X11 server, this brings up a WORLD of fun.
 For a 1.00 release it's an obvious "Don't Do That Then", and later on might
have special case behavior.)</p>

<p>If an actual file handle is open to an otherwise unlinked file, you need
to either make a link to that file somewhere (not too hard, that info is
already in proc/###/fs) or maybe cache the contents of the file as part of
the serialized image...</p>

<p>Which brings up the whole question of how portable a serialized program
image should be.  Forget swapping kernels, I mean running the system for a
while before resuming the "frozen" executable.  Rename a couple files and the
resume is going to get confused.  You kind of have to restore to the exact
same system you left off at, because if you have an open fiile handle to
file or device driver that isn't there on the resumed system, you basically
have some variant of a "broken pipe" scenario.  (Then again, forced unmount
of filesystems can sort of give you this problem anyway, so infrastructure
to deal with it is going to have to be faced at some point...)</p>

<p>For rebooting a running system with the same mounted partitions and
hopefully the same set of device drivers, this isn't really any worse than
software suspend.  And detecting a missing file and having the resume fail
with an error would be pretty easy.  But also pretty darn easy to trigger,
but that's the user's problem...</p>

<p>What other resources attach to a process?  The process infos itself
(user ID, capabilities), memory mappings, file handles...  Bound sockets...
Signal handlers and masks...  I/O port mappings and such if you're running
as root...</p>

<p>It's not an unsolvable problem, but it IS a can of worms.  Just plain
reparenting a process turned out to be complicated enough they made
reparent_to_init (see kernel/sched.c).</p>

</quote>

</section>

<section
  title="Cleaning Up The Source Tree"
  subject="2.5.x: arch/i386/kernel/cpu"
  archive="http://www.uwsg.indiana.edu/hypermail/linux/kernel/0206.2/0627.html"
  posts="6"
  startdate="18 Jun 2002 19:45:18 -0800"
  enddate="21 Jun 2002 10:18:30 -0800"
>
<topic>Source Tree</topic>

<p>H. Peter Anvin called out:</p>

<quote who="H. Peter Anvin">

<p>Whomever broke up arch/i386/kernel/setup.c and created the CPU directory
(very good idea) messed up in at least one place:</p>

<p>The *AMD-defined* CPUID flags (0x80000001) are not just used on AMD
processors!  In fact, at least AMD, Transmeta, Cyrix and VIA all use them;
I don't know about Centaur or Rise.  Intel supports the actual level starting
with the P4 although it returns all zero.</p>

<p>It should, in my opinion, be moved into generic_identify().  Anyone who
has a reason why that shouldn't be done speak now or I'll send the patch
to Linus.</p>

</quote>

<p>Dave Jones gave credit for the patch to Patrick Mochel, and agreed that H.
Peter's patch would be a good idea, unless Patrick had a better one. H. Peter
remarked, <quote who="H. Peter Anvin">Note that this is great.  We should
do the same with bugs.h which is, if anything, an even worse mess.</quote>
And Dave replied, <quote who="Dave Jones">Agreed. Patrick also did similar
work on the mtrr driver which isn't merged anywhere yet. That's something
else that's been long overdue this treatment.  (Also on my list for chopping
into bits is agpgart_be.c, but that's another story..)</quote></p>

<p>Patrick also came into the discussion, thanking H. Peter for the catch,
and encouraging him to send in his patch if it was readily available, and
<quote who="Patrick Mochel">If not, I'll add it to my short list and look
at it in the next few days (hopefully).</quote></p>

</section>

<section
  title="ext2/ext3 Scalability"
  subject="ext3 performance bottleneck as the number of spindles gets large"
  archive="http://www.uwsg.indiana.edu/hypermail/linux/kernel/0206.2/0846.html"
  posts="35"
  startdate="19 Jun 2002 13:29:45 -0800"
  enddate="24 Jun 2002 14:51:35 -0800"
>
<topic>Disks: SCSI</topic>
<topic>FS: ext2</topic>
<topic>FS: ext3</topic>
<topic>Locking</topic>
<topic>SMP</topic>

<p>Someone from Intel reported that they'd been doing throughput comparisons
and benchmarks of block I/O throughput for 8K writes, as the number of SCSI
addapters and drives per adapter were increased. On their dual processor
1.2GHz PIII with 2G RAM, running kernel 2.4.16 or 2.4.18, they found that the
Bonnie++ benchmark showed throughput going down as the number of spindles
went up. As far as they could tell, the problem boiled down to ext3 making
too much use of the BKL (Big Kernel Lock). The poster suggested replacing
the BKL usage with per-filesystem locking instead. Andrew Morton replied,
<quote who="Andrew Morton">ext3 scalability is very poor, I'm afraid.  The fs
really wasn't up and running until kernel 2.4.5 and we just didn't have time
to address that issue.</quote> He added, <quote who="Andrew Morton">The vague
plan there is to replace lock_kernel with lock_journal where appropriate.
But ext3 scalability work of this nature will be targetted at the 2.5 kernel,
most probably.</quote></p>

<p>Dave Hansen took a look at the code and agreed that BKL contention was
pretty hairy in ext3. He added, <quote who="Dave Hansen">We used to see
plenty of ext2 BKL contention, but Al Viro did a good job fixing that early
in 2.5 using a per-inode rwlock. I think that this is the required level of
lock granularity, another global lock just won't cut it. <a
href="http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock">http://lse.sourceforge.net/lockhier/bkl_rollup.html#getblock</a>.</quote>
Andreas Dilger said:</p>

<quote who="Andreas Dilger">

<p>There are a variety of different efforts that could be made towards
removing the BKL from ext2 and ext3.  The first, of course, would be to have a
per-filesystem lock instead of taking the BKL (I don't know if Al has changed
lock_super() in 2.5 to be a real semaphore or not).  As Andrew mentioned,
there would also need to be be a per-journal lock to ensure coherency of the
journal data.  Currently the per-filesystem and per-journal lock would be
equivalent, but when a single journal device can be shared among multiple
filesystems they would be different locks.</p>

<p>I will leave it up to Andrew and Stephen to discuss locking scalability
within the journal layer.</p>

<p>Within the filesystem there can be a large number of increasingly fine locks
added - a superblock-only lock with per-group locks, or even per-bitmap and
per-inode-table(-block) locks if needed.  This would allow multi- threaded
inode and block allocations, but a sane lock ranking strategy would have to
be developed.  The bitmap locks would only need to be 2-state locks, because
you only look at the bitmaps when you want to modify them.  The inode table
locks would be read/write locks.</p>

<p>If there is a try-writelock mechanism for the individual inode table
blocks you can avoid write lock contention for creations by simply finding
the first un-write-locked block in the target group's inode table (usually
in the hundreds of blocks per group for default parameters).  For inode
allocation you don't really care which inode you get, as long as you get one
in the preferred group (even that isn't critical for directory creation).
For inode deletions you will get essentially random block locking, which
is actually improved by the find-first-unlocked allocation policy (at the
expense of dirtying more inode table blocks).</p>

<p>Contention for the superblock lock for updates to the superblock free block
and free inode counts could be mitigated by keeping "per-group delta buckets"
in memory, that are written into the superblock only once every few seconds
or at statfs time instead of needing multiple locks for each block/inode
alloc/free.  The groups already keep their own summary counts for free blocks
and inodes.  The coherency of these fields with the superblock on recovery
would be handled at journal recovery time (either in the kernel or e2fsck).
Other than these two fields there are few write updates to the superblock
(on ext3 there is also the orphan list, modified at truncate and when an
open file is unlinked and when such a file is closed).</p>

<p>I have even been thinking about multi-threaded directory-entry creation in
a single directory.  One nice thing about ext2/ext3 directory blocks is that
each one is self-contained and can be modified independently.  For regular
ext2/ext3 directories you would only be able to do multi-threaded deletes
by having a lock for each directory block.  For creations you would need to
lock the entire directory to ensure exclusive access for a create, which is
the same single-threaded behaviour for a single directory we have today with
the directory i_sem.</p>

<p>However, if you are using the htree indexed directory layout (which you
will be, if you care about scalable filesystem performance) then there is
only a single block into which a given filename can be added, so you can have
per-block locks even for file creation.  As the number of directory entries
grows (and hence more directory blocks) the locking becomes increasingly
more fine-grained so you get better scalability with larger directories,
which is what you want.</p>

</quote>

<p>Andrew said:</p>

<quote who="Andrew Morton">

<p>The next steps for ext2 are: stare at Anton's next set of graphs and
then, I expect, removal of the fs-private bitmap LRUs, per-cpu buffer LRUs
to avoid blockdev mapping lock contention,  per-blockgroup locks and removal
of lock_super from the block allocator.</p>

<p>But there's no point in doing that while zone->lock and pagemap_lru_lock
are top of the list.  Fixes for both of those are in progress.</p>

<p>ext2 is bog-simple.  It will scale up the wazoo in 2.6.</p>

</quote>

<p>But he added, <quote who="Andrew Morton">ext3 is about 700x as complex
as ext2.  It will need to be done with some care.</quote></p>

<p>Elsewhere, Stephen C. Tweedie felt that it might not be necessary to wait for
2.6 for ext3 scalability. He said:</p>

<quote who="Stephen C. Tweedie">

<p>I think we can do better than that, with care.  lock_journal could easily
become a read/write lock to protect the transaction state machine, as there's
really only one place --- the commit thread --- where we end up changing
the state of a transaction itself (eg. from running to committing).  For
short-lived buffer transformations, we already have the datalist spinlock.</p>

<p>There are a few intermediate types of operation, such as the
do_get_write_access.  That's a buffer operation, but it relies on us being
able to allocate memory for the old version of the buffer if we happen to be
committing the bh to disk already.  All of those cases are already prepared
to accept BKL being dropped during the memory allocation, so there's no
problem with doing the same for a short-term buffer spinlock; and if the
journal_lock is only taken shared in such places, then there's no urgent
need to drop that over the malloc.</p>

<p>Even the commit thread can probably avoid taking the journal lock in many
cases --- it would need it exclusively while changing a transaction's global
state, but while it's just manipulating blocks on the committing transaction
it can probably get away with much less locking.</p>

</quote>

</section>

<section
  title="Status Of CML2 And Kernel Configuration System"
  subject="CML2"
  archive="http://www.uwsg.indiana.edu/hypermail/linux/kernel/0206.2/1115.html"
  posts="5"
  startdate="20 Jun 2002 16:43:58 -0800"
  enddate="22 Jun 2002 14:28:38 -0800"
>
<topic>Configuration</topic>
<topic>Disks: SCSI</topic>
<topic>Kernel Build System</topic>
<notopic>Disks</notopic>

<mention>Eric S. Raymond</mention>



<p>Hayden James asked about the status of CML2, and Eric Weigle replied,
<quote who="Eric Weigle">OOoooh, ouch. You apparently missed the two
(or more) significant flamewars on these topics. The current status
is that kbuild will probably slowly be merged by going through Kai and
being munged into Linus-acceptable patches, while CML2 will probably sit
around and never get merged unless ESR accepts the fact that cool code
solving a problem doesn't automagically get into the kernel.  See the
thread rooted somewhere around here ("Disgusted with Kbuild..."): <a
href="http://www.uwsg.iu.edu/hypermail/linux/kernel/0202.2/0000.html">http://www.uwsg.iu.edu/hypermail/linux/kernel/0202.2/0000.html</a></quote></p>

<p>Roman Zippel also said (referring to Eric S. Raymond, CML2 author), <quote
who="Roman Zippel">Due to the silence of him, we must assume that he has given
up. CML2 has a few problems, which make it unlikely that it gets included
as is.  Anyway, not all hope is lost, I started my own configuration system
some time ago, which will be less complex than CML2. It's only advancing
a bit slowly currently, as I only have little time to work on it.</quote>
Sam Ravnborg asked:</p>

<quote who="Sam Ravnborg">

<p>Despite the fact that you are advancing slowly could you explain what
your plans are with the configuration system?</p>

<p>As of today we have basically three different ways to read the Config.in
files, where xconfig are the one with the best but also most critical
parser/analyser.  Do you plan to replace all of them or?</p>

</quote>

<p>And Roman replied:</p>

<quote who="Roman Zippel">

<p>My plan is to convert the current configuration into a new format (I
have a tool for that), which is more flexible and will allow that all needed
information to configure/build a driver is at a single place. It currently
looks like this:</p>

<p><tt>config BLK_DEV_SD<br />
&#160;&#160;depends SCSI<br />
&#160;&#160;tristate "SCSI disk support"<br />
&#160;&#160;help<br />
&#160;&#160;If you want to use a SCSI hard disk ...</tt></p>

<p>More information can be added later to this.</p>

<p>The current parsers will all be replaced with a single parser, actually
it's a library that does all the work and which allows multiple front ends
to behave identical.</p>

</quote>

</section>

</kc>

