page cache

IRC, freenode, #hurd, 2012-04-26

<braunr> another not-too-long improvement would be changing the page cache
  policy
<youpi> to drop the 4000 objects limit, you mean ?
<braunr> yes
<youpi> do you still have my patch attempt ?
<braunr> no
<youpi> let me grab that
<braunr> oh i won't start it right away you know
<braunr> i'll ask for it when i do
<youpi> k
<braunr> (otherwise i fell i'll just loose it again eh)
<youpi> :)
<braunr> but i imagine it's not too hard to achieve
<youpi> yes
<braunr> i also imagine to set a large threshold of free pages to avoid
  deadlocks
<braunr> which will still be better than the current situation where we
  have either lots of free pages because tha max limit is reached, or lots
  of pressure and system freezes :/
<youpi> yes

IRC, freenode, #hurd, 2012-06-17

<braunr> youpi: i don't understand your patch :/
<youpi> arf
<youpi>  which part don't you understand?
<braunr> the global idea :/
<youpi> first, drop the limit on number of objects
<braunr> you added a new collect call at pageout time
<youpi> (i.e. here, hack overflow into 0)
<braunr> yes
<braunr> obviously
<youpi> but then the cache keeps filling up with objects
<youpi> which sooner or later become empty
<youpi> thus the collect, which is supposed to look for empty objects, and
  just drop them
<braunr> but not at the right time
<braunr> objects should be collected as soon as their ref count drops to 0
<braunr> err
<youpi> now, the code of the collect is just a crude attempt without
  knowing much about the vm
<braunr> when their resident page count drops to 0
<youpi> so don't necessarily read it :)
<braunr> ok
<braunr> i've begin playing with the vm recently
<braunr> the limits (arbitrary, and very old obviously) seem far too low
  for current resources
<braunr> (e.g. the threshold on free pages is 50 iirc ...)
<youpi> yes
<braunr> i'll probably use a different approach
<braunr> the one i mentioned (collecting one object at a time - or pushing
  them on a list for bursts - when they become empty)
<braunr> this should relax the kernel allocator more
<braunr> (since there will be less empty vm_objects remaining until the
  next global collecttion)

IRC, freenode, #hurd, 2012-06-30

<braunr> the threshold values of the page cache seem quite enough actually
<youpi> braunr: ah
<braunr> youpi: yes, it seems the problems are in ext2, not in the VM
<youpi> k
<youpi> the page cache limitation still doesn't help :)
<braunr> the problem in the VM is the recycling of vm_objects, which aren't
  freed once empty
<braunr> but it only wastes some of the slab memory, it doesn't prevent
  correct processing
<youpi> braunr: thus the limitation, right?
<braunr> no
<braunr> well
<braunr> that's the policy they chose at the time
<braunr> for what reason .. i can't tell
<youpi> ok, but I mean
<youpi> we can't remove the policy because of the non-free of empty objects
<braunr> we must remove vm_objects at some point
<braunr> but even without it, it makes no sense to disable the limit while
  ext2 is still unstable
<braunr> also, i noticed that the page count in vm_objects never actually
  drop to 0 ...
<youpi> you mean the limit permits to avoid going into the buggy scenarii
  too often?
<braunr> yes
<youpi> k
<braunr> at least, that's my impression
<braunr> my test case is tar xf files.tar.gz, which contains 50000 files of
  12k random data
<braunr> i'll try with other values
<braunr> i get crashes, deadlocks, livelocks, and it's not pretty :)

libpager deadlock.

<braunr> and always in ext2, mach doesn't seem affected by the issue, other
  than the obvious
<braunr> (well i get the usual "deallocating an invalid port", but as
  mentioned, it's "most probably a bug", which is the case here :)
<youpi> braunr: looks coherent with the hangs I get on the buildds
<braunr> youpi: so that's the nasty bug i have to track now
<youpi> though I'm also still getting some out of memory from gnumach
  sometimes
<braunr> the good thing is i can reproduce it very quickly
<youpi> a dump from the allocator to know which zone took all the room
  might help
<braunr> youpi: yes i promised that too
<youpi> although that's probably related with ext2 issues :)
<braunr> youpi: can you send me the panic message so i can point the code
  which must output the allocator state please ?
<youpi> next time I get it, sure :)
<pinotree> braunr: you could implement a /proc/slabinfo :)
<braunr> pinotree: yes but when a panic happens, it's too late
<braunr> http://git.sceen.net/rbraun/slabinfo.git/ btw
<braunr> although it's not part of procfs
<braunr> and the mach_debug interface isn't provided :(

IRC, freenode, #hurd, 2012-07-03

<braunr> it looks like pagers create a thread per memory object ...
<antrik> braunr: oh. so if I open a lot of files, ext2fs will *inevitably*
  have lots of threads?...
<braunr> antrik: i'm not sure
<braunr> it may only be required to flush them
<braunr> but when there are lots of them, the threads could run slowly,
  giving the impression there is one per object
<braunr> in sync mode i don't see many threads
<braunr> and i don't get the bug either for now
<braunr> while i can see physical memory actually being used
<braunr> (and the bug happens before there is any memory pressure in the
  kernel)
<braunr> so it definitely looks like a corruption in ext2fs
<braunr> and i have an idea .... :>
<braunr> hm no, i thought an alloca with a big size parameter could erase
  memory outside the stack, but it's something else
<braunr> (although alloca should really be avoided)
<braunr> arg, the problem seems to be in diskfs_sync_everything ->
  ports_bucket_iterate (pager_bucket, sync_one); :/
<braunr> :(
<braunr> looks like the ext2 problem is triggered by calling pager_sync
  from diskfs_sync_everything
<braunr> and is possibly related to
  http://lists.gnu.org/archive/html/bug-hurd/2010-03/msg00127.html
<braunr> (and for reference, the rest of the discussion
  http://lists.gnu.org/archive/html/bug-hurd/2010-04/msg00012.html)
<braunr> multithreading in libpager is scary :/
<antrik> braunr: s/in libpager/ ;-)
<braunr> antrik: right
<braunr> omg the ugliness :/
<braunr> ok i found a bug
<braunr> a real one :)
<braunr> (but not sure it's the only one since i tried that before)
<braunr> 01:38 < braunr> hm no, i thought an alloca with a big size
  parameter could erase memory outside the stack, but it's something else
<braunr> turns out alloca is sometimes used for 64k+ allocations
<braunr> which explains the stack corruptions
<pinotree> ouch
<braunr> as it's used to duplicate the node table before traversing it, it
  also explains why the cache limit affects the frequency of the bug
<braunr> now the fun part, write the patch following GNU protocol .. :)

id:"1341350006-2499-1-git-send-email-rbraun@sceen.net"

<braunr> if someone feels like it, there are a bunch of alloca calls in the
  hurd (like around 30 if i'm right)
<braunr> most of them look safe, but some could trigger that same problem
  in other servers
<braunr> ok so far, no problem with the upstream ext2fs code :)
<braunr> 20 loops of tar xf / rm -rf consuming all free memory as cache :)
<braunr> the hurd uses far too much cpu time for no valid reason in many
  places :/
* braunr happy
<braunr> my hurd is completely using its ram :)
<gnu_srs> Meaning, the bug is solved? Congrats if so :)
<braunr> well, ext2fs looks way more stable now
<braunr> i haven't had a single issue since the change, so i guess i messed
  something with my previous test
<braunr> and the Mach VM cache implementation looks good enough
<braunr> now the only thing left is to detect unused objects and release
  them
<braunr> which is actually the core of my work :)
<braunr> but i'm glad i could polish ext2fs
<braunr> with luck, this is the issue that was striking during "thread
  storms" in the past
* pinotree hugs braunr
<braunr> i'm also very happy to see the slab allocator reacting well upon
  memory pressure :>
<mcsim> braunr: Why alloca corrupted memory diskfs_node_iterate? Was
  temporary node to big to keep it in stack?
<braunr> mcsim: yes
<braunr> 17:54 < braunr> turns out alloca is sometimes used for 64k+
  allocations
<braunr> and i wouldn't be surprised if our thread stacks are
  simplecontiguous 64k mappings of zero-filled memory
<braunr> (as Mach only provides bottom-up allocation)
<braunr> our thread implementation should leave unmapped areas between
  thread stacks, to easily catch such overflows
<pinotree> braunr: wouldn't also fatfs/inode.c and tmpfs/node.c need the
  same fix?
<braunr> pinotree: possibly
<braunr> i haven't looked
<braunr> more than 300 loops of tar xf / rm -rf on an archive of 20000
  files of 12 KiB each, without any issue, still going on :)
<youpi> braunr: yay

id:"20120703121820.GA30902@mail.sceen.net", 2012-07-03

IRC, freenode, #hurd, 2012-07-04

<braunr> mach is so good it caches objects which *no* page in physical
  memory
<braunr> hm i think i have a working and not too dirty vm cache :>
<kilobug> braunr: congrats :)
<braunr> kilobug: hey :)
<braunr> the dangerous side effect is the increased swappiness
<braunr> we'll have to monitor that on the buildds
<braunr> otherwise the cache is effectively used, and the slab allocator
  reports reasonable amounts of objects, not increasing once the ram is
  full
<braunr> let's see what happens with 1.8 GiB of RAM now
<braunr> damn glibc is really long to build :)
<braunr> and i fear my vm cache patch makes non scalable algorithms negate
  some of its benefits :/
<braunr> 72 tasks, 2090 threads
<braunr> we need the ability to monitor threads somewhere

IRC, freenode, #hurd, 2012-07-05

<braunr> hm i get kernel panics when not using the host cache :/
<braunr> no virtual memory for stack allocations
<braunr> that's scary
<antrik> ?
<braunr> i guess the lack of host cache makes I/O slow enough to create a
  big thread storm
<braunr> that completely exhausts the kernel space
<braunr> my patch challenges scalability :)
<antrik> and not having a zalloc zone anymore, instead of getting a nice
  panic when trying to allocate yet another thread, you get an address
  space exhaustion on an unrelated event instead. I see ;-)
<braunr> thread stacks are not allocated from a zone/cache
<braunr> also, the panic concerned aligned memory, but i don't think that
  matters
<braunr> the kernel panic clearly mentions it's about thread stack
  allocation
<antrik> oh, by "stack allocations" you actually mean allocating a stack
  for a new thread...
<braunr> yes
<antrik> that's not what I normally understand when reading "stack
  allocations" :-)
<braunr> user stacks are simple zero filled memory objects
<braunr> so we usually get a deadlock on them :>
<braunr> i wonder if making ports_manage_port_operations_multithread limit
  the number of threads would be a good thing to do
<antrik> braunr: last time slpz did that, it turned out that it causes
  deadlocks in at least one (very specific) situation
<braunr> ok
<antrik> I think you were actually active at the time slpz proposed the
  patch (and it was added to Debian) -- though probably not at the time
  where youpi tracked it down as the cause of certain lockups, so it was
  dropped again...
<braunr> what seems very weird though is that we're normally using
  continuations

continuation.

<antrik> braunr: you mean in the kernel? how is that relevant to the topic
  at hand?...
<braunr> antrik: continuations have been designed to reduce the number of
  stacks to one per cpu :/
<braunr> but they're not used everywhere
<antrik> they are not used *anywhere* in the Hurd...
<braunr> antrik: continuations are supposed to be used by kernel code
<antrik> braunr: not sure what you are getting at. of course we should use
  some kind of continuations in the Hurd instead of having an active thread
  for every single request in flight -- but that's not something that could
  be done easily...
<braunr> antrik: oh no, i don't want to use continuations at all
<braunr> i just want to use less threads :)
<braunr> my panic definitely looks like a thread storm
<braunr> i guess increasing the kmem_map will help for the time bein
<braunr> g
<braunr> (it's not the whole kernel space that gets filled up actually)
<braunr> also, stacks are kept on a local cache until there is memory
  pressure oO
<braunr> their slab cache can fill the backing map before there is any
  pressure
<braunr> and it makes a two level cache, i'll have to remove that
<antrik> well, how do you reduce the number of threads? apart from
  optimising scheduling (so requests are more likely to be completed before
  new ones are handled), the only way to reduce the number of threads is to
  avoid having a thread per request
<braunr> exactly
<antrik> so instead the state of each request being handled has to be
  explicitly stored...
<antrik> i.e. continuations
<braunr> hm actually, no
<braunr> you use thread migration :)
<braunr> i don't want to artificially use the number of kernel threads
<braunr> the hurd should be revamped not to use that many threads
<braunr> but it looks like a hard task
<antrik> well, thread migration would reduce the global number of threads
  in the system... it wouldn't prevent a server from having thousands of
  threads
<braunr> threads would allready be allocated before getting in the server
<antrik> again, the only way not to use a thread for each outstanding
  request is having some explicit request state management,
  i.e. continuations
<braunr> hm right
<braunr> but we can nonetheless reduce the number of threads
<braunr> i wonder if the sync threads are created on behalf of the pagers
  or the kernel
<braunr> one good thing is that i can already feel better performance
  without using the host cache until the panic happens
<antrik> the tricky bit about that is that I/O can basically happen at any
  point during handling a request, by hitting a page fault. so we need to
  be able to continue with some other request at any point...
<braunr> yes
<antrik> actually, readahead should help a lot in reducing the number of
  request and thus threads... still will be quite a lot though
<braunr> we should have a bunch of pageout threads handling requests
  asynchronously
<braunr> it depends on the implementation
<braunr> consider readahead detects that, in the next 10 pages, 3 are not
  resident, then 1 is, then 3 aren't, then 1 is again, and the last two
  aren't
<braunr> how is this solved ? :)
<braunr> about the stack allocation issue, i actually think it's very
  simple to solv
<braunr> the code is a remnant of the old BSD days, when processes were
  heavily swapped
<braunr> so when a thread is created, its stack isn't allocated
<braunr> the allocation happens when the thread is dispatched, and the
  scheduler finds it's swapped (which is the initial state)
<braunr> the stack is allocated, and the operation is assumed to succeed,
  which is why failure produces a panic
<antrik> well, actually, not just readahead... clustered paging in
  general. the thread storms happen mostly on write not read AIUI
<braunr> changing that to allocate at thread creation time will allow a
  cleaner error handling
<braunr> antrik: yes, at writeback
<braunr> antrik: so i guess even when some physical pages are already
  present, we should aim at larger sizes for fewer I/O requests
<antrik> not sure that would be worthwhile... probably doesn't happen all
  that often. and if some of the pages are dirty, we would have to make
  sure that they are ignored although they were part of the request...
<braunr> yes
<braunr> so one request per missing area ?
<antrik> the opposite might be a good idea though -- if every other page is
  dirty, it *might* indeed be preferable to do a single request rewriting
  even the clean ones in between...
<braunr> yes
<braunr> i personally think one request, then replace only what was
  missing, is simpler and preferable
<antrik> OTOH, rewriting clean pages might considerably increase write time
  (and wear) on SSDs
<braunr> why ?
<antrik> I doubt the controller is smart enough to recognies if a page
  doesn't really need rewriting
<antrik> so it will actually allocate and write a new cluster
<braunr> no but it won't spread writes on different internal sectors, will
  it ?
<braunr> sectors are usually really big
<antrik> "sectors" is not a term used in SSDs :-)
<braunr> they'll be erased completely whatever the amount of data at some
  point if i'm right
<braunr> ah
<braunr> need to learn more about that
<braunr> i thought their internal hardware was much like nand flash
<antrik> admittedly I don't remember the correct terminology either...
<antrik> they *are* NAND flash
<antrik> writing is actually not the problem -- it can happen in small
  chunks. the problem is erasing, which is only possible in large blocks
<braunr> yes
<braunr> so having larger requests doesn't seem like a problem to me
<braunr> because of that
<antrik> thus smart controllers (which pretty much all SSD nowadays have,
  and apparently even SD cards) do not actually overwrite. instead, writes
  always happen to clean portions, and erasing only happens when a block is
  mostly clean
<antrik> (after relocating the remaining used parts to other clean areas)
<antrik> braunr: the problem is not having larger requests. the problem is
  rewriting clusters that don't really need rewriting. it means the dist
  performs unnecessary writing actions.
<antrik> it doesn't hurt for magnetic disks, as the head has to pass over
  the unchanged sectors anyways; and rewriting the unnecessarily doesn't
  increase wear
<antrik> but it's different for SSDs
<antrik> each write has a penalty there
<braunr> i thought only erases were the real penalty
<antrik> well, erase happens in the background with modern controllers; so
  it has no direct penalty. the write has a direct performance penalty when
  saturating the bandwith, and always has a direct wear penalty
<braunr> can't controllers handle 32k requests ? like everything does ? :/
<antrik> sure they can. but that's beside the point...
<braunr> if they do, they won't mind the clean data inside such large
  blocks
<antrik> apparently we are talking past each other
<braunr> i must be missing something important about SSD
<antrik> braunr: the point is, the controller doesn't *know* it's clean
  data; so it will actually write it just like the really unclean data
<braunr> yes
<braunr> and it will choose an already clean sector for that (previously
  erased), so writing larger blocks shouldn't hurt
<braunr> there will be a slight increase in bandwidth usage, but that's
  pretty much all of it
<braunr> isn't it ?
<antrik> well, writing always happens to clean blocks. but writing more
  blocks obviously needs more time, and causes more wear...
<braunr> aiui, blocks are always far larger than the amount of pages we
  want to writeback in one request
<braunr> the only way to use more than one is crossing a boundary
<antrik> no. again, the blocks that can be *written* are actually quite
  small. IIRC most SSDs use 4k nowadays
<braunr> ok
<antrik> only erasing operates on much larger blocks
<braunr> so writing is a problem too
<braunr> i didn't think it would cause wear leveling to happen
<antrik> well, I'm not sure whether the wear actually happens on write or
  on erase... but that doesn't matter, as the number of blocks that need to
  be erased is equivalent to the number of blocks written...
<braunr> sorry, i'm really not sure
<braunr> if you erase one sector, then write the first and third block,
  it's clearly not equivalent
<braunr> i mean
<braunr> let's consider two kinds of pageout requests
<braunr> 1/ a big one including clean pages
<braunr> 2/ several ones for dirty pages only
<braunr> let's assume they both need an erase when they happen
<braunr> what's the actual difference between them ?
<braunr> wear will increase only if the controller handle it on writes, if
  i'm right
<braunr> but other than that, it's just bandwidth
<antrik> strictly speaking erase is only *necessary* when there are no
  clean blocks anymore. but modern controllers will try to perform erase of
  unused blocks in the background, so it doesn't delay actual writes
<braunr> i agree on that
<antrik> but the point is that for each 16 pages (or so) written, we need
  to erase one block so we get 16 clean pages to write...
<braunr> yes
<braunr> which is about the size of a request for the sequential policy
<braunr> so it fits
<antrik> just to be clear: it doesn't matter at all how the pages
  "fit". the controller will reallocate them anyways
<antrik> what matters is how many pages you write
<braunr> ah
<braunr> i thought it would just put the whole request in a single sector
  (or two)
<antrik> I'm not sure what you mean by "sector". as I said, it's not a term
  used in SSD technology
<braunr> so do you imply that writes can actually get spread over different
  sectors ?
<braunr> the sector is the unit at the nand flash level, its size is the
  erase size
<antrik> actually, I used the right terminology... the erase unit is the
  block; the write unit is the page
<braunr> sector is a synonym of block
<antrik> never seen it. and it's very confusing, as it isn't in any way
  similar to sectors in magnetic disks...
<braunr> http://en.wikipedia.org/wiki/Flash_memory#NAND_flash
<braunr> it's actually in the NOR part right before, paragraph "Erasing"
<braunr> "Modern NOR flash memory chips are divided into erase segments
  (often called blocks or sectors)."
<antrik> ah. I skipped the NOR part :-)
<braunr> i've only heard sector where i worked, but i don't consider french
  computer engineers to be authorities on the matter :)
<antrik> hehe
<braunr> let's call them block
<braunr> so, thread stacks are allocated out of the kernel map
<braunr> this is already a bad thing (which is probably why there is a
  local cache btw)
<antrik> anyways, yes. modern controllers might split a contiguous write
  request onto several blocks, as well as put writes to completely
  different logical pages into one block. the association between addresses
  and actual blocks is completely free
<braunr> now i wonder why the kernel map is so slow, as the panic happens
  at about 3k threads, so about 11M of thread stacks
<braunr> antrik: ok
<braunr> antrik: well then it makes sense to send only dirty pages
<braunr> s/slow/low/
<antrik> it's different for raw flash (using MTD subsystem in Linux) -- but
  I don't think this is something we should consider any time soon :-)
<antrik> (also, raw flash is only really usable with specialised
  filesystems anyways)
<braunr> yes
<antrik> are the thread stacks really only 4k? I would expect them to be
  larger in many cases...
<braunr> youpi reduced them some time ago, yes
<braunr> they're 4k on xen
<braunr> uh, 16k
<braunr> damn, i'm wondering why i created separate submaps for the slab
  allocator :/
<braunr> probably because that's how it was done by the zone allocator
  before
<braunr> but that's stupid :/
<braunr> hm the stack issue is actually more complicated than i thought
  because of interrupt priority levels
<braunr> i increased the kernel map size to avoid the panic instead
<braunr> now libc0.3 seems to build fine
<braunr> and there seems to be a clear decrease of I/O :)

IRC, freenode, #hurd, 2012-07-06

<antrik> braunr: there is a submap for the slab allocator? that's strange
  indeed. I know we talked about this; and I am pretty sure we agreed
  removing the submap would actually be among the major benefits of a new
  allocator...
<braunr> antrik: a submap is a good idea anyway
<braunr> antrik: it avoids fragmenting the kernel space too much
<braunr> it also breaks down locking
<braunr> but we could consider it
<braunr> as a first step, i'll merge the kmem and kalloc submaps (the ones
  used for the slab caches and the malloc-like allocations respectively)
<braunr> then i'll change the allocation of thread stacks to use a slab
  cache
<braunr> and i'll also remove the thread swapping stuff
<braunr> it will take some time, but by the end we should be able to
  allocate tens of thousands of threads, and suffer no panic when the limit
  is reached
<antrik> braunr: I'm not sure "no panic" is really a worthwhile goal in
  such a situation...
<braunr> antrik: uh ?N
<braunr> antrik: it only means the system won't allow the creation of
  threads until there is memory available
<braunr> from my pov, the microkernel should never fail up to a point it
  can't continue its job
<antrik> braunr: the system won't be able to recover from such a situation
  anyways. without actual resource management/priorisation, not having a
  panic is not really helpful. it only makes it harder to guess what
  happened I fear...
<braunr> i don't see why it couldn't recover :/

IRC, freenode, #hurd, 2012-07-07

<braunr> grmbl, there are a lot of issues with making the page cache larger
  :(
<braunr> it actually makes the system slower in half of my tests
<braunr> we have to test that on real hardware
<braunr> unfortunately my current results seem to indicate there is no
  clear benefit from my patch
<braunr> the current limit of 4000 objects creates a good balance between
  I/O and cpu time
<braunr> with the previous limit of 200, I/O is often extreme
<braunr> with my patch, either the working set is less than 4k objects, so
  nothing is gained, or the lack of scalability of various parts of the
  system add overhead that affect processing speed
<braunr> also, our file systems are cached, but our block layer isn't
<braunr> which means even when accessing data from the cache, accesses
  still cause some I/O for metadata

IRC, freenode, #hurd, 2012-07-08

<braunr> youpi: basically, it works fine, but exposes scalability issues,
  and increases swapiness
<youpi> so it doens't help with stability?
<braunr> hum, that was never the goal :)
<braunr> the goal was to reduce I/O, and increase performance
<youpi> sure
<youpi> but does it at least not lower stability too much?
<braunr> not too much, no
<youpi> k
<braunr> most of the issues i found could be reproduced without the patch
<youpi> ah
<youpi> then fine :)
<braunr> random deadlocks on heavy loads
<braunr> youpi: but i'm not sure it helps with performance
<braunr> youpi: at least not when emulated, and the host cache is used
<youpi> that's not very surprising
<braunr> it does help a lot when there is no host cache and the working set
  is greater (or far less) than 4k objects
<youpi> ok
<braunr> the amount of vm_object and ipc_port is gracefully adjusted
<youpi> that'd help us with not having to tell people to use the complex
  -drive option :)

(writeback caching.)

<braunr> so you can easily run a hurd with 128 MiB with decent performance
  and no leak in ext2fs
<braunr> yes
<braunr> for example
<youpi> braunr: I'd say we should just try it on buildds
<braunr> (it's not finished yet, i'd like to work more on reducing
  swapping)
<youpi> (though they're really not busy atm, so the stability change can't
  really be measured)
<braunr> when building the hurd, which takes about 10 minutes in my kvm
  instances, there is only a 30 seconds difference between using the host
  cache and not using it
<braunr> this is already the case with the current kernel, since the
  working set is less than 4k objects
<braunr> while with the previous limit of 200 objects, it took 50 minutes
  without host cache, and 15 with it
<braunr> so it's a clear benefit for most uses, except my virtual machines
  :)
<youpi> heh
<braunr> because there, the amount of ram means a lot of objects can be
  cached, and i can measure an increase in cpu usage
<braunr> slight, but present
<braunr> youpi: isn't it a good thing that buildds are resting a bit ? :)
<youpi> on one hand, yes
<youpi> but on the other hand, that doesn't permit to continue
  stress-testing the Hurd :)
<braunr> we're not in a hurry for this patch
<braunr> because using it really means you're tickling the pageout daemon a
  lot :)

metadata caching

IRC, freenode, #hurd, 2012-07-12

<braunr> i'm only adding a cached pages count you know :)
<braunr> (well actually, this is now a vm_stats call that can replace
  vm_statistics, and uses flavors similar to task_info)
<braunr> my goal being to see that yellow bar in htop
<braunr> ... :)
<pinotree> yellow?
<braunr> yes, yellow
<braunr> as in http://www.sceen.net/~rbraun/htop.png
<pinotree> ah

IRC, freenode, #hurd, 2012-07-13

<braunr> i always get a "no more room for vm_map_enter" error when building
  glibc :/
<braunr> but the build continues, probably a failed test
<braunr> ah yes, i can see the yellow bar :>
<antrik> braunr: congrats :-)
<braunr> antrik: thanks
<braunr> but i think my patch can't make it into the git repo until the
  swap deadlock is solved (or at least very infrequent ..)

libpager deadlock.

<braunr> well, the page cache accounting tells me something is wrong there
  too lol
<braunr> during a build 112M of data was created, of which only 28M made it
  into the cache
<braunr> which may imply something is still holding references on the
  others objects (shadow objects hold references to their underlying
  object, which could explain this)
<braunr> ok i'm stupid, i just forgot to subtract the cached pages from the
  used pages .. :>
<braunr> (hm, actually i'm tired, i don't think this should be done)
<braunr> ahh yes much better
<braunr> i simply forgot to convert pages in kilobytes .... :>
<braunr> with the fix, the accounting of cached files is perfect :)

IRC, freenode, #hurd, 2012-07-14

<youpi> braunr: btw, if you want to stress big builds, you might want to
  try webkit, ppl, rquantlib, rheolef, yade
<youpi> they don't pass on bach (1.3GiB), but do on ironforge (1.8GiB)
<braunr> youpi: i don't need to, i already know my patch triggers swap
  deadlocks more often, which was expected
<youpi> k
<braunr> there are 3 tasks concerning my work : 1/ page cache accounting
  (i'm sending the patch right now) 2/ removing the fixed limit and 3/
  hunting the swap deadlock and fixing as much as possible
<braunr> 2/ can't get in the repository without 3/ imo
<youpi> btw, the increase of PAGE_FREE_* in your 2/ could go already,
  couldn't it?
<braunr> yes
<braunr> but we should test with higher thresholds
<braunr> well
<braunr> it really depends on the usage pattern :/

ext2fs libports reference counting assertion

IRC, freenode, #hurd, 2012-07-15

<braunr> concerning the page cache patch, i've been using for quite some
  time now, did lots of builds with it, and i actually wonder if it hurts
  stability as much as i think
<braunr> considering i didn't stress the system as much before
<braunr> and it really improves performance

<braunr> cached memobjs:   138606
<braunr> cache:             1138M
<braunr> i bet ext2fs can have a hard time scanning 138k entries in a
  linked list, using callback functions on each of them :x

IRC, freenode, #hurd, 2012-07-16

<tschwinge> braunr: Sorry that I didn't have better results to present.
  :-/
<braunr> eh, that was expected :)
<braunr> my biggest problem is the hurd itself :/
<braunr> for my patch to be useful (and the rest of the intended work), the
  hurd needs some serious fixing
<braunr> not syncing from the pagers
<braunr> and scalable algorithms everywhere of course

IRC, freenode, #hurd, 2012-07-23

<braunr> youpi: FYI, the branches rbraun/page_cache in the gnupach and hurd
  repos are ready to be merged after review
<braunr> gnumach*
<youpi> so you fixed the hangs & such?
<braunr> they only the cache stats, not the "improved" cache
<braunr> no
<braunr> it requires much more work for that :)
<youpi> braunr: my concern is that the tests on buildds show stability
  regression
<braunr> youpi: tschwinge also reported performance degradation
<braunr> and not the minor kind
<youpi> uh
<tschwinge> :-/
<braunr> far less pageins, but twice as many pageouts, and probably high
  cpu overhead
<braunr> building (which is what buildds do) means lots of small files
<braunr> so lots of objects
<braunr> huge lists, long scans, etc..
<braunr> so it definitely requires more work
<braunr> the stability issue comes first in mind, and i don't see a way to
  obtain a usable trace
<braunr> do you ?
<youpi> nope
<braunr> (except making it loop forever instead of calling assert() and
  attach gdb to a qemu instance)
<braunr> youpi: if you think the infinite loop trick is ok, we could
  proceed with that
<youpi> which assert?
<braunr> the port refs one
<youpi> which one?
<braunr> whicih prevented you from using the page cache patch on buildds
<youpi> ah, the libports one
<youpi> for that one, I'd tend to take the time to perhaps use coccicheck
  actually

code analysis.

<braunr> oh
<youpi> it's one of those which is supposed to be statically ananyzable
<youpi> s/n/l
<braunr> that would be great
<tschwinge> :-)
<tschwinge> And set precedence.

IRC, freenode, #hurd, 2012-07-26

<braunr> hm i killed darnassus, probably the page cache patch again

IRC, freenode, #hurd, 2012-09-19

<youpi> I was wondering about the page cache information structure
<youpi> I guess the idea is that if we need to add a field, we'll just
  define another RPC?
<youpi> braunr: ↑
<braunr> i've done that already, yes
<braunr> youpi: have a look at the rbraun/page_cache gnumach branch
<youpi> that's what I was referring to
<braunr> ok

IRC, freenode, #hurd, 2013-01-15

<braunr> hm, no wonder the page cache patch reduced performance so much
<braunr> the page cache when building even moderately large packages is
  about a few dozens MiB (around 50)
<braunr> the patch enlarged it to several hundreds :/
<ArneBab> braunr: so the big page cache essentially killed memory locality?
<braunr> ArneBab: no, it made ext2fs crazy (disk translators - used as
  pagers - scan their cached pages every 5 seconds to flush the dirty ones)
<braunr> you can imagine what happens if scanning and flushing a lot of
  pages takes more than 5 seconds
<ArneBab> ouch… that’s heavy, yes
<ArneBab> I already see it pile up in my mindb 
<braunr> and it's completely linear, using a lock to protect the whole list
<braunr> darnassus is currently showing such a behaviour, because tschwinge
  is linking huge files (one object with lots of pages)
<braunr> 446 MB of swap used, between 200 and 1850 MiB of RAM used, and i
  can still use vim and build stuff without being too disturbed
<braunr> the system does feel laggy, but there has been great stability
  improvements
<braunr> have*
<braunr> and even if laggy, it doesn't feel much more than the usual lag of
  a network (ssh) based session

IRC, freenode, #hurd, 2013-10-08

<braunr> hmm i have to change what gnumach reports as being cached memory

IRC, freenode, #hurd, 2013-10-09

<braunr> mhmm, i'm able to copy files as big as 256M while building debian
  packages, using a gnumach kernel patched for maximum memory usage in the
  page cache
<braunr> just because i used --sync=30 in ext2fs
<braunr> a bit of swapping (around 40M), no deadlock yet
<braunr> gitweb is a bit slow but that's about it
<braunr> that's quite impressive
<braunr> i suspect thread storms might not even be the cataclysmic event
  that we thought it was
<braunr> the true problem might simply be parallel fs synces

IRC, freenode, #hurd, 2013-10-10

<braunr> even with the page cache patch, memory filled, swap used, and lots
  of cached objects (over 200k), darnassus is impressively resilient
<braunr> i really wonder whether we fixed ext2fs deadlock

<braunr> youpi: fyi, darnassus is currently running a patched gnumach with
  the vm cache changes, in hope of reproducing the assertion errors we had
  in the past
<braunr> i increased the sync interval of ext2fs to 30s like we discussed a
  few months back
<braunr> and for now, it has been very resilient, failing only because of
  the lack of kernel map entries after several heavy package builds
<gg0> wait the latter wasn't a deadlock it resumed after 1363.06 s
<braunr> gg0: thread storms can sometimes (rarely) fade and let the system
  resume "normally"
<braunr> which is why i increased the sync interval to 30s, this leaves
  time between two intervals for normal operations
<braunr> otherwise writebacks are queued one after the other, and never
  processed fast enough for that queue to become empty again (except
  rarely)
<braunr> youpi: i think we should consider applying at least the sync
  interval to exodar, since many DDs are just unaware of the potential
  problems with large IOs
<youpi> sure

<braunr> 222k cached objects (1G of cached memory) and darnassus is still
  kicking :)
<braunr> youpi: those lock fixing patches your colleague sent last year
  must have helped somewhere
<youpi> :)

IRC, freenode, #hurd, 2013-10-13

<youpi> braunr: how are your tests going with the object cache?
<braunr> youpi: not so good
<braunr> youpi: it failed after 2 days of straight building without a
  single error output :/