There is a FOSS Factory bounty (p266) on this task.

IRC, freenode, #hurd, 2011-04-12

<antrik> braunr: do you think the allocator you wrote for x15 could be used
  for gnumach? and would you be willing to mentor this? :-)
<braunr> antrik: to be willing to isn't my current problem
<braunr> antrik: and yes, I think my allocator can be used
<braunr> it's a slab allocator after all, it only requires reap() and
<braunr> or mmap()/munmap() whatever you want to call it
<braunr> a backend
<braunr> antrik: although i've been having other ideas recently
<braunr> that would have more impact on our usage patterns I think
<antrik> mcsim: have you investigated how the zone allocator works and how
  it's hooked into the system yet?
<braunr> mcsim: now let me give you a link
<braunr> mcsim:;f=mem.c;h=330436e799f322949bfd9e2fedf0475660309946;hb=HEAD
<braunr> mcsim: this is an implementation of the slab allocator i've been
  working on recently
<braunr> mcsim: i haven't made it public because i reworked the per
  processor layer, and this part isn't complete yet
<braunr> mcsim: you could use it as a reference for your project
<mcsim> braunr: ok
<braunr> it used to be close to the 2001 vmem paper
<braunr> but after many tests, fragmentation and accounting issues have
  been found
<braunr> so i rewrote it to be closer to the linux implementation (cache
  filling/draining in bukl transfers)
<braunr> bulk*
<braunr> they actually use the word draining in linux too :)
<mcsim> antrik: not complete yet.
<antrik> braunr: oh, it's unfinished? that's unfortunate...
<braunr> antrik: only the per processor part
<braunr> antrik: so it doesn't matter much for gnumach
<braunr> and it's not difficult to set up
<antrik> mcsim: hm, OK... but do you think you will have a fairly good
  understanding in the next couple of days?...
<antrik> I'm asking because I'd really like to see a proposal a bit more
  specific than "I'll look into things..."
<antrik> i.e. you should have an idea which things you will actually have
  to change to hook up a new allocator etc.
<antrik> braunr: OK. will the interface remain unchanged, so it could be
  easily replaced with an improved implementation later?
<braunr> the zone allocator in gnumach is a badly written bare object
  allocator actually, there aren't many things to understand about it
<braunr> antrik: yes
<antrik> great :-)
<braunr> and the per processor part should be very close to the phys
  allocator sitting next to it
<braunr> (with the slight difference that, as per cpu caches have variable
  sizes, they are allocated on the free path rather than on the allocation
<braunr> this is a nice trick in the vmem paper i've kept in mind
<braunr> and the interface also allows to set a "source" for caches
<antrik> ah, good point... do you think we should replace the physmem
  allocator too? and if so, do it in one step, or one piece at a time?...
<braunr> no
<braunr> too many drivers currently depend on the physical allocator and
  the pmap module as they are
<braunr> remember linux 2.0 drivers need a direct virtual to physical
<braunr> (especially true for dma mappings)
<antrik> OK
<braunr> the nice thing about having a configurable memory source is that
<antrik> whot do you mean by "allocated on the free path"?
<braunr> even if most caches will use the standard vm_kmem module as their
<braunr> there is one exception in the vm_map module, allowing us to get
  rid of either a static limit, or specific allocation code
<braunr> antrik: well, when you allocate a page, the allocator will lookup
  one in a per cpu cache
<braunr> if it's empty, it fills the cache
<braunr> (called pools in my implementations)
<braunr> it then retries
<braunr> the problem in the slab allocator is that per cpu caches have
  variable sizes
<braunr> so per cpu pools are allocated from their own pools
<braunr> (remember the magazine_xx caches in the output i showed you, this
  is the same thing)
<braunr> but if you allocate them at allocation time, you could end up in
  an infinite loop
<braunr> so, in the slab allocator, when a per cpu cache is empty, you just
  fall back to the slab layer
<braunr> on the free path, when a per cpu cache doesn't exist, you allocate
  it from its own cache
<braunr> this way you can't have an infinite loop
<mcsim> antrik: I'll try, but I have exams now.
<mcsim> As I understand amount of elements which could be allocated we
  determine by zone initialization. And at this time memory for zone is
  reserved. I'm going to change this. And make something similar to kmalloc
  and vmalloc (support for pages consecutive physically and virtually). And
  pages in zones consecutive always physically.
<mcsim> Am I right?
<braunr> mcsim: don't try to do that
<mcsim> why?
<braunr> mcsim: we just need a slab allocator with an interface close to
  the zone allocator
<antrik> mcsim: IIRC the size of the complete zalloc map is fixed; but not
  the number of elements per zone
<braunr> we don't need two allocators like kmalloc and vmalloc
<braunr> actually we just need vmalloc
<braunr> IIRC the limits are only present because the original developers
  wanted to track leaks
<braunr> they assumed zones would be large enough, which isn't true any
  more today
<braunr> but i didn't see any true reservation
<braunr> antrik: i'm not sure i was clear enough about the "allocation of
  cpu caches on the free path"
<braunr> antrik: for a better explanation, read the vmem paper ;)
<antrik> braunr: you mean there is no fundamental reason why the zone map
  has a limited maximal size; and it was only put in to catch cases where
  something eats up all memory with kernel object creation?...
<antrik> braunr: I think I got it now :-)
<braunr> antrik: i'm pretty certin of it yes
<antrik> I don't see though how it is related to what we were talking
<braunr> 10:55 < braunr> and the per processor part should be very close to
  the phys allocator sitting next to it
<braunr> the phys allocator doesn't have to use this trick
<braunr> because pages have a fixed size, so per cpu caches all have the
  same size too
<braunr> and the number of "caches", that is, physical segments, is limited
  and known at compile time
<braunr> so having them statically allocated is possible
<antrik> I see
<braunr> it would actually be very difficult to have a phys allocator
  requiring dynamic allocation when the dynamic allocator isn't yet ready
<antrik> hehe :-)
<mcsim> total size of all zone allocations is limited to 12 MB. And is "was
  only put in to catch cases where something eats up all memory with kernel
  object creation?"
<braunr> mcsim: ah right, there could be a kernel submap backing all the
<braunr> but this can be increased too
<braunr> submaps are kind of evil :/
<antrik> mcsim: I think it's actually 32 MiB or something like that in the
  Debian version...
<antrik> braunr: I'm not sure I ever fully understood what the zalloc map
  is... I looked through the code once, and I think I got a rough
  understading, but I was still pretty uncertain about some bits. and I
  don't remember the details anyways :-)
<braunr> antrik: IIRC, it's a kernel submap
<braunr> it's named kmem_map in x15
<antrik> don't know what a submap is
<braunr> submaps are vm_map objects
<braunr> in a top vm_map, there are vm_map_entries
<braunr> these entries usually point to vm_objects
<braunr> (for the page cache)
<braunr> but they can point to other maps too
<braunr> the goal is to reduce fragmentation by isolating allocations
<braunr> this also helps reducing contention
<braunr> for exemple, on BSD, there is a submap for mbufs, so that the
  network code doesn't interfere too much with other kernel allocations
<braunr> antrik: they are similar to spans in vmem, but vmem has an elegant
  importing mechanism which eliminates the static limit problem
<antrik> so memory is not directly allocated from the physical allocator,
  but instead from another map which in turn contains physical memory, or
  something like that?...
<braunr> no, this is entirely virtual
<braunr> submaps are almost exclusively used for the kernel_map
<antrik> you are using a lot of identifies here, but I don't remember (or
  never knew) what most of them mean :-(
<braunr> sorry :)
<braunr> the kernel map is the vm_map used to represent the ~1 GiB of
  virtual memory the kernel has (on i386)
<braunr> vm_map objects are simple virtual space maps
<braunr> they contain what you see in linux when doing /proc/self/maps
<braunr> cat /proc/self/maps
<braunr> (linux uses entirely different names but it's roughly the same
<braunr> each line is a vm_map_entry
<braunr> (well, there aren't submaps in linux though)
<braunr> the pmap tool on netbsd is able to show the kernel map with its
  submaps, but i don't have any image around
<mcsim> braunr: is limit for zones is feature and shouldn't be changed?
<braunr> mcsim: i think we shouldn't have fixed limits for zones
<braunr> mcsim: this should be part of the debugging facilities in the slab
<braunr> is this fixed limit really a major problem ?
<braunr> i mean, don't focus on that too much, there are other issues
  requiring more attention
<antrik> braunr: at 12 MiB, it used to be, causing a lot of zalloc
  panics. after increasing, I don't think it's much of a problem anymore...
<antrik> but as memory sizes grow, it might become one again
<antrik> that's the problem with a fixed size...
<braunr> yes, that's the issue with submaps
<braunr> but gnumach is full of those, so let's fix them by order of
<antrik> well, I'm still trying to digest what you wrote about submaps :-)
<braunr> i'm downloading netbsd, so you can have a good view of all this
<antrik> so, when the kernel allocates virtual address space regions
  (mostly for itself), instead of grabbing chunks of the address space
  directly, it takes parts out of a pre-reserved region?
<braunr> not exactly
<braunr> both statements are true
<mcsim> antrik: only virtual addresses are reserved
<braunr> it grabs chunks of the address space directly, but does so in a
  reserved region of the address space
<braunr> a submap is like a normal map, it has a start address, a size, and
  is empty, then it's populated with vm_map_entries
<braunr> so instead of allocating from 3-4 GiB, you allocate from, say,
  3.1-3.2 GiB
<antrik> yeah, that's more or less what I meant...
<mcsim> braunr: I see two problems: limited zones and absence of caching. 
<mcsim> with caching absence of readahead paging will be not so significant
<braunr> please avoid readahead
<mcsim> ok
<braunr> and it's not about paging, it's about kernel memory, which is
<braunr> (well most of it)
<braunr> what about limited zones ?
<braunr> the whole kernel space is limited, there has to be limits
<braunr> the problem is how to handle them
<antrik> braunr: almost all. I looked through all zones once, and IIRC I
  found exactly one that actually allows paging...
<braunr> currently, when you reach the limit, you have an OOM error
<braunr> antrik: yes, there are
<braunr> i don't remember which implementation does that but, when
  processes haven't been active for a minute or so, they are "swapedout"
<braunr> completely
<braunr> even the kernel stack
<braunr> and the page tables
<braunr> (most of the pmap structures are destroyed, some are retained)
<antrik> that might very well be true... at least inactive processes often
  show up with 0 memory use in top on Hurd
<braunr> this is done by having a pageable kernel map, with wired entries
<braunr> when the swapper thread swaps tasks out, it unwires them
<braunr> but i think modern implementations don't do that any more
<antrik> well, I was talking about zalloc only :-)
<braunr> oh
<braunr> so the zalloc_map must be pageable
<braunr> or there are two submaps ?
<antrik> not sure whether "morden implementations" includes Linux ;-)
<braunr> no, i'm talking about the bsd family only
<antrik> but it's certainly true that on Linux even inactive processes
  retain some memory
<braunr> linux doesn't make any difference between processor-bound and
  I/O-bound processes
<antrik> braunr: I have no idea how it works. I just remember that when
  creating zones, one of the optional flags decides whether the zone is
  pagable. but as I said, IIRC there is exactly one that actually is...
<braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
  zone_map_size, FALSE);
<braunr> kmem_suballoc(parent, min, max, size, pageable)
<braunr> so the zone_map isn't
<antrik> IIRC my conclusion was that pagable zones do not count in the
  fixed zone map limit... but I'm not sure anymore
<braunr> zinit() has a memtype parameter
<braunr> with ZONE_PAGEABLE as a possible flag
<braunr> this is wierd :)
<mcsim> There is no any zones which use ZONE_PAGEABLE flag
<antrik> mcsim: are you sure? I think I found one...
<braunr> if (zone->type & ZONE_PAGEABLE) {
<antrik> admittedly, it is several years ago that I looked into this, so my
  memory is rather dim...
<braunr> if (kmem_alloc_pageable(zone_map, &addr, ...
<braunr> calling kmem_alloc_pageable() on an unpageable submap seems wrong
<mcsim> I've greped gnumach code and there is no any zinit procedure call
  with ZONE_PAGEABLE flag
<braunr> good
<antrik> hm... perhaps it was in some code that has been removed
  alltogether since ;-)
<antrik> actually I think it would be pretty neat to have pageable kernel
  objects... but I guess it would require considerable effort to implement
  this right
<braunr> mcsim: you also mentioned absence of caching
<braunr> mcsim: the zone allocator actually is a bare caching object
<braunr> antrik: no, it's easy
<braunr> antrik: i already had that in x15 0.1
<braunr> antrik: the problem is being sure the objects you allocate from a
  pageable backing store are never used when resolving a page fault
<braunr> that's all
<antrik> I wouldn't expect that to be easy... but surely you know better
<mcsim> braunr: indeed. I was wrong.
<antrik> braunr: what is a caching object allocator?...
<braunr> antrik: ok, it's not easy
<braunr> antrik: but once you have vm_objects implemented, having pageable
  kernel object is just a matter of using the right options, really
<braunr> antrik: an allocator that caches its buffers
<braunr> some years ago, the term "object" would also apply to
  preconstructed buffers
<antrik> I have no idea what you mean by "caches its buffers" here :-)
<braunr> well, a memory allocator which doesn't immediately free its
  buffers caches them
<mcsim> braunr: but can it return objects to system?
<braunr> mcsim: which one ?
<antrik> yeah, obviously the *implementation* of pageable kernel objects is
  not hard. the tricky part is deciding which objects can be pageable, and
  which need to be wired...
<mcsim> Can zone allocator return cached objects to system as in slab?
<mcsim> I mean reap()
<braunr> well yes, it does so, and it does that too often
<braunr> the caching in the zone allocator is actually limited to the
<braunr> once page is completely free, it is returned to the vm
<mcsim> this is bad caching
<braunr> yes
<mcsim> if object takes all page than there is now caching at all
<braunr> caching by side effect
<braunr> true
<braunr> but the linux slab allocator does the same thing :p
<braunr> hm
<braunr> no, the solaris slab allocator does so
<mcsim> linux's slab returns objects only when system ask
<antrik> without preconstructed objects, is there actually any point in
  caching empty slabs?...
<mcsim> Once I've changed my allocator to slab and it cached more than 1GB
  of my memory)
<braunr> ok wait, need to fix a few mistakes first
<mcsim> s/ask/asks
<braunr> the zone allocator (in gnumach) actually has a garbage collector
<antrik> braunr: well, the Solaris allocator follows the slab/magazine
  paper, right? so there is caching at the magazine layer... in that case
  caching empty slabs too would be rather redundant I'd say...
<braunr> which is called when running low on memory, similar to the slab
<braunr> antrik: yes
<antrik> (or rather the paper follows the Solaris allocator ;-) )
<braunr> mcsim: the zone allocator reap() is zone_gc()
<antrik> braunr: hm, right, there is a "collectable" flag for zones... but
  I never understood what it means
<antrik> braunr: BTW, I heard Linux has yet another allocator now called
  "slob"... do you happen to know what that is?
<braunr> slob is a very simple allocator for embedded devices
<mcsim> AFAIR this is just heap allocator
<braunr> useful when you have a very low amount of memory
<braunr> like 1 MiB
<braunr> yes
<antrik> just googled it :-)
<braunr> zone and slab are very similar
<antrik> sounds like a simple heap allocator
<mcsim> there is another allocator that calls slub, and it better than slab
  in many cases
<braunr> the main difference is the data structures used to store slabs
<braunr> mcsim: i disagree
<antrik> mcsim: ah, you already said that :-)
<braunr> mcsim: slub is better for systems with very large amounts of
  memory and processors
<braunr> otherwise, slab is better
<braunr> in addition, there are accounting issues with slub
<braunr> because of cache merging
<mcsim> ok. This strange that slub is default allocator
<braunr> well both are very good
<braunr> iirc, linus stated that he really doesn't care as long as its
  works fine
<braunr> he refused slqb because of that
<braunr> slub is nice because it requires less memory than slab, while
  still being as fast for most cases
<braunr> it gets slower on the free path, when the cpu performing the free
  is different from the one which allocated the object
<braunr> that's a reasonable cost
<mcsim> slub uses heap for large object. Are there any tests that compare
  what is better for large objects?
<antrik> well, if slub requires less memory, why do you think slab is
  better for smaller systems? :-)
<braunr> antrik: smaller is relative
<antrik> mcsim: for large objects slab allocation is rather pointless, as
  you don't have multiple objects in a page anyways...
<braunr> antrik: when lameter wrote slub, it was intended for systems with
  several hundreds processors
<antrik> BTW, was slqb really refused only because the other ones are "good
<braunr> yes
<antrik> wow, that's a strange argument...
<braunr> linus is already unhappy of having "so many" allocators
<antrik> well, if the new one is better, it could replace one of the others
<antrik> or is it useful only in certain cases?
<braunr> that's the problem
<braunr> nobody really knows
<antrik> hm, OK... I guess that should be tested *before* merging ;-)
<antrik> is anyone still working on it, or was it abandonned?
<antrik> mcsim: back to caching...
<antrik> what does caching in the kernel object allocator got to do with
  readahead (i.e. clustered paging)?...
<mcsim> if we cached some physical pages we don't need to find new ones for
  allocating new object. And that's why there will not be a page fault.
<mcsim> antrik: Regarding kam. Hasn't he finished his project?
<antrik> err... what?
<antrik> one of us must be seriously confused
<antrik> I totally fail to see what caching of physical pages (which isn't
  even really a correct description of what slab does) has to do with page
<antrik> right, KAM didn't finish his project
<mcsim> If we free the physical page and return it to system we need
  another one for next allocation. But if we keep it, we don't need to find
  new physical page. 
<mcsim> And physical page is allocated only then when page fault
  occurs. Probably, I'm wrong
<antrik> what does "return to system" mean? we are talking about the
<antrik> zalloc/slab are about allocating kernel objects. this doesn't have
  *anything* to do with paging of userspace processes
<antrik> only thing the have in common is that they need to get pages from
  the physical page allocator. but that's yet another topic
<mcsim> Under "return to system" I mean ability to use this page for other
<braunr> mcsim: consider kernel memory to be wired
<braunr> here, return to system means releasing a page back to the vm
<braunr> the vm_kmem module then unmaps the physical page and free its
  virtual address in the kernel map
<mcsim> ok
<braunr> antrik: the problem with new allocators like slqb is that it's
  very difficult to really know if they're better, even with extensive
<braunr> antrik: there are papers (like wilson95) about the difficulties in
  making valuable results in this field
<braunr> see
<mcsim> how can be allocated physically continuous object now?
<braunr> mcsim: rephrase please
<mcsim> what is similar to kmalloc in Linux to gnumach?
<braunr> i know memory is reserved for dma in a direct virtual to physical
<braunr> so even if the allocation is done similarly to vmalloc()
<braunr> the selected region of virtual space maps physical memory, so
  memory is physically contiguous too
<braunr> for other allocation types, a block large enough is allocated, so
  it's contiguous too
<mcsim> I don't clearly understand. If we have fragmentation in physical
  ram, so there aren't 2 free pages in a row, but there are able apart, we
  can't to allocate these 2 pages along?
<braunr> no
<braunr> but every system has this problem
<mcsim> But since we have only 12 or 32 MB of memory the problem becomes
  more significant
<braunr> you're confusing virtual and physical memory
<braunr> those 32 MiB are virtual
<braunr> the physical pages backing them don't have to be contiguous
<mcsim> Oh, indeed 
<mcsim> So the only problem are limits?
<braunr> and performance
<braunr> and correctness
<braunr> i find the zone allocator badly written
<braunr> antrik: mcsim: here is the content of the kernel pmap on NetBSD
  (which uses a virtual memory system close to the Mach VM)
<braunr> antrik: mcsim:


<braunr> you can see the kmem_map (which is used for most general kernel
  allocations) is 128 MiB large
<braunr> actually it's not the kernel pmap, it's the kernel_map
<antrik> braunr: why is it called pmap.out then? ;-)
<braunr> antrik: because the tool is named pmap
<braunr> for process map
<braunr> it also exists under Linux, although direct access to
  /proc/xx/maps gives more info
<mcsim> braunr: I've said that this is kernel_map. Can I see kernel_map for
<braunr> mcsim: I don't know how to do that
<mcsim> s/I've/You've
<braunr> but Linux doesn't have submaps, and uses a direct virtual to
  physical mapping, so it's used differently
<antrik> how are things (such as zalloc zones) entered into kernel_map?
<braunr> in zone_init() you have
<braunr> zone_map = kmem_suballoc(kernel_map, &zone_min, &zone_max,
  zone_map_size, FALSE);
<braunr> so here, kmem_map is named zone_map
<braunr> then, in zalloc()
<braunr> kmem_alloc_wired(zone_map, &addr, zone->alloc_size)
<antrik> so, kmem_alloc just deals out chunks of memory referenced directly
  by the address, and without knowing anything about the use?
<braunr> kmem_alloc() gives virtual pages
<braunr> zalloc() carves them into buffers, as in the slab allocator
<braunr> the difference is essentially the lack of formal "slab" object
<braunr> which makes the zone code look like a mess
<antrik> so kmem_suballoc() essentially just takes a bunch of pages from
  the main kernel_map, and uses these to back another map which then in
  turn deals out pages just like the main kernel_map?
<braunr> no
<braunr> kmem_suballoc creates a vm_map_entry object, and sets its start
  and end address
<braunr> and creates a vm_map object, which is then inserted in the new
<braunr> maybe that's what you meant with "essentially just takes a bunch
  of pages from the main kernel_map"
<braunr> but there really is no allocation at this point
<braunr> except the map entry and the new map objects
<antrik> well, I'm trying to understand how kmem_alloc() manages things. so
  it has map_entry structures like the maps of userspace processes? do
  these also reference actual memory objects?
<braunr> kmem_alloc just allocates virtual pages from a vm_map, and backs
  those with physical pages (unless the user requested pageable memory)
<braunr> it's not "like the maps of userspace processes"
<braunr> these are actually the same structures
<braunr> a vm_map_entry can reference a memory object or a kernel submap
<braunr> in netbsd, it can also referernce nothing (for pure wired kernel
  memory like the vm_page array)
<braunr> maybe it's the same in mach, i don't remember exactly
<braunr> antrik: this is actually very clear in vm/vm_kern.c
<braunr> kmem_alloc() creates a new kernel object for the allocation
<braunr> allocates a new entry (or uses a previous existing one if it can
  be extended) through vm_map_find_entry()
<braunr> then calls kmem_alloc_pages() to back it with wired memory
<antrik> "creates a new kernel object" -- what kind of kernel object?
<braunr> kmem_alloc_wired() does roughly the same thing, except it doesn't
  need a new kernel object because it knows the new area won't be pageable
<braunr> a simple vm_object
<braunr> used as a container for anonymous memory in case the pages are
  swapped out
<antrik> vm_object is the same as memory object/pager? or yet something
<braunr> antrik: almost
<braunr> antrik: a memory_object is the user view of a vm_object
<braunr> as in the kernel/user interfaces used by external pagers
<braunr> vm_object is a more internal name
<mcsim> Is fragmentation a big problem in slab allocator?
<mcsim> I've tested it on my computer in Linux and for some caches it
  reached 30-40%
<antrik> well, fragmentation is a major problem for any allocator...
<antrik> the original slab allocator was design specifically with the goal
  of reducing fragmentation
<antrik> the revised version with the addition of magazines takes a step
  back on this though
<antrik> have you compared it to slub? would be pretty interesting...
<mcsim> I have an idea how can it be decreased, but it will hurt by
<mcsim> antrik: no I haven't, but there will be might the same, I think
<mcsim> if each cache will handle two types of object: with sizes that will
  fit cache sizes (or I bit smaller) and with sizes which are much smaller
  than maximal cache size. For first type of object will be used standard
  slab allocator and for latter type will be used (within page) heap
<mcsim> I think that than fragmentation will be decreased
<antrik> not at all. heap allocator has much worse fragmentation. that's
  why slab allocator was invented
<antrik> the problem is that in a long-running program (such an the
  kernel), objects tend to have vastly varying lifespans
<mcsim> but we use heap only for objects of specified sizes
<antrik> so often a few old objects will keep a whole page hostage
<mcsim> for example for 32 byte cache it could be 20-28 byte objects
<antrik> that's particularily visible in programs such as firefox, which
  will grow the heap during use even though actual needs don't change
<antrik> the slab allocator groups objects in a fashion that makes it more
  likely adjacent objects will be freed at similar times
<antrik> well, that's pretty oversimplyfied, but I hope you get the
  idea... it's about locality
<mcsim> I agree, but I speak not about general heap allocation. We have
  many heaps for objects with different sizes.
<mcsim> Could it be better?
<antrik> note that this has been a topic of considerable research. you
  shouldn't seek to improve the actual algorithms -- you would have to read
  up on the existing research at least before you can contribute anything
  to the field :-)
<antrik> how would that be different from the slab allocator?
<mcsim> slab will allocate 32 byte for both 20 and 32 byte requests
<mcsim> And if there was request for 20 bytes we get 12 unused
<antrik> oh, you mean the implementation of the generic allocator on top of
  slabs? well, that might not be optimal... but it's not an often used case
  anyways. mostly the kernel uses constant-sized objects, which get their
  own caches with custom tailored size
<antrik> I don't think the waste here matters at all
<mcsim> affirmative. So my idea is useless. 
<antrik> does the statistic you refer to show the fragmentation in absolute
  sizes too?
<mcsim> Can you explain what is absolute size?
<mcsim> I've counted what were requested (as parameter of kmalloc) and what
  was really allocated (according to best fit cache size).
<antrik> how did you get that information?
<mcsim> I simply wrote a hook
<antrik> I mean total. i.e. how many KiB or MiB are wasted due to
  fragmentation alltogether
<antrik> ah, interesting. how does it work?
<antrik> BTW, did you read the slab papers?
<mcsim> Do you mean articles from
<antrik> no 
<antrik> I mean the papers from the Sun hackers who invented the slab
<antrik> Bonwick mostly IIRC
<mcsim> Yes
<antrik> hm... then you really should know the rationale behind it...
<mcsim> There he says about 11% percent of memory waste
<antrik> you didn't answer my other questions BTW :-)
<mcsim> I've corrupted kernel tree with patch, and tomorrow I'm going to
  read myself up for exam (I have it on Thursday). But than I'll send you a
  module which I've used for testing.
<antrik> OK
<mcsim> I can send you module now, but it will not work without patch.
<mcsim> It would be better to rewrite it using debugfs, but when I was
  writing this test I didn't know about trace_* macros

IRC, freenode, #hurd, 2011-04-15

<mcsim> There is a hack in zone_gc when it allocates and frees two
  vm_map_kentry_zone elements to make sure the gc will be able to allocate
  two in vm_map_delete. Isn't it better to allocate memory for these
  entries statically?
<youpi> mcsim: that's not the point of the hack
<youpi> mcsim: the point of the hack is to make sure vm_map_delete will be
  able to allocate stuff
<youpi> allocating them statically will just work once
<youpi> it may happen several times that vm_map_delete needs to allocate it
  while it's empty (and thus zget_space has to get called, leading to a
<youpi> funnily enough, the bug is also in macos X
<youpi> it's still in my TODO list to manage to find how to submit the
  issue to them
<braunr> really ?
<braunr> eh
<braunr> is that because of map entry splitting ?
<youpi> it's git commit efc3d9c47cd744c316a8521c9a29fa274b507d26
<youpi> braunr: iirc something like this, yes
<braunr> netbsd has this issue too
<youpi> possibly
<braunr> i think it's a fundamental problem with the design
<braunr> people think of munmap() as something similar to free()
<braunr> whereas it's really unmap
<braunr> with a BSD-like VM, unmap can easily end up splitting one entry in
<braunr> but your issue is more about harmful recursion right ?
<youpi> I don't remember actually
<youpi> it's quite some time ago :)
<braunr> ok
<braunr> i think that's why i have "sources" in my slab allocator, the
  default source (vm_kern) and a custom one for kernel map entries

IRC, freenode, #hurd, 2011-04-18

<mcsim> braunr: you've said that once page is completely free, it is
  returned to the vm.
<mcsim> who else, besides zone_gc, can return free pages to the vm?
<braunr> mcsim: i also said i was wrong about that
<braunr> zone_gc is the only one

IRC, freenode, #hurd, 2011-04-19

<braunr> antrik: mcsim: i added back a new per-cpu layer as planned
<braunr> mcsim: btw, in mem_cache_reap(), you can clearly see there are two
  loops, just as in zone_gc, to reduce contention and avoid deadlocks
<braunr> this is really common in memory allocators

IRC, freenode, #hurd, 2011-04-23

<mcsim> I've looked through some allocators and all of them use different
  per cpu cache policy. AFAIK gnuhurd doesn't support multiprocessing, but
  still multiprocessing must be kept in mind. So, what do you think what
  kind of cpu caches is better? As for me I like variant with only per-cpu
  caches (like in slqb).
<antrik> mcsim: well, have you looked at the allocator braunr wrote
  himself? :-)
<antrik> I'm not sure I suggested that explicitly to you; but probably it
  makes most sense to use that in gnumach

IRC, freenode, #hurd, 2011-04-24

<mcsim> antrik: Yes, I have. He uses both global and per cpu caches. But he
  also suggested to look through slqb, where there are only per cpu
<braunr> i don't remember slqb in detail
<braunr> what do you mean by "only per-cpu caches" ?
<braunr> a whole slab sytem for each cpu ?
<mcsim> I mean that there are no global queues in caches, but there are
  special queues for each cpu.
<mcsim> I've just started investigating slqb's code, but I've read an
  article on lwn about it. And I've read that it is used for zen kernel.
<braunr> zen ?
<mcsim> Here is this article
<mcsim> Yes, this is linux kernel with some patches which haven't been
  approved to torvald's tree
<braunr> i see
<braunr> well it looks nice
<braunr> but as for slub, the problem i can see is cross-CPU freeing
<braunr> and I think nick piggins mentions it
<braunr> piggin*
<braunr> this means that sometimes, objects are "burst-free" from one cpu
  cache to another
<braunr> which has the same bad effects as in most other allocators, mainly
<mcsim> There is a special list for freeing object allocated for another
<mcsim> And garbage collector frees such object on his own
<braunr> so what's your question ?
<mcsim> It is described in the end of article.
<mcsim> What cpu-cache policy do you think is better to implement?
<braunr> at this point, any
<braunr> and even if we had a kernel that perfectly supports
  multiprocessor, I wouldn't care much now
<braunr> it's very hard to evaluate such allocators
<braunr> slqb looks nice, but if you have the same amount of fragmentation
  per slab as other allocators do (which is likely), you have tat amount of
  fragmentation multiplied by the number of processors
<braunr> whereas having shared queues limit the problem somehow
<braunr> having shared queues mean you have a bit more contention
<braunr> so, as is the case most of the time, it's a tradeoff
<braunr> by the way, does pigging say why he "doesn't like" slub ? :)
<braunr> piggin*
<mcsim> here he describes what slqb is better.
<braunr> well it doesn't describe why slub is worse
<mcsim> but not very particularly 
<braunr> except for order-0 allocations
<braunr> and that's a form of fragmentation like i mentioned above
<braunr> in mach those problems have very different impacts
<braunr> the backend memory isn't physical, it's the kernel virtual space
<braunr> so the kernel allocator can request chunks of higher than order-0
<braunr> physical pages are allocated one at a time, then mapped in the
  kernel space
<mcsim> Doesn't order of page depend on buffer size?
<braunr> it does
<mcsim> And why does gnumach allocates higher than order-0 pages more?
<braunr> why more ?
<braunr> i didn't say more
<mcsim> And why in mach those problems have very different impact?
<braunr> ?
<braunr> i've just explained why :)
<braunr> 09:37 < braunr> physical pages are allocated one at a time, then
  mapped in the kernel space
<braunr> "one at a time" means order-0 pages, even if you allocate higher
  than order-0 chunks
<mcsim> And in Linux they allocated more than one at time because of
  prefetching page reading?
<braunr> do you understand what virtual memory is ?
<braunr> linux allocators allocate "physical memory"
<braunr> mach kernel allocator allocates "virtual memory"
<braunr> so even if you allocate a big chunk of virtual memory, it's backed
  by order-0 physical pages
<mcsim> yes, I understand this
<braunr> you don't seem to :/
<braunr> the problem of higher than order-0 page allocations is
<braunr> do you see why ?
<mcsim> yes
<braunr> so
<braunr> fragmentation in the kernel space is less likely to create issues
  than it does in physical memory
<braunr> keep in mind physical memory is almost always full because of the
  page cache
<braunr> and constantly under some pressure
<braunr> whereas the kernel space is mostly empty
<braunr> so allocating higher then order-0 pages in linux is more dangerous
  than it is in Mach or BSD
<mcsim> ok
<braunr> on the other hand, linux focuses pure performance, and not having
  to map memory means less operations, less tlb misses, quicker allocations
<braunr> the Mach VM must map pages "one at a time", which can be expensive
<braunr> it should be adapted to handle multiple page sizes (e.g. 2 MiB) so
  that many allocations can be made with few mappings
<braunr> but that's not easy
<braunr> as always: tradeoffs
<mcsim> There are other benefits of physical allocating. In big DMA
  transfers can be needed few continuous physical pages. How does mach
  handles such cases?
<braunr> gnumach does that awfully
<braunr> it just reserves the whole DMA-able memory and uses special
  allocation functions on it, IIRC
<braunr> but kernels which have a MAch VM like memory sytem such as BSDs
  have cleaner methods
<braunr> NetBSD provides a function to allocate contiguous physical memory
<braunr> with many constraints
<braunr> FreeBSD uses a binary buddy system like Linux
<braunr> the fact that the kernel allocator uses virtual memory doesn't
  mean the kernel has no mean to allocate contiguous physical memory ...

IRC, freenode, #hurd, 2011-05-02

<braunr> hm nice, my allocator uses less memory than glibc (squeeze
  version) on both 32 and 64 bits systems
<braunr> the new per-cpu layer is proving effective
<neal> braunr: Are you reimplementation malloc?
<braunr> no
<braunr> it's still the slab allocator for mach, but tested in userspace
<braunr> so i wrote malloc wrappers
<neal> Oh.
<braunr> i try to heavily test most of my code in userspace now
<neal> it's easier :-)
<neal> I agree
<braunr> even the physical memory allocator has been implemented this way
<neal> is this your mach version?
<braunr> virtual memory allocation will follow
<neal> or are you working on gnu mach?
<braunr> for now it's my version
<braunr> but i intend to spend the summer working on ipc port names

rework gnumach IPC spaces.

<braunr> and integrate the result in gnu mach
<neal> are you keeping the same user-space API?
<neal> Or are you experimenting with something new?
<antrik> braunr: to be fair, it's not terribly hard to use less memory than
  glibc :-)
<braunr> yes
<braunr> antrik: well ptmalloc3 received some nice improvements
<braunr> neal: the goal is to rework some of the internals only
<braunr> neal: namely, i simply intend to replace the splay tree with a
  radix tree
<antrik> braunr: the glibc allocator is emphasising performace, unlike some
  other allocators that trade some performance for much better memory
<antrik> ptmalloc3?
<braunr> that's the allocator used in glibc
<antrik> OK. haven't seen any recent numbers... the comparision I have in
  mind is many years old...
<braunr> i also made some additions to my avl and red-black trees this week
  end, which finally make them suitable for almost all generic uses
<braunr> the red-black tree could be used in e.g. gnu mach to augment the
  linked list used in vm maps
<braunr> which is what's done in most modern systems
<braunr> it could also be used to drop the overloaded (and probably over
  imbalanced) page cache hash table

gnumach vm map red-black trees.

IRC, freenode, #hurd, 2011-05-03

<mcsim> antrik: How should I start porting? Have I just include rbraun's
  allocator to gnumach and make it compile?
<antrik> mcsim: well, basically yes I guess... but you will have to look at
  the code in question first before we know anything more specific :-)
<antrik> I guess braunr might know better how to start, but he doesn't
  appear to be here :-(
<braunr> mcsim: you can't juste put my code into gnu mach and make it run,
  it really requires a few careful changes
<braunr> mcsim: you will have to analyse how the current zone allocator
  interacts with regard to locking
<braunr> if it is used in interrupt handlers
<braunr> what kind of locks it should use instead of the pthread stuff
  available in userspace
<braunr> you will have to change the reclamiing policy, so that caches are
  reaped on demand
<braunr> (this basically boils down to calling the new reclaiming function
  instead of zone_gc())
<braunr> you must be careful about types too
<braunr> there is work to be done ;)
<braunr> (not to mention the obvious about replacing all the calls to the
  zone allocator, and testing/debugging afterwards)

IRC, freenode, #hurd, 2011-07-14

<braunr> can you make your patch available ?
<mcsim> it is available in gnumach repository at savannah 
<mcsim> tree mplaneta/libbraunr/master
<braunr> mcsim: i'll test your branch
<mcsim> ok. I'll give you a link in a minute
<braunr> hm why balloc ?
<mcsim> Braun's allocator
<braunr> err
<braunr> mcsim: this is the interface i had in mind for a kernel version :)
<braunr> very similar to the original slab allocator interface actually
<braunr> well, you've been working
<mcsim> But I have a problem with this patch. When I apply it to gnumach
  code from debian repository. I have to make a change in file ramdisk.c
  with sed -i 's/kernel_map/\&kernel_map/' device/ramdisk.c
<mcsim> because in git repository there is no such file
<braunr> mcsim: how do you configure the kernel before building ?
<braunr> mcsim: you should keep in touch more often i think, so that you
  get feedback from us and don't spend too much time "off course"
<mcsim> I didn't configure it. I just run dpkg-buildsource -b.
<braunr> oh you build the debian package
<braunr> well my version was by configure --enable-kdb --enable-rtl8139
<braunr> and it seems stuck in an infinite loop during bootstrap
<mcsim> and printf doesn't work. The first function called by c_boot_entry
  is printf(version).
<braunr> mcsim: also, you're invited to get the x15mach version of my
  files, which are gplv2+ licensed
<braunr> be careful of my macros.h file, it can conflict with the
  macros_help.h file from gnumach iirc
<mcsim> There were conflicts with MACRO_BEGIN and MACRO_END. But I solved
<braunr> ok
<braunr> it's tricky
<braunr> mcsim: try to find where the first use of the allocator is made

IRC, freenode, #hurd, 2011-07-22

<mcsim> braunr, hello. Kernel with your allocator already compiles and
  runs. There still some problems, but, certainly, I'm on the final stage
  already. I hope I'll finish in a few days.
<tschwinge> mcsim: Oh, cool!  Have you done some measurements already?
<mcsim> Not yet
<tschwinge> OK.
<tschwinge> But if it able to run a GNU/Hurd system, then that already is
  something, a big milestone!
<braunr> nice
<braunr> although you'll probably need to tweak the garbage collecting
<mcsim> tschwinge: thanks
<mcsim> braunr: As back-end for allocating memory I use
  kmem_alloc_wired. But in zalloc was an opportunity to use as back-end
  kmem_alloc_pageable. Although there was no any zone that used
  kmem_alloc_pageable. Do I need to implement this functionality?
<braunr> mcsim: do *not* use kmem_alloc_pageable()
<mcsim> braunr: Ok. This is even better)
<braunr> mcsim: in x15, i've taken this even further: there is *no* kernel
  vm object, which means all kernel memory is wired and unmanaged
<braunr> making it fast and safe
<braunr> pageable kernel memory was useful back when RAM was really scarce
<braunr> 20 years ago
<braunr> but it's a source of deadlock
<mcsim> Indeed. I'll won't use kmem_alloc_pageable.

IRC, freenode, #hurd, 2011-08-09

< braunr> mcsim: what's the "bug related to MEM_CF_VERIFY" you refer to in
  one of your commits ?
< braunr> mcsim: don't use spin_lock_t as a member of another structure
< mcsim> braunr: I confused with types in *_verify functions, so they
  didn't work. Than I fixed it in the commit you mentioned.
< braunr> in gnumach, most types are actually structure pointers
< braunr> use simple_lock_data_t
< braunr> mcsim: ok
< mcsim> > use simple_lock_data_t
< mcsim> braunr: ok
< braunr> mcsim: don't make too many changes to the code base, and if
  you're unsure, don't hesitate to ask
< braunr> also, i really insist you rename the allocator, as done in x15
  for example
  (;f=vm/kmem.c), instead of
  a name based on mine :/
< mcsim> braunr: Ok. It was just work name. When I finish I'll rename the
< braunr> other than that, it's nice to see progress
< braunr> although again, it would be better with some reports along
< braunr> i won't be present at the meeting tomorrow unfortunately, but you
  should use those to report the status of your work
< mcsim> braunr: You've said that I have to tweak gc process. Did you mean
  to call mem_gc() when physical memory ends instead of calling it every x
  seconds? Or something else?
< braunr> there are multiple topics, alhtough only one that really matters
< braunr> study how zone_gc was called
< braunr> reclaiming memory should happen when there is pressure on the VM
< braunr> but it shouldn't happen too ofte, otherwise there is trashing
< braunr> and your caches become mostly useless
< braunr> the original slab allocator uses a 15-second period after a
  reclaim during which reclaiming has no effect
< braunr> this allows having a somehow stable working set for this duration
< braunr> the linux slab allocator uses 5 seconds, but has a more
  complicated reclaiming mechanism
< braunr> it releases memory gradually, and from reclaimable caches only
  (dentry for example)
< braunr> for x15 i intend to implement the original 15 second interval and
  then perform full reclaims
< mcsim> In zalloc mem_gc is called by vm_pageout_scan, but not often than
  once a second.
< mcsim> In balloc I've changed interval to once in 15 seconds.
< braunr> don't use the code as it is
< braunr> the version you've based your work on was meant for userspace
< braunr> where there isn't memory pressure
< braunr> so a timer is used to trigger reclaims at regular intervals
< braunr> it's different in a kernel
< braunr> mcsim: where did you see vm_pageout_scan call the zone gc once a
  second ?
< mcsim> vm_pageout_scan calls consider_zone_gc and consider_zone_gc checks
  if second is passed.
< braunr> where ?
< mcsim> Than zone_gc can be called.
< braunr> ah ok, it's in zaclloc.c then
< braunr> zalloc.c
< braunr> yes this function is fine
< mcsim> so old gc didn't consider vm pressure. Or I missed something.
< braunr> it did
< mcsim> how?
< braunr> well, it's called by the pageout daemon
< braunr> under memory pressure
< braunr> so it's fine
< mcsim> so if mem_gc is called by pageout daemon is it fine?
< braunr> it must be changed to do something similar to what
  consider_zone_gc does
< mcsim> It does. mem_gc does the same work as consider_zone_gc and
< braunr> good
< mcsim> so gc process is fine?
< braunr> should be
< braunr> i see mem.c only includes mem.h, which then includes other
< braunr> don't do that
< braunr> always include all the headers you need where you need them
< braunr> if you need avltree.h in both mem.c and mem.h, include it in both
< braunr> and by the way, i recommend you use the red black tree instead of
  the avl type
< braunr> (it's the same interface so it shouldn't take long)
< mcsim> As to report. If you won't be present at the meeting, I can tell
  you what I have to do now.
< braunr> sure
< braunr> in addition, use GPLv2 as the license, teh BSD one is meant for
  the userspace version only
< braunr> GPLv2+ actually
< braunr> hm you don't need list.c
< braunr> it would only add dead code
< braunr> "Zone for dynamical allocator", don't mix terms
< braunr> this comment refers to a vm_map, so call it a map
< mcsim> 1. Change constructor for kentry_alloc_cache.
< mcsim> 2. Make measurements.
< mcsim> +
< mcsim> 3. Use simple_lock_data_t
< mcsim> 4. Replace license
< braunr> kentry_alloc_cache <= what is that ?
< braunr> cache for kernel map entries in vm_map ?
< braunr> the comment for mem_cpu_pool_get doesn't apply in gnumach, as
  there is no kernel preemption


< braunr> "Don't attempt mem GC more frequently than hz/MEM_GC_INTERVAL
  times a second.
< braunr> "
< mcsim> sorry. I meant vm_map_kentry_cache
< braunr> hm nothing actually about this comment
< braunr> mcsim: ok
< braunr> yes kernel map entries need special handling
< braunr> i don't know how it's done in gnumach though
< braunr> static preallocation ?
< mcsim> yes
< braunr> that's ugly :p
< mcsim> but it uses dynamic allocation further even for vm_map kernel
< braunr> although such bootstrapping issues are generally difficult to
  solve elegantly
< braunr> ah
< mcsim> now I use only static allocation, but I'll add dynamic allocation
< braunr> when you have time, mind the coding style (convert everything to
  gnumach style, which mostly implies using tabs instead of 4-spaces
< braunr> when you'll work on dynamic allocation for the kernel map
  entries, you may want to review how it's done in x15
< braunr> the mem_source type was originally intended for that purpose, but
  has slightly changed once the allocator was adapted to work in my kernel
< mcsim> ok
< braunr> vm_map_kentry_zone is the only zone created with ZONE_FIXED
< braunr> and it is zcram()'ed immediately after
< braunr> so you can consider it a statically allocated zone
< braunr> in x15 i use another strategy: there is a special kernel submap
  named kentry_map which contains only one map entry (statically allocated)
< braunr> this map is the backend (mem_source) for the kentry_cache
< braunr> the kentry_cache is created with a special flag that tells it
  memory can't be reclaimed
< braunr> when the cache needs to grow, the single map entry is extended to
  cover the allocated memory
< braunr> it's similar to the way pmap_growkernel() works for kernel page
  table pages
< braunr> (and is actually based on that idea)
< braunr> it's a compromise between full static and dynamic allocation
< braunr> the advantage is that the allocator code can be used (so there is
  no need for a special allocator like in netbsd)
< braunr> the drawback is that some resources can never be returned to
  their source (and under peaks, the amount of unfreeable resources could
  become large, but this is unexpected)
< braunr> mcsim: for now you shouldn't waste your time with this
< braunr> i see the number of kernel map entries is fixed at 256
< braunr> and i've never seen the kernel use more than around 30 entries
< mcsim> Do you think that I have to left this problem to the end?
< braunr> yes

IRC, freenode, #hurd, 2011-08-11

< mcsim> braunr: Hello. Can you give me an advice how can I make
  measurements better?
< braunr> mcsim: what kind of measurements
< mcsim> braunr: How much is your allocator better than zalloc.
< braunr> slightly :p
< braunr> that's why i never took the time to put it in gnumach
< mcsim> braunr: Just I thought that there are some rules or
  recommendations of such measurements. Or I can do them any way I want?
< braunr> mcsim: i don't know
< braunr> mcsim: benchmarking is an art of its own, and i don't even know
  how to use the bits of profiling code available in gnumach (if it still
< antrik> mcsim: hm... are you saying you already have a running system
  with slab allocator?... :-)
< braunr> mcsim: the main advantage i can see is the removal of many
  arbitrary hard limits
< mcsim> antrik: yes
< antrik> \o/
< antrik> nice work!
< braunr> :)
< braunr> the cpu layer should also help a bit, but it's hard to measure
< braunr> i guess it could be seen on the ipc path for very small buffers
< mcsim> antrik: Thanks. But I still have to 1. Change constructor for
  kentry_alloc_cache. and 2. Make measurements.
< braunr> and polish the whole thing :p
< antrik> mcsim: I'm not sure this can be measured... the performance
  differente in any real live usage is probably just a few percent at most
  -- it's hard to construct a benchmark giving enough precision so it's not
  drowned in noise...
< antrik> perhaps it conserves some memory -- but that too would be hard to
  measure I fear
< braunr> yes
< braunr> there *should* be better allocation times, less fragmentation,
  better accounting ... :)
< braunr> and no arbitrary limits !
< antrik> :-)
< braunr> oh, and the self debugging features can be nice too
< mcsim> But I need to prove that my work wasn't useless
< braunr> well it wasn't, but that's hard to measure
< braunr> it's easy to prove though, since there are additional features
  that weren't present in the zone allocator
< mcsim> Ok. If there are some profiling features in gnumach can you give
  me a link with their description?
< braunr> mcsim: sorry, no
< braunr> mcsim: you could still write the basic loop test, which counts
  the number of allocations performed in a fixed time interval
< braunr> but as it doesn't match many real life patterns, it won't be very
< braunr> and i'm afraid that if you consider real life patterns, you'll
  see how negligeable the improvement can be compared to other operations
  such as memory copies or I/O (ouch)
< mcsim> Do network drivers use this allocator?
< mcsim> ok. I'll scrape up some test and than I'll report results.

IRC, freenode, #hurd, 2011-08-26

< mcsim> hello. Are there any analogs of copy_to_user and copy_from_user in
  linux for gnumach?
< mcsim> Or how can I determine memory map if I know address? I need this
  for vm_map_copyin
< guillem> mcsim: vm_map_lookup_entry?
< mcsim> guillem: but I need to transmit map to this function and it will
  return an entry which contains specified address.
< mcsim> And I don't know what map have I transmit.
< mcsim> I need to transfer static array from kernel to user. What map
  contains static data?
< antrik> mcsim: Mach doesn't have copy_{from,to}_user -- instead, large
  chunks of data are transferred as out-of-line data in IPC messages
  (i.e. using VM magic)
< mcsim> antrik: can you give me an example? I just found using
  vm_map_copyin in host_zone_info.
< antrik> no idea what vm_map_copyin is to be honest...

IRC, freenode, #hurd, 2011-08-27

< braunr> mcsim: the primitives are named copyin/copyout, and they are used
  for messages with inline data
< braunr> or copyinmsg/copyoutmsg
< braunr> vm_map_copyin/out should be used for chunks larger than a page
  (or roughly a page)
< braunr> also, when writing to a task space, see which is better suited:
  vm_map_copyout or vm_map_copy_overwrite
< mcsim> braunr: and what will be src_map for vm_map_copyin/out?
< braunr> the caller map
< braunr> which you can get with current_map() iirc
< mcsim> braunr: thank you
< braunr> be careful not to leak anything in the transferred buffers
< braunr> memset() to 0 if in doubt
< mcsim> braunr:ok
< braunr> antrik: vm_map_copyin() is roughly vm_read()
< antrik> braunr: what is it used for?
< braunr> antrik: 01:11 < antrik> mcsim: Mach doesn't have
  copy_{from,to}_user -- instead, large chunks of data are transferred as
  out-of-line data in IPC messages (i.e. using VM magic)
< braunr> antrik: that "VM magic" is partly implemented using vm_map_copy*
< antrik> braunr: oh, you mean it doesn't actually copy data, but only page
  table entries? if so, that's *not* really comparable to

IRC, freenode, #hurd, 2011-08-28

< braunr> antrik: the equivalent of copy_{from,to}_user are
< braunr> antrik: but when the data size is about a page or more, it's
  better not to copy, of course
< antrik> braunr: it's actually not clear at all that it's really better to
  do VM magic than to copy...

IRC, freenode, #hurd, 2011-08-29

< braunr> antrik: at least, that used to be the general idea, and with a
  simpler VM i suspect it's still true
< braunr> mcsim: did you progress on your host_zone_info replacement ?
< braunr> mcsim: i think you should stick to what the original
  implementation did
< braunr> which is making an inline copy if caller provided enough space,
  using kmem_alloc_pageable otherwise
< braunr> specify ipc_kernel_map if using kmem_alloc_pageable
< mcsim> braunr: yes. And it works. But I use kmem_alloc, not pageable. Is
  it worse?
< mcsim> braunr: host_zone_info replacement is pushed to savannah
< braunr> mcsim: i'll have a look
< mcsim> braunr: I've pushed one more commit just now, which has attitude
  to host_zone_info.
< braunr> mem_alloc_early_init should be renamed mem_bootstrap
< mcsim> ok
< braunr> mcsim: i don't understand your call to kmem_free
< mcsim> braunr: It shouldn't be there?
< braunr> why should it be there ?
< braunr> you're freeing what the copy object references
< braunr> it's strange that it even works
< braunr> also, you shouldn't pass infop directly as the copy object
< braunr> i guess you get a warning for that
< braunr> do what the original code does: use an intermediate copy object
  and a cast
< mcsim> ok
< braunr> another error (without consequence but still, you should mind it)
< braunr> simple_lock(&mem_cache_list_lock);
< braunr> [...]
< braunr> kr = kmem_alloc(ipc_kernel_map, &info, info_size);
< braunr> you can't hold simple locks while allocating memory
< braunr> read how the original implementation works around this
< mcsim> ok
< braunr> i guess host_zone_info assumes the zone list doesn't change much
  while unlocked
< braunr> or that's it's rather unimportant since it's for debugging
< braunr> a strict snapshot isn't required
< braunr> list_for_each_entry(&mem_cache_list, cache, node) max_caches++;
< braunr> you should really use two separate lines for readability
< braunr> also, instead of counting each time, you could just maintain a
  global counter
< braunr> mcsim: use strncpy instead of strcpy for the cache names
< braunr> not to avoid overflow but rather to clear the unused bytes at the
  end of the buffer
< braunr> mcsim: about kmem_alloc vs kmem_alloc_pageable, it's a minor
< braunr> you're handing off debugging data to a userspace application
< braunr> a rather dull reporting tool in most cases, which doesn't require
  wired down memory
< braunr> so in order to better use available memory, pageable memory
  should be used
< braunr> in the future i guess it could become a not-so-minor issue though
< mcsim> ok. I'll fix it
< braunr> mcsim: have you tried to run the kernel with MC_VERIFY always on
< braunr> MEM_CF_VERIFY actually
< mcsim1> yes.
< braunr> oh
< braunr> nothing wrong 
< braunr> ?
< mcsim1> it is always set
< braunr> ok
< braunr> ah, you set it in macros.h ..
< braunr> don't
< braunr> put it in mem.c if you want, or better, make it a compile-time
< braunr> macros.h is a tiny macro library, it shouldn't define such
  unrelated options
< mcsim1> ok.
< braunr> mcsim1: did you try fault injection to make sure the checking
  code actually works and how it behaves when an error occurs ?
< mcsim1> I think that when I finish I'll merge files cpu.h and macros.h
  with mem.c
< braunr> yes that would simplify things
< mcsim1> Yes. When I confused with types mem_buf_fill worked wrong and
  panic occurred.
< braunr> very good
< braunr> have you progressed concerning the measurements you wanted to do
< mcsim1> not much.
< braunr> ok
< mcsim1> I think they will be ready in a few days.
< antrik> what measurements are these?
< mcsim1> braunr: What maximal size for static data and stack in kernel?
< braunr> what do you mean ?
< braunr> kernel stacks are one page if i'm right
< braunr> static data (rodata+data+bss) are limited by grub bugs only :)
< mcsim1> braunr: probably they are present, because when I created too big
  array I couldn't boot kernel
< braunr> local variable or static ?
< mcsim1> static
< braunr> how large ?
< mcsim1> 4Mb
< braunr> hm
< braunr> it's not a grub bug then
< braunr> i was able to embed as much as 32 MiB in x15 while doing this
  kind of tests
< braunr> I guess it's the gnu mach boot code which only preallocates one
  page for the initial kernel mapping
< braunr> one PTP (page table page) maps 4 MiB
< braunr> (x15 does this completely dynamically, unlike mach or even
  current BSDs)
< mcsim1> antrik: First I want to measure time of each cache
  creation/allocation/deallocation and then compile kernel.
< braunr> cache creation is irrelevant
< braunr> because of the cpu pools in the new allocator, you should test at
  least two different allocation patterns
< braunr> one with quick allocs/frees
< braunr> the other with large numbers of allocs then their matching frees
< braunr> (larger being at least 100)
< braunr> i'd say the cpu pool layer is the real advantage over the
  previous zone allocator
< braunr> (from a performance perspective)
< mcsim1> But there is only one cpu
< braunr> it doesn't matter
< braunr> it's stil a very effective cache
< braunr> in addition to reducing contention
< braunr> compare mem_cpu_pool_pop() against mem_cache_alloc_from_slab()
< braunr> mcsim1: work is needed to polish the whole thing, but getting it
  actually working is a nice achievement for someone new on the project
< braunr> i hope it helped you learn about memory allocation, virtual
  memory, gnu mach and the hurd in general :)
< antrik> indeed :-)

IRC, freenode, #hurd, 2011-09-06

[some performance testing]
<braunr> i'm not sure such long tests are relevant but let's assume balloc
  is slower
<braunr> some tuning is needed here
<braunr> first, we can see that slab allocation occurs more often in balloc
  than page allocation does in zalloc
<braunr> so yes, as slab allocation is slower (have you measured which part
  actually is slow ? i guess it's the kmem_alloc call)
<braunr> the whole process gets a bit slower too
<mcsim> I used alloc_size = 4096 for zalloc
<braunr> i don't know what that is exactly
<braunr> but you can't hold 500 16 bytes buffers in a page so zalloc must
  have had free pages around for that
<mcsim> I use kmem_alloc_wired
<braunr> if you have time, measure it, so that we know how much it accounts
<braunr> where are the results for dealloc ?
<mcsim> I can't give you result right now because internet works very
  bad. But for first DEALLOC result are the same, exept some cases when it
  takes balloc for more than 1000 ticks
<braunr> must be the transfer from the cpu layer to the slab layer
<mcsim> as to kmem_alloc_wired. I think zalloc uses this function too for
  allocating objects in zone I test.
<braunr> mcsim: yes, but less frequently, which is why it's faster
<braunr> mcsim: another very important aspect that should be measured is
  memory consumption, have you looked into that ?
<mcsim> I think that I made too little iterations in test SMALL
<mcsim> If I increase constant SMALL_TESTS will it be good enough?
<braunr> mcsim: i don't know, try both :)
<braunr> if you increase the number of iterations, balloc average time will
  be lower than zalloc, but this doesn't remove the first long
  initialization step on the allocated slab
<mcsim> SMALL_TESTS to 500, I mean
<braunr> i wonder if maintaining the slabs sorted through insertion sort is
  what makes it slow
<mcsim> braunr: where do you sort slabs? I don't see this.
<braunr> mcsim: mem_cache_alloc_from_slab and its free counterpart
<braunr> mcsim: the mem_source stuff is useless in gnumach, you can remove
  it and directly call the kmem_alloc/free functions
<mcsim> But I have to make special allocator for kernel map entries.
<braunr> ah right
<mcsim> btw. It turned out that 256 entries are not enough.
<braunr> that's weird
<braunr> i'll make a patch so that the mem_source code looks more like what
  i have in x15 then
<braunr> about the results, i don't think the slab layer is that slow
<braunr> it's the cpu_pool_fill/drain functions that take time
<braunr> they preallocate many objects (64 for your objects size if i'm
  right) at once
<braunr> mcsim: look at the first result page: some times, a number around
  8000 is printed
<braunr> the common time (ticks, whatever) for a single object is 120
<braunr> 8132/120 is 67, close enough to the 64 value
<mcsim> I forgot about SMALL tests here are they: (balloc)
<mcsim> braunr: why do you divide 8132 by 120?
<braunr> mcsim: to see if it matches my assumption that the ~8000 number
  matches the cpu_pool_fill call
<mcsim> braunr: I've got it
<braunr> mcsim: i'd be much interested in the dealloc results if you can
  paste them too
<mcsim> dealloc:
<braunr> mcsim: thanks
<mcsim> second dealloc:
<braunr> mcsim: so the main conclusion i retain from your tests is that the
  transfers from the cpu and the slab layers are what makes the new
  allocator a bit slower
<mcsim> OPERATION_SMALL dealloc:
<braunr> mcsim: what needs to be measured now is global memory usage
<mcsim> braunr: data from /proc/vmstat after kernel compilation will be
<braunr> mcsim: let me check
<braunr> mcsim: no it won't do, you need to measure kernel memory usage
<braunr> the best moment to measure it is right after zone_gc is called
<mcsim> Are there any facilities in gnumach for memory measurement?
<braunr> it's specific to the allocators
<braunr> just count the number of used pages
<braunr> after garbage collection, there should be no free page, so this
  should be rather simple
<mcsim> ok
<mcsim> braunr: When I measure memory usage in balloc, what formula is
  better cache->nr_slabs * cache->bufs_per_slab * cache->buf_size or
  cache->nr_slabs * cache->slab_size?
<braunr> the latter

IRC, freenode, #hurd, 2011-09-07

<mcsim> braunr: I've disabled calling of mem_cpu_pool_fill and allocator
  became faster
<braunr> mcsim: sounds nice
<braunr> mcsim: i suspect the free path might not be as fast though
<mcsim> results for first calling: second: and with many alloc/free:
<braunr> mcsim: thanks
<mcsim> best result are for second call: average time decreased from 159.56
  to 118.756
<mcsim> First call slightly worse, but this is because I've added some
  profiling code
<braunr> i still see some ~8k lines in 128639
<braunr> even some around ~12k
<mcsim> I think this is because of mem_cache_grow I'm investigating it now
<braunr> i guess so too
<mcsim> I've measured time for first call in cache and from about 22000
  mem_cache_grow takes 20000
<braunr> how did you change the code so that it doesn't call
  mem_cpu_pool_fill ?
<braunr> is the cpu layer still used ?
<braunr> don't forget the free path
<braunr> mcsim: anyway, even with the previous slightly slower behaviour we
  could observe, the performance hit is negligible
<mcsim> Is free path a compilation? (I'm sorry for my english)
<braunr> mcsim: mem_cache_free
<braunr> mcsim: the last two measurements i'd advise are with big (>4k)
  object sizes and, really, kernel allocator consumption
<mcsim> (first, second, small)
<braunr> mcsim: these numbers are closer to the zalloc ones, aren't they ?
<mcsim> deallocating slighty faster too
<braunr> it may not be the case with larger objects, because of the use of
  a tree
<mcsim> yes, they are closer
<braunr> but then, i expect some space gains
<braunr> the whole thing is about compromise
<mcsim> ok. I'll try to measure them today. Anyway I'll post result and you
  could read them in the morning
<braunr> at least, it shows that the zone allocator was actually quite good
<braunr> i don't like how the code looks, there are various hacks here and
  there, it lacks self inspection features, but it's quite good
<braunr> and there was little room for true improvement in this area, like
  i told you :)
<braunr> (my allocator, like the current x15 dev branch, focuses on mp
<braunr> mcsim: thanks again for these numbers
<braunr> i wouldn't have had the courage to make the tests myself before
  some time eh
<mcsim> braunr: hello. Look at the small_4096 results (balloc)
<braunr> mcsim: wow, what's that ? :)
<braunr> mcsim: you should really really include your test parameters in
  the report
<braunr> like object size, purpose, and other similar details
<mcsim> for balloc I specified only object_size = 4096
<mcsim> for zalloc object_size = 4096, alloc_size = 4096, memtype = 0;
<braunr> the results are weird
<braunr> apart from the very strange numbers (e.g. 0 or 4429543648), none
  is around 3k, which is the value matching a kmem_alloc call
<braunr> happy to see balloc behaves quite good for this size too
<braunr> s/good/well/
<mcsim> Oh
<mcsim> here is significant only first 101 lines
<mcsim> I'm sorry
<braunr> ok
<braunr> what does the test do again ? 10 loops of 10 allocs/frees ?
<mcsim> yes
<braunr> ok, so the only slowdown is at the beginning, when the slabs are
<braunr> the two big numbers (31844 and 19548) are strange
<mcsim> on the other hand time of compilation is 
<mcsim> balloc               zalloc
<mcsim> 38m28.290s  38m58.400s 
<mcsim> 38m38.240s  38m42.140s 
<mcsim> 38m30.410s  38m52.920s 
<braunr> what are you compiling ?
<mcsim> gnumach kernel
<braunr> in 40 mins ?
<mcsim> yes
<braunr> you lack hvm i guess
<mcsim> is it long?
<mcsim> I use real PC
<braunr> very
<braunr> ok
<braunr> so it's normal
<mcsim> in vm it was about 2 hours)
<braunr> the difference really is negligible
<braunr> ok i can explain the big numbers
<braunr> the slab size depends on the object size, and for 4k, it is 32k
<braunr> you can store 8 4k buffers in a slab (lines 2 to 9)
<mcsim> so we need use kmem_alloc_* 8 times?
<braunr> on line 10, the ninth object is allocated, which adds another slab
  to the cache, hence the big number
<braunr> no, once for a size of 32k
<braunr> and then the free list is initialized, which means accessing those
  pages, which means tlb misses
<braunr> i guess the zone allocator already has free pages available
<mcsim> I see
<braunr> i think you can stop performance measurements, they show the
  allocator is slightly slower, but so slightly we don't care about that
<braunr> we need numbers on memory usage now (at the page level)
<braunr> and this isn't easy
<mcsim> For balloc I can get numbers if I summarize nr_slabs*slab_size for
  each cache, isn't it?
<braunr> yes
<braunr> you can have a look at the original implementation, function
<mcsim> And for zalloc I have to summarize of cur_size and then add
<braunr> i don't know :/
<braunr> i think the best moment to obtain accurate values is after zone_gc
  removes the collected pages
<braunr> for both allocators, you could fill a stats structure at that
  moment, and have an rpc copy that structure when a client tool requests
<braunr> concerning your tests, there is another point to have in mind
<braunr> the very first loop in your code shows a result of 31844
<braunr> although you disabled the call to cpu_pool_fill
<braunr> but the reason why it's so long is that the cpu layer still exists
<braunr> and if you look carefully, the cpu pools are created as needed on
  the free path
<mcsim> I removed cpu_pool_drain
<braunr> but not cpu_pool_push/pop i guess
<braunr> see, you still allocate the cpu pool array on the free path
<mcsim> but I don't fill it
<braunr> that's not the point
<braunr> it uses mem_cache_alloc
<braunr> so in a call to free, you can also have an allocation, that can
  potentially create a new slab
<mcsim> I see, so I have to create cpu_pool at the initialization stage?
<braunr> no, you can't
<braunr> there is a reason why they're allocated on the free path
<braunr> but since you don't have the fill/drain functions, i wonder if you
  should just comment out the whole cpu layer code
<braunr> but hmm
<braunr> no really, it's not worth the effort
<braunr> even with drains/fills, the results are really good enough
<braunr> it makes the allocator smp ready
<braunr> we should just keep it that way
<braunr> mcsim: fyi, the reason why cpu pool arrays are allocated on the
  free path is to avoid recursion
<braunr> because cpu pool arrays are allocated from caches just as almost
  everything else
<mcsim> ok
<mcsim> summ of cur_size and then adding zalloc_wasted_space gives 0x4e1954
<mcsim> but this value isn't even page aligned
<mcsim> For balloc I've got 0x4c6000 0x4aa000 0x48d000
<braunr> hm can you report them in decimal, >> 10 so that values are in KiB
<mcsim> 4888 4776 4660 for balloc
<mcsim> 4998 for zalloc
<braunr> when ?
<braunr> after boot ?
<mcsim> boot, compile, zone_gc
<mcsim> and then measure
<braunr> ?
<mcsim> I call garbage collector before measuring
<mcsim> and I measure after kernel compilation
<braunr> i thought it took you 40 minutes
<mcsim> for balloc I got results at night
<braunr> oh so you already got them
<braunr> i can't beleive the kernel only consumes 5 MiB
<mcsim> before gc it takes about 9052 Kib
<braunr> can i see the measurement code ?
<braunr> oh, and how much ram does your machine have ?
<mcsim> 758 mb
<mcsim> 768
<braunr> that's really weird
<braunr> i'd expect the kernel to consume much more space
<mcsim> it's only dynamically allocated data
<braunr> yes
<braunr> ipc ports, rights, vm map entries, vm objects, and lots of other
  hanging buffers
<braunr> about how much is zalloc_wasted_space ?
<braunr> if it's small or constant, i guess you could ignore it
<mcsim> about 492
<mcsim> KiB
<braunr> well it's another good point, mach internal structures don't imply
  much overhead
<braunr> or, the zone allocator is underused

<tschwinge> mcsim, braunr: The memory allocator project is coming along
  good, as I get from your IRC messages?
<braunr> tschwinge: yes, but as expected, improvements are minor
<tschwinge> But at the very least it's now well-known, maintainable code.
<braunr> yes, it's readable, easier to understand, provides self inspection
  and is smp ready
<braunr> there also are less hacks, but a few less features (there are no
  way to avoid sleeping so it's unusable - and unused - in interrupt
<braunr> is* no way
<braunr> tschwinge: mcsim did a good job porting and measuring it

IRC, freenode, #hurd, 2011-09-08

<antrik> braunr: note that the zalloc map used to be limited to 8 MiB or
  something like that a couple of years ago... so it doesn't seems
  surprising that the kernel uses "only" 5 MiB :-)
<antrik> (yes, we had a *lot* of zalloc panics back then...)

IRC, freenode, #hurd, 2011-09-14

<mcsim> braunr: hello. I've written a constructor for kernel map entries
  and it can return resources to their source. Can you have a look at it? If all be OK I'll push it tomorrow.
<braunr> mcsim: send the patch through mail please, i'll apply it on my
<braunr> are you sure the cache is reapable ?
<mcsim> All slabs, except first I allocate with kmem_alloc_wired.
<braunr> how can you be sure ?
<mcsim> First slab I allocate during bootstrap and use pmap_steal_memory
  and further I use only kmem_alloc_wired
<braunr> no, you use kmem_free
<braunr> in kentry_dealloc_cache()
<braunr> which probably creates a recursion
<braunr> using the constructor this way isn't a good idea
<braunr> constructors are good for preconstructed state (set counters to 0,
  init lists and locks, that kind of things, not allocating memory)
<braunr> i don't think you should try to make this special cache reapable
<braunr> mcsim: keep in mind constructors are applied on buffers at *slab*
  creation, not at object allocation
<braunr> so if you allocate a single slab with, say, 50 or 100 objects per
  slab, kmem_alloc_wired would be called that number of times
<mcsim> why kentry_dealloc_cache can create recursion? kentry_dealloc_cache
  is called only by mem_cache_reap.
<braunr> right
<braunr> but are you totally sure mem_cache_reap() can't be called by
  kmem_free() ?
<braunr> i think you're right, it probably can't

IRC, freenode, #hurd, 2011-09-25

<mcsim> braunr: hello. I rewrote constructor for kernel entries and seems
  that it works fine. I think that this was last milestone. Only moving of
  memory allocator sources to more appropriate place and merge with main
  branch left.
<braunr> mcsim: it needs renaming and reindenting too
<mcsim> for reindenting C-x h Tab in emacs will be enough?
<braunr> mcsim: make sure which style must be used first
<mcsim> and what should I rename and where better to place allocator? For
  example, there is no lib directory, like in x15. Should I create it and
  move list.* and rbtree.* to lib/ or move these files to util/ or
  something else?
<braunr> mcsim: i told you balloc isn't a good name before, use something
  more meaningful (kmem is already used in gnumach unfortunately if i'm
<braunr> you can put the support files in kern/
<mcsim> what about vm_alloc?
<braunr> you should prefix it with vm_
<braunr> shouldn't
<braunr> it's a top level allocator
<braunr> on top of the vm system
<braunr> maybe mcache
<braunr> hm no
<braunr> maybe just km_
<mcsim> kern/km_alloc.*?
<braunr> no
<braunr> just km
<mcsim> ok.

IRC, freenode, #hurd, 2011-09-27

<mcsim> braunr: hello. When I've tried to speed of new allocator and bad
  I've removed function mem_cpu_pool_fill. But you've said to undo this. I
  don't understand why this function is necessary. Can you explain it,
<mcsim> When I've tried to compare speed of new allocator and old*
<braunr> i'm not sure i said that
<braunr> i said the performance overhead is negligible
<braunr> so it's better to leave the cpu pool layer in place, as it almost
  doesn't hurt
<braunr> you can implement the KMEM_CF_NO_CPU_POOL I added in the x15 mach
<braunr> so that cpu pools aren't used by default, but the code is present
  in case smp is implemented
<mcsim> I didn't remove cpu pool layer. I've just removed filling of cpu
  pool during creation of slab.
<braunr> how do you fill the cpu pools then ?
<mcsim> If object is freed than it is added to cpu poll
<braunr> so you don't fill/drain the pools ?
<braunr> you try to get/put an object and if it fails you directly fall
  back to the slab layer ?
<mcsim> I drain them during garbage collection
<braunr> oh
<mcsim> yes
<braunr> you shouldn't touch the cpu layer during gc
<braunr> the number of objects should be small enough so that we don't care
<mcsim> ok. I can drain cpu pool at any other time if it is prohibited to
  in mem_gc.
<mcsim> But why do we need to fill cpu poll during slab creation?
<mcsim> In this case allocation consist of: get object from slab -> put it
  to cpu pool -> get it from cpu pool
<mcsim> I've just remove last to stages
<braunr> hm cpu pools aren't filled at slab creation
<braunr> they're filled when they're empty, and drained when they're full
<braunr> so that the number of objects they contain is increased/reduced to
  a value suitable for the next allocations/frees
<braunr> the idea is to fall back as little as possible to the slab layer
  because it requires the acquisition of the cache lock
<mcsim> oh. You're right. I'm really sorry. The point is that if cpu pool
  is empty we don't need to fill it first
<braunr> uh, yes we do :)
<mcsim> Why cache locking is so undesirable? If we have free objects in
  slabs locking will not take a lot if time.
<braunr> mcsim: it's undesirable on a smp system
<mcsim> ok.
<braunr> mcsim: and spin locks are normally noops on a up system
<braunr> which is the case in gnumach, hence the slightly better
  performances without the cpu layer
<braunr> but i designed this allocator for x15, which only supports mp
  systems :)
<braunr> mcsim: sorry i couldn't look at your code, sick first, busy with
  server migration now (new server almost ready for xen hurds :))
<mcsim> ok.
<mcsim> I ended with allocator if didn't miss anything important:)
<braunr> i'll have a look soon i hope :)

IRC, freenode, #hurd, 2011-09-27

<antrik> braunr: would it be realistic/useful to check during GC whether
  all "used" objects are actually in a CPU pool, and if so, destroy them so
  the slab can be freed?...
<antrik> mcsim: BTW, did you ever do any measurements of memory
<mcsim> antrik: I couldn't do this for zalloc
<antrik> oh... why not?
<antrik> (BTW, I would be interested in a comparision between using the CPU
  layer, and bare slab allocation without CPU layer)
<mcsim> Result I've got were strange. It wasn't even aligned to page size.
<mcsim> Probably is it better to look into /proc/vmstat?
<mcsim> Because I put hooks in the code and probably I missed something
<antrik> mcsim: I doubt vmstat would give enough information to make any
  useful comparision...
<braunr> antrik: isn't this draining cpu pools at gc time ?
<braunr> antrik: the cpu layer was found to add a slight overhead compared
  to always falling back to the slab layer
<antrik> braunr: my idea is only to drop entries from the CPU cache if they
  actually prevent slabs from being freed... if other objects in the slab
  are really in use, there is no point in flushing them from the CPU cache
<antrik> braunr: I meant comparing the fragmentation with/without CPU
  layer. the difference in CPU usage is probably negligable anyways...
<antrik> you might remember that I was (and still am) sceptical about CPU
  layer, as I suspect it worsens the good fragmentation properties of the
  pure slab allocator -- but it would be nice to actually check this :-)
<braunr> antrik: right
<braunr> antrik: the more i think about it, the more i consider slqb to be
  a better solution ...... :>
<braunr> an idea for when there's time
<braunr> eh
<antrik> hehe :-)

IRC, freenode, #hurd, 2011-10-13

<braunr> mcsim: what's the current state of your gnumach branch ?
<mcsim> I've merged it with master in September
<braunr> yes i've seen that, but does it build and run fine ?
<mcsim> I've tested it on gnumach from debian repository, but for building
  I had to make additional change in device/ramdisk.c, as I mentioned.
<braunr> mcsim: why ?
<mcsim> And it runs fine for me.
<braunr> mcsim: why did you need to make other changes ?
<mcsim> because there is a patch which comes with from-debian-repository
  kernel and it addes some code, where I have to make changes. Earlier
  kernel_map was a pointer to structure, but I change that and now
  kernel_map is structure. So handling to it should be by taking the
  address (&kernel_map)
<braunr> why did you do that ?
<braunr> or put it another way: what made you do that type change on
  kernel_map ?
<mcsim> Earlier memory for kernel_map was allocating with zalloc. But now
  salloc can't allocate memory before it's initialisation
<braunr> that's not a good reason
<braunr> a simple workaround for your problem is this :
<braunr> static struct vm_map kernel_map_store;
<braunr> vm_map_t kernel_map = &kernel_map_store;
<mcsim> braunr: Ok. I'll correct this.

IRC, freenode, #hurd, 2011-11-01

<braunr> etenil: but mcsim's work is, for one, useful because the allocator
  code is much clearer, adds some debugging support, and is smp-ready

IRC, freenode, #hurd, 2011-11-14

<braunr> i've just realized that replacing the zone allocator removes most
  (if not all) static limit on allocated objects
<braunr> as we have nothing similar to rlimits, this means kernel resources
  are actually exhaustible
<braunr> and i'm not sure every allocation is cleanly handled in case of
  memory shortage
<braunr> youpi: antrik: tschwinge: is this acceptable anyway ?
<braunr> (although IMO, it's also a good thing to get rid of those limits
  that made the kernel panic for no valid reason)
<youpi> there are actually not many static limits on allocated objects
<youpi> only a few have one
<braunr> those defined in kern/mach_param.h
<youpi> most of them are not actually enforced
<braunr> ah ?
<braunr> they are used at zinit() time
<braunr> i thought they were
<youpi> yes,  but most zones are actually fine with overcoming the max
<braunr> ok
<youpi> see zone->max_size += (zone->max_size >> 1);
<youpi> you need both !EXHAUSTIBLE and FIXED
<braunr> ok
<pinotree> making having rlimits enforced would be nice...
<pinotree> s/making//
<braunr> pinotree: the kernel wouldn't handle many standard rlimits anyway

<braunr> i've just committed my final patch on mcsim's branch, which will
  serve as the starting point for integration
<braunr> which means code in this branch won't change (or only last minute
<braunr> you're invited to test it
<braunr> there shouldn't be any noticeable difference with the master
<braunr> a bit less fragmentation
<braunr> more memory can be reclaimed by the VM system
<braunr> there are debugging features
<braunr> it's SMP ready
<braunr> and overall cleaner than the zone allocator
<braunr> although a bit slower on the free path (because of what's
  performed to reduce fragmentation)
<braunr> but even "slower" here is completely negligible

IRC, freenode, #hurd, 2011-11-15

<mcsim> I enabled cpu_pool layer and kentry cache exhausted at "apt-get
  source gnumach && (cd gnumach-* && dpkg-buildpackage)"
<mcsim> I mean kernel with your last commit
<mcsim> braunr: I'll make patch how I've done it in a few minutes, ok? It
  will be more specific.
<braunr> mcsim: did you just remove the #if NCPUS > 1 directives ?
<mcsim> no. I replaced macro NCPUS > 1 with SLAB_LAYER, which equals NCPUS
  > 1, than I redefined macro SLAB_LAYER
<braunr> ah, you want to make the layer optional, even on UP machines
<braunr> mcsim: can you give me the commands you used to trigger the
  problem ?
<mcsim> apt-get source gnumach && (cd gnumach-* && dpkg-buildpackage)
<braunr> mcsim: how much ram & swap ?
<braunr> let's see if it can handle a quite large aptitude upgrade
<mcsim> how can I check swap size?
<braunr> free
<braunr> cat /proc/meminfo
<braunr> top
<braunr> whatever
<mcsim>              total       used       free     shared    buffers
<mcsim> Mem:        786368     332296     454072          0          0
<mcsim> -/+ buffers/cache:     332296     454072
<mcsim> Swap:      1533948          0    1533948
<braunr> ok, i got the problem too
<mcsim> braunr: do you run hurd in qemu?
<braunr> yes
<braunr> i guess the cpu layer increases fragmentation a bit
<braunr> which means more map entries are needed
<braunr> hm, something's not right
<braunr> there are only 26 kernel map entries when i get the panic
<braunr> i wonder why the cache gets that stressed
<braunr> hm, reproducing the kentry exhaustion problem takes quite some
<mcsim> braunr: what do you mean?
<braunr> sometimes, dpkg-buildpackage finishes without triggering the
<mcsim> the problem is in apt-get source gnumach
<braunr> i guess the problem happens because of drains/fills, which
  allocate/free much more object than actually preallocated at boot time
<braunr> ah ?
<braunr> ok
<braunr> i've never had it at that point, only later
<braunr> i'm unable to trigger it currently, eh
<mcsim> do you use *-dbg kernel?
<braunr> yes
<braunr> well, i use the compiled kernel, with the slab allocator, built
  with the in kernel debugger
<mcsim> when you run apt-get source gnumach, you run it in clean directory?
  Or there are already present downloaded archives?
<braunr> completely empty
<braunr> ah just got it
<braunr> ok the limit is reached, as expected
<braunr> i'll just bump it
<braunr> the cpu layer drains/fills allocate several objects at once (64 if
  the size is small enough)
<braunr> the limit of 256 (actually 252 since the slab descriptor is
  embedded in its slab) is then easily reached
<antrik> mcsim: most direct way to check swap usage is vmstat
<braunr> damn, i can't live without slabtop and the amount of
  active/inactive cache memory any more
<braunr> hm, weird, we have active/inactive memory in procfs, but not
  buffers/cached memory
<braunr> we could set buffers to 0 and everything as cached memory, since
  we're currently unable to communicate the purpose of cached memory
  (whether it's used by disk servers or file system servers)
<braunr> mcsim: looks like there are about 240 kernel map entries (i forgot
  about the ones used in kernel submaps)
<braunr> so yes, addin the cpu layer is what makes the kernel reach the
  limit more easily
<mcsim> braunr: so just increasing limit will solve the problem?
<braunr> mcsim: yes
<braunr> slab reclaiming looks very stable
<braunr> and unfrequent
<braunr> (which is surprising)
<pinotree> braunr: "unfrequent"?
<braunr> pinotree: there isn't much memory pressure
<braunr> slab_collect() gets called once a minute on my hurd
<braunr> or is it infrequent ?
<braunr> :)
<pinotree> i have no idea :)
<braunr> infrequent, yes

IRC, freenode, #hurd, 2011-11-16

<braunr> for those who want to play with the slab branch of gnumach, the
  slabinfo tool is available at
<braunr> for those merely interested in numbers, here is the output of
  slabinfo, for a hurd running in kvm with 512 MiB of RAM, an unused swap,
  and a short usage history (gnumach debian packages built, aptitude
  upgrade for a dozen of packages, a few git commands)
<antrik> braunr: numbers for a long usage history would be much more
  interesting :-)

IRC, freenode, #hurd, 2011-11-17

<braunr> antrik: they'll come :)
<etenil> is something going on on darnassus? it's mighty slow
<braunr> yes
<braunr> i've rebooted it to run a modified kernel (with the slab
  allocator) and i'm building stuff on it to stress it
<braunr> (i don't have any other available machine with that amount of
  available physical memory)
<etenil> ok
<antrik> braunr: probably would be actually more interesting to test under
  memory pressure...
<antrik> guess that doesn't make much of a difference for the kernel object
  allocator though
<braunr> antrik: if ram is larger, there can be more objects stored in
  kernel space, then, by building something large such as eglibc, memory
  pressure is created, causing caches to be reaped
<braunr> our page cache is useless because of vm_object_cached_max
<braunr> it's a stupid arbitrary limit masking the inability of the vm to
  handle pressure correctly 
<braunr> if removing it, the kernel freezes soon after ram is filled
<braunr> antrik: it may help trigger the "double swap" issue you mentioned
<antrik> what may help trigger it?
<braunr> not checking this limit
<antrik> hm... indeed I wonder whether the freezes I see might have the
  same cause

IRC, freenode, #hurd, 2011-11-19

<braunr> <= state of the slab
  allocator after building the debian libc packages and removing all files
  once done
<braunr> it's mostly the same as on any other machine, because of the
  various arbitrary limits in mach (most importantly, the max number of
  objects in the page cache)
<braunr> fragmentation is still quite low
<antrik> braunr: actually fragmentation seems to be lower than on the other
<braunr> antrik: what makes you think that ?
<antrik> the numbers of currently unused objects seem to be in a similar
  range IIRC, but more of them are reclaimable I think
<antrik> maybe I'm misremembering the other numbers
<braunr> there had been more reclaims on the other run

IRC, freenode, #hurd, 2011-11-25

<braunr> mcsim: i've just updated the slab branch, please review my last
  commit when you have time
<mcsim> braunr: Do you mean compilation/tests?
<braunr> no, just a quick glance at the code, see if it matches what you
  intended with your original patch
<mcsim> braunr: everything is ok
<braunr> good
<braunr> i think the branch is ready for integration

IRC, freenode, #hurd, 2011-12-17

<braunr> in the slab branch, there now is no use for the defines in
<braunr> should the file be removed or left empty as a placeholder for
  future arbitrary limits ?
<braunr> (i'd tend ro remove it as a way of indicating we don't want
  arbitrary limits but there may be a good reason to keep it around .. :))
<youpi> I'd just drop it
<braunr> ok
<braunr> hmm maybe we do want to keep that one :
<braunr> #define IMAR_MAX        (1 << 10)       /* Max number of
  msg-accepted reqs */
<antrik> whatever that is...
<braunr> it gets returned in ipc_marequest_info
<braunr> but the mach_debug interface has never been used on the hurd
<braunr> there now is a master-slab branch in the gnumach repo, feel free
  to test it

IRC, freenode, #hurd, 2011-12-22

<youpi> braunr: does the new gnumach allocator has profiling features?
<youpi> e.g. to easily know where memory leaks reside
<braunr> youpi: you mean tracking call traces to allocated blocks ?
<youpi> not necessarily traces
<youpi> but at least means to know what kind of objects is filling memory
<braunr> it's very close to the zone allocator
<braunr> but instead of zones, there are caches
<braunr> each named after the type they store
<braunr> see
<youpi> ok, so we can know, per-type, how much memory is used
<braunr> yes
<youpi> good
<braunr> if backtraces can easily be forged, it wouldn't be hard to add
  that feature too
<youpi> does it dump such info when memory goes short?
<braunr> no but it can
<braunr> i've done this during tests
<youpi> it'd be good
<youpi> because I don't know in advance when a buildd will crash due to
  that :)
<braunr> each time slab_collect() is called for example
<youpi> I mean not on collect, but when it's too late
<youpi> and thus always enabled
<braunr> ok
<youpi> (because there's nothing better to do than at least give infos)
<braunr> you just have to define "when it's too late", and i can add that
<youpi> when there is no memory left
<braunr> you mean when the number of free pages strictly reaches 0 ?
<youpi> yes
<braunr> ok
<youpi> i.e. just before crashing the kernel
<braunr> i see

IRC, freenode, #hurdfr, 2012-01-02

<youpi> braunr: le code du slab allocator, il est écrit from scratch ?
<youpi> il y a encore du copyright carnegie mellon
<youpi> (dans slab_info.h du moins)
<youpi> ipc_hash_global_size = 256;
<youpi> il faudrait mettre 256 comme constante dans un header
<youpi> sinon c'est encore une valeur arbitraire cachée dans du code
<youpi> de même pour ipc_marequest_size etc.
<braunr> youpi: oui, from scratch
<braunr> slab_info.h est à l'origine zone_info.h
<braunr> pour les valeurs fixes, elles étaient déjà présentes de cette
  façon, j'ai pensé qu'il valait mieux laisser comme ça pour faciliter la
  lecture des diffs
<braunr> je ferai des macros à la place
<braunr> du coup il faudra peut-être remettre mach_param.h
<braunr> ou alors dans les .h ipc

IRC, freenode, #hurd, 2012-01-18

<braunr> does the slab branch need other reviews/reports before being
  integrated ?

IRC, freenode, #hurd, 2012-01-30

<braunr> youpi: do you have some idea about when you want to get the slab
  branch in master ?
<youpi> I was considering as soon as mcsim gets his paper
<braunr> right

IRC, freenode, #hurd, 2012-02-22

<mcsim> Do I understand correct, that real memory page should be
  necessarily in one of following lists: vm_page_queue_active,
  vm_page_queue_inactive, vm_page_queue_free?
<braunr> cached pages are
<braunr> some special pages used only by the kernel aren't
<braunr> pages can be both wired and cached (i.e. managed by the page
  cache), so that they can be passed to external applications and then
  unwired (as is the case with your host_slab_info() function if you
<braunr> use "physical" instead of "real memory"
<mcsim> braunr: thank you.

IRC, freenode, #hurd, 2012-04-22

<braunr> youpi: tschwinge: when the slab code was added, a few new files
  made into gnumach that come from my git repo and are used in other
  projects as well
<braunr> they're licensed under BSD upstream and GPL in gnumach, and though
  it initially didn't disturb me, now it does
<braunr> i think i should fix this by leaving the original copyright and
  adding the GPL on top
<youpi> sure, submit a patch
<braunr> hm i have direct commit acces if im right
<youpi> then fix it :)
<braunr> do you want to review ?
<youpi> I don't think there is any need to
<braunr> ok

IRC, freenode, #hurd, 2012-12-08

<mcsim> braunr: hi. Do I understand correct that merely the same technique
  is used in linux to determine the slab where, the object to be freed,
<braunr> yes but it's faster on linux since it uses a direct mapping of
  physical memory
<braunr> it just has to shift the virtual address to obtain the physical
  one, whereas x15 has to walk the pages tables
<braunr> of course it only works for kmalloc, vmalloc is entirely different
<mcsim> btw, is there sense to use some kind of B-tree instead of AVL to
  decrease number of cache misses? AFAIK, in modern processors size of L1
  cache line is at least 64 bytes, so in one node we can put at least 4
  leafs (key + pointer to data) making search faster.
<braunr> that would be a b-tree
<braunr> and yes, red-black trees were actually developed based on
  properties observed on b-trees
<braunr> but increasing the size of the nodes also increases memory
<braunr> and code complexity
<braunr> that's why i have a radix trees for cases where there are a large
  number of entries with keys close to each other :)
<braunr> a radix-tree is basically a b-tree using the bits of the key as
  indexes in the various arrays it walks instead of comparing keys to each
<braunr> the original avl tree used in my slab allocator was intended to
  reduce the average height of the tree (avl is better for that)
<braunr> avl trees are more suited for cases where there are more lookups
  than inserts/deletions
<braunr> they make the tree "flatter" but the maximum complexity of
  operations that change the tree is 2log2(n), since rebalancing the tree
  can make the algorithm reach back to the tree root
<braunr> red-black trees have slightly bigger heights but insertions are
  limited to 2 rotations and deletions to 3
<mcsim> there should be not much lookups in slab allocators
<braunr> which explains why they're more generally found in generic
<mcsim> or do I misunderstand something?
<braunr> well, there is a lookup for each free()
<braunr> whereas there are insertions/deletions when a slab becomes
<mcsim> I see
<braunr> so it was very efficient for caches of small objects, where slabs
  have many of them
<braunr> also, i wrote the implementation in userspace, without
  functionality pmap provides (although i could have emulated it

IRC, freenode, #hurd, 2013-01-06

<youpi> braunr: panic: vm_map: kentry memory exhausted
<braunr> youpi: ouch
<youpi> that's what I usually get
<braunr> ok
<braunr> the kentry area is a preallocated memory area that is used to back
  the vm_map_kentry cache
<braunr> objects from this cache are used to describe kernel virtual memory
<braunr> so in this case, i simply assume the kentry area must be enlarged
<braunr> (currently, both virtual and physical memory is preallocated, an
  improvement could be what is now done in x15, to preallocate virtual
  memory only
<braunr> )
<youpi> Mmm, why do we actually have this limit?
<braunr> the kentry area must be described by one entry
<youpi> ah, sorry, vm/vm_resident.c:       kentry_data =
<braunr> a statically allocated one
<youpi> I had missed that one
<braunr> previously, the zone allocator would do that
<braunr> the kentry area is required to avoid recursion when allocating
<braunr> another solution would be a custom allocator in vm_map, but i
  wanted to use a common cache for those objects too
<braunr> youpi: you could simply try doubling KENTRY_DATA_SIZE
<youpi> already doing that
<braunr> we might even consider a much larger size until it's reworked
<youpi> well, it's rare enough on buildds already
<youpi> doubling should be enough
<youpi> or else we have leaks
<braunr> right
<braunr> it may not be leaks though
<braunr> it may be poor map entry merging
<braunr> i'd expected the kernel map entries to be easier to merge, but it
  may simply not be the case
<braunr> (i mean, when i made my tests, it looked like there were few
  kernel map entries, but i may have missed corner cases that could cause
  more of them to be needed)

IRC, freenode, #hurd, 2014-02-11

<braunr> youpi: what's the issue with kentry_data_size ?
<youpi> I don't know
<braunr> so back to 64pages from 256 ?
<youpi> in debian for now yes
<braunr> :/
<braunr> from what i recall with x15, grub is indeed allowed to put modules
  and command lines around as it likes
<braunr> restricted to 4G
<braunr> iirc, command lines were in the first 1M while modules could be
  loaded right after the kernel or at the end of memory, depending on the
<youpi> braunr: possibly VM_KERNEL_MAP_SIZE is then not big enough
<braunr> youpi: what's the size of the ramdisk ?
<braunr> youpi: or kmem_map too big
<braunr> we discussed this earlier with teythoon 

user-space device drivers, Open Issues, System Boot, IRC, freenode, #hurd, 2011-07-27, IRC, freenode, #hurd, 2014-02-10

<braunr> or maybe we want to remove kmem_map altogether and directly use
<youpi> it's 6.2MiB big
<braunr> hm
<youpi> err no
<braunr> looks small
<youpi> 70MiB
<braunr> ok yes
<youpi> (uncompressed)
<braunr> well
<braunr> kernel_map is supposed to have 64M on i386 ...
<braunr> it's 192M large, with kmem_map taking 128M
<braunr> so at most 64M, with possible fragmentation
<teythoon> i believe the compressed initrd is stored in the ramdisk
<youpi> ah, right it's ext2fs which uncompresses it
<braunr> uncompresses it where 
<braunr> ?
<teythoon> libstore does that
<youpi> module --nounzip /boot/${gtk}initrd.gz 
<youpi> braunr: in userland memory
<youpi> it's not grub which uncompresses it for sure
<teythoon> braunr: so my ramdisk isn't 64 megs either
<braunr> which explains why it sometimes works
<teythoon> yes
<teythoon> mine is like 15 megs
<braunr> kentry_data_size calls pmap_steal_memory, an early allocation
  function which changes virtual_space_start, which is later used to create
  the first kernel map entry
<braunr> err, pmap_steal_memory is called with kentry_data_size as its
<braunr> this first kernel map entry is installed inside kernel_map and
  reduces the amount of available virtual memory there
<braunr> so yes, it all points to a layout problem
<braunr> i suggest reducing kmem_map down to 64M
<youpi> that's enough to get d-i back to boot
<youpi> what would be the downside?
<youpi> (why did you raise it to 128 actually? :) )
<braunr> i merged the map used by generic kalloc allocations into kmem_map
<braunr> both were 64M
<braunr> i don't see any downside for the moment
<braunr> i rarely see more than 50M used by the slab allocator
<braunr> and with the recent code i added to collect reclaimable memory on
  kernel allocation failures, it's unlikely the slab allocator will be
<youpi> but then we need that patch too
<braunr> no
<braunr> it would be needed if kmem_map gets filled
<braunr> this very rarely happens
<youpi> is "very rarely" enough ? :)
<braunr> actualy i've never seen it happen
<braunr> i added it because i had port leaks with fakeroot
<braunr> port rights are a bit special because they're stored in a table in
  kernel space
<braunr> this table is enlarged with kmem_realloc
<braunr> when an ipc space gets very large, fragmentation makes it very
  difficult to successfully resize it
<braunr> that should be the only possible issue
<braunr> actually, there is another submap that steals memory from
  kernel_map: device_io_map is 16M large
<braunr> so kernel_map gets down to 48M
<braunr> if the initial entry (that is, kentry_data_size + the physical
  page table size) gets a bit large, kernel_map may have very little
  available room
<braunr> the physical page table size obviously varies depending on the
  amount of physical memory loaded, which may explain why the installer
  worked on some machines
<youpi> well, it works up to 1855M
<youpi> at 1856 it doesn't work any more :)
<braunr> heh :)
<youpi> and that's about the max gnumach can handle anyway
<braunr> then reducing kmem_map down to 96M should be enough
<youpi> it works indeed
<braunr> could you check the amount of available space in kernel_map ?
<braunr> the value of kernel_map->size should do
<youpi> printing it "multiboot modules" print should be fine I guess?

IRC, freenode, #hurd, 2014-02-12

<braunr> probably
<teythoon> ?
<braunr> i expect a bit more than 160M
<braunr> (for the value of kernel_map->size)
<braunr> teythoon: ?
<youpi> well, it's 2110210048
<teythoon> what is multiboot modules printing ?
<youpi> almost last in gnumach bootup
<braunr> humm
<braunr> it must account directly mapped physical pages
<braunr> considering the kernel has exactly 2G, this means there is 36M
  available in kernel_map
<braunr> youpi: is the ramdisk loaded at that moment ?
<youpi> what do you mean by "loaded" ? :)
<braunr> created
<youpi> where?
<braunr> allocated in kernel memory
<youpi> the script hasn't started yet
<braunr> ok
<braunr> its size was 6M+ right ?
<braunr> so it leaves around 30M
<youpi> something like this yes
<braunr> and changing kmem_map from 128M to 96M gave us 32M
<braunr> so that's it

IRC, freenode, #hurd, 2013-04-18

<braunr> oh nice, i've found a big scalability issue with my slab allocator
<braunr> it shouldn't affect gnumach much though

IRC, freenode, #hurd, 2013-04-19

<ArneBab> braunr: is it fixable?
<braunr> yes
<braunr> well, i'll do it in x15 for a start
<braunr> again, i don't think gnumach is much affected
<braunr> it's a scalability issue
<braunr> when millions of objects are in use
<braunr> gnumach rarely has more than a few hundred thousands
<braunr> it's also related to heavy multithreading/smp
<braunr> and by multithreading, i also mean preemption
<braunr> gnumach isn't preemptible and uniprocessor
<braunr> if the resulting diff is clean enough, i'll push it to gnumach
  though :)

IRC, freenode, #hurd, 2013-04-21

<braunr> ArneBab_: i fixed the scalability problems btw

IRC, freenode, #hurd, 2013-04-20

<braunr> well, there is also a locking error in the slab allocator,
  although not a problem for a non preemptible kernel like gnumach
<braunr> non preemptible / uniprocessor