IRC, freenode, #hurd, 2012-02-14

<slpz> Open question: what do you think about dropping the memory object
  model and implementing a simple block-level cache?

memory object.

<kilobug> slpz: AFAIK the memory object has more purpose than just cache,
  it's allow used for passing chunk of data between processes, handling
  swap (which similar to cache, but still slightly different), ...
<slpz> kilobug: user processes usually make their way to data with POSIX
  operations, so memory objects are only needed for mmap'ed files
<slpz> kilobug: and swap can be replaced for an in-kernel system or even
  could still use the memory object
<braunr> slpz: memory objects are used for the page cache
<kilobug> slpz: translators (especially diskfs based) make heavy use of
  memory objects, and if "user processes" use POSIX semantics, Hurd process
  (translators, pagers, ...) shouldn't be bound to POSIX
<slpz> braunr: and page cache could be moved to a lower level, near to the
  devices
<braunr> not likely
<braunr> well, it could, but then you'd still have the file system overhead
<slpz> kilobug: but the use of memory objects it's not compulsory, you can
  easily write a fs translator without implementing memory objects at all
  (except to mmap)
<braunr> a unified buffer/VM cache as all modern systems have is probably
  the most efficient approach
<slpz> braunr: I agree. I want to look at *BSD/Linux vfs systems to seem
  how much cache policy depends on the filesystem
<slpz> braunr: Are you aware of any good papers on this matter?
<braunr> netbsd UVM, the linux virtual memory system
<braunr> both a bit old bit still relevant
<slpz> braunr: Thanks.
<slpz> the problem in our case is that having FS and cache information at
  different contexts (kernel vs. translator), I find hard to coordinate
  them.
<slpz> that's why I though about a block-level cache that GNU Mach could
  manage by itself
<slpz> I wonder how QNX deals with this
<braunr> the point of having a simple page cache is explicitely about not
  caring if those pages are blocks or files or whatever
<braunr> the kernel (at least, mach) normally has all the accounting
  information it needs to implement its cache policy
<braunr> file system translators shouldn't cache much
<braunr> the pager interface could be refined, but it looks ok to me as it
  is
<slpz> Mach has the accounting info, but it's not able to purge the cache
  without coordination with translators
<braunr> which is normal
<slpz> And this is a big problem when memory pressure increases, as it
  doesn't know for sure when memory is going to be freed
<braunr> Mach flushes its cache when it decides to, and sends back dirty
  pages if needed by the pager
<braunr> that's the case with every paging implementation
<braunr> the main difference is security with untrusted pagers
<braunr> but that's another issue
<slpz> but in a monolithic implementation, the kernel is able for force a
  chunk of cache memory to be freed without hoping for other process to do
  the job
<braunr> that's not true
<braunr> they're not process, they're threads, but the timing issue is the
  same
<braunr> see pdflush on linux
<slpz> no, it isn't.
<braunr> when memory is scarce, threads that request memory can either wait
  or immediately fail, and if they wait, they're usually woken by one of
  the vm threads once flushing is done
<slpz> a kernel thread can access all the information in the kernel, and
  synchronization is pretty easy.
<braunr> on mach, synchronization is done with messages, that's even easier
  than shared kernel locks
<slpz> with processes in different spaces, resource coordination becomes
  really difficult
<braunr> and what kind of info would an external pager need when simply
  asked to take back its dirty pages
<braunr> what resources ?
<slpz> just take a look at the thread storm problem when GNU Mach needs to
  clean a bunch of pages
<braunr> Mach is big enough to correctly account memory
<braunr> there can be thread storms on monolithic systems
<braunr> that's a Mach issue, not a microkernel issue
<braunr> that's why linux limits the number of pdflush thread instances
<slpz> Mach can account memory, but can't assure when be freed by any
  means, in a lesser degree than a monolithic system
<braunr> again i disagree
<braunr> no system can guarantee when memory will be freed with paging
<slpz> a block level cache can, for most situations
<braunr> slpz: why ?
<braunr> slpz: or how i mean ?
<slpz> braunr: with a block-level page cache, GNU Mach should be able to
  flush dirty pages directly to the underlaying device without all the
  complexity and resource cost involved in a m_o_data_return message. It
  can also throttle the rate at which pages are being cleaned, and do all
  this while blocking new page allocations to deal with memory exhaustion
  cases.
<slpz> braunr: in the current state, when cleaning dirty pages, GNU Mach
  sends a bunch on m_o_data_return to the corresponding pagers, hoping they
  will do their job as soon and as fast as possible.
<slpz> memory is not really freed, but transformed from page cache to
  anonymous memory pertaining to the corresponding translator
<slpz> and GNU Mach never knows for sure when this memory is released, if
  it ever is.
<slpz> not being able to flush dirty pages synchronously is a big problem
  when you need to throttle memory usage
<slpz> and needing allocating more memory when you're trying to free (which
  is the case for the m_o_data_return mechanism) makes the problem even
  worse
<braunr> your idea of a block level cache means in kernel block drivers
<braunr> that's not the direction we're taking
<braunr> i agree flushing should be a synchronous process, which was one of
  the proposed improvements in the thread migration papers
<braunr> (they didn't achieve it but thought about it for future works, so
  that the thread at the origin of the fault would handle it itself)
<braunr> but it should be possible to have kernel threads similar to
  pdflush and throttle flush requests too
<braunr> again, i really think it's a mach bug, and having a buffer cache
  would be stepping backward
<braunr> the real design issue is allocating memory while trying to free
  it, yes
<slpz> braunr: thread migration doesn't apply to asynchronous IPC, and the
  entire paging mechanism is implemented this way
<slpz> in fact, trying to do a synchronous m_o_data_return will trigger a
  deadlock for sure
<slpz> to achieve synchronous flushing with translators, the entire paging
  model must be redesigned
<slpz> It's true that I'm not very confident of the viability of user space
  drivers
<slpz> at least, not for every device
<slpz> I know this is against the current ideas for most ukernel designs,
  but if we want to achieve real work functionality, I think some
  sacrifices must be done. Or at least a reasonable compromise.
<braunr> slpz: thread migration for paging requests implies synchronous
  RPC, we don't care much about the IPC layer there
<braunr> and it requires large changes of the VM code in addition, yes
<braunr> let's not talk about this, we don't have thread migration anyway
  :p
<braunr> except the allocation-on-free-path issue, i really don't see how
  the current pager interface or the page cache creates problems wrt
  flushing ..
<braunr> monolithic systems also have that problem, with lower impacts
  though, but still
<slpz> braunr: because as it doesn't know when memory is really freed, 1)
  it just blindly sends a bunch of m_o_data_return to the pagers, usually
  overloading them (causing thread storms), and 2) it can't properly
  throttle new page requests to deal with resource exhaustion
<braunr> it does know when memory is really freed
<braunr> and yes, it blindly sends a bunch of requests, they can and should
  be trottled
<slpz> but dirty pages freed become indistinguishable from common anonymous
  chunks released, so it doesn't really know if page flushes are really
  working or not (i.e. doesn't know how fast a device is processing write
  requests)
<braunr> memory is freed when the pager deallocates it
<braunr> the speed of the operation is irrelevant
<braunr> no system can rely on disk speed to guarantee correct page flushes
<braunr> disk or anything else
<slpz> requests can't be throttled if Mach doesn't know when they are being
  processed
<braunr> it can easily know it
<braunr> they are processed as soon as the request is sent from the kernel
<braunr> and processing is done when the pager acknowledges the end of the
  flush
<braunr> memory backing the flushed pages should be released before
  acknowleding that to avoid starting new requests too soon
<slpz> AFAIK pagers doesn't acknowledge the end of the flush
<braunr> well that's where the interface should be refined
<slpz> Mach just sends the m_o_data_return and continues on its own
<braunr> that's why flushing should be synrhconous
<braunr> are you sure about that however ?
<slpz> so the entire paging system needs a new design... :)
<slpz> pretty sure
<braunr> not a new design ..
<braunr> there is m_o_supply_completed, i don't see how difficult it would
  be to add m_o_data_return_completed
<braunr> it's not a small change, but not a difficult one either
<braunr> i'm more worried about the allocation problem
<braunr> the default pager should probably be wired in memory
<braunr> maybe others
<slpz> let's suppose a case in which Mach needs to free memory due to an
  increase in its pressure. vm_pageout_daemon starts running, clean pages
  are freed easily, but for each dirty one a m_o_data_return in sent. 1)
  when should this daemon stop sending m_o_data_return and start waiting
  for m_o_data_return_completed? 2) what happens if the translator needs to
  read new blocks to fulfill a write request (pretty common in ext2fs)?
<braunr> it should stop after an arbitrary limit is reached
<braunr> a reasonable one
<braunr> linux limits the number of pdflush threads for that reason as i
  mentioned (to 8 iirc)
<braunr> the problem of reading blocks while flushing is what i'm worried
  about too, hence the need to wire that code
<braunr> well, i'm nto sure it's needed
<braunr> again, a reasonable about of free memory should be reserved for
  that at all times
<slpz> but the work for pdflush seems to be a lot easier, as it only deals
  directly with block devices (if I understood it correctly, I just started
  looking at it).
<braunr> i don't know how other systems compute that, but this is how they
  seem to do as well
<braunr> no, i don't think so
<slpz> well, I'll try to invest a few days understanding how pdflush work,
  to see if some ideas can be borrowed for Hurd
<braunr> iirc, freebsd has thresholds in percent for each part of its cache
  (active, inactive, free, dirty)
<slpz> but I still think simple solutions work better, and using the memory
  object for page cache is tremendously complex.
<braunr> the amount of free cache pages is generally sufficient to
  guarantee much memory can be released at once if needed, without flushing
  anything
<braunr> yes but that's the whole point of the Mach VM
<braunr> and its greatest advance ..
<slpz> what, memory objects?
<braunr> yes
<braunr> using physical memory as a cache for anything, not just block
  buffers
<slpz> memory objects work great as a way to provide a shared image of
  objects between processes, but as page cache they are an overkill (IMHO).
<slpz> or, at least, in the way we're using them
<braunr> probably
<braunr> http://lwn.net/Articles/326552/
<braunr> this can help udnerstand the problems we may have without better
  knowledge of the underlying devices, yes
<braunr> (e.g. not being able to send multiple requests to pagers that
  don't share the same disk)
<braunr> slpz: actually i'm not sure it's that overkill
<braunr> the linux vm uses struct vm_file to represent memory objects iirc
<braunr> there are many links between that structure and some vfs related
  subsystems
<braunr> when a system very actively uses the page cache, the kernel has to
  maintain a lot of objects to accurately describe the cache content
<braunr> you could consider this overkill at first too
<braunr> the mach way of doing it just implies some ipc messages instead of
  function calls, it's not that overkill for me
<braunr> the main problems are recursion (allocation while freeing,
  handling page faults in order to handle flushes, that sort of things)
<braunr> struct file and struct address_space actually
<braunr> slpz: see struct address_space, it contains a set of function
  pointers that can help understanding the linux pager interface
<braunr> they probably sufferred from similar caveats and worked around
  them, adjusting that interface on the way
<slpz> but their strategy makes them able to treat the relationship between
  the page cache and the block devices in a really simple way, almost as a
  traditional storage cache.
<slpz> meanwhile on Mach+pager scenario, the relationship between a block
  in a file and its underlying storage becomes really blurry
<slpz> this is a huge advantage when flusing out data, specially when
  resources are scarce
<slpz> I think the idea of using abstract objects for page cache, loses a
  bit the point that we just want to avoid accessing constantly to a slow
  device
<slpz> and breaking the tight relationship between the device and its
  cache, makes things a lot harder
<slpz> this also manifest itself when flushing clean pages, as things like
  having an static maximum for cached memory objects
<slpz> we shouldn't care about the number of objects, we just need to
  control the number of pages
<slpz> but as we need the pager to flush pages, we need to keep alive a lot
  of control ports to them
<mcsim> slpz: When mo_data_return is called, once the memory manager no
  longer needs supplied data, it should be deallocated using
  vm_deallocate. So this way pagers acknowledges the end of flush.

IRC, freenode, #hurd, 2013-08-26

< Spyro> Ok, so
< Spyro> idiot question:  in a nutshell, what is a memory object?
< Spyro> and how is swapping/paging handled?
< braunr> Spyro: a memory object is how the virtual memory system views a
  file
< braunr> so it's a sequence of bytes with a length
< braunr> "swapping" is just a special case of paging that applies to
  anonymous objects
< braunr> (which are named so because they're not associated with a file
  and have no name)
< Spyro> Who creates a memory object, and when?
< braunr> pagers create memory objects when needed, e.g. when you open a
  file
< Spyro> and this applies both to mmap opens as well as regular I/O opens
  as in for read() and write()?
< braunr> basically, all file systems capable of handling mmap requests
  and/or caching in physical memory are pagers
< braunr> yes
< braunr> read/write will go through the page cache when possible
< Spyro> and who owns the page cache?
< Spyro> also, who decides what pages ot evict to swap/file if physical
  memory gets tight?
< braunr> the kernel
< braunr> that's one of the things that make mach a hybrid
< Spyro> so the kernel owns the page cage?
< Spyro> ...fml
< Spyro> cache!
< braunr> yes

IRC, freenode, #hurd, 2013-08-27

< Spyro> so braunr:  So, who creates the memory object, and how does it get
  populated?
< Spyro> and how does a process accessing a file get hooked up to the
  memory object?
< braunr> Spyro: i told you, pagers create memory objects
< braunr> memory objects are how the VM system views files, so they're
  populated from the content of files
< braunr> either true files or virtual files such as in /proc
< braunr> Spyro: processes don't directly access memory objects unless
  memory mapping them with vm_map()
< braunr> pagers (basically = file systems) do
<Spyro> ok, so how is a pager/fs involved in handling a fault?

IRC, freenode, #hurd, 2013-08-28

<braunr> Spyro: each object is linked to a pager
<braunr> Spyro: when a fault occurs, the kernel looks up the VM map (kernel
  or a user one), and the address in this map, then the map entry, checks
  access and lots of other details
<Spyro> ok, so it's pager -> object -> vmem
<Spyro> ?
<braunr> Spyro: then finds the object mapped at that address (similar to
  how a file is mapped with mmap)
<braunr> from the object, it finds the pager
<Spyro> ok
<braunr> and asks the pager about the data at the appropriate offset
<Spyro> so how does a user process do normal file I/O?  is faulting just a
  special case of it?
<braunr> it's completely separate
<Spyro> eww
<braunr> normal I/O is done with message passing
<braunr> the hurd io interface
<Spyro> ok
<Spyro> so who talks to who on a file I/O?
<braunr> a client (e.g. cat) talks to a file system server (e.g. ext2fs)
<Spyro> ok so
<Spyro> it's client to the pager for regular file I/O?
<braunr> Spyro: i don't understand the question
<braunr> Spyro: it's client to server, the server might not be a pager
<Spyro> ok
<Spyro> just trying to figure out the difference between paging/faulting
  and regular I/O
<braunr> regular I/O is just message passing
<braunr> page fault handling is dealt with by pagers
<Spyro> and I have a hunch that the fs/pager is involved somehow in both,
  because the server is the source of the data
<Spyro> I'm getting a headache
<braunr> nalaginrut: a server like ext2fs is both a file server and a pager
<Spyro> oh!
<Spyro> oh btw, does a file server make use of memory objects for caching?
<braunr> Spyro: yes
<Spyro> or rather, can it?
<Spyro> does it have to?
<braunr> memory objects are for caching, and thus for page faults
<braunr> Spyro: for caching, it's a requirement
<braunr> for I/O, it's not
<braunr> you could have I/O without memory objects
<Spyro> ok
<Spyro> so how does the pager/fileserver use memory objects for caching?
<Spyro> does it just map and write to them?
<braunr> basically yes but there is a complete protocol with the kernel for
  that
<braunr>
  http://www.gnu.org/software/hurd/gnumach-doc/External-Memory-Management.html#External-Memory-Management
<Spyro> heh, lucky guess
<Spyro> ty
<Spyro> I am in way over my head here btw
<Spyro> zero experience with micro kernels in practice
<braunr> it's not trivial
<braunr> that's not a microkernel thing at all
<braunr> that's how it works in monolithic kernels too
<braunr> i recommend netbsd uvm thesis
<braunr> there are nice pictures describing the vm system
<Spyro> derrr...preacious?
<Spyro> wow
<braunr> just ignore the anonymous memory handling part which is specific
  to uvm
<Spyro> @_@
<braunr> the rest is common to practically all VM systems out there
<Spyro> I know about the linux page cache
<braunr> well it's almost the same
<Spyro> with memory objects being the same thing as files in a page cache?
<braunr> memory objects are linux "address spaces"
<braunr> and address spaces are how the linux mm views a file, yes
<Spyro> derp
<Spyro> ...
<Spyro> um...
<braunr> struvt vm_page == struct page
* Spyro first must learn what an address_space is
<braunr> struct vm_map == struct mm_struct
<braunr> struct vm_map_entry == struct vm_area_struct
* Spyro isn't a linux kernel vm expert either
<braunr> struct vm_object == struct address_space
<braunr> roughly
<braunr> details vary a lot
<Spyro> and what's an address_space ?
<braunr> 11:41 < braunr> and address spaces are how the linux mm views a
  file, yes
<Spyro> ok
<braunr> see include/linux/fs.h
<braunr> struct address_space_operations is the pager interface
* Spyro should look at the linux kernel sources perhaps, unless you have an
    easier reference
<Spyro> embarrassingly, RVR hired me as an editor for the linux-mm wiki
<Spyro> I should know this stuff
<braunr> see
  http://darnassus.sceen.net/~rbraun/design_and_implementation_of_the_uvm_virtual_memory_system.pdf
<braunr> page 42
<braunr> page 66 for another nice view
<braunr> i wouldn't recommend using linux source as refernece
<braunr> it's very complicated, filled with a lot of code dealing with
  details
<Spyro> lmao
<braunr> and linux guys have a habit of choosing crappy names
<Spyro> I was only going to 
<Spyro> stoppit
<braunr> except for "linux" and "git"
<Spyro> ...make me laugh any more and I'll need rib surgery
<braunr> laugh ?
<Spyro> complicated and crappy
<braunr> seriously, "address space" for a file is very very confusing
<Spyro> oh I agree with that
<braunr> yes, names are crappy
<braunr> and the code is very complicated
<braunr> it took me half an hour to find where readahead is done once
<braunr> and i'm still not sure it was the right code
<Spyro> so in linkern, there is an address_space for each cached file?
<braunr> takes me 30 seconds in netbsd ..
<braunr> yes
<Spyro> eww
<Spyro> yeah, BAD name
<Spyro> but thanks for the explanation
<Spyro> now I finally know what an address space is
<braunr> many linux core developers admit they don't care much about names
<Spyro> so, in hurd, a memory object is to hurd, what an address_space is
  to linux?
<braunr> yes
<braunr> notto hurd
<Spyro> ok
<braunr> to mach
<Spyro> you know what I mean
<Spyro> :P
<Spyro> easier than for linux I can tell you that much
<braunr> and the bsd vm system is a stripped version of the mach vm
<Spyro> ok
<braunr> that's why i think it's important to note it
<Spyro> good, I learned something abou tthe linux vm...from the mach guys
<Spyro> this is funny
<braunr> linux did too
<braunr> there is a paper about linux page eviction that directly borrows
  the mach algorithm and improves it
<braunr> mach is the historic motivation behind mmap on posix
<Spyro> oh nice!
<Spyro> but yes, linux picked a shitty name
<braunr> is all that clearer to you ?
<Spyro> I think that address_space connection was a magic bolt of
  understanding
<braunr> and do you see how I/O and paging are mostly unrelated ?
<Spyro> almost
<Spyro> but how does a file I/O take advantage of caching by a memory
  object?
<Spyro> does the file server just nudge the core for a hint?
<braunr> the file system copies from the memory object
* Spyro noddles
<Spyro> I think I understand a bit better now
<braunr> it's message passing
<Spyro> but I havfe too much to digest already
<braunr> memory copying
<braunr> if the memory is already there, good, if not, the kernel will ask
  the file system to bring the data
<braunr> if message passing uses zero copy, data retrieval can be deferred
  until the client actually accesses it
<Spyro> which is a fancy way of saying demand paging? :P
<braunr> it's always demand paging
<braunr> what i mean is that the file system won't fetch data as soon as it
  copies memory
<braunr> but when this data is actually needed by the client
<Spyro> uh...
<Spyro> whta's a precious page?
<braunr> let me check quickly
<braunr> If precious is FALSE, the kernel treats the data as a temporary
  and may throw it away if it hasn't been changed. If the precious value is
  TRUE, the kernel treats its copy as a data repository and promises to
  return it to the manager
<braunr> basically, it's used when you want the kernel to keep cached data
  in memory
<braunr> the cache becomes a lossless container for such pages
<braunr> the kernel may flush them, but not evict them
<Spyro> what's the difference?
<braunr> imagine a ramfs
<Spyro> point made
<braunr> ok
<Spyro> would be pretty hard to flush something that doesn't have a backing
  store
<braunr> that was quick :)
<braunr> well
<braunr> the normal backing store for anonymous memory is the default pager
<braunr> aka swap
<Spyro> eww
<braunr> but if you want your data *either* in swap or in memory and never
  in both
<braunr> it may be useful