large stores




This is -*- mode: outline -*-

* Introduction

Here is a try to describe the ext2fs patch for the Hurd.  This patch
allows using partitions/stores larger that approximately 1.5G by not
memory mapping the whole store to address space.

As a guideline, the changelog of RC1 (Release Candidate 1) is
followed, so I hope nothing is missed.  During writing of this text,
some questions arised and they are marked with XXX.  An effort will be
made to fix all these for RC2.

					Ognyan Kulev <ogi@fmi.uni-sofia.bg>

* The block layer and its purpose

The basic unit of ext2 filesystem is "block".  All filesystem
operation work on blocks which are read, and sometimes modified and
written back.  Possible block sizes are 1K, 2K and 4K, but current
implementation works reliably only on 4K blocks (= page size of i386).

So the two basic operations on blocks are "reading" block and
"writing" block.

* Current implementation

** Reading

Currently, the whole store is memory mapped into address space of
ext2fs process.  The is called "disk image", although "store image"
would be more accurate.  The address of the start of the disk image is
stored in pager.c:disk_image.  So "reading" block is easy: just
calculate byte offset of block and add it to disk_image.  The resulting
address points to the start of the desired block.

The macro ext2fs.h:bptr has exactly this purpose: given block number,
it returns pointer to block.  Sometimes we have pointer somewhere in
the block, and we want the block number.  This is calculated by
ext2fs.h:bptr_block.

There is another set of macros that use byte offsets instead of block
numbers.  These are boffs_ptr (store offset -> memory pointer) and
bptr_offs (memory pointer -> store offset).

Converting between store offset and block number is easy with macros
boffs (block -> offset) and boffs_block (offset -> block).  Other
useful macros are trunc_block and round_block.

** Writing

Modifying block and saving it is not that straight-forward as
reading.  For writing, you need to use "pokel" ("poked elements").
Pokel interface is in ext2fs.h.  Implementation is in pokel.c.

The problem is that generally multiple blocks are modified and we want
all these changes to hit disk at relatively same time.  So we can't
just change block and leave decision when it's going to be written to
the microkernel.

So there is a pokel for each set of changes and each change should be
reported to the pokel by calling pokel_add.  When this set of changes
is completed, pokel_sync of pokel_flush is called.  (The latter is
used to ignore changes.)

In practice, there is one indir_pokel for each ext2fs.h:disknode,
which is used for indirect blocks of ext2fs.  The only other pokel
used is ext2fs.h:global_pokel, where all other changes to metadata are
registered.

* Proposed implementation

First one must realize that the idea of mapping the whole store is to
be thrown away.  So only parts of the store should be mapped.  These
currently mapped parts of store are collectively called "cache".

In the proposed implementation, the cache has fixed size of
ext2fs.h:DISK_CACHE_BLOCKS.  In RC1, it's 100, but this is only to
easily catch bugs.  In practice, it can be, for example, 512M, or
(512*1024/4) blocks of 4K.  pager.c:disk_cache_size and
pager.c:disk_cache_blocks are additional variables about that
information.

The cached blocks are mapped in ext2fs.h:disk_cache and span
disk_cache_size bytes (= disk_cache_blocks blocks).  As in the
original implementation, this part of address space is handled by
custom pager.

** Data structures

Blocks in cache aren't consecutive, so we need data structure to hold
which part of address space represents what block.  This is the
purpose of pager.c:disk_cache_info.  Index in this array is "cached
block index".  But this array doesn't help in finding if specific
block is mapped, and where.  This is the purpose of the
pager.c:disk_cache_bptr ihash which finds cached block index from
given block number.  Both data structures are guarded by
pager.c:disk_cache_lock.

** Public interface

"Public" interface to the cache are functions disk_cache_block_ref,
disk_cache_block_ref_ptr, disk_cache_block_deref,
disk_cache_block_is_ref.  disk_cache_block_ref takes block number and
return pointer to block content.  Reference count of this cached block
is incremented.  After finishing work with block,
disk_cache_block_deref should be called.

In converting original ext2fs code to use this functions, usually call
to bptr is turned into call to disk_cache_block_ref.  In addition,
after pointer to block content is not used anymore,
disk_cache_block_deref is called.  This simple scheme is only for
reading from block.  For modifying block, see about pokels below.

disk_cache_block_ref_ptr just increments reference count of specified
block.  It's used when we give pointer to block content to somebody
else that will dereference it (e.g. pokel) and we want to continue to
use this content.

disk_cache_block_is_ref checks if specified block has reference count
greater than zero.  It's used in assert:s.

*** bptr* and boffs* macros

These macros continue to work as before, but they don't deal with
reference counting and this should be taken into consideration.  In
addition, bptr_index returns cached block index from given pointer to
block content.  (This function is used internally.)

*** Pokels

When pokel_add is called with pointer to block content, this
"consumes" reference of block.  It's not consumed (decremented by 1)
immediately, but when pokel_sync or pokel_flush is called.  (Reference
is consumed immediately if the block is already in the pokel.  The
important thing is that you always lose one reference of the block.)

So we have the following code when we read from block:

	char *bh = disk_cache_block_ref (block);
	...
	disk_cache_block_deref (bh);

And the following code when we modify block:

	char *bh = disk_cache_block_ref (block);
	...
	pokel_add (pokel, bh, block_size);

**** Indirect calls to pokel_add

Some functions indirectly call pokel_add, so this should be taken into
consideration.  These are:

   * record_global_poke
   * record_indir_poke

So these functions should be treated in the same scheme as pokel_add.
For example:

	char *bh = disk_cache_block_ref (block);
	...
	record_indir_poke (node, bh);

**** Modifying SBLOCK in diskfs_set_hypermetadata

SBLOCK is global variable that points to superblock content.  There is
one reference count for superblock, so before we call
record_global_poke (which consumes reference),
disk_cache_block_ref_ptr is called.

**** Modifying GDP

When group descriptor is wanted, usuall group_desc is called and
result is stored in local variable GDP.  After modifying GDP,
record_global_poke is called.  But because record_global_poke is used,
we need call to disk_cache_block_ref_ptr:

	gdp = group_desc (i);
	...
	disk_cache_block_ref_ptr (gdp);
	record_global_poke (gdp);

*** More complex use of pointer to block content

In ext2_new_block and ext2_alloc_inode functions, we have local
pointer variable BH that sometimes points to block content and
sometimes points to nothing.  In order to reduce possible errors, when
BH points to nothing it's always 0.  In some points (goto labels),
there is assertion if BH is what's expected (pointer to nothing or
pointer to something).

*** dino

dino function return pointer to struct ext2_inode for given ino_t.
This uses reference, so corresponding disk_cache_block_deref should be
called after finishing work with ext2_inode.  For convenience, dino is
renamed to dino_ref, and dino_deref just calls disk_cache_block_deref.

	struct ext2_inode *di = dino_ref (np->cache_id);
	...
	dino_deref (di);

Or

	struct ext2_inode *di = dino_ref (np->cache_id);
	...
	sync_global_ptr (di, 1);
	dino_deref (di);

Or

	struct ext2_inode *di = dino_ref (np->cache_id);
	...
	record_global_poke (di);

* Internals of the proposed implementation

As said earlier, instead of mapping the whole store of filesystem to
address space, only part of it is mapped.  This part is called "cache"
or "disk cache" (although "store cache" would be more appropriate).
Currently, the cache is contiguous area in address space that starts
at disk_cache.  Its size is disk_cache_size which is disk_cache_blocks
number of blocks of size block_size.

Mapped blocks in disk cache are not fixed -- each block in the cache
can be replaced at any time with another block.  So we need to know
which blocks are cached currently and where.  Information about each
cached block is stored in disk_cache_info[].  Index is from 0 to
disk_cache_blocks-1.  In this information the block number is stored
(among some other things, discussed later).  The reverse direction,
getting the index of cached block from block number, is achieved by
using disk_cache_bptr ihash.  Both these data structures are guarded
by disk_cache_lock.

** Requesting a block

When ext2 code requests block, it calls disk_cache_block_ref.  First,
this block is search with disk_cache_bptr.  If its there, the
reference count is incremented and pointer to block content is
returned.  In this case, there is a call to disk_cache_wait_remapping,
which is explained a bit later.

It's more interesting when block is not found in disk_cache_bptr.  In
this case, disk_cache_map is called.  Again, disk_cache_bptr is
consulted, because in the meantime another could already have mapped
this block.  If this is the case, the code is essentially the same as
those in disk_cache_block_ref.

When it's assured that block is not in the cache, we have no choice
but throw away an already mapped/cached block and put our block in its
place.  Such block has to meet the following conditions:

- Its reference count being 0
- Not in the core
- Not being remapped (explained later)
- Not being forbidden to be remapped ("fixed", explained later)

The last three conditions are actually flags in disk_cache_info:
DC_INCORE, DC_REMAPPING and DC_FIXED.  DC_DONT_REUSE collectively
gives the condition in which block is not suitable for
reusing/remapping.

Searching suitable place in cache is linear.  As an optimisation, this
search doesn't start from the beginning, but starts from where last
time it has ended.  This last index is stored in disk_cache_hint.  So
new candidate blocks for replacement are searched "circular".

If suitable place is found, the old mapping is removed, and the new
mapping is initialized.  But we are still not ready to return pointer
to block content, because this content is not available yet.  We mark
the block as DC_REMAPPING, which makes disk_cache_block_ref for that
block in other threads to wait until page is completely remapped.

In both cases, when we have found place and when suitable place is not
found, disk_cache_hint is updated so that next disk_cache_map
continues searching from where we ended.

When not suitable place is found, we have to use force.  First all
pages in disk cache are touched.  This is workaround because of some
bug in GNU Mach.  The patch relies on "precious page" features of
Mach.  Marking a page as precious instructs Mach to always inform us
about evicting this page.  If page is modified, it seems that we are
always informed.  But if page is unmodified and page is evicted,
sometimes Mach forgets to tell us.  It's true that with large disk
cache, e.g. 512M, this potentially will re-read the whole cache from
disk.  But if we reach this point, the microkernel is telling us that
all is already read :-)

This is preparation for following calls to pager_return_some.  This
libpager function is called only on cached blocks that has reference
count of 0.  These are the potential candidates for replacement --
there is no sense in calling pager_return_some when reference count is
1 or more.  One final case is when there is no cached block that has
reference count of 0.  This is bad and we can't do anything about it.
In this case, we just wait one second hoping that some other thread
will drop reference count of block to 0.  (XXX Currently (in RC1)
sleep(1) is always executed.  It should be executed only when disk
cache is starving.  There is some rationale behind calling sleep(1) even when
disk cache is not starving.  Although pager_return_some(,,,1)
guarantees that upon return of this function the page is returned, I'm
not sure that it's guaranteed that pager_notify_pageout is called.
This is because pager_return_some and
libpager/data-return.c:_pager_do_write_request are executed in
different threads and pager_return_some is confirmed before calling
pager_notify_pageout.  This issue is open.)

So, after forcibly evicting all pages (blocks) that can potentially be
reused, disk_cache_map is called again.

In the case when suitable place is found and all data structures
(disk_cache_info and disk_cache_bptr) are changed accordingly,
pager_return_some(,,,1) is called and we wait for pager_read_page to
clear DC_REMAPPING.  The purpose of this flag (DC_REMAPPING) is solely
this: to forbid any use of this block until we are absolutely sure
that this page contains exactly the wanted block.  If NDEBUG is not
defined (so we include debug code), flags of the blocks are checked if
DC_REMAPPING is really cleared.

Is DC_REMAPPING really needed?  Is there possibility that between last
"mutex_unlock (&disk_cache_lock)" and "return bptr" something could go
wrong?  Actually, disk cache just follows protocol set by
pager_notify_pageout: that between pager_return_some and changing
internal structures for the remapping no thread may touch the page.
This is achieved by marking the page as DC_REMAPPING.  For
convenience, function disk_cache_wait_remapping is defined which waits
for cached block while it's marked as DC_REMAPPING.

XXX XXX: Actually, the sequence used in RC1 is: remap block and
pager_return_some.  The latter seems redundant, as only blocks that
are evicted are candidates for remapping.  I'll try to fix that for
RC2.

** Modifying blocks and pokels

After block is modified, it should be registered with pokel_add to
some pokel.  Pokel contains list of ranges of cached blocks.  All this
blocks should have reference count at least 1.  In pokel_flush and
pokel_sync, this reference is consumed.

So in pokel_add if added blocks are already in the pokel, their
references are consumed, because only 1 reference is consumed in
pokel_{sync,flush}.  It's checked if pokel is for disk_cache, because
pokels are used in file access too, where disk cache layer is not
used.

pokel_{flush,sync} both use _pokel_exec, so this is the place where
block references are consumed.  (XXX: In RC1, they are consumed
always, but it's better to check if these pages are in disk_cache.
Although calling disk_cache_block_deref on non-disk_cache page does no
harm.)

*** Indirect use of pokel_add

record_global_poke and record_indir_poke use indirectly pokel_add.
These functions are slightly changed to use public interface of
disk_cache.  Only new precondition is added for them: caller should
supply "reference" that will be consumed later by pokel_{flush,sync}.

*** Modifying block without using pokels

sync_global_ptr synchronizes given block immediately.  No reference is
consumed.  (XXX: This should be changed in RC2 to consuming reference.
This will make the function similar in use to
record_{global,indir}_poke and will make the code more nice-looking.)

** Initialization

*** The superblock

To create disk cache, we need the block size of the filesystem.  This
information is in superblock, so we need to read superblock without
using disk cache.  For this purpose get_hypermetadata is changed to
read the superblock with store_read instead of old bptr.  New function
map_hypermetadata is created that sets sblock global variable to point
to the already mapped superblock.  So to get behavior of old
get_hypermetadata, first new get_hypermetadata should be called, and
then map_hypermetadata.

In ext2fs.c:main, instead of calling get_hypermetadata,
map_hypermetadata is called.  The call to get_hypermetadata is in
pager.c:create_disk_pager.

In ext2fs.c:diskfs_reload_global_state, along with get_hypermetada,
map_hypermetadata is called.

*** disk_cache

Disk cache data structures are initialized in
pager.c:create_disk_pager called from ext2fs.c:main.  Disk pager is
still initialized with diskfs_start_disk_pager, but due to block_size
variable we call get_hypermetadata.  Basic parameters of disk cache
like disk_cache_blocks and disk_cache_size are initialized here.  The
rest of the initialization process is delegated to disk_cache_init.

disk_cache_init initializes the rest of disk cache data structures:
disk_cache_lock, disk_cache_remapping, disk_cache_bptr,
disk_cache_info and disk_cache_hint.  After that superblock and group
descriptors are mapped into the cached and are marked as DC_FIXED.
This forbids reusing those blocks, because Hurd's ext2 code relies on
these blocks being mapped into fixed location in address space.

** Pager callbacks

disk_pager_read_page and disk_pager_write_page just use disk cache
data structures to get the right pointers to blocks.
disk_pager_read_page requests notification of page-out and updates
DC_INCORE and DC_REMAPPING too.  DC_INCORE is set and DC_REMAPPING is
cleared (because reading the new block finishes its remapping).

disk_pager_notify_pageout just clears DC_INCORE, making that page
available for remapping.

* libpager changes

Here memory_object_data_ prefix is shorten to m_o_d_.  And when it's
talked about m_o_d_function Mach function, usually its libpager
handler is meant.

** Notification on eviction

The most important change that is wanted from libpager is supporting
notification when page is evicted.  Mach already has partial support
for notification on eviction by argument "kcopy" of m_o_d_return.  If
kcopy is 0, then Mach doesn't have copy of this page anymore, so the
page is "evicted".  The problem is that m_o_d_return is usually called
only when page is modified, and if it's not modified, it's silently
dropped.

The solutions is marking page as "precious".  This has the exact
semantics we need: when page is evicted, m_o_d_return callback is
always called with kcopy=0.

*** Implementation details

New argument is added to user callback pager_read_page:
notify_on_pageout.  If it's non-zero and the page is evicted, user
callback pager_notify_pageout(pager,page) is called.  This change ABI
requires all libpager clients in the Hurd to be changed according to
the new API.

m_o_d_request stores notify_on_pageout as flag PM_NOTIFY_PAGEOUT.

m_o_d_return no longer just skips non-dirty pages.  Local array
notified[] is build and at the end of the function,
pager_notify_pageout is called for all pages that are evicted
(kcopy=0).

** Avoiding libpager optimization

Unfortunately, there is one more problem, this time specific to
libpager, not Mach.  There is an optimization in m_o_d_request when
page is being paged out.  In the beginning of m_o_d_return, all pages
being return are marked as PM_PAGINGOUT.  This mark is cleared after
m_o_d_supply (which supplies page content to Mach) is called.  If
m_o_d_request is called on page that is marked as PM_PAGINGOUT, this
page is marked with PM_PAGEINWAIT, and m_o_d_supply inside
m_o_d_return is not called for this page.  This is possible because
neither of these functions hold pager->interlock during the whole
execution of function.  This lock is temporarily unlocked during call
to user callbacks pager_read_page and pager_write_page.

So what is the implication of this optimization to our page eviction
notification?  When page is paged out, we get notified and we can
decide to reuse it.  After arranging disk_cache_info, etc, page is
touched, but if this happens fast enough, the optimization is
triggered and we get the old content!  Reading the page is "optimized"
and pager_read_page is not called, but instead the content of old
block is used.

This is solved by marking flushed and synced pages (via
pager_{flush,sync}{,_some} with PM_FORCEREAD.  (These functions call
lock-object.c:_pager_lock_object which marks pages with PM_FORCEREAD
if they are already marked with PM_NOTIFY_PAGEOUT.)  In handling
m_o_d_request, pages marked as PM_FORCEREAD are not optimized in this
way.  XXX: Currently, this fine-grained logic is disabled (with #if),
as it needs more testing.  Probably RC2 will use it.  For now, all
pages are considered PM_FORCEREAD and this particular optimization
never happens.

*** Technical details

As said above, we need guarantee that after pager_{sync,flush}*,
pager_read_page callback is called.  The most convenient place to mark
these pages as being forced to re-read is
lock-object.c:_pager_lock_object, because this function is used by all
pager_{sync,flush}* functions.  So there we just mark page as
PM_FORCEREAD if it's already marked as PM_NOTIFY_PAGEOUT.

First, this mark influences behaviour of m_o_d_request.  If page is
marked with PM_FORCEREAD and PM_PAGINGOUT, then we set PM_PAGEINWAIT
and wait until related m_o_d_return finishes (unmarks PM_PAGEINWAIT).
Then we continue with pager_read_page, etc.  If page is not marked
with PM_FORCEREAD and is marked with PM_PAGINGOUT, then old logic is
used and pager_read_page is not called (because m_o_d_return handler
will call m_o_d_supply instead of us).  (XXX: Again, this logic is
inside #if 0.  Currently, all pages are considered as marked with
PM_FORCEREAD.)

The other place where PM_FORCEREAD is taken into consideration is
handler of m_o_d_return.  The original code checks if page is marked
with PM_PAGEINWAIT, and if it is, m_o_d_supply is called for the just
written page.  PM_PAGEINWAIT is used as "delegator" of the
m_o_d_supply call to Mach.

In patched libpager, there is one more condition for when to call
m_o_d_supply.  It's called when page is marked as PM_PAGEINWAIT and
not marked as PM_FORCEREAD.  If it's marked as PM_FORCEREAD, then we
leave m_o_d_supply to m_o_d_request handler which gets notified by
condition pager->wakeup.