<henrikcozza> I do not understand what are the deficiencies of Mach, the content I find on this is vague... <antrik> the major problems are that the IPC architecture offers poor performance; and that resource usage can not be properly accounted to the right parties <braunr> antrik: the more i study it, the more i think ipc isn't the problem when it comes to performance, not directly <braunr> i mean, the implementation is a bit heavy, yes, but it's fine <braunr> the problems are resource accounting/scheduling and still too much stuff inside kernel space <braunr> and with a very good implementation, the performance problem would come from crossing address spaces <braunr> (and even more on SMP, i've been thinking about it lately, since it would require syncing mmu state on each processor currently using an address space being modified) <antrik> braunr: the problem with Mach IPC is that it requires too many indirections to ever be performant AIUI <braunr> antrik: can you mention them ? <antrik> the semantics are generally quite complex, compared to Coyotos for example, or even Viengoos <braunr> antrik: the semantics are related to the message format, which can be simplified <braunr> i think everybody agrees on that <braunr> i'm more interested in the indirections <antrik> but then it's not Mach IPC anymore :-) <braunr> right <braunr> 22:03 < braunr> i mean, the implementation is a bit heavy, yes, but it's fine <antrik> that's not an implementation issue <braunr> that's what i meant by heavy :) <braunr> well, yes and no <braunr> Mach IPC have changed over time <braunr> it would be newer Mach IPC ... :) <antrik> the fact that data types are (supposed to be) transparent to the kernel is a major part of the concept, not just an implementation detail <antrik> but it's not just the message format <braunr> transparent ? <braunr> but they're not :/ <antrik> the option to buffer in the kernel also adds a lot of complexity <braunr> buffer in the kernel ? <braunr> ah you mean message queues <braunr> yes <antrik> braunr: eh? the kernel parses all the type headers during transfer <braunr> yes, so it's not transparent at all <antrik> maybe you have a different understanding of "transparent" ;-) <braunr> i guess <antrik> I think most of the other complex semantics are kinda related to the in-kernel buffering... <braunr> i fail to see why :/ <antrik> well, it allows ports rights to be destroyed while a message is in transfer. a lot of semantics revolve around what happens in that case <braunr> yes but it doesn't affect performance a lot <antrik> sure it does. it requires a lot of extra code and indirections <braunr> not a lot of it <antrik> "a lot" is quite a relative term :-) <antrik> compared to L4 for example, it *is* a lot <braunr> and those indirections (i think you refer to more branching here) are taken only when appropriate, and can be isolated, improved through locality, etc.. <braunr> the features they add are also huge <braunr> L4 is clearly insufficient <braunr> all current L4 forks have added capabilities .. <braunr> (that, with the formal verification, make se4L one of the "hottest" recent system projects) <braunr> seL4* <antrik> yes, but with very few extra indirection I think... similar to EROS (which claims to have IPC almost as efficient as the original L4) <braunr> possibly <antrik> I still fail to see much real benefit in formal verification :-) <braunr> but compared to other problems, this added code is negligible <braunr> antrik: for a microkernel, me too :/ <braunr> the kernel is already so small you can simply audit it :) <antrik> no, it's not neglible, if you go from say two cache lines touched per IPC (original L4) to dozens (Mach) <antrik> every additional variable that needs to be touched to resolve some indirection, check some condition adds significant overhead <braunr> if you compare the dozens to the huge amount of inter processor interrupt you get each time you change the kernel map, it's next to nothing .. <antrik> change the kernel map? not sure what you mean <braunr> syncing address spaces on hundreds of processors each time you send a message is a real scalability issue here (as an example), where Mach to L4 IPC seem like microoptimization <youpi> braunr: modify, you mean? <braunr> yes <youpi> (not switchp <youpi> ) <braunr> but that's only one example <braunr> yes, modify, not switch <braunr> also, we could easily get rid of the ihash library <braunr> making the message provide the address of the object associated to a receive right <braunr> so the only real indirection is the capability, like in other systems, and yes, buffering adds a bit of complexity <braunr> there are other optimizations that could be made in mach, like merging structures to improve locality <pinotree> "locality"? <braunr> having rights close to their target port when there are only a few <braunr> pinotree: locality of reference <youpi> for cache efficiency <antrik> hundreds of processors? let's stay realistic here :-) <braunr> i am .. <braunr> a microkernel based system is also a very good environment for RCU <braunr> (i yet have to understand how liburcu actually works on linux) <antrik> I'm not interested in systems for supercomputers. and I doubt desktop machines will get that many independant cores any time soon. we still lack software that could even romotely exploit that <braunr> hum, the glibc build system ? :> <braunr> lol <youpi> we have done a survey over the nix linux distribution <youpi> quite few packages actually benefit from a lot of cores <youpi> and we already know them :) <braunr> what i'm trying to say is that, whenever i think or even measure system performance, both of the hurd and others, i never actually see the IPC as being the real performance problem <braunr> there are many other sources of overhead to overcome before getting to IPC <youpi> I completely agree <braunr> and with the advent of SMP, it's even more important to focus on contention <antrik> (also, 8 cores aren't exactly a lot...) <youpi> antrik: s/8/7/ , or even 6 ;) <antrik> braunr: it depends a lot on the use case. most of the problems we see in the Hurd are probably not directly related to IPC performance; but I pretty sure some are <antrik> (such as X being hardly usable with UNIX domain sockets) <braunr> antrik: these have more to do with the way mach blocks than IPC itself <braunr> similar to the ext2 "sleep storm" <antrik> a lot of overhead comes from managing ports (for for example), which also mostly comes down to IPC performance <braunr> antrik: yes, that's the main indirection <braunr> antrik: but you need such management, and the related semantics in the kernel interface <braunr> (although i wonder if those should be moved away from the message passing call) <antrik> you mean a different interface for kernel calls than for IPC to other processes? that would break transparency in a major way. not sure we really want that... <braunr> antrik: no <braunr> antrik: i mean calls specific to right management <antrik> admittedly, transparency for port management is only useful in special cases such as rpctrace, and that probably could be served better with dedicated debugging interfaces... <braunr> antrik: i.e. not passing rights inside messages <antrik> passing rights inside messages is quite essential for a capability system. the problem with Mach IPC in regard to that is that the message format allows way more flexibility than necessary in that regard... <braunr> antrik: right <braunr> antrik: i don't understand why passing rights inside messages is important though <braunr> antrik: essential even <youpi> braunr: I guess he means you need at least one way to pass rights <antrik> braunr: well, for one, you need to pass a reply port with each RPC request... <braunr> youpi: well, as he put, the message passing call is overpowered, and this leads to many branches in the code <braunr> antrik: the reply port is obvious, and can be optimized <braunr> antrik: but the case i worry about is passing references to objects between tasks <braunr> antrik: rights and identities with the auth server for example <braunr> antrik: well ok forget it, i just recall how it actually works :) <braunr> antrik: don't forget we lack thread migration <braunr> antrik: you may not think it's important, but to me, it's a major improvement for RPC performance <antrik> braunr: how can seL4 be the most interesting microkernel then?... ;-) <braunr> antrik: hm i don't know the details, but if it lacks thread migration, something is wrong :p <braunr> antrik: they should work on viengoos :) <antrik> (BTW, AIUI thread migration is quite related to passive objects -- something Hurd folks never dared seriously consider...) <braunr> i still don't know what passive objects are, or i have forgotten it :/ <antrik> no own control threads <braunr> hm, i'm still missing something <braunr> what do you refer to by control thread ? <braunr> with* <antrik> i.e. no main loop etc.; only activated by incoming calls <braunr> ok <braunr> well, if i'm right, thomas bushnel himself wrote (recently) that the ext2 "sleep" performance issue was expected to be solved with thread migration <braunr> so i guess they definitely considered having it <antrik> braunr: don't know what the "sleep peformance issue" is... <braunr> http://lists.gnu.org/archive/html/bug-hurd/2011-12/msg00032.html <braunr> antrik: also, the last message in the thread, http://lists.gnu.org/archive/html/bug-hurd/2011-12/msg00050.html <braunr> antrik: do you consider having a reply port being an avoidable overhead ? <antrik> braunr: not sure. I don't remember hearing of any capability system doing this kind of optimisation though; so I guess there are reasons for that... <braunr> antrik: yes me too, even more since neal talked about it on viengoos <antrik> I wonder whether thread management is also such a large overhead with fully sync IPC, on L4 or EROS for example... <braunr> antrik: it's still a very handy optimization for thread scheduling <braunr> antrik: it makes solving priority inversions a lot easier <antrik> actually, is thread scheduling a problem at all with a thread activation approach like in Viengoos? <braunr> antrik: thread activation is part of thread migration <braunr> antrik: actually, i'd say they both refer to the same thing <antrik> err... scheduler activation was the term I wanted to use <braunr> same <braunr> well <braunr> scheduler activation is too vague to assert that <braunr> antrik: do you refer to scheduler activations as described in http://en.wikipedia.org/wiki/Scheduler_activations ? <antrik> my understanding was that Viengoos still has traditional threads; they just can get scheduled directly on incoming IPC <antrik> braunr: that Wikipedia article is strange. it seems to use "scheduler activations" as a synonym for N:M multithreading, which is not at all how I understood it <youpi> antrik: I used to try to keep a look at those pages, to fix such wrong things, but left it <braunr> antrik: that's why i ask <antrik> IIRC Viengoos has a thread associated with each receive buffer. after copying the message, the kernel would activate the processes activation handler, which in turn could decide to directly schedule the thead associated with the buffer <antrik> or something along these lines <braunr> antrik: that's similar to mach handoff <youpi> antrik: generally enough, all the thread-related pages on wikipedia are quite bogus <antrik> nah, handoff just schedules the process; which is not useful, if the right thread isn't activated in turn... <braunr> antrik: but i think it's more than that, even in viengoos <youpi> for instance, the french "thread" page was basically saying that they were invented for GUIs to overlap computation with user interaction <braunr> .. :) <antrik> youpi: good to know... <braunr> antrik: the "misunderstanding" comes from the fact that scheduler activations is the way N:M threading was implemented on netbsd <antrik> youpi: that's a refreshing take on the matter... ;-) <braunr> antrik: i'll read the critique and viengoos doc/source again to be sure about what we're talking :) <braunr> antrik: as threading is a major issue in mach, and one of the things i completely changed (and intend to change) in x15, whenever i get to work on that again ..... :) <braunr> antrik: interestingly, the paper about scheduler activations was written (among others) by brian bershad, in 92, when he was actively working on research around mach <antrik> braunr: BTW, I have little doubt that making RPC first-class would solve a number of problems... I just wonder how many others it would open
<braunr> it was intended as a mach clone, but now that i have better knowledge of both mach and the hurd, i don't want to retain mach compatibility <braunr> and unlike viengoos, it's not really experimental <braunr> it's focused on memory and cpu scalability, and performance, with techniques likes thread migration and rcu <braunr> the design i have in mind is closer to what exists today, with strong emphasis on scalability and performance, that's all <braunr> and the reason the hurd can't be modified first is that my design relies on some important design changes <braunr> so there is a strong dependency on these mechanisms that requires the kernel to exists first
<gnu_srs> And you will address the design flaws or implementation faults with x15? <braunr> no <braunr> i'll address the implementation details :p <braunr> and some design issues like cpu and memory resource accounting <braunr> but i won't implement generic resource containers <braunr> assuming it's completed, my work should provide a hurd system on par with modern monolithic systems <braunr> (less performant of course, but performant, scalable, and with about the same kinds of problems) <braunr> for example, thread migration should be mandatory <braunr> which would make client calls behave exactly like a userspace task asking a service from the kernel <braunr> you have to realize that, on a monolithic kernel, applications are clients, and the kernel is a server <braunr> and when performing a system call, the calling thread actually services itself by running kernel code <braunr> which is exactly what thread migration is for a multiserver system <braunr> thread migration also implies sync IPC <braunr> and sync IPC is inherently more performant because it only requires one copy, no in kernel buffering <braunr> sync ipc also avoids message floods, since client threads must run server code <gnu_srs> and this is not achievable with evolved gnumach and/or hurd? <braunr> well that's not entirely true, because there is still a form of async ipc, but it's a lot less likely <braunr> it probably is <braunr> but there are so many things to change i prefer starting from scratch <braunr> scalability itself probably requires a revamp of the hurd core libraries <braunr> and these libraries are like more than half of the hurd code <braunr> mach ipc and vm are also very complicated <braunr> it's better to get something new and simpler from the start <gnu_srs> a major task nevertheless:-D <braunr> at least with the vm, netbsd showed it's easier to achieve good results from new code, as other mach vm based systems like freebsd struggled to get as good <braunr> well yes <braunr> but at least it's not experimental <braunr> everything i want to implement already exists, and is tested on production systems <braunr> it's just time to assemble those ideas and components together into something that works <braunr> you could see it as a qnx-like system with thread migration, the global architecture of the hurd, and some improvements from linux like rcu :)
<antrik> braunr: thread migration is tested on production systems? <antrik> BTW, I don't think that generally increasing the priority of servers is a good idea <antrik> in most cases, IPC should actually be sync. slpz looked at it at some point, and concluded that the implementation actually has a fast-path for that case. I wonder what happens to scheduling in this case -- is the receiver sheduled immediately? if not, that's something to fix... <braunr> antrik: qnx does something very close to thread migration, yes <braunr> antrik: i agree increasing the priority isn't a good thing, but it's the best of the quick and dirty ways to reduce message floods <braunr> the problem isn't sync ipc in mach <braunr> the problem is the notifications (in our cases the dead name notifications) that are by nature async <braunr> and a malicious program could send whatever it wants at the fastest rate it can <antrik> braunr: malicious programs can do any number of DOS attacks on the Hurd; I don't see how increasing priority of system servers is relevant in that context <antrik> (BTW, I don't think dead name notifications are async by nature... just like for most other IPC, the *usual* case is that a server thread is actively waiting for the message when it's generated) <braunr> antrik: it's async with respect to the client <braunr> antrik: and malicious programs shouldn't be able to do that kind of dos <braunr> but this won't be fixed any time soon <braunr> on the other hand, a higher priority helps servers not create too many threads because of notifications, and that's a good thing <braunr> gnu_srs: the "fix" for this will be to rewrite select so that it's synchronous btw <braunr> replacing dead name notifications with something like cancelling a previously installed select request <antrik> no idea what "async with respect to the client" means <braunr> it means the client doesn't wait for anything <antrik> what is the client? what scenario are you talking about? how does it affect scheduling? <braunr> for notifications, it's usually the kernel <braunr> it doesn't directly affect scheduling <braunr> it affects the amount of messages a hurd server has to take care of <braunr> and the more messages, the more threads <braunr> i'm talking about event loops <braunr> and non blocking (or very short) selects <antrik> the amount of messages is always the same. the question is whether they can be handled before more come in. which would be the case if be default the receiver gets scheduled as soon as a message is sent... <braunr> no <braunr> scheduling handoff doesn't imply the thread will be ready to service the next message by the time a client sends a new one <braunr> the rate at which a message queue gets filled has nothing to do with scheduling handoff <antrik> I very much doubt rates come into play at all <braunr> well they do <antrik> in my understanding the problem is that a lot of messages are sent before the receive ever has a chance to handle them. so no matter how fast the receiver is, it looses <braunr> a lot of non blocking selects means a lot of reply ports destroyed, a lot of dead name notifications, and what i call message floods at server side <braunr> no <braunr> it used to work fine with cthreads <braunr> it doesn't any more with pthreads because pthreads are slightly slower <antrik> if the receiver gets a chance to do some work each time a message arrives, in most cases it would be free to service the next request with the same thread <braunr> no, because that thread won't have finished soon enough <antrik> no, it *never* worked fine. it might have been slighly less terrible. <braunr> ok it didn't work fine, it worked ok <braunr> it's entirely a matter of rate here <braunr> and that's the big problem, because it shouldn't <antrik> I'm pretty sure the thread would finish before the time slice ends in almost all cases <braunr> no <braunr> too much contention <braunr> and in addition locking a contended spin lock depresses priority <braunr> so servers really waste a lot of time because of that <antrik> I doubt contention would be a problem if the server gets a chance to handle each request before 100 others come in <braunr> i don't see how this is related <braunr> handling a request doesn't mean entirely processing it <braunr> there is *no* relation between handoff and the rate of incoming message rate <braunr> unless you assume threads can always complete their task in some fixed and low duration <antrik> sure there is. we are talking about a single-processor system here. <braunr> which is definitely not the case <braunr> i don't see what it changes <antrik> I'm pretty sure notifications can generally be handled in a very short time <braunr> if the server thread is scheduled as soon as it gets a message, it can also get preempted by the kernel before replying <braunr> no, notifications can actually be very long <braunr> hurd_thread_cancel calls condition_broadcast <braunr> so if there are a lot of threads on that .. <braunr> (this is one of the optimizations i have in mind for pthreads, since it's possible to precisely select the target thread with a doubly linked list) <braunr> but even if that's the case, there is no guarantee <braunr> you can't assume it will be "quick enough" <antrik> there is no guarantee. but I'm pretty sure it will be "quick enough" in the vast majority of cases. which is all it needs. <braunr> ok <braunr> that's also the idea behind raising server priorities <antrik> braunr: so you are saying the storms are all caused by select(), and once this is fixed, the problem should be mostly gone and the workaround not necessary anymore? <braunr> yes <antrik> let's hope you are right :-) <braunr> :) <antrik> (I still think though that making hand-off scheduling default is the right thing to do, and would improve performance in general...) <braunr> sure <braunr> well <braunr> no it's just a hack ;p <braunr> but it's a right one <braunr> the right thing to do is a lot more complicated <braunr> as roland wrote a long time ago, the hurd doesn't need dead-name notifications, or any notification other than the no-sender (which can be replaced by a synchronous close on fd like operation) <antrik> well, yes... I still think the viengoos approach is promising. I meant the right thing to do in the existing context ;-) <braunr> better than this priority hack <antrik> oh? you happen to have a link? never heard of that... <braunr> i didn't want to do it initially, even resorting to priority depression on trhead creation to work around the problem <braunr> hm maybe it wasn't him, i can't manage to find it <braunr> antrik: http://lists.gnu.org/archive/html/l4-hurd/2003-09/msg00009.html <braunr> "Long ago, in specifying the constraints of <braunr> what the Hurd needs from an underlying IPC system/object model we made it <braunr> very clear that we only need no-senders notifications for object <braunr> implementors (servers)" <braunr> "We don't in general make use of dead-name notifications, <braunr> which are the general kind of object death notification Mach provides and <braunr> what serves as task death notification." <braunr> "In the places we do, it's to serve <braunr> some particular quirky need (and mostly those are side effects of Mach's <braunr> decouplable RPCs) and not a semantic model we insist on having."
<antrik> The notion that seemed appropriate when we thought about these issues for <antrik> Fluke was that the "alert" facility be a feature of the IPC system itself <antrik> rather than another layer like the Hurd's io_interrupt protocol. <antrik> braunr: funny, that's *exactly* what I was thinking when looking at the io_interrupt mess :-) <antrik> (and what ultimately convinced me that the Hurd could be much more elegant with a custom-tailored kernel rather than building around Mach)
<braunr> my initial attempt was a mach clone <braunr> but now i want a mach-like kernel, without compability <lisporu> which new licence ? <braunr> and some very important changes like sync ipc <braunr> gplv3 <braunr> (or later) <lisporu> cool 8) <braunr> yes it is gplv2+ since i didn't take the time to read gplv3, but now that i have, i can't use anything else for such a project: ) <lisporu> what is mach-like ? (how it is different from Pistachio like ?) <braunr> l4 doesn't provide capabilities <lisporu> hmmm.. <braunr> you need a userspace for that <braunr> +server <braunr> and it relies on complete external memory management <lisporu> how much work is done ? <braunr> my kernel will provide capabilities, similar to mach ports, but simpler (less overhead) <braunr> i want the primitives right <braunr> like multiprocessor, synchronization, virtual memory, etc..
<braunr> for those interested, x15 is now a project of its own, with no gnumach compability goal, and covered by gplv3+
<braunr> bits of news about x15: it can now create tasks, threads, vm_maps, physical maps (cpu-specific page tables) for user tasks, and stack tracing (in addition to symbol resolution when symbols are available) were recently added
<braunr> Anarchos: as a side note, i'm currently working on a hurd clone with a microkernel that takes a lot from mach but also completely changes the ipc interface (making it not mach at all in the end) <braunr> it's something between mach and qnx neutrino <zacts> braunr: do you have a git repo of your new clone? <braunr> http://git.sceen.net/rbraun/x15.git/ <zacts> neat <braunr> it's far from complete <braunr> and hasn't reached a status where it can be publically announced <zacts> ok <braunr> but progress has been constant so far, the ideas i'm using have proven applicable on other systems, i don't expect the kind of design issues that blocked HurdNG <braunr> (also, my attempt doesn't aim at the same goals as hurdng did) <braunr> (e.g. denial of service remains completely possible) <zacts> so x15 will use the current hurd translators? you are only replacing gnumach? <braunr> that was the plan some years ago, but now that i know the hurd better, i think the main issues are in the hurd, so there isn't much point rewriting mach <braunr> so, if the hurd needs a revamp, it's better to also make the underlying interface better if possible <braunr> zacts: in other words: it's a completely different beast <zacts> ok <braunr> the main goal is to create a hurd-like system that overcomes the current major defficiencies, most of them being caused by old design decisions <zacts> like async ipc? <braunr> yes <Anarchos> time for a persistent hurd ? :) <braunr> no way <braunr> i don't see a point to persistence for a general purpose system <braunr> and it easily kills performance <braunr> on the other hand, it would be nice to have a truely scalable, performant, and free microkernel based system <braunr> (and posix compatible) <braunr> there is currently none <braunr> zacts: the projects focuses mostly on performance and scalability, while also being very easy to understand and maintain (something i think the current hurd has failed at :/) <braunr> project* <zacts> very cool <braunr> i think so, but we'll have to wait for an end result :) <braunr> what's currently blocking me is the IDL <braunr> earlier research has shown that an idl must be optmized the same way compilers are for the best performances <braunr> i'm not sure i can write something good enough :/ <braunr> the first version will probably be very similar to mig, small and unoptimized
<zacts> braunr: so how exactly do the goals of x15 differ from viengoos? <braunr> zacts: viengoos is much more ambitious about the design <braunr> tbh, i still don't clearly see how its half-sync ipc work <braunr> x15 is much more mach-like, e.g. a hybrid microkernel with scheduling and virtual memory in the kernel <braunr> its goals are close to those of mach, adding increased scalability and performance to the list <zacts> that's neat <braunr> that's different <braunr> in a way, you could consider x15 is to mach what linux is to unix, a clone with a "slightly" different interface <zacts> ah, ok. cool! <braunr> viengoos is rather a research project, with very interesting goals, i think they're both neat :p
<braunr> for now, it provides kernel memory allocation and basic threading <braunr> it already supports both i386 and amd64 processors (from i586 onwards), and basic smp <zacts> oh wow <zacts> how easily can it be ported to other archs? <braunr> the current focus is smp load balancing, so that thread migration is enabled during development <braunr> hard to say <braunr> everything that is arch-specific is cleanly separated, the same way it is in mach and netbsd <braunr> but the arch-specific interfaces aren't well defined yet because there is only one (and incomplete) arch
<antrik> BTW, what is your current direction? did you follow through with abandonning Mach resemblance?... <braunr> no <braunr> it's very similar to mach in many ways <braunr> unless mach is defined by its ipc in which case it's not mach at all <braunr> the ipc interface will be similar to the qnx one <antrik> well, Mach is pretty much defined by it's IPC and VM interface... <braunr> the vm interface remains <antrik> its <braunr> although vm maps will be first class objects <braunr> so that it will be possible to move parts of the vm server outside the kernel some day if it feels like a good thing to do <braunr> i.e. vm maps won't be inferred from tasks <braunr> not implicitely <braunr> the kernel will be restricted to scheduling, memory management, and ipc, much as mach is (notwithstanding drivers) <antrik> hm... going with QNX IPC still seems risky to me... it's designed for simple embedded environments, not for general-purpose operating systems in my understanding <braunr> no, the qnx ipc interface is very generic <braunr> they can already call remote services <braunr> the system can scale well on multiprocessor machines <braunr> that's not risky at all, on the contrary <antrik> yeah, I'm sure it's generic... but I don't think anybody tried to build a Hurd-like system on top of it; so it's not at all clear whether it will work out at all... <auronandace> clueless question: does x15 have any inspiration from helenos? <braunr> absolutely none <braunr> i'd say x15 is almost an opposite to helenos <braunr> it's meant as a foundation for unix systems, like mach <braunr> some unix interfaces considered insane by helenos people (such as fork and signals) will be implemented (although not completely in the kernel) <braunr> ipc will be mostly synchronous <braunr> they're very different <braunr> well, helenos is very different <auronandace> cool <braunr> x15 and actually propel (the current name i have for the final system), are meant to create a hurd clone <auronandace> another clueless question: any similarities of x15 to minix? <braunr> and since we're few, implementing posix efficiently is a priority goal for me <braunr> again, absolutely none <braunr> for the same reasons <braunr> minix targets resilience in embedded environments <braunr> propel is a hurd clone <braunr> propel aims at being a very scalable and performant hurd clone <braunr> that's all <auronandace> neato <braunr> unfortunately, i couldn't find a name retaining all the cool properties of the hurd <braunr> feel free to suggest ideas :) <auronandace> propel? as in to launch forward? <braunr> push forward, yes <auronandace> that's very likely a better name than anything i could conjure up <braunr> x15 is named after mach (the first aircraft to break mach 4, reaching a bit less than mach 7) <braunr> servers will be engines, and together to push the system forward ..... :) <auronandace> nice <auronandace> thrust might be a bit too generic i guess <braunr> oh i'm looking for something like "hurd" <braunr> doubly recursive acronym, related to gnu <braunr> and short, so it can be used as a c namespace <braunr> antrik: i've thought about it a lot, and i'm convinced this kind of interface is fine for a hurd like system <braunr> the various discussions i found about the hurd requirements (remember roland talking about notifications) all went in this direction <braunr> note however the interface isn't completely synchronous <braunr> and that's very important <antrik> well, I'm certainly curious. but if you are serious about this, you'd better start building a prototype as soon as possible, rather than perfecting SMP ;-) <braunr> i'm not perfecting smp <braunr> but i consider it very important to have migrations and preemption actually working before starting the prototype <braunr> so that tricky mistakes about concurrency can be catched early <antrik> my current hunch is that you are trying to do too much at the same time... improving both the implementation details and redoing the system design <braunr> so, for example, there is (or will be soon, actually) thread migratio, but the scheduler doesn't take processor topology into account <braunr> that's why i'm starting from scratch <braunr> i don't delve too deep into the details <braunr> just the ones i consider very important <antrik> what do you mean by thread migration here? didn't you say you don't even have IPC?... <braunr> i mean migration between cpus <antrik> OK <braunr> the other is too confusing <braunr> and far too unused and unknown to be used <braunr> and i won't actually implement it the way it was done in mach <braunr> again, it will be similar to qnx <antrik> oh? now that's news for me :-) <antrik> you seemed pretty hooked on thread migration when we talked about these things last time... <braunr> i still am <braunr> i'm just saying it won't be implemented the same way <braunr> instead of upcalls from the kernel into userspace, i'll "simply" make server threads inherit from the caller's scheduling context <braunr> the ideas i had about stack management are impossible to apply in practice <braunr> which make the benefit i imagined unrealistic <braunr> and the whole idea was very confusing when compared and integrated into a unix like view <braunr> so stack usage will be increased <braunr> that's ok <antrik> but thread migration is more or less equivalent with first-class RPCs AIUI. does that work with the QNX IPC model?... <braunr> the very important property that threads don't block and wake a server when sending, and the server again blocks and wake the client on reply, is preserved <antrik> (in fact I find the term "first-class RPC" much clearer...) <braunr> i dont <braunr> there are two benefits in practice: since the scheduling context is inherited, the client is charged for the cpu time consumed <braunr> and since there are no wakeups and blockings, but a direct hand off in the scheduler, the cost of crossing task space is closer to the system call <antrik> which can be problematic too... but still it's the solution chosen by EROS for example AIUI <antrik> (inheriting scheduling contexts I mean) <braunr> by practically all modern microkernel based systems actually, as noted by shapiro <antrik> braunr: well, both benefits can be achieved by other means as well... scheduler activations like in Viengoos should handle the hand-off part AIUI, and scheduling contexts can be inherited explicitly too, like in EROS (and in a way in Viengoos) <braunr> i don't understand viengoos well enough to do it that way
<braunr> a microkernel loosely based on mach for a future hurd-like system <JoshuaB> ok. no way! Are you in the process of building a micro-kernel that the hurd may someday run on? <braunr> not the hurd, a hurd-like system <JoshuaB> ok wow. sounds pretty cool, and tricky <braunr> the hurd could, but would require many changes too, and the point of this rewrite is to overcome the most difficult technical performance and scalability problems of the current hurd <braunr> doing that requires deep changes in the low level interfaces <braunr> imo, a rewrite is more appropriate <braunr> sometimes, things done in x15 can be ported to the hurd <braunr> but it still requires a good deal of effort
<bddebian> braunr: Did I see that you are back tinkering with X15? <braunr> well yes i am <braunr> and i'm very satisfied with it currently, i hope i can maintain the same level of quality in the future <braunr> it can already handle hundreds of processors with hundreds of GB of RAM in a very scalable way <braunr> most algorithms are O(1) <braunr> even waking up multiple threads is O(1) :) <braunr> i'd like to implement rcu this summer <bddebian> Nice. When are you gonna replace gnumach? ;-P <braunr> never <braunr> it's x15, not x15mach now <braunr> it's not meant to be compatible <bddebian> Who says it has to be compatible? :) <braunr> i don't know, my head <braunr> the point is, the project is about rewriting the hurd now, not just the kernel <braunr> new kernel, new ipc, new interfaces, new libraries, new everything <bddebian> Yikes, now that is some work. :) <braunr> well yes and no <braunr> ipc shouldn't be that difficult/long, considering how simple i want the interface to be <bddebian> Cool. <braunr> networking and drivers will simply be reused from another code base like dde or netbsd <braunr> so besides the kernel, it's a few libraries (e.g. a libports like library), sysdeps parts in the c library, and a file system <bddebian> For inclusion in glibc or are you not intending on using glibc? <braunr> i intend to use glibc, but not for upstream integration, if that's what you meant <braunr> so a private, local branch i assume <braunr> i expect that part to be the hardest
<zacts> braunr: also, will propel/x15 use netbsd drivers or netdde linux drivers? <zacts> or both? <braunr> probably netbsd drivers <zacts> and if netbsd, will it utilize rump?
user-space device drivers, External Projects, The Anykernel and Rump Kernels.
<braunr> i don't know yet <zacts> ok <braunr> device drivers and networking will arrive late <braunr> the system first has to run in ram, with a truely configurable boot process <braunr> (i.e. a boot process that doesn't use anything static, and can boot from either disk or network) <braunr> rump looks good but it still requires some work since it doesn't take care of messaging as well as we'd want <braunr> e.g. signal relaying isn't that great <zacts> I personally feel like using linux drivers would be cool, just because linux supports more hardware than netbsd iirc.. <mcsim> zacts: But it could be problematic as you should take quite a lot code from linux kernel to add support even for a single driver. <braunr> zacts: netbsd drivers are far more portable <zacts> oh wow, interesting. yeah I did have the idea that netbsd would be more portable. <braunr> mcsim: that doesn't seem to be as big a problem as you might suggest <braunr> the problem is providing the drivers with their requirements <braunr> there are a lot of different execution contexts in linux (hardirq, softirq, bh, threads to name a few) <braunr> being portable (as implied in netbsd) also means being less demanding on the execution context <braunr> which allows reusing code in userspace more easily, as demonstrated by rump <braunr> i don't really care about extensive hardware support, since this is required only for very popular projects such as linux <braunr> and hardware support actually comes with popularity (the driver code base is related with the user base) <zacts> so you think that more users will contribute if the projects takes off? <braunr> i care about clean and maintainable code <braunr> well yes <zacts> I think that's a good attitude <braunr> what i mean is, there is no need for extensive hardware support <mcsim> braunr: TBH, I did not really got idea of rump. Do they try to run the whole kernel or some chosen subsystems as user tasks? <braunr> mcsim: some subsystems <braunr> well <braunr> all the subsystems required by the code they actually want to run <braunr> (be it a file system or a network stack) <mcsim> braunr: What's the difference with dde? <braunr> it's not kernel oriented <mcsim> what do you mean? <braunr> it's not only meant to run on top of a microkernel <braunr> as the author named it, it's "anykernel" <braunr> if you remember at fosdem, he run code inside a browser <braunr> ran* <braunr> and also, netbsd drivers wouldn't restrict the license <braunr> although not a priority, having a (would be) gnu system under gplv3+ would be nice <zacts> that would be cool <zacts> x15 is already gplv3+ <zacts> iirc <braunr> yes <zacts> cool <zacts> yeah, I would agree netbsd drivers do look more attractive in that case <braunr> again, that's clearly not the main reason for choosing them <zacts> ok <braunr> it could also cause other problems, such as accepting a bsd license when contributing back <braunr> but the main feature of the hurd isn't drivers, and what we want to protect with the gpl is the main features <zacts> I see <braunr> drivers, as well as networking, would be third party code, the same way you run e.g. firefox on linux <braunr> with just a bit of glue <zacts> braunr: what do you think of the idea of being able to do updates for propel without rebooting the machine? would that be possible down the road? <braunr> simple answer: no <braunr> that would probably require persistence, and i really don't want that <zacts> does persistence add a lot of complexity to the system? <braunr> not with the code, but at execution, yes <zacts> interesting <braunr> we could add per-program serialization that would allow it but that's clearly not a priority for me <braunr> updating with a reboot is already complex enough :)
<braunr> the thing is, i consider the basic building blocks of the hurd too crappy to build anything really worth such effort over them <braunr> mach is crappy, mig is crappy, signal handling is crappy, hurd libraries are ok but incur a lot of contention, which is crappy today <bddebian> Understood but it is all we have currently. <braunr> i know <braunr> and it's good as a prototype <bddebian> We have already had L4, viengoos, etc and nothing has ever come to fruition. :( <braunr> my approach is compeltely different <braunr> it's not a new design <braunr> a few things like ipc and signals are redesigned, but that's minor compared to what was intended for hurdng <braunr> propel is simply meant to be a fast, scalable implementation of the hurd high level architecture <braunr> bddebian: imagine a mig you don't fear using <braunr> imagine interfaces not constrained to 100 calls ... <braunr> imagine per-thread signalling from the start <bddebian> braunr: I am with you 100% but it's vaporware so far.. ;-) <braunr> bddebian: i'm just explaining why i don't want to work on large scale projects on the hurd <braunr> fixing local bugs is fine <braunr> fixing paging is mandatory <braunr> usb could be implemented with dde, perhaps by sharing the pci handling code <braunr> (i.e. have one big dde server with drivers inside, a bit ugly but straightforward compared to a full fledged pci server) <bddebian> braunr: But this is the problem I see. Those of you that have the skills don't have the time or energy to put into fixing that kind of stuff. <bddebian> braunr: That was my thought. <braunr> bddebian: well i have time, and i'm currently working :p <braunr> but not on that <braunr> bddebian: also, it won't be vaporware for long, i may have ipc working well by the end of the year, and optimized and developer-friendly by next year)
<braunr> i'll soon add my radix tree with support for lockless lookups :> <braunr> a tree organized based on the values of the keys thmselves, and not how they relatively compare to each other <braunr> also, a tree of arrays, which takes advantage of cache locality without the burden of expensive resizes <arnuld> you seem to be applying good algorithmic teghniques <arnuld> that is nice <braunr> that's one goal of the project <braunr> you can't achieve performance and scalability without the appropriate techniques <braunr> see http://git.sceen.net/rbraun/librbraun.git/blob/HEAD:/rdxtree.c for the existing userspace implementation <arnuld> in kern/work.c I see one TODO "allocate numeric IDs to better identify worker threads" <braunr> yes <braunr> and i'm adding my radix tree now exactly for that <braunr> (well not only, since radix tree will also back VM objects and IPC spaces, two major data structures of the kernel)
<braunr> and also starting paging anonymous memory in x15 :> <braunr> well, i've merged my radix tree code, made it safe for lockless access (or so i hope), added generic concurrent work queues <braunr> and once the basic support for anonymous memory is done, x15 will be able to load modules passed from grub into userspace :> <braunr> but i've also been thinking about how to solve a major scalability issue with capability based microkernels that noone else seem to have seen or bothered thinking about <braunr> for those interested, the problem is contention at the port level <braunr> unlike on a monolithic kernel, or a microkernel with thread-based ipc such as l4, mach and similar kernels use capabilities (port rights in mach terminology) to communicate <braunr> the kernel then has to "translate" that reference into a thread to process the request <braunr> this is done by using a port set, putting many ports inside, and making worker threads receive messages on the port set <braunr> and in practice, this gets very similar to a traditional thread pool model <braunr> one thread actually waits for a message, while others sit on a list <braunr> when a message arrives, the receiving thread wakes another from that list so it receives the next message <braunr> this is all done with a lock <bddebian> Maybe they thought about it but couldn't or were to lazy to find a better way? :) <mcsim> braunr: what do you mean under "unlike .... a microkernel with thread-based ipc such as l4, mach and similar kernels use capabilities"? L4 also has capabilities. <braunr> mcsim: not directly <braunr> capabilities are implemented by a server on top of l4 <braunr> unless it's OKL4 or another variant with capabilities back in the kernel <braunr> i don't know how fiasco does it <braunr> so the problem with this lock is potentially very heavy contention <braunr> and contention in what is the equivalent of a system call .. <braunr> it's also hard to make it real-time capable <braunr> for example, in qnx, they temporarily apply priority inheritance to *every* server thread since they don't know which one is going to be receiving next <mcsim> braunr: in fiasco you have capability pool for each thread and this pool is stored in tread control block. When one allocates capability kernel just marks slot in a pool as busy <braunr> mcsim: ok but, there *is* a thread for each capability <braunr> i mean, when doing ipc, there can only be one thread receiving the message <braunr> (iirc, this was one of the big issue for l4-hurd) <mcsim> ok. i see the difference. <braunr> well i'm asking <braunr> i'm not so sure about fiasco <braunr> but that's what i remember from the generic l4 spec <mcsim> sorry, but where is the question? <braunr> 16:04 < braunr> i mean, when doing ipc, there can only be one thread receiving the message <mcsim> yes, you specify capability to thread you want to send message to <braunr> i'll rephrase: <braunr> when you send a message, do you invoke a capability (as in mach), or do you specify the receiving thread ? <mcsim> you specify a thread <braunr> that's my point <mcsim> but you use local name (that is basically capability) <braunr> i see <braunr> from wikipedia: "Furthermore, Fiasco contains mechanisms for controlling communication rights as well as kernel-level resource consumption" <braunr> not certain that's what it refers to, but that's what i understand from it <braunr> more capability features in the kernel <braunr> but you still send to one thread <mcsim> yes <braunr> that's what makes it "easily" real time capable <braunr> a microkernel that would provide mach-like semantics (object-oriented messaging) but without contention at the messsage passing level (and with resource preallocation for real time) would be really great <braunr> bddebian: i'm not sure anyone did <bddebian> braunr: Well you can be the hero!! ;) <braunr> the various papers i could find that were close to this subject didn't take contention into account <braunr> exception for network-distributed ipc on slow network links <braunr> bddebian: eh <braunr> well i think it's doable acctually <mcsim> braunr: can you elaborate on where contention is, because I do not see this clearly? <braunr> mcsim: let's take a practical example <braunr> a file system such as ext2fs, that you know well enough <braunr> imagine a large machine with e.g. 64 processors <braunr> and an ignorant developer like ourselves issuing make -j64 <braunr> every file access performed by the gcc tools will look up files, and read/write/close them, concurrently <braunr> at the server side, thread creation isn't a problem <braunr> we could have as many threads as clients <braunr> the problem is the port set <braunr> for each port class/bucket (let's assume they map 1:1), a port set is created, and all receive rights for the objects managed by the server (the files) are inserted in this port set <braunr> then, the server uses ports_manage_port_operations_multithread() to service requests on that port set <braunr> with as many threads required to process incoming messages, much the same way a work queue does it <braunr> but you can't have *all* threads receiving at the same time <braunr> there can only be one <braunr> the others are queued <braunr> i did a change about the queue order a few months ago in mach btw <braunr> mcsim: see ipc/ipc_thread.c in gnumach <braunr> this queue is shared and must be modified, which basically means a lock, and contention <braunr> so the 64 concurrent gcc processes will suffer from contenion at the server while they're doing something similar to a system call <braunr> by that, i mean, even before the request is received <braunr> mcsim: if you still don't understand, feel free to ask <mcsim> braunr: I'm thinking on it :) give me some time <braunr> "Fiasco.OC is a third generation microkernel, which evolved from its predecessor L4/Fiasco. Fiasco.OC is capability based" <braunr> ok <braunr> so basically, there are no more interesting l4 variants strictly following the l4v2 spec any more <braunr> "The completely redesigned user-land environment running on top of Fiasco.OC is called L4 Runtime Environment (L4Re). It provides the framework to build multi-component systems, including a client/server communication framework" <braunr> so yes, client/server communication is built on top of the kernel <braunr> something i really want to avoid actually <mcsim> So when 1 core wants to pull something out of queue it has to lock it, and the problem arrives when other 63 cpus are waiting in the same lock. Right? <braunr> mcsim: yes <mcsim> could this be solved by implementing per cpu queues? Like in slab allocator <braunr> solved, no <braunr> reduced, yes <braunr> by using multiple port sets, each with their own thread pool <braunr> but this would still leave core problems unsolved <braunr> (those making real-time hard) <mcsim> to make it real-time is not really essential to solve this problem <braunr> that's the other way around <mcsim> we just need to guarantee that locking protocol is fair <braunr> solving this problem is required for quality real-time <braunr> what you refer to is similar to what i described in qnx earlier <braunr> it's ugly <braunr> keep in mind that message passing is the equivalent of system calls on monolithic kernels <braunr> os ideally, we'd want something as close as possible to an actually system call <braunr> so* <braunr> mcsim: do you see why it's ugly ? <mcsim> no i meant exactly opposite, I meant to use some deterministic locking protocol <braunr> please elaborate <braunr> because what qnx does is deterministic <mcsim> We know in what sequences threads will acquire the lock, so we will not have to apply inheritance to all threads <braunr> hwo do you know ? <mcsim> there are different approaches, like you use ticket system or MCS lock (http://portal.acm.org/citation.cfm?id=103729) <braunr> that's still locking <braunr> a system call has 0 contention <braunr> 0 potential contention <mcsim> in linux? <braunr> everywhere i assume <mcsim> than why do they need locks? <braunr> they need locks after the system call <braunr> the system call itself is a stupid trap that makes the thread "jump" in the kernel <braunr> and the reason why it's so simple is the same as in fiasco: threads (clients) communicate directly with the "server thread" (themselves in kernel mode) <braunr> so 1/ they don't go through a capability or any other abstraction <braunr> and 2/ they're even faster than on fiasco because they don't need to find the destination, it's implied by the trap mechanism) <braunr> 2/ is only an optimization that we can live without <braunr> but 1/ is a serious bottleneck for microkernels <mcsim> Do you mean that there system call that process without locks or do you mean that there are no system calls that use locks? <braunr> this is what makes papers such as https://www.kernel.org/doc/ols/2007/ols2007v1-pages-251-262.pdf valid <braunr> i mean the system call (the mechanism used to query system services) doesn't have to grab any lock <braunr> the idea i have is to make the kernel transparently (well, as much as it can be) associate a server thread to a client thread at the port level <braunr> at the server side, it would work practically the same <braunr> the first time a server thread services a request, it's automatically associated to a client, and subsequent request will directly address this thread <braunr> when the client is destroyed, the server gets notified and destroys the associated server trhead <braunr> for real-time tasks, i'm thinking of using a signal that gets sent to all servers, notifying them of the thread creation so that they can preallocate the server thread <braunr> or rather, a signal to all servers wishing to be notified <braunr> or perhaps the client has to reserve the resources itself <braunr> i don't know, but that's the idea <mcsim> and who will send this signal? <braunr> the kernel <braunr> x15 will provide unix like signals <braunr> but i think the client doing explicit reservation is better <braunr> more complicated, but better <braunr> real time developers ought to know what they're doing anyway <braunr> mcsim: the trick is using lockless synchronization (like rcu) at the port so that looking up the matching server thread doesn't grab any lock <braunr> there would still be contention for the very first access, but that looks much better than having it every time <braunr> (potential contention) <braunr> it also simplifies writing servers a lot, because it encourages the use of a single port set for best performance <braunr> instead of burdening the server writer with avoiding contention with e.g. a hierarchical scheme <mcsim> "looking up the matching server" -- looking up where? <braunr> in the port <mcsim> but why can't you just take first? <braunr> that's what triggers contention <braunr> you have to look at the first <mcsim> > (16:34:13) braunr: mcsim: do you see why it's ugly ? <mcsim> BTW, not really <braunr> imagine serveral clients send concurrently <braunr> mcsim: well, qnx doesn't do it every time <braunr> qnx boosts server threads only when there are no thread currently receiving, and a sender with a higher priority arrives <braunr> since qnx can't know which server thread is going to be receiving next, it boosts every thread <braunr> boosting priority is expensive, and boosting everythread is linear with the number of threads <braunr> so on a big system, it would be damn slow for a system call :) <mcsim> ok <braunr> and grabbing "the first" can't be properly done without serialization <braunr> if several clients send concurrently, only one of them gets serviced by the "first server thread" <braunr> the second client will be serviced by the "second" (or the first if it came back) <braunr> making the second become the first (i call it the manager) must be atomic <braunr> that's the core of the problem <braunr> i think it's very important because that's currently one of the fundamental differences wih monolithic kernels <mcsim> so looking up for server is done without contention. And just assigning task to server requires lock, right? <braunr> mcsim: basically yes <braunr> i'm not sure it's that easy in practice but that's what i'll aim at <braunr> almost every argument i've read about microkernel vs monolithic is full of crap <mcsim> Do you mean lock on the whole queue or finer grained one? <braunr> the whole port <braunr> (including the queue) <mcsim> why the whole port? <braunr> how can you make it finer ? <mcsim> is queue a linked list? <braunr> yes <mcsim> than can we just lock current element in the queue and elements that point to current <braunr> that's two lock <braunr> and every sender will want "current" <braunr> which then becomes coarse grained <mcsim> but they want different current <braunr> let's call them the manager and the spare threads <braunr> yes, that's why there is a lock <braunr> so they don't all get the same <braunr> the manager is the one currently waiting for a message, while spare threads are available but not doing anything <braunr> when the manager finally receives a message, it takes the first spare, which becomes the new manager <braunr> exactly like in a common thread pool <braunr> so what are you calling current ? <mcsim> we have in a port queue of threads that wait for message: t1 -> t2 -> t3 -> t4; kernel decided to assign message to t3, than t3 and t2 are locked. <braunr> why not t1 and t2 ? <mcsim> i was calling t3 in this example as current <mcsim> some heuristics <braunr> yeah well no <braunr> it wouldn't be deterministic then <mcsim> for instance client runs on core 3 and wants server that also runs on core 3 <braunr> i really want the operation as close as a true system call as possible, so O(1) <braunr> what if there are none ? <mcsim> it looks up forward up to the end of queue: t1->t2->t4; takes t4 <mcsim> than it starts from the beginning <braunr> that becomes linear in the worst case <mcsim> no <braunr> so 4095 attempts on a 4096 cpus machine <braunr> ? <mcsim> you're right <braunr> unfortunately :/ <braunr> a per-cpu scheme could be good <braunr> and applicable <braunr> with much more thought <braunr> and the problem is that, unlike the kernel, which is naturally a one thread per cpu server, userspace servers may have less or more threads than cpu <braunr> possibly unbalanced too <braunr> so it would result in complicated code <braunr> one good thing with microkernels is that they're small <braunr> they don't pollute the instruction cache much <braunr> keeping the code small is important for performance too <braunr> so forgetting this kind of optimization makes for not too complicated code, and we rely on the scheduler to properly balance threads <braunr> mcsim: also note that, with your idea, the worst cast is twice more expensive than a single lock <braunr> and on a machine with few processors, this worst case would be likely <mcsim> so, you propose every time try to take first server from the queue? <mcsim> braunr: ^ <braunr> no <braunr> that's what is done already <braunr> i propose doing that the first time a client sends a message <braunr> but then, the server thread that replied becomes strongly associated to that client (it cannot service requests from other clients) <braunr> and it can be recycled only when the client dies <braunr> (which generates a signal indicating the server it can now recycle the server thread) <braunr> (a signal similar to the no-sender or dead-name notifications in mach) <braunr> that signal would be sent from the kernel, in the traditional unix way (i.e. no dedicated signal thread since it would be another source of contention) <braunr> and the server thread would directly receive it, not interfering with the other threads in the server in any way <braunr> => contention on first message only <braunr> now, for something like make -j64, which starts a different process for each compilation (itself starting subprocesses for preprocessing/compiling/assembling) <braunr> it wouldn't be such a big win <braunr> so even this first access should be optimized <braunr> if you ever get an idea, feel free to share :) <mcsim> May mach block thread when it performs asynchronous call? <mcsim> braunr: ^ <braunr> sure <braunr> but that's unrelated <braunr> in mach, a sender is blocked only when the message queue is full <mcsim> So we can introduce per cpu queues at the sender side <braunr> (and mach_msg wasn't called in non blocking mode obviously) <braunr> no <braunr> they need to be delivered in order <mcsim> In what order? <braunr> messages can't be reorder once queued <braunr> reordered <braunr> so fifo order <braunr> if you break the queue in per cpu queues, you may break that, or need work to rebuild the order <braunr> which negates the gain from using per cpu queues <mcsim> Messages from the same thread will be kept in order <braunr> are you sure ? <braunr> and i'm not sure it's enough <mcsim> thes cpu queues will be put to common queue once context switch occurs <braunr> *all* messages must be received in order <mcsim> these* <braunr> uh ? <braunr> you want each context switch to grab a global lock ? <mcsim> if you have parallel threads that send messages that do not have dependencies than they are unordered <mcsim> always <braunr> the problem is they might <braunr> consider auth for example <braunr> you have one client attempting to authenticate itself to a server through the auth server <braunr> if message order is messed up, it just won't work <braunr> but i don't have this problem in x15, since all ipc (except signals) is synchronous <mcsim> but it won't be messed up. You just "send" messages in O(1), but than you put these messages that are not actually sent in queue all at once <braunr> i think i need more details please <mcsim> you have lock on the port as it works now, not the kernel lock <mcsim> the idea is to batch these calls <braunr> i see <braunr> batching can be effective, but it would really require queueing <braunr> x15 only queues clients when there is no receiver <braunr> i don't think batching can be applied there <mcsim> you batch messages only from one client <braunr> that's what i'm saying <mcsim> so client can send several messages during his time slice and than you put them into queue all together <braunr> x15 ipc is synchronous, no more than 1 message per client at any time <braunr> there also are other problems with this strategy <braunr> problems we have on the hurd, such as priority handling <braunr> if you delay the reception of messages, you also delay priority inheritance to the server thread <braunr> well not the reception, the queueing actually <braunr> but since batching is about delaying that, it's the same <mcsim> if you use synchronous ipc than there is no sence in batching, at least as I see it. <braunr> yes <braunr> 18:08 < braunr> i don't think batching can be applied there <braunr> and i think sync ipc is the only way to go for a system intended to provide messaging performance as close as possible to the system call <mcsim> do you have as many server thread as many cores you have? <braunr> no <braunr> as many server threads as clients <braunr> which matches the monolithic model <mcsim> in current implementation? <braunr> no <braunr> currently i don't have userspace :> <mcsim> and what is in hurd atm? <mcsim> in gnumach <braunr> asyn ipc <braunr> async <braunr> with message queues <braunr> no priority inheritance, simple "handoff" on message delivery, that's all <anatoly> I managed to read the conversation :-) <braunr> eh <braunr> anatoly: any opinion on this ? <anatoly> braunr: I have no opinion. I understand it partially :-) But association of threads sounds for me as good idea <anatoly> But who am I to say what is good or what is not in that area :-) <braunr> there still is this "first time" issue which needs at least one atomic instruction <anatoly> I see. Does mach do this "first time" thing every time? <braunr> yes <braunr> but gnumach is uniprocessor so it doesn't matter <mcsim> if we have 1:1 relation for client and server threads we need only per-cpu queues <braunr> mcsim: explain that please <braunr> and the problem here is establishing this relation <braunr> with a lockless lookup, i don't even need per cpu queues <mcsim> you said: (18:11:16) braunr: as many server threads as clients <mcsim> how do you create server threads? <braunr> pthread_create <braunr> :) <mcsim> ok :) <mcsim> why and when do you create a server thread? <braunr> there must be at least one unbound thread waiting for a message <braunr> when a message is received, that thread knows it's now bound with a client, and if needed wakes up/spawns another thread to wait for incoming messages <braunr> when it gets a signal indicating the death of the client, it knows it's now unbound, and goes back to waiting for new messages <braunr> becoming either the manager or a spare thread if there already is a manager <braunr> a timer could be used as it's done on the hurd to make unbound threads die after a timeout <braunr> the distinction between the manager and spare threads would only be done at the kernel level <braunr> the server would simply make unbound threads wait on the port set <anatoly> How client sends signal to thread about its death (as I understand signal is not message) (sorry for noob question) <mcsim> in what you described there are no queues at all <braunr> anatoly: the kernel does it <braunr> mcsim: there is, in the kernel <braunr> the queue of spare threads <braunr> anatoly: don't apologize for noob questions eh <anatoly> braunr: is that client is a thread of some user space task? <braunr> i don't think it's a newbie topic at all <braunr> anatoly: a thread <mcsim> make these queue per cpu <braunr> why ? <braunr> there can be a lot less spare threads than processors <braunr> i don't think it's a good idea to spawn one thread per cpu per port set <braunr> on a large machine you'd have tons of useless threads <mcsim> if you have many useless threads, than assign 1 thread to several core, thus you will have twice less threads <mcsim> i mean dynamically <braunr> that becomes a hierarchical model <braunr> it does reduce contention, but it's complicated, and for now i'm not sure it's worth it <braunr> it could be a tunable though <mcsim> if you want something fast you should use something complicated. <braunr> really ? <braunr> a system call is very simple and very fast <braunr> :p <mcsim> why is it fast? <mcsim> you still have a lot of threads in kernel <braunr> but they don't interact during the system call <braunr> the system call itself is usually a simple instruction with most of it handled in hardware <mcsim> if you invoke "write" system call, what do you do in kernel? <braunr> you look up the function address in a table <mcsim> you still have queues <braunr> no <braunr> sorry wait <braunr> by system call, i mean "the transition from userspace to kernel space" <braunr> and the return <braunr> not the service itself <braunr> the equivalent on a microkernel system is sending a message from a client, and receiving it in a server, not processing the request <braunr> ideally, that's what l4 does: switching from one thread to another, as simply and quickly as the hardware can <braunr> so just a context and address space switch <mcsim> at some point you put something in queue even in monolithic kernel and make request to some other kernel thread <braunr> the problem here is the indirection that is the capability <braunr> yes but that's the service <braunr> i don't care about the service here <braunr> i care about how the request reaches the server <mcsim> this division exist for microkernels <mcsim> for monolithic it's all mixed <anatoly> What does thread do when it receive a message? <braunr> anatoly: what it wants :p <braunr> the service <braunr> mcsim: ? <braunr> mixed ? <anatoly> braunr: hm, is it a thread of some server? <mcsim> if you have several working threads in monolithic kernel you have to put request in queue <braunr> anatoly: yes <braunr> mcsim: why would you have working threads ? <mcsim> and there is no difference either you consider it as service or just "transition from userspace to kernel space" <braunr> i mean, it's a good thing to have, they usually do, but they're not implied <braunr> they're completely irrelevant to the discussion here <braunr> of course there is <braunr> you might very well perform system calls that don't involve anything shared <mcsim> you can also have only one working thread in microkernel <braunr> yes <mcsim> and all clients will wait for it <braunr> you're mixing up work queues in the discussion here <braunr> server threads are very similar to a work queue, yes <mcsim> but you gave me an example with 64 cores and each core runs some server thread <braunr> they're a thread pool handling requests <mcsim> you can have only one thread in a pool <braunr> they have to exist in a microkernel system to provide concurrency <braunr> monolithic kernels can process concurrently without them though <mcsim> why? <braunr> because on a monolithic system, _every client thread is its own server_ <braunr> a thread making a system call is exactly like a client requesting a service <braunr> on a monolithic kernel, the server is the kernel <braunr> and it *already* has as many threads as clients <braunr> and that's pretty much the only thing beautiful about monolithic kernels <mcsim> right <mcsim> have to think about it :) <braunr> that's why they scale so easily compared to microkernel based systems <braunr> and why l4 people chose to have thread-based ipc <braunr> but this just moves the problems to an upper level <braunr> and is probably why they've realized one of the real values of microkernel systems is capabilities <braunr> and if you want to make them fast enough, they should be handled directly by the kernel
<bddebian> Heya Richard. Solve the worlds problems yet? :) <kilobug> bddebian: I fear the worlds problems are NP-complete ;) <bddebian> heh <braunr> bddebian: i wish i could solve mine at least :p <bddebian> braunr: I meant the contention thing you were discussing the other day :) <braunr> bddebian: oh <braunr> i have a solution that improves the behaviour yes, but there is still contention the first time a thread performs an ipc <bddebian> Any thread or the first time there is contention? <braunr> there may be contention the first time a thread sends a message to a server <braunr> (assuming a server uses a single port set to receive requests) <bddebian> Oh aye <braunr> i think it's as much as can be done considering there is a translation from capability to thread <braunr> other schemes are just too heavy, and thus don't scale well <braunr> this translation is one of the two important nice properties of microkernel based systems, and translations (or indrections) usually have a cost <braunr> so we want to keep them <braunr> and we have to accept that cost <braunr> the amount of code in the critical section should be so small it should only matter for machines with several hundreds or thousands processors <braunr> so it's not such a bit problem <bddebian> OK <braunr> but it would have been nice to have an additional valid theoretical argument to explain how ipc isn't that slow compared to system calls <braunr> s/bit/big/ <braunr> people keep saying l4 made ipc as fast as system calls without taking that stuff into account <braunr> which makes the community look lame in the eyes of those familiar with it <bddebian> heh <braunr> with my solution, persistent applications like databases should perform as fast as on an l4 like kernel <braunr> but things like parallel builds, which start many different processes for each file, will suffer a bit more from contention <braunr> seems like a fair compromise to me <bddebian> Aye <braunr> as mcsim said, there is a lot of contention about everywhere in almost every application <braunr> and lockless stuff is hard to correctly implement <braunr> os it should be all right :) <braunr> ... :) <mcsim> braunr: What if we have at least 1 thread for each core that stay in per-core queue. When we decide to kill a thread and this thread is last in a queue we replace it with load balancer. This is still worse than with monolithic kernel, but it is simplier to implement from kernel perspective. <braunr> mcsim: it doesn't scale well <braunr> you end up with one thread per cpu per port set <mcsim> load balancer is only one thread <mcsim> why would it end up like you said? <braunr> remember the goal is to avoid contention <braunr> your proposition is to set per cpu queues <braunr> the way i understand what you said, it means clients will look up a server thread in these queues <braunr> one of them actually, the one for the cpu they're currently running one <braunr> so 1/ it disables migration <braunr> or 2/ you have one server thread per client per cpu <braunr> i don't see what a "load balancer" would do here <mcsim> client either finds server thread without contention or it sends message to load balancer, that redirects message to thread from global queue. Where global queue is concatenation of local ones. <braunr> you can't concatenate local queues in a global one <braunr> if you do that, you end up with a global queue, and a global lock again <mcsim> not global <mcsim> load balancer is just one <braunr> then you serialize all remote messaging through a single thread <mcsim> so contention will be only among local thread and load balancer <braunr> i don't see how it doesn't make the load balancer global <mcsim> it makes <mcsim> but it just makes bootstraping harder <braunr> i'm not following <braunr> and i don't see how it improves on my solution <mcsim> in your example with make -j64 very soon there will be local threads at any core <braunr> yes, hence the lack of scalability <mcsim> but that's your goal: create as many server thread as many clients you have, isn't it? <braunr> your solution may create a lot more <braunr> again, one per port set (or server) per cpu <braunr> imagine this worst case: you have a single client with one thread <braunr> which gets migrated to every cpu on the machine <braunr> it will spawn one thread per cpu at the server side <mcsim> why would it migrate all the time? <braunr> it's a worst case <braunr> if it can migrate, consider it will <braunr> murphy's law, you know <braunr> also keep in mind contention doesn't always occur with a global lock <braunr> i'm talking about potential contention <braunr> and same things apply: if it can happen, consider it will <mcsim> than we can make load balancer that also migrates server threads <braunr> ok so in addition to worker threads, we'll add an additional per server load balancer which may have to lock several queues at once <braunr> doesn't it feel completely overkill to you ? <mcsim> load balancer is global, not per-cpu <mcsim> there could be contention for it <braunr> again, keep in mind this problem becomes important for several hundreds processors, not below <braunr> yes but it has to balance <braunr> which means it has to lock cpu queues <braunr> and at least two of them to "migrate" server threads <braunr> and i don't know why it would do that <braunr> i don't see the point of the load balancer <mcsim> so, you start make -j64. First 64 invocations of gcc will suffer from contention for load balancer, but later on it will create enough server threads and contention will disappear <braunr> no <braunr> that's the best case : there is always one server thread per cpu queue <braunr> how do you guarantee your 64 server threads don't end up in the same cpu queue ? <braunr> (without disabling migration) <mcsim> load balancer will try to put some server thread to the core where load balancer was invoked <braunr> so there is no guarantee <mcsim> LB can pin server thread <braunr> unless we invoke it regularly, in a way similar to what is already done in the SMP scheduler :/ <braunr> and this also means one balancer per cpu then <mcsim> why one balance per cpu? <braunr> 15:56 < mcsim> load balancer will try to put some server thread to the core where load balancer was invoked <braunr> why only where it was invoked ? <mcsim> because it assumes that if some one asked for server at core x, it most likely will ask for the same service from the same core <braunr> i'm not following <mcsim> LB just tries to prefetch were next call will be <braunr> what you're describing really looks like per-cpu work queues ... <braunr> i don't see how you make sure there aren't too many threads <braunr> i don't see how a load balancer helps <braunr> this is just an heuristic <mcsim> when server thread is created? <mcsim> who creates it? <braunr> and it may be useless, depending on how threads are migrated and when they call the server <braunr> same answer as yesterday <braunr> there must be at least one thread receiving messages on a port set <braunr> when a message arrives, if there aren't any spare threads, it spawns one to receive messages while it processes the request <mcsim> at the moment server threads are killed by timeout, right? <braunr> yes <braunr> well no <braunr> there is a debian patch that disables that <braunr> because there is something wrong with thread destruction <braunr> but that's an implementation bug, not a design issue <mcsim> so it is the mechanism how we insure that there aren't too many threads <mcsim> it helps because yesterday I proposed to hierarchical scheme, were one server thread could wait in cpu queues of several cores <mcsim> but this has to be implemented in kernel <braunr> a hierarchical scheme would help yes <braunr> a bit <mcsim> i propose scheme that could be implemented in userspace <braunr> ? <mcsim> kernel should not distinguish among load balancer and server thread <braunr> sorry this is too confusing <braunr> please start describing what you have in mind from the start <mcsim> ok <mcsim> so my starting point was to use hierarchical management <mcsim> but the drawback was that to implement it you have to do this in kernel <mcsim> right? <braunr> no <mcsim> so I thought how can this be implemented in user space <braunr> being in kernel isn't the problem <braunr> contention is <braunr> on the contrary, i want ipc in kernel exactly because that's where you have the most control over how it happens <braunr> and can provide the best performance <braunr> ipc is the main kernel responsibility <mcsim> but if you have few clients you have low contention <braunr> the goal was "0 potential contention" <mcsim> and if you have many clients, you have many servers <braunr> let's say server threads <braunr> for me, a server is a server task or process <mcsim> right <braunr> so i think 0 potential contention is just impossible <braunr> or it requires too many resources that make the solution not scalable <mcsim> 0 contention is impossible, since you have disbalance in numbers of client threads and server threads <braunr> well no <braunr> it *canù be achieved <braunr> imagine servers register themselves to the kernel <braunr> and the kernel signals them when a client thread is spawned <braunr> you'd effectively have one server thread per client <braunr> (there would be other problems like e.g. when a server thread becomes the client of another, etc..) <braunr> so it's actually possible <braunr> but we clearly don't want that, unless perhaps for real time threads <braunr> but please continue <mcsim> what does "and the kernel signals them when a client thread is spawned" mean? <braunr> it means each time a thread not part of a server thread is created, servers receive a signal meaning "hey, there's a new thread out there, you might want to preallocate a server thread for it" <mcsim> and what is the difference with creating thread on demand? <braunr> on demand can occur when receiving a message <braunr> i.e. during syscall <mcsim> I will continue, I just want to be sure that I'm not basing on wrong assumtions. <mcsim> and what is bad in that? <braunr> (just to clarify, i use the word "syscall" with the same meaning as "RPC" on a microkernel system, whereas it's a true syscall on a monolithic one) <braunr> contention <braunr> whether you have contention on a list of threads or on map entries when allocating a stack doesn't matter <braunr> the problem is contention <mcsim> and if we create server thread always? <mcsim> and do not keep them in queue? <braunr> always ? <mcsim> yes <braunr> again <braunr> you'd have to allocate a stack for it <braunr> every time <braunr> so two potentially heavy syscalls to allocate/free the stac <braunr> k <braunr> not to mention the thread itself, its associations with its task, ipc space, maintaining reference counts <braunr> (moar contention) <braunr> creating threads was considered cheap at the time the process was the main unit of concurrency <mcsim> ok, than we will have the same contention if we will create a thread when "the kernel signals them when a client thread is spawned" <braunr> now we have work queues / thread pools just to avoid that <braunr> no <braunr> because that contention happens at thread creation <braunr> not during a syscall <braunr> i'll redefine the problem: the problem is contention during a system call / IPC <mcsim> ok <braunr> note that my current solution is very close to signalling every server <braunr> it's the lazy version <braunr> match at first IPC time <mcsim> so I was basing my plan on the case when we create new thread when client makes syscall and there is not enough server threads <braunr> the problem exists even when there is enough server threads <braunr> we shouldn't consider the case where there aren't enough server threads <braunr> real time tasks are the only ones which want that, and can preallocate resources explicitely <mcsim> I think that real time tasks should be really separated <mcsim> For them resource availability as much more important that good resource utilisation. <mcsim> So if we talk about real time tasks we should apply one police and for non-real time another <mcsim> So it shouldn't be critical if thread is created during syscall <braunr> agreed <braunr> that's what i was saying : <braunr> :) <braunr> 16:23 < braunr> we shouldn't consider the case where there aren't enough server threads <braunr> in this case, we spawn a thread, and that's ok <braunr> it will live on long enough that we really don't care about the cost of lazily creating it <braunr> so let's concentrate only on the case where there already are enough server threads <mcsim> So if client makes a request to ST (is it ok to use abbreviations?) there are several cases: <mcsim> 1/ There is ST waiting on local queue (trivial case) <mcsim> 2/ There is no ST, only load balancer (LB). LB decides to create a new thread <mcsim> 3/ Like in previous case, but LB decides to perform migration <braunr> migration of what ? <mcsim> migration of ST from other core <braunr> the only case effectively solving the problem is 1 <braunr> others introduce contention, and worse, complex code <braunr> i mean a complex solution <braunr> not only code <braunr> even the addition of a load balancer per port set <braunr> thr data structures involved for proper migration <mcsim> But 2 and 3 in long run will lead to having enough threads on all cores <braunr> then you end up having 1 per client per cpu <mcsim> migration is needed in any case <braunr> no <braunr> why would it be ? <mcsim> to balance load <mcsim> not only for this case <braunr> there already is load balancing in the scheduler <braunr> we don't want to duplicate its function <mcsim> what kind of load balancing? <mcsim> *has scheduler <braunr> thread weight / cpu <mcsim> and does it perform migration? <braunr> sure <mcsim> so scheduler can be simplified if policy "when to migrate" will be moved to user space <braunr> this is becoming a completely different problem <braunr> and i don't want to do that <braunr> it's very complicated for no real world benefit <mcsim> but all this will be done in userspace <braunr> ? <braunr> all what ? <mcsim> migration decisions <braunr> in your scheme you mean ? <mcsim> yes <braunr> explain how <mcsim> LB will decide when thread will migrate <mcsim> and LB is user space task <braunr> what does it bring ? <braunr> imagine that, in the mean time, the scheduler then decides the client should migrate to another processor for fairness <braunr> you'd have migrated a server thread once for no actual benefit <braunr> or again, you need to disable migration for long durations, which sucks <braunr> also <braunr> 17:06 < mcsim> But 2 and 3 in long run will lead to having enough threads on all cores <braunr> contradicts the need for a load balancer <braunr> if you have enough threads every where, why do you need to balance ? <mcsim> and how are you going to deal with the case when client will migrate all the time? <braunr> i intend to implement something close to thread migration <mcsim> because some of them can die because of timeout <braunr> something l4 already does iirc <braunr> the thread scheduler manages scheduling contexts <braunr> which can be shared by different threads <braunr> which means the server thread bound to its client will share the scheduling context <braunr> the only thing that gets migrated is the scheduling context <braunr> the same way a thread can be migrated indifferently on a monolithic system, whether it's in user of kernel space (with kernel preemption enabled ofc) <braunr> or* <mcsim> but how server thread can process requests from different clients? <braunr> mcsim: load becomes a problem when there are too many threads, not when they're dying <braunr> they can't <braunr> at first message, they're *bound* <braunr> => one server thread per client <braunr> when the client dies, the server thread is ubound and can be recycled <braunr> unbound* <mcsim> and you intend to put recycled threads to global queue, right? <braunr> yes <mcsim> and I propose to put them in local queues in hope that next client will be on the same core <braunr> the thing is, i don't see the benefit <braunr> next client could be on another <braunr> in which case it gets a lot heavier than the extremely small critical section i have in mind <mcsim> but most likely it could be on the same <braunr> uh, no <mcsim> becouse on this load on this core is decreased <mcsim> *because <braunr> well, ok, it would likely remain on the same cpu <braunr> but what happens when it migrates ? <braunr> and what about memory usage ? <braunr> one queue per cpu per port set can get very large <braunr> (i understand the proposition better though, i think) <mcsim> we can ask also "What if random access in memory will be more usual than sequential?", but we still optimise sequential one, making random sometimes even worse. The real question is "How can we maximise benefit of knowledge where free server thread resides?" <mcsim> previous was reply to: "(17:17:08) braunr: but what happens when it migrates ?" <braunr> i understand <braunr> you optimize for the common case <braunr> where a lot more ipc occurs than migrations <braunr> agreed <braunr> now, what happens when the server thread isn't in the local queue ? <mcsim> than client request will be handled to LB <braunr> why not search directly itself ? <braunr> (and btw, the right word is "then") <mcsim> LB can decide whom to migrate <mcsim> right, sorry <braunr> i thought you were improving on my scheme <braunr> which implies there is a 1:1 mapping for client and server threads <mcsim> If job of LB is too small than it can be removed and everything will be done in kernel <braunr> it can't be done in userspace anyway <braunr> these queues are in the port / port set structures <braunr> it could be done though <braunr> i mean <braunr> using per cpu queues <braunr> server threads could be both in per cpu queues and in a global queue as long as they exist <mcsim> there should be no global queue, because there again will be contention for it <braunr> mcsim: accessing a load balancer implies contention <braunr> there is contention anyway <braunr> what you're trying to do is reduce it in the first message case if i'm right <mcsim> braunr: yes <braunr> well then we have to revise a few assumptions <braunr> 17:26 < braunr> you optimize for the common case <braunr> 17:26 < braunr> where a lot more ipc occurs than migrations <braunr> that actually becomes wrong <braunr> the first message case occurs for newly created threads <mcsim> for make -j64 this is actually common case <braunr> and those are usually not spawn on the processor their parent runs on <braunr> yes <braunr> if you need all processors, yes <braunr> i don't think taking into account this property changes many things <braunr> per cpu queues still remain the best way to avoid contention <braunr> my problem with this solution is that you may end up with one unbound thread per processor per server <braunr> also, i say "per server", but it's actually per port set <braunr> and even per port depending on how a server is written <braunr> (the system will use one port set for one server in the common case but still) <braunr> so i'll start with a global queue for unbound threads <braunr> and the day we decide it should be optimized with local (or hierarchical) queues, we can still do it without changing the interface <braunr> or by simply adding an option at port / port set creation <braunr> whicih is a non intrusive change <mcsim> ok. your solution should be simplier. And TBH, what I propose is not clearly much mory gainful. <braunr> well it is actually for big systems <braunr> it is because instead of grabbing a lock, you disable preemption <braunr> which means writing to a local, uncontended variable <braunr> with 0 risk of cache line bouncing <braunr> this actually looks very good to me now <braunr> using an option to control this behaviour <braunr> and yes, in the end, it gets very similar to the slab allocator, where you can disable the cpu pool layer with a flag :) <braunr> (except the serialized case would be the default one here) <braunr> mcsim: thanks for insisting <braunr> or being persistent <mcsim> braunr: thanks for conversation :) <mcsim> and probably I had to start from statement that I wanted to improve common case
<congzhang> braunr: how about your x15, it is impovement for mach or redesign? I really want to know that:) <braunr> it's both largely based on mach and now quite far from it <braunr> based on mach from a functional point of view <braunr> i.e. the kernel assumes practically the same functions, with a close interface <congzhang> Good point:) <braunr> except for ipc which is entirely rewritten <braunr> why ? :) <congzhang> for from a functional point of view:) I think each design has it intrinsic advantage and disadvantage <braunr> but why is it good ? <congzhang> if redesign , I may need wait more time to a new function hurd <braunr> you'll have to wait a long time anyway :p <congzhang> Improvement was better sometimes, although redesign was more attraction sometimes :) <congzhang> I will wait :) <braunr> i wouldn't put that as a reason for it being good <braunr> this is a departure from what current microkernel projects are doing <braunr> i.e. x15 is a hybrid <congzhang> Sure, it is good from design too:) <braunr> yes but i don't see why you say that <congzhang> Sorry, i did not show my view clear, it is good from design too:) <braunr> you're just saying it's good, you're not saying why you think it's good <congzhang> I would like to talk hybrid, I want to talk that, but I am a litter afraid that you are all enthusiasm microkernel fans <braunr> well no i'm not <braunr> on the contrary, i'm personally opposed to the so called "microkernel dogma" <braunr> but i can give you reasons why, i'd like you to explain why *you* think a hybrid design is better <congzhang> so, when I talk apple or nextstep, I got one soap :) <braunr> that's different <braunr> these are still monolithic kernels <braunr> well, monolithic systems running on a microkernel <congzhang> yes, I view this as one type of hybrid <braunr> no it's not <congzhang> microkernel wan't to divide process ( task ) from design view, It is great <congzhang> as implement view or execute view, we have one cpu and some physic memory, as the simplest condition, we can't change that <congzhang> that what resource the system has <braunr> what's your point ? <congzhang> I view this as follow <congzhang> I am cpu and computer <congzhang> application are the things I need to do <congzhang> for running the program and finish the job, which way is the best way for me <congzhang> I need keep all the thing as simple as possible, divide just from application design view, for me no different <congzhang> desgin was microkernel , run just for one cpu and these resource. <braunr> (well there can be many processors actually) <congzhang> I know, I mean hybrid at some level, we can't escape that <congzhang> braunr: I show my point? <braunr> well l4 systems showed we somehow can <braunr> no you didn't <congzhang> x15's api was rpc, right? <braunr> yes <braunr> well a few system calls, and mostly rpcs on top of the ipc one <braunr> jsu tas with mach <congzhang> and you hope the target logic run locally just like in process function call, right? <braunr> no <braunr> it can't run locally <congzhang> you need thread context switch <braunr> and address space context switch <congzhang> but you cut down the cost <braunr> how so ? <congzhang> I mean you do it, right? <congzhang> x15 <braunr> yes but no in this way <braunr> in every other way :p <congzhang> I know, you remeber performance anywhere :p <braunr> i still don't see your point <braunr> i'd like you to tell, in one sentence, why you think hybrids are better <congzhang> balance the design and implement problem :p <braunr> which is ? <congzhang> hybird for kernel arc <braunr> you're stating the solution inside the problem <congzhang> you are good at mathmatics <congzhang> sorry, I am not native english speaker <congzhang> braunr: I will find some more suitable sentence to show my point some day, but I can't find one if you think I did not show my point:) <congzhang> for today <braunr> too bad <congzhang> If i am computer I hope the arch was monolithic, If i am programer I hope the arch was microkernel, that's my idea <braunr> ok let's get a bit faster <braunr> monolithic for performance ? <congzhang> braunr: sorry for that, and thank you for the talk:) <braunr> (a computer doesn't "hope") <congzhang> braunr: you need very clear answer, I can't give you that, sorry again <braunr> why do you say "If i am computer I hope the arch was monolithic" ? <congzhang> I know you can slove any single problem <braunr> no i don't, and it's not about me <braunr> i'm just curious <congzhang> I do the work for myself, as my own view, all the resource belong to me, I does not think too much arch related divide was need, if I am the computer :P <braunr> separating address spaces helps avoiding serious errors like corrupting memory of unrelated subsystems <braunr> how does one not want that ? <braunr> (except for performance) <congzhang> braunr: I am computer when I say that words! <braunr> a computer doesn't want anything <braunr> users (including developers) on the other way are the point of view you should have <congzhang> I am engineer other time <congzhang> we create computer, but they are lifeable just my feeling, hope not talk this topic <braunr> what ? <congzhang> I mark computer as life things <braunr> please don't <braunr> and even, i'll make a simple example in favor of isolating resources <braunr> if we, humans, were able to control all of our "resources", we could for example shut down our heart by mistake <congzhang> back to the topic, I think monolithic was easy to understand, and cut the combinatorial problem count for the perfect software <braunr> the reason the body have so many involuntary functions is probably because those who survived did so because these functions were involuntary and controlled by separated physiological functions <braunr> now that i've made this absurd point, let's just not consider computers as life forms <braunr> microkernels don't make a system that more complicated <congzhang> they does <braunr> no <congzhang> do <braunr> they create isolation <braunr> and another layer of indirection with capabilities <braunr> that's it <braunr> it's not that more complicated <congzhang> view the kernel function from more nature view, execute some code <braunr> what ? <congzhang> I know the benefit of the microkernel and the os <congzhang> it's complicated <braunr> not that much <congzhang> I agree with you <congzhang> microkernel was the idea of organization <braunr> yes <braunr> but always keep in mind your goal when thinking about means to achieve them <congzhang> we do the work at diferent view <kilobug> what's quite complicated is making a microkernel design without too much performances loss, but aside from that performances issue, it's not really much more complicated <congzhang> hurd do the work at os level <kilobug> even a monolithic kernel is made of several subsystems that communicated with each others using an API <core-ix> i'm reading this conversation for some time now <core-ix> and I have to agree with braunr <core-ix> microkernels simplify the design <braunr> yes and no <braunr> i think it depends a lot on the availability of capabilities <core-ix> i have experience mostly with QNX and i can say it is far more easier to write a driver for QNX, compared to Linux/BSD for example ... <braunr> which are the major feature microkernels usually add <braunr> qnx >= 5 do provide capabilities <braunr> (in the form of channels) <core-ix> yeah ... it's the basic communication mechanism <braunr> but my initial and still unanswered question was: why do people think a hybrid kernel is batter than a true microkernel, or not <braunr> better* <congzhang> I does not say what is good or not, I just say hybird was accept <braunr> core-ix: and if i'm right, they're directly implemented by the kernel, and not a userspace system server <core-ix> braunr: evolution is more easily accepted than revolution :) <core-ix> braunr: yes, message passing is in the QNX kernel <braunr> not message passing, capabilities <braunr> l4 does message passing in kernel too, but you need to go through a capability server <braunr> (for the l4 variants i have in mind at least) <congzhang> the operating system evolve for it's application. <braunr> congzhang: about evolution, that's one explanation, but other than that ? <braunr> core-ix: ^ <core-ix> braunr: by capability you mean (for the lack of a better word i'll use) access control mechanisms? <braunr> i mean reference-rights <core-ix> the "trusted" functionality available in other OS? <braunr> http://en.wikipedia.org/wiki/Capability-based_security <braunr> i don't know what other systems refer to with "trusted" functionnality <core-ix> yeah, the same thing <congzhang> for now, I am searching one way to make hurd arm edition suitable for Raspberry Pi <congzhang> I hope design or the arch itself cant scale <congzhang> can be scale <core-ix> braunr: i think (!!!) that those are implemented in the Secure Kernel (http://www.qnx.com/products/neutrino-rtos/secure-kernel.html) <core-ix> never used it though ... <congzhang> rpc make intercept easy :) <braunr> core-ix: regular channels are capabilities <core-ix> yes, and by extensions - they are in the kenrel <braunr> that's my understanding too <braunr> and that one thing that, for me, makes qnx an hybrid as well <congzhang> just need intercept in kernel, <core-ix> braunr: i would dive the academic aspects of this ... in my mind a microkernel is system that provides minimal hardware abstraction, communication primitives (usually message passing), virtual memory protection <core-ix> *wouldn't ... <braunr> i think it's very important on the contrary <braunr> what you describe is the "microkernel dogma" <braunr> precisely <braunr> that doesn't include capabilities <braunr> that's why l4 messaging is thread-based <braunr> and that's why l4 based systems are so slow <braunr> (except okl4 which put back capabilities in the kernel) <core-ix> so the compromise here is to include capabilities implementation in the kernel, thus making the final product hybrid? <braunr> not only <braunr> because now that you have them in kernel <braunr> the kernel probably has to manage memory for itself <braunr> so you need more features in the virtual memory system <core-ix> true ... <braunr> that's what makes it a hybrid <braunr> other ways being making each client provide memory, but that's when your system becomes very complicated <core-ix> but I believe this is true for pretty much any "general OS" case <braunr> and some resources just can't be provided by a client <braunr> e.g. a client can't provide virtual memory to another process <braunr> okl4 is actually the only pragmatic real-world implementation of l4 <braunr> and they also added unix-like signals <braunr> so that's an interesting model <braunr> as well as qnx <braunr> the good thing about the hurd is that, although it's not kernel agnostic, it doesn't require a lot from the underlying kernel <core-ix> about hurd? <braunr> yes <core-ix> i really need to dig into this code at some point :) <braunr> well you may but you may not see that property from the code itself
<teythoon> so tell me about x15 if you are in the mood to talk about that <braunr> what do you want to know ? <teythoon> well, the high level stuff first <teythoon> like what's the big picture <braunr> the big picture is that x15 is intended to be a "better mach for the hurd <braunr> " <braunr> mach is too general purpose <braunr> its ipc mechanism too powerful <braunr> too complicated, error prone, and slow <braunr> so i intend to build something a lot simpler and faster :p <teythoon> so your big picture includes actually porting hurd? i thought i read somewhere that you have a rewrite in mind <braunr> it's a clone, yes <braunr> x15 will feature mostly sync ipc, and no high level types inside messages <braunr> the ipc system call will look like what qnx does <braunr> send-recv from the client, recv/reply/reply-recv from the server <teythoon> but doesn't sync mean that your context switch will have to be quite fast? <braunr> how does that differ from the async approach ? <braunr> (keep in mind that almost all hurd RPCs are synchronous) <teythoon> yes, I know, and it also affects async mode, but a slow switch is worse for the sync case, isn't it? <teythoon> ok so your ipc will be more agnostic wrt to what it transports? unlike mig I presume? <braunr> no it's the same <braunr> yes <braunr> input will be an array, each entry denoting either memory or port rights <braunr> (or directly one entry for fast ipcs) <teythoon> memory as in pointers? <braunr> (well fast ipc when there is only one entry to avoid hitting a table) <braunr> pointer/size yes <teythoon> hm, surely you want a way to avoid copying that, right? <braunr> the only operation will be copy (i.e. unlike mach which allows sharing) <braunr> why ? <braunr> copy doesn't exclude zero copy <braunr> (zero copy being adjusting page tables with copy on write techniques) <teythoon> right <teythoon> but isn't that too coarse, like in cow a whole page? <braunr> depends on the message size <braunr> or options provided by the caller, i don't know yet <teythoon> oh, you are going to pack the memory anyway? <braunr> depends on the caller <braunr> i'm not yet sure about these details <braunr> ideally, i'd like to avoid serialization altogether <teythoon> wouldn't that be like cheating b/c it's the first copy? <braunr> directly pass pointers/sizes from the sender address space, and either really copy or use zero copy <teythoon> right, but then you're back at the page size issue <braunr> yes <braunr> it's not a real issue <braunr> the kernel must support both ways <braunr> the minor issue is determining which way to choose <braunr> it's not a critical issue <braunr> my current plan is to always copy, unless the caller has explicitely set a flag and is passing properly aligned buffers <teythoon> u sure? I mean the caller is free to arange the stuff he intends to send anyway he likes, how are you going to cow that then? <teythoon> ok <teythoon> right <braunr> properly aligned buffers :) <braunr> otherwise the kernel rejects the request <teythoon> that's reasonable, yes <braunr> in addition to being synchronous, ipc will also take a special path in the scheduler to directly use the client scheduling context <braunr> avoiding the sleep/wakeup overhead, and providing priority inheritence by side effect <teythoon> uh, but wouldn't dropping serialization create security and reliability issues? if the receiver isn't doing a proper job sanitizing its stuff <braunr> why would the client not sanitize ? <braunr> err <braunr> server <braunr> it has to anyway <teythoon> sure, but a proper parser written once might be more robust, even if it adds overhead <teythoon> the serialization i mean <braunr> it's just a layer <braunr> even with high level types, you still need to sanitize <braunr> the real downside is loosing cross architecture portability <braunr> making the potential implementation of a single system image a lot more restricted or difficult <braunr> but i don't care about that much <braunr> mach was built with this in mind though <teythoon> it's a nice idea, but i don't believe anyone does ssi anymore <braunr> i don't know <teythoon> and certainly not across architectures <braunr> there are few projects <braunr> anyway it's irrelevant currently <braunr> and my interface just restricts it, it doesn't prevent it <braunr> so i consider it an acceptable compromise <teythoon> so, does it run? what does it do? <teythoon> it certainly is, yes <braunr> for now, it manages memory (physical, virtual, kernel, and soon, anonymous) <braunr> support multiple processors with the required posix scheduling policies <braunr> (it uses a cute proportionally fair time sharing algorithm) <braunr> there are locks (spin locks, mutexes, condition variables) and lockless stuff (à la rcu) <braunr> both x86 and x86_64 are supported <braunr> (even pae) <braunr> work queues <teythoon> sounds impressive :) <braunr> :) <braunr> i also added basic debugging <braunr> stack trace (including getting the symbol table) handling <braunr> so yes, it's much much better than what i previously did <braunr> and on the right track <braunr> it already scales a lot better than mach for what it does <braunr> there are generic data structures (linked list, red-black tree, radix tree) <braunr> the radix tree supports lockless lookups, so looking up both the page cache and the ipc spaces is lockless) <teythoon> that's nice :) <braunr> there are a few things using global locks, but there are TODOs about them <braunr> even with that, it should be scalable enough for a start <braunr> and improving those parts shouldn't be too difficult
<nlightnfotis> braunr: From what I have understood you aim for x15 to be a production ready μ-kernel for usage in the Hurd? Or is it unrelated to the Hurd? <braunr> nlightnfotis: it's for a hurd clone <nlightnfotis> braunr: I see. Is it close to any of the existing microkernels as far as its design is concerned (L4, Viengoos) or is it new research? <braunr> it's close to mach <braunr> and qnx
<braunr> making progress on x15 pmap module <braunr> factoring code for mapping creation/removal on current/kernel and remote processes <braunr> also started "swap emulation" by reserving some physical memory to act as swap backing store <braunr> which will allow creating memory pressure very early in the development process
< nlightnfotis> braunr: something a little bit irrelevant: how many things are missing from mach to be considered a solid base for the Hurd? Is it only SMP and x86_64 support? < braunr> define "solid base for the hurd" < nlightnfotis> solid enough to not look for a replacement for it < braunr> then i'd say, from my very personal point of view, that you want x15 < nlightnfotis> I didn't understand this. Are you planning for x15 to be a better mach? < braunr> with a different interface, so not compatible < braunr> and thus, not mach < nlightnfotis> is the source code for it available? Can I read it somewhere? < braunr> the implied answer being: no, mach isn't a solid base for the hurd considering your definition < braunr> http://git.sceen.net/rbraun/x15.git/ < nlightnfotis> thanks. for that. So it's definite that mach won't stay for long as the Hurd's base, right? < braunr> it will, for long < braunr> my opinion is that it needs to be replaced < nlightnfotis> is it possible that it (slowly) gets rearchitected into what's being considered a second generation microkernel, or is it hopeless? < braunr> it would require a new interface < braunr> you can consider x15 to be a modern mach, with that new interface < braunr> from a high level view, it's very similar (it's a hybrid, with both scheduling and virtual memory management in the kernel) < braunr> ipc change a lot
<braunr> for those of us interested in x15 and scalability in general: http://darnassus.sceen.net/~rbraun/radixvm_scalable_address_spaces_for_multithreaded_applications.pdf <braunr> finally an implementation allowing memory mapping to occur concurrently <braunr> (which is another contention issue when using mach-like ipc, which often do need to allocate/release virtual memory)
<rah> braunr: http://git.sceen.net/rbraun/x15.git/blob/HEAD:/README <rah> "X15 is a free microkernel." <rah> braunr: what distinguishes it from existing microkernels?
<braunr> rah: the next part maybe ? <braunr> "Its purpose is to provide a foundation for a Hurd-like operating system." <rah> braunr: there are already microkernels that canbe used as the foundatin for Hurd=like operating systems; why are you creating another one? <rah> braunr: what distinguishes your microkernel from existing microkernels? <tschwinge> rah: http://www.gnu.org/software/hurd/microkernel/mach/deficiencies.html <braunr> rah: it's better :) <braunr> rah: and please, cite one suitable kernel for the hurd <rah> tschwinge: those are deficiencies in Mach; I'm asking about x15 <rah> braunr: in what way is it better exactly? <braunr> rah: more performant, more scalable <rah> braunr: how? <braunr> better algorithms, better interfaces <braunr> for example, it supports smp <rah> ah <rah> it supports SMP <rah> ok <rah> that's one thing <braunr> it implements lockless synchronization à la rcu <rah> are there any others? <rah> ok <rah> lockless sync <rah> anything else? <braunr> it can scale from 4MB of physical memory up to several hundreds GiB <braunr> ipc is completely different, leading to simpler code, less data involved, faster context switches <braunr> (although there is no code for that yet) <rah> how can it support larger memory while other microkernels can't? <rah> how is the ipc "different"? <braunr> others can <braunr> gnumach doesn't <rah> how can it support larger memory while gnumach can't? <azeem_> because it's not the same code base? <braunr> gnumach doesn't support temporary kernel mapping <rah> ok, so x15 supports temporary kernel mapping <braunr> not exactly <braunr> virtual memory is completely different <rah> how so? <braunr> gnumach does the same as linux, physical memory is mapped in kernel space <braunr> so you can't have more physical memory than you have kernel space <braunr> which is why gnumach can't handle more than 1.8G right now <braunr> it's a 2/2 split <braunr> in x15, the kernel maps what it needs <braunr> and can map it from anywhere in physical memory <tschwinge> rah: I think basically all this has already been discussed before and captured on that page? <braunr> it already supports i386/pae/amd64 <rah> I see <braunr> the drawback is that it needs to update kernel page tables more often <braunr> on linux, a small part of the kernel space is reserved for temporary mappings, which need page table updates too <braunr> but most allocations don't use that <braunr> it's complicated <braunr> also, i plan to make virtual memory operations completely concurrent on x15, similar to what is described in radixvm <rah> ok <braunr> which means mapping operations on non overlapping regions won't be serialized <braunr> a big advantage for microkernels which base their messaging optimizations on mapping <braunr> so simply put, better performance because of simpler ipc and data structures, and better scalability because of improved data structure algorithms and concurrency <rah> tschwinge: yes but that page is no use to someone who wants a summary of what distinguishes x15 <braunr> x15 is still far from complete, which is why i don't advertise it other than here <rah> "release early, release often"? <braunr> give it a few more years :p <braunr> release what ? <braunr> something that doesn't work ? <rah> software <rah> yes <braunr> this release early practice applies to maintenance <rah> release something that doesn't work so that others can help make it work <braunr> not big developments <braunr> i don't want that for now <braunr> i have a specific idea of what i want, and both explaining and defending it would take time, better spent in development itself <braunr> just wait for a first prototype <braunr> and then you'll see if you want to help or not * rah does not count himself as one of the "others" who might help make it work <braunr> one big difference with other microkernels is that x15 is specifically intended to run a unix like system <braunr> a hurd like system providing a psoix interface more accurately <braunr> and efficiently <braunr> so for example, while many microkernels provide only sync ipc, x15 provides both sync ipc and signals <braunr> and then, there are a lot of small optimizations, like port names which will transparently identify as file descriptors <braunr> light reference counting <braunr> a restriction on ipc that only allows reliable transfers across network to machines with same arch and endianness <braunr> etc..
Created on 2013-09-29 by wiki user BobHam, rah on IRC.
The x15 microkernel is under development by Richard Braun. Overall, x15 is intended to provide better performance because of simpler IPC and data structures and better scalability because of improved data structure algorithms and concurrency.
The following specific features are intended to distinguish x15 from other microkernels. However, it should be noted that the microkernel is under heavy development and so the list may (and almost certainly will) change.
- SMP support
- Lockless synchronisation à la RCU
- Support for large amounts of physical memory. GNU Mach does the same as Linux, physical memory is mapped in kernel space so you can't have more physical memory than you have kernel space which is why GNU Mach can't handle more than 1.8G right now, it's a 2/2 split. In x15, the kernel maps what it needs and can map it from anywhere in physical memory the drawback is that it needs to update kernel page tables more often.
- Virtual memory operations are planned to be completely concurrent on x15, similar to what is described in radixvm
- Intended to efficiently run a Hurd-like system providing a POSIX interface
- Providing both synchronisation IPC and signals, as opposed to just synchronisation IPC
- Port names which will transparently identify as file descriptors
- Light reference counting
- A restriction on IPC that only allows reliable transfers across network to machines with same arch and endianness
<zacts> braunr: are you still working on x15/propel? * zacts checks the git logs <braunr> zacts: taking a break for now, will be back on it when i have a clearer view of the new vm system
<gnufreex> braunr, few questions about x15. I was reading IRC logs on hurd site, and in the latest part, you say (or I misunderstood) that x15 is now hybrid kernel. So what made you change design... or did you? <braunr> gnufreex: i always intended to go for a hybrid
<zacts> braunr: when do you plan to start on x15/propel again? <braunr> zacts: after i'm done with thread destruction on the hurd
<zacts> and do you plan to actually run hurd on top of x15, or are you still going to reimplement hurd as propel? <braunr> and no, i don't intend to run the hurd on top of x15
<neal> braunr: What is your Mach replacement doing? <braunr> "what" ? :) <braunr> you mean how i guess <neal> Sure. <braunr> well it's not a mach replacement any more <braunr> and for now it's stalled while i'm working on the hurd <neal> that could be positive :) <braunr> it's in good shape <neal> how did it diverge? <braunr> sync ipc, with unix-like signals <braunr> and qnx-like bare data messages <neal> hmm, like okl5? <braunr> (with scatter gather) <neal> okl4 <braunr> yes <braunr> btw, if you can find a document that explains this property of okl4, i'm interested, since i can't find it again on my own :/ <braunr> basically, x15 has a much lighter ipc interface <neal> capabilities? <braunr> mach ports are mostly retained <braunr> but reference counting will be simplified <neal> hmm <neal> I don't like the reference counting part <braunr> port names will be plain integers, to directly be usable as file descriptors and avoid a useless translation layer <braunr> (same as in qnx) <neal> this sounds like future tense <braunr> there is no ipc code yet <neal> so I guess this stuff is not implemented <neal> ok. <braunr> next step is virtual memory <braunr> and i'm taking my time because i want it to be a killer feature <neal> so if you don't IPC and you don't have VM, what do you have? :) <braunr> i have multiprocessor multithreading <neal> I see. <braunr> mutexes, condition variables, rcu-like lockless synchronization, work queues <braunr> basic bsd-like virtual memory <braunr> which i want to rework <neal> I ignored all of that in Viengoos :) <braunr> and since ipc will still depend on virtual memory for zero-copy, i want the vm system to be right <braunr> well, i'm more interested in the implementation than the architecture <braunr> for example, i have unpublished code that features a lockless radix tree for vm_object lookups <braunr> that's quite new for a microkernel based system, but the ipc interface itself is very similar to what already exists <braunr> your half-sync ipc are original :) <neal> I'm considering getting back in the OS game. <braunr> oh <neal> But, I'm not going to write a kernel this time. <braunr> did anyone here consider starting a company for such things, like genode did ? <neal> I was considering using genode as a base. <braunr> neal: why genode ? <neal> I want to build a secure system. <neal> I think the best way to do that is using capabilities. <neal> Genode runs on Fiasco.OC, for instance <neal> and it provides a lot of infrastructure <braunr> neal: why not l4re for example ? <braunr> neal: how important is the ability to revoke capabilities ?
In the discussion on object lookups, IRC, freenode, #hurd, 2013-10-24:
<teythoon> and, with some effort, getting rid of the hash table lookup by letting the kernel provide the address of the object (iirc neil knew the proper term for that) <braunr> teythoon: that is a big interface change <teythoon> how so <braunr> optimizing libihash and libpthread should already be a good start <braunr> well how do you intend to add this information ? <braunr> ok, "big" is overstatement, but still, it's a low level interface change that would probably break a lot of things <teythoon> store a pointer in the port structure in gnumach, make that accessible somehow <braunr> yes but how ? <teythoon> interesting question indeed <braunr> my plan for x15 is to make this "label" part of received messages <braunr> which means you need to change the format of messages <braunr> that is what i call a big change
<antrik> neal: you mentioned you want to use Genode as a base... what exactly would you want to build on top of it, different than what the Genode folks are doing?
<neal> antrik: I want to build a secure operating system. <neal> antrik: One focused on user security. <neal> braunr: You mean revoke individual send rights? <neal> braunr: Or, what do you mean? <neal> Or do you mean the ability to receive anotification on revocation? <braunr> neal: yes, revoking individual send rights <neal> I don't think it is needed in practice. <braunr> neal: ok <neal> But, you need a membrane object <neal> Here's the idea: <braunr> like a peropen ? <neal> you have say a file server <neal> and a proxy <neal> a process only talks to the file server via the proxy <neal> for the proxy to revoke access to the file object it gave out, it needs to either use your revoke <neal> interpose on all ipcs (which is expensive) <neal> or use a proxy object/membrane <neal> which basically forwards messages to the underlying object <braunr> isn't that also interposing ? <neal> of course <neal> but if it is done in the kernel, it is fast <braunr> ah in the kernel <neal> you just walk a linked list <braunr> what's the difference with a peropen object ? <neal> That's another option <neal> you use a peropen and then provide a call to force the per-open to be closed <neal> so the proxy now invokes the server <neal> the issue here is that the proxy has to trust the server <braunr> yes <braunr> how can you not trust servers ? <neal> that is, if the intent is to prevent further communication between the server and the process, the server may ignore the request <neal> in this case, you probably trust the server <braunr> hum <neal> but it could be that you have two processes communicating <braunr> if the intent is to prevent communication, doesn't the client just need to humm not communicate ? :) <neal> the point is that the two processes are colluding <braunr> what are these two processes ? <neal> I'm not sure this case is of practical relevance <braunr> ok <neal> https://www.cs.cornell.edu/courses/cs513/2002sp/L10.html <braunr> thanks
<antrik> neal: hm... I was under the impression that the Genode themselves are also interested in user security... what is missing from their version that you want to add? <antrik> err... the Genode folks <neal> antrik: I'm missing some context <antrik> neal: a while back you said that you want to build a secure system on top of Genode <neal> yes <neal> the fact that they are doing what I want is great <neal> but there is more to a secure system than an operating system <antrik> ah, so it's about applications+ <antrik> ? <neal> yes, that is part of it <neal> it's also about secure messaging <neal> and hiding "meta-data" <braunr> i'm still wondering how you envision the powerbox <neal> when a program wants the user to select a file, it makes an upcall to the power box application <antrik> braunr: you can probably find some paper from Shapiro ;-) <braunr> well, sure, it looks easy <braunr> but is there always a power box application ? <braunr> is there always a guarantee there won't be recursive calls made by that application ? <braunr> how does it integrate with the various interfaces a system can have ? <neal> there is always a power box application <neal> I don't know what you mean by recursive calls <braunr> aer techniques such as remembering for some time like sudo does applicable to a powerbox application ? <neal> if you mean many calls, then it is possible to rate limit it <braunr> well, the powerbox will use messaging itself <braunr> is it always privileged ? <braunr> privileged enough <neal> it is privileged such like the X11 display manager is privileged and can see all of the video content <braunr> what else other than accessing a file would it be used for ? <braunr> one case i think of is accessing the address space of another application, in debuggers <braunr> 14:56 < neal> there is always a power box application <braunr> what would it be when logging on a terminal ? <antrik> braunr: when running pure command line tools, you can already pass the authority as part of the command line. however, I'm wondering whether it really makes sense to apply this to traditional shell tools... <braunr> that's one of my concerns <braunr> when does it really make sense ? <antrik> for interactive use (opening new files from within a running program), I don't think it can be accomplished in a pure terminal interaction model... <braunr> and you say "you pass the authority" <antrik> braunr: it makes sense for interactive applications <braunr> i thought the point of the powerbox is precisely not to do that <antrik> no, it's still possible and often reasonable to pass some initial authority on startup. the powerbox is only necessary when further access needs to be provided at runtime <braunr> ok <neal> the power box enable dynamic delegation of authority, as antrik said <braunr> ok <braunr> but how practical is it ? <neal> applications whose required authority is known apriori and max(required authority) is approximately min(required authority) can be handled with static policies <braunr> don't application sometimes need a lot of additional authority ? <braunr> ok <antrik> actally, thinking about it, a powerbox should also be possible on a simple terminal, if we make sure the application doesn't get full control of the terminal, but rather allow the powerbox to temporarily take over input/output without the application being able to interpose... so not quite a traditional UNIX terminal, but close enough I'd say <braunr> the terminal itself maybe ? <antrik> hm... that would avoid having to implement a more generic multiplexing approach -- but it would mix things that are normally quite orthogonal... <antrik> BTW, I personally believe terminals need to get smarter anyways :-) <braunr> ok <antrik> the traditional fully linear dialog has some nice properties; but it is also pretty limited, leading to usability problems soon. I have some vague ideas for an approach that still looks mostly like a linear dialog, but is actually more structured
<braunr> yes the learning curve [of the Hurd] is too hard <braunr> that's an entry barrier <braunr> this is why i use well known posix-like (or other well established) apis in x15 <braunr> also why i intend to make port rights blend into file descriptors <teythoon> right <braunr> well <braunr> the real reason is efficiency <braunr> but matching existing practices is very good too
<gnufreex> braunr, how is work on x-15 progressing? Is there some site to check what is new? <braunr> gnufreex: stalled for 2 months <braunr> i'm working on the hurd for now, will get back to it later <braunr> no site <braunr> well <gnufreex> so, you hit some design problem, or what? I mean why stalled <braunr> http://git.sceen.net/rbraun/x15.git/ :p <gnufreex> Thanks <braunr> something like that yes <braunr> i came across http://darnassus.sceen.net/~rbraun/radixvm_scalable_address_spaces_for_multithreaded_applications.pdf <gnufreex> I read that, I think I found it on Hurd site. <braunr> and since x15 aims at being performant and scalable, it seems like a major feature to bring in <braunr> but it's not simple to integrate <gnufreex> So you want to add that? <braunr> gnufreex: yes <gnufreex> branur, but what are the problems? <braunr> ? <braunr> ah <braunr> you really want to know ? :) <gnufreex> Well... yeah <braunr> you need to know both x15 and radixvm for that <braunr> for one, refcache, as described in the radixvm paper, doesn't seem scalable <braunr> it is in practice in their experiments, but only because they didn't push some parameters too high <braunr> so i need to rethink it <gnufreex> I don't know x15... but I read radixvm paper <braunr> next, the bsd-like vm used by x15 uses a red-black tree to store memory areas, which doesn't need external storage <braunr> radixvm as implemented in xv6 is only used for user processes, not the kernel <braunr> which means the kernel allocator is a separate implementation, as it's done in linux <braunr> x15 uses the same implementation for both the kernel and user maps <braunr> which results in a recursion problem <braunr> because a radix tree uses external nodes that must be dynamically allocated <gnufreex> so you would pretty much need to rewrite x15 <braunr> no <braunr> just vm/ <braunr> and $arch/pmap <braunr> and yes, pmap needs to handle per-core page tables <braunr> something i wanted to add already but couldn't because of similar recursion problems <gnufreex> Yeah, vm system... but what else did you do with x15... it is at early stage... <braunr> multithreading <gnufreex> That doesn't need to be rewriten? <braunr> no <gnufreex> Ok... good. <braunr> physical memory allocation neither <braunr> only virtual memory <gnufreex> is x15 in runable state? I mean in virtual machine? <braunr> you can start it <braunr> but you won't go far :) <gnufreex> What do you use as development platform? <braunr> it basically detects memory and processors, starts idle, migration and worker threads, and leaves <gnufreex> Is it compilable on fedora 19 <braunr> probably <braunr> i use debian stable <braunr> and unstable on the hurd <gnufreex> ok, I will probably try it in KVM... <braunr> better do it on real hardware too in case you find a bug <gnufreex> I cant make new partition now... it seems my hard drive is dying. When I get a new one I will try on real harware. <braunr> you don't need a new partition <braunr> the reason radixvm is important is twofold <braunr> 1/ ipc will probably make use of the core vm operations used by mmap and munmap <braunr> 2/ no other system currently provides scalable mmap/munmap/mprotect <gnufreex> Yes, that would make x15 pretty special... <gnufreex> But I read somewhere that you wanted to implement RCU during summer <gnufreex> Did you do that?
<braunr> neal: about secure operating systems <braunr> i assume you consider clients providing their own memory a strong requirement for that, right ? <neal> no <neal> I'm less interested in availability <neal> or performance guarantees <braunr> ok <braunr> but <braunr> i thought it was a requirement to avoid denial of service <neal> of course <braunr> then why don't you consider it required ? <neal> I want something working in a reasonable amount of time :) <braunr> agreed <neal> more seriously: <neal> my primary requirement is that a program cannot access information that the user has not authorized it to access <braunr> ok <neal> the requirement that you are suggesting is that a program be able to access information that the user has authorized it to access <neal> this is availability <braunr> i'm not following <braunr> what's the difference ? <neal> assume we have two programs: A and B <neal> on Unix, if they run under the same uid, they access access each other files <neal> I want to fix this <braunr> ok, that's not explicit authorization <braunr> but is that what you mean ? <neal> Now, assuming that A cannot access B's data and vice versa <neal> we have an availability problem <neal> A could prevent B from accessing its data <neal> via a DoS attach <neal> I'm not going to try to fix that. <braunr> ok <braunr> and how do you intend to allow A to access B's data ? <braunr> i guess the powerbox mentioned in the critique <braunr> but do you have a more precise description about something practical to use ?
In context of libports, Open Issues, IRC, freenode, #hurd, 2013-11-14.
<braunr> fyi, x15 will not provide port renaming <braunr> teythoon: also, i'm considering enforcing port names to be as close as possible to 0 when being allocated as part of the interface <braunr> what do you think about that ? <teythoon> braunr: that's probably wise, yes <teythoon> you could hand out receive ports close to 0 and send ports close to ~0 <braunr> teythoon: what for ? <teythoon> well, if one stores only one kind in an array, it won't waste as much space <braunr> this also means you need to separate receive from send rights in the interface <braunr> so that you know where to look for them <braunr> i'm not sure it's worth the effort <braunr> using the same code for them both looks more efficient <braunr> the right lookup code is probably one of the hottest path in the system <teythoon> right <neal> one of the nice things about not reusing port names is that it helps catch bugs <neal> you don't want to accidently send a message to the wrong recipient <braunr> how could you, if the same name at different times denotes different rights ? <neal> you forget to clean up something <braunr> if you don't clean, how could you get the same name for a right you didn't release ? <neal> that's not hard to do :) <neal> ah, you cleaned up the port right but not the name <braunr> ah ok <neal> destroy the port and forget that a thread is still working on a response <neal> the data structure says use the port at index X <neal> X is reallocated in the mean time <teythoon> excuse my ignorance, but gnumach *is* reusing port names, isn't it? <braunr> that policy is why i'm not sure i want to enforce allocation policy in the interface :/ <neal> This is not about a security property of the system <neal> this is about failing fast <neal> you want to fail as close to the source of the problem as possible <braunr> we could make the kernel use different allocation policies for names, to catch bugs, yes <neal> make the index X valid again and you've potentially masked the bug <teythoon> braunr: if you were to merge your radix tree implementation into gnumach and replace the splay tree with it, would that make using renamed ports fast enough so we can just rename all receive ports doing away with the extra lookup like mach-defpager does ? <braunr> i don't think so <braunr> the radix tree code is able to compress its size when keys are close to 0 <braunr> using addresses would add 1, 2, maybe 3 levels of internal nodes <braunr> for every right <braunr> we could use a true integer hash table for that though <braunr> hm no, hurd packages crash ... :/ <teythoon> but malloc allocates stuff in a contigious space, so the pointers should be similar in the most significant bits <braunr> if you use malloc, yes <teythoon> sure <teythoon> but that'd make the radix tree representation compact, no? <braunr> it could <braunr> the current code only compresses near 0 <teythoon> oh <braunr> better compression could be implemented though
<teythoon> have you seen liburcu ? <braunr> a bit, yes <teythoon> it might be worth investigating to use it in some servers <braunr> it is <teythoon> the proc server comes to mind <braunr> personally, i think all hurd servers should use rcu <braunr> libports should use rcu <teythoon> yes <braunr> lockless synchronization should be a major feature of x15/propel <braunr> present even during message passing
<braunr> improving our page cache with arc would be great <braunr> it's on the todo list for x15 :> <braunr> not sure you referred to virtual memory management though <braunr> (actually, it's CAR, not ARC that is planned for x15)
<braunr> zacts: http://darnassus.sceen.net/~rbraun/x15/qemu_x15.sh
<braunr> oh, btw, i've started working on x15 again :> <teythoon> saw that :) <braunr> first item on the list: per-cpu page tables <braunr> the magic that will make ipc extremely scalable :) <teythoon> i'm worried about your approach tbh <braunr> too much overhead ? <teythoon> not on any technical level <teythoon> but haven <braunr> ? <teythoon> 't there been enough reimplementation efforts that got nowhere ? <braunr> oh that <teythoon> ^^ <braunr> well, i have personal constraints and frustrations with the existing code, and my goal isn't to actually produce anything serious until it actually gets there <braunr> which, yes, it might not <braunr> really, i'm doing it for fun <teythoon> well sure <teythoon> that's a damn good reason ;) <braunr> and if it ever reaches a state where it can actually be used to run stuff, i would be very happy <braunr> and considering how it's done, i'm pretty sure things could be built a lot faster on such a system <teythoon> but you need to reimplement all the userspace servers as well, and the libc stuff <braunr> yes <teythoon> do you plan to reimplement this from scratch or do you have plans to 'bootstrap' propel from hurd ? <braunr> from scratch <teythoon> well... i'm not sure that this is feasible or even a good idea. that's what i meant in a nutshell i guess. <braunr> i'm familiar with that criticism <braunr> and you may be right <braunr> this is also why i keep working on the hurd at the same time <teythoon> we could also talk about making hurd more easily portable <braunr> portable with regard to what ? <teythoon> evolving hurd and mach to the point where it might be feasible to port hurd to another ukernel <braunr> not so easy <teythoon> i know <braunr> i'm not even sure i would want that <braunr> well, since the hurd isn't optimized at all, why not <teythoon> why would it neccessarily hinder optimization ? <braunr> because in practice, it's rare for a microkernel to provide all the features the hurd would require to run really well <braunr> the most severe issue being that they either provide asynchronous ipc, used for signals, or only synchronous ipc, making signal and other event-driven code hard to emulate (usually requiring separate threads)
<teythoon> i wonder if it would not be best to add a description to mach tasks <braunr> i think it would <teythoon> to aid fixing these kind of issues <braunr> in x15, i actually add descriptions (names) to all kernel objects <teythoon> that's probably a good idea, yes <braunr> well, not all, but many <braunr> i'd like to push x15 this year <braunr> it currently is the only design of a truely scalable microkernel that i know of <azeem_> push how? <braunr> spend time on it <azeem_> k <azeem_> do you think it will make sense to solicit outside contributions at one point? <braunr> yes <braunr> the roadmap is vm system -> ipc system -> userspace (including RPC handling) <braunr> once we can actually do things in userspace, the priority will be getting a shell with glibc <braunr> people will be able to help "easily" at that point <azeem_> just wondering, apart from scalability, did you write it for performance, for hackability, or something else? <braunr> it's basically the hurd architecture, including improvements from the critique, with performance and scalability in mind <azeem_> ok <braunr> the main improvements i think of currently are resource containers, lexical .. resolution, and lists of trusted users with which to communicate <braunr> it's strongly oriented for posix compatibility though <teythoon> sounds nice, i like it already ;) <azeem_> is it compatible with Mach to some degree? <braunr> so things like running without an identity will be forbidden in the default system personality <braunr> no, not compatible with mach at all <azeem_> this sounds like it is doing more than Mach did <azeem_> braunr: ah, ok <braunr> it's not "x15mach" any more :) <azeem_> right, I missed out on that
<braunr> i also don't write anything that would prevent real-time <teythoon> b/c that's a potential market for such an operating system ? <braunr> yes <teythoon> well, i can't say i don't like the sound of that ;) <braunr> the ipc interface should be close to that of qnx
<cluck> braunr: have you looked at genode? <cluck> braunr: i sometimes wonder how hard it'd be to port hurd atop it because i find some similarities with what l4/fiasco/viengos provided <braunr> cluck: i have, but genode seems a bit too far from posix for our tastes <cluck> (and yes, i realize we'd be getting farther from the hw) <braunr> ah you really mean running the hurd on top of it <braunr> i personally don't like the idea <cluck> braunr: well, true, but their noux implementation proves it's not a dealbreaker <cluck> braunr: at least initially that'd be the best implementation approach, no? as time went on integrating hurd servers more tightly at a lower level makes sense but doing so from the get go would be foolhardy <cluck> braunr: or am i missing something obvious? <braunr> cluck: why would it be ? <cluck> braunr: going by what happened with l4 it's too much code to port and optimize at once <braunr> cluck: i don't think it is <braunr> cluck: problems with l4 didn't have much to do with "too much code" <cluck> braunr: i won't debate that, you have more experience than me with hurd code. anyway that's how i'd go about it, first get it all running then get it running fast. breakage is bad <braunr> and you think moving from something like linux or genode to an implementation closer to hardware won't break things ? <cluck> braunr: yes, i read the paper, obvious unexpected shortcomings but even had them not been there the paradigms are too different and creating proper mappings from one model to the other would at least be time consuming <braunr> ye <braunr> yes <braunr> i'm convinved the simple approach of a small microkernel with the proper interfacen along with the corresponding sysdeps layer in glibc would be enough to get a small hurd like system quickly <braunr> experience with other systems shows how to directly optimize a lot of things from the start, without much effort <cluck> braunr: sorry. back to our talk, i mentioned genode because of the nice features it has that'd be useful on hurd <braunr> cluck: which ones do you refer to ? <cluck> braunr: the security model is the biggest one <braunr> how does it differ from the hurd, except for revocation ? <cluck> braunr: then there's the ease of portability <braunr> ? <cluck> braunr: it's more strict <braunr> how would that help us ? <cluck> braunr: if hurd was running atop it we'd get extra platforms supported almost for free whenever they did (since we'd be using the same primitives) <braunr> why not choose the underlying microkernel directly ? <cluck> call me crazy but i believe code reuse is a good thing, i see little point in duplicating existing code just because you can <braunr> what part of genode should be reused then ? <cluck> that's what got me thinking about genode in the first place, ideologically they share a lot (if not most) of hurd's goals and code wise they feel close enough to make a merge of sorts not seem crazy talk, thus my asking if i'm missing something obvious <braunr> i think the design is incompatible with our goals of posix compatibility <cluck> braunr: oh, ok. <cluck> braunr: i was assuming that wasn't an issue, as i mentioned before they have noux already and if hurd's servers got ported they'd provide whatever else that was missing <braunr> noux looks like a unix server for binary compatibility <braunr> i'm not sure it is but that's what the description makes me think <braunr> and if it really, then it's no different than running linux on top of an hypervisor <braunr> ok it's not for binary compatibility but it definitely is a (partial) unix server <braunr> i much prefer the way the hurd is posix compliant without any additional layer for compatibility or virtualization <cluck> braunr: noux is a runtime, as i understand it there's no binary compatibility just source (ie library/api calls) <braunr> yes i corrected that just now <cluck> sorry, i'm having lag issues <braunr> no worries <cluck> braunr: anyway, how's x15 coming along? still far from being a practical replacement? <braunr> yes .. :( <braunr> and it's not a replacement <cluck> (for mach) <braunr> no <cluck> huh? <braunr> it's not a replacement for the hurd <braunr> err, for mach <cluck> braunr: i thought you were writing it to be compatible with mach's interfaces <braunr> no <braunr> it used to be that way <braunr> but no <cluck> braunr: what changed? <braunr> mach ipc is too ccmplicated <braunr> complicated* <braunr> its supposed benefit (of allowing the creation of computer clusters for single system images) are outdated and not very interesting <braunr> it's error prone <braunr> and it incurrs more overhead than it should <cluck> no arguing there <cluck> braunr: are you still targeting being able to run hurd atop x15 or is it just your pet project now? <braunr> i don't intend the hurd to run on top of it <braunr> the reason it's a rewrite is to fix a whole bunch of major issues in one go
<nalaginrut> braunr: x15 can be compiled successfully, is it possible to run it on Hurd now? <braunr> nalaginrut: it will never be <braunr> x15 isn't a drop in mach replacement <braunr> for now it's stalled until i have more free time :/ <braunr> i need to rewrite virtual memory, and then implement ipc <nalaginrut> oh, it's planed for standalone one? <braunr> it's planned for a hurd clone <nalaginrut> sounds bigger plan than I thought ;-) <braunr> it is but not that much actually <braunr> it's just i have really little free time currently .. <braunr> and i chose to spend it on the hurd instead <braunr> it's easier to fix bugs when you have little time, than think and write big and difficult features like an smp scalable VM system <nalaginrut> yes, I understand <nalaginrut> it take huge time to think and design <nalaginrut> and relative easier to fix the current one <braunr> well, the whole thing is thought already <braunr> what's hard is taking care of the details <nalaginrut> well, you're so self-confident, to my experiences, even if all the designs are in my mind, issues may make me change the original design a lot ;-) <nalaginrut> alright, I saw the key line, everything exists, just assemble it