General

Some tschwinge comments regarding your proposal. Which is very good, if I may say so again! :-)

Of course, everyone is invited to contribute here!

I want to give the following methodology a try, instead of only having email/IRC discussions -- for the latter are again and again showing a tendency to be dumped and deposited into their respective archives, and be forgotten there. Of course, email/IRC discussions have their usefulness too, so we're not going to replace them totally. For example, for conducting discussions with a bunch of people (who may not even be following these pages here), email (or, as applicable, the even more interactive IRC) will still be the medium of choice. (And then, the executive summary should be posted here, or incorporated into your proposal.)

Also, if you disagree with this suggested procedure right away, or at some later point begin to feel that this thing doesn't work out, or simply takes too much time (I don't think so: writing emails takes time, too), just say so, and we can reconsider.

Of course, as this wiki is a passive medium rather than an active one as IRC and email are, it is fine to send notices like: I have updated the wiki page, please have a look.

One idea is that your proposal evolves alongside with the ongoing work, and represents (in more or less detail) what has been done and what will be done. Also, we can hopefully use parts of it for documentation purposes, or as recipes for similar work (enabling other programming languages on the Hurd, for example).

For this, I suggest the following procedure: as applicable, you can either address any comments in here (for example, if they're wrong :-), or if they require further discussion; think: email discussion), or you can address them directly in your propoal and remove the comments from here at the same time (think: bug fix).

Generally, you can assume that for things I didn't comment on (within some reasonable timeframe/upon asking me again) that I'm fine with them. Otherwise, I might say: I don't like this as is, but I'll need more time to think about it.

There is also a possibility that parts of your proposal will be split off; in cases where we think they're valuable to follow, but not at this time. (As you know, your proposal is not really a trivial one, so it may just be too much for one person's summer.) Such bits could be moved to open issues pages, either new ones or existing ones, as applicable.

GSoC Site Discussion

Java Native Interface (JNI)

Java Native Access (JNA)

This is a different approach, and while some attention is paid to performance, correctness and ease of use take priority.

As we plan on only having a few native methods (for invoking mach_msg, essentially), JNA is probably the wrong approach: portability and ease of use is not important, but performance is.

Compiled Native Interface (CNI)

Probably faster than JNI, but only usable with GCJ.

Given that we have very few JNI calls, it might be interesting to take a "dual" approach if CNI actually improves performance when compiling to native code. --jkoenig 2011-07-20

IRC, freenode, #hurd, 2011-07-13

<jkoenig> Yes, I guess so. Maybe start investigating mig because it may
  have repercussions on what the best approach would be for some aspects of
  the Mach bindings.
<tschwinge> I still think that making MIG emit Java code is not too
  difficult, once you have the required Java infrastructure (like what
  you're writing at the moment).
<tschwinge> On the other hand, if there's another approach that you'd like
  to use, I'm not trying to force using MIG.
<braunr> i still have a problem understanding your approach
<braunr> at which level are your bindings located ?
<jkoenig> I expect mig it will be the easiest route, but of course possibly
  it won't.
<tschwinge> jkoenig: Yeah, be give some high-level to low-level overview?
<jkoenig> ok, so
<jkoenig> at the very core, low-level, we have a very thin amount of JNI
  code to access (proper) system calls.
<jkoenig> by "proper" I mean things like mach_task_self, mach_msg and
  mach_reply_port, which are actually system calls rather than RPCs to the
  kernel.
<braunr> right
<jkoenig> at this level, we manipulate port names as integers, and the
  message buffers for mach_msg are raw ByteBuffers (from the java.nio
  package)
<jkoenig> actually, so-called /direct/ ByteBuffers, which are backed by
  memory allocated outside of the Java heap, rather than as a byte[] array
<jkoenig> we can retreive the pointer from the JNI code and use the buffer
  directly.
<jkoenig> (so, good for performance and it's also portable.)
<braunr> ok
<braunr> i'm more interested in the higher level bindings :)
<jkoenig> ok so, higher up.
<jkoenig> design goal from my proposal: "the memory safety of Java should
  be maintained and extended to Mach primitives such as port names and
  out-of-line memory regions"
<jkoenig> so integer port names are not "safe" in the sense that they can
  be forged and misused in all kinds of way
<jkoenig> which is why I have a layer of Java code whose job is to wrap
  this kind of low-level Mach stuff into safe abstractions
<jkoenig> and ideally the user should only use these safe abstractions.
<tschwinge> (Not to restrict the programmer, but to help him write correct
  code.)
<jkoenig> right.
<braunr> so you can't use mach RPCs directly
<jkoenig> tschwinge, also to actually restrict them, in a Joe-E /
  object-capability context, but that's not the primary concern right now
  ;-)
<braunr> or you force your wrappers to have these abstractions as input
<jkoenig> braunr, well, actually at this level you still have Mach RPC
<jkoenig> but for instance, port names are encapsulated into "MachPort"
  objects which ensure they are handled correcly
<tschwinge> As I understand it, you use these abstractions to prepare a
  usual mach_msg message, and then invoke mach_msg.
<braunr> ok
<jkoenig> and message buffers are wrapped into "MachMsg" objects which both
  help you write the messages into the ByteBuffer and prevent you from
  doing funky stuff
<jkoenig> and ensure the ports which you send/receive/pseudo-receive after
  an error/... are deallocated as required, etc.
<braunr> what's the interface to use IPC ?
<tschwinge> Is MIG doing that, too, I think?  (And antrik once found some
  error there, which is still to be reviewed...)
<jkoenig> braunr, so basically as a user you would be free to use either
  one of these layers, or to use MIG-generated classes which would
  construct and exchange messages for you using the second (safe) layer.
<braunr> ok, let's just finish with the low level layer before going
  further please
<jkoenig> tschwinge, MIG does some type checking on the received message
  and saves you the trouble of constructing/parsing them yourself, but I'm
  not sure about how mach_msg errors are handled
<braunr> what are the main methods of MachMsg for example ?
<jkoenig> braunr, you may want to have a look at
  http://jk.fr.eu.org/hurd-java/doc/html/classorg_1_1gnu_1_1mach_1_1MachMsg.html
<braunr> right, sorry
<braunr> grabbed the code at work and forgot here
<jkoenig> and also
  https://github.com/jeremie-koenig/hurd-java/blob/master/HelloMach.java
  which uses it
<jkoenig> but roughly, you'd use setRemotePort, setLocalPort, setId to
  write your message's header
<jkoenig> then use one of the putFoo() methods to add data items to the
  message
<braunr> ok, the mapping with the low level C interface is very clear
<braunr> that's good for me
<jkoenig> the putFoo() methods would write the appropriate type
  descriptors, then the actual data.
<braunr> we can go on with the MiG part if you want :)
<jkoenig> right,
<jkoenig> so here you may want to look at the UML class diagram from
  http://www.bddebian.com/~hurd-web/user/jkoenig/java/proposal/

proposal.

<jkoenig> so in the C case, mig generates 3 files
<jkoenig> a header file which has the prototypes of the mig-generated
  stubs,
<jkoenig> a *User.c which has their actual implementation
<jkoenig> and a *Server.c which handles demultiplexing the incoming
  messages and helps with implementing servers.
<jkoenig> so we would do something along these lines, more or less:
<jkoenig> mig would generate the code for a Java interface in lieu of the
  *.h file.
<jkoenig> a generated FooUser class would implement this interface by doing
  RPC
<jkoenig> (so basically you would pass a MachPort object to the
  constructor, and then you could use the resulting object to do RPC with
  whatever is on the other end)
<jkoenig> and the generated FooServer class would do the opposite,
<braunr> ok
<braunr> issues with threads ?
<jkoenig> you would pass an object implementing the Foo interface to the
  constructor,
<braunr> i'm guessing the demux part may have to create threads, right ?
<jkoenig> and the resulting object would handle messages by using the
  object you passed.
<jkoenig> braunr, right, so that would be more a libports kind of code,
<braunr> the libports-like library, i see
<jkoenig> to which you could pass Server objects (for instance the
  FooServer above), and it would handle incoming messages.
<braunr> how is message content mapped to a java interface ?
<jkoenig> this would be determined from the .defs files and MIG would
  generate the appropriate code, hopefully.
<braunr> so the demux part would handle rpc integer identifiers ?
<jkoenig> right.
<braunr> but hm
<jkoenig> also mapping .defs files to Java interfaces might prove to be
  tricky. data types conversion and all
<antrik> tschwinge: my mamory is rather hazy. IIRC the issue was that the
  MIG-generated stubs deallocate out-of-line port arrays after the
  implementation returns, before returning to the dispatcher
<braunr> i'll just overlook this specific implementation detail
<jkoenig> but we could use some annotation-based system if we need to
  provide more information to generate the java code.
<antrik> but the Hurd (or rather glibc) RPC handling also automatically
  deallocates everything if an error occurs
<antrik> so I changed the MIG code to deallocate only when no error occurs
<braunr> jkoenig: ok, we'll talk about that when there is more progress and
  you have a better view of the problem
<antrik> at that time I was pretty sure that this is a correctly working
  solution, but it always seemed questionable conceptually... however, I
  wasn't able to come up with a better one, and nobody else commented on it
<braunr> antrik: shouldn't the hurd be changed not to deallocate something
  it didn't allocate in the first place ?
<antrik> braunr: no, the server has to deallocate stuff before returning to
  the client. the request message is destroyed before returning the reply.
<tschwinge> jkoenig, braunr: That's what I had in mind where MIG might be a
  bit awkward.  Then we can indeed either add annotations to the .defs
  files, or reproduce them in some other format.  That's some work, but
  it's mostly a one-time work.
<tschwinge> After all, the RPC interface is a binary one, and there may be
  more than one API for creating these messages, etc.
<antrik> jkoenig: actually, at least in the Hurd, server-side and
  client-side headers are separate -- so MIG actually creates four files
<jkoenig> tschwinge, wrt to annotations I was more thinking about Java
  ones, such as: @MIGDefsFile("mach/task.defs") @MIGCType("task_t") public
  interface Task { }
<jkoenig> antrik, oh, ok, it makes sense.
<braunr> jkoenig: anything else ?
<jkoenig> braunr, nothing that I can think of
<braunr> ok
<antrik> tschwinge: I think it would be a *very* bad idea to introduce
  redundancy regarding RPC definitions
<braunr> thanks for the tour :)
<antrik> (the _request.defs/_reply.defs mess is bad enough...)
<jkoenig> did I speak about the "Unsafe" pseudo-exception? that's
  interesting :-)
<tschwinge> jkoenig: Also, virtual memory abstractions?
<braunr> jkoenig: you didn't
<tschwinge> antrik: Well, then we could create some other super-format.
  But that's just a detail IMO.
<jkoenig> ok, so wrt virtual memory, a page we received can be wrapped with
  some JNI help into a (direct) ByteBuffer object.
<jkoenig> deallocating sent pages will be tricky, though.
<tschwinge> antrik: To put it this way: for me the .defs files are just one
  way of expressing the RPC interfaces' contracts.  (At the same time, they
  happen to be the actual reference for these, too.  But the specification
  itself could just as well be a textual one.) 
<jkoenig> on approach I've been thinking about would be to "wrap" the
  ByteBuffer object into an object which has the sole reference to it, so
  that when it's deallocated the reference can be replaced with "null", and
  further attempts to access the buffer would throw exceptions.
<braunr> sounds reasonable
<jkoenig> but that's still in flux in my head, we may end up needing our
  own implementation of ByteBuffer-like objects.
<tschwinge> The problem being that there is no mechanism to ``revoke'' an
  object once a reference to it has been shared.
<jkoenig> right.
<tschwinge> A wrapper is one possibility indeed.
<antrik> tschwinge: they are called interface *definitions* for a reason
  :-)
<tschwinge> This is a very similar problem as with capabilities when there
  is no revoke operation for these, too.
<tschwinge> antrik: Yes, because they define MIG's input.  :-P
<tschwinge> Isn't that what is called a membrane in the capability world?
<antrik> I do not say that we have to consider the format of the .defs to
  be set in stone; but I do insist on using a canonical machine-parsable
  source for all language bindings
<tschwinge> attenuation
<jkoenig> tschwinge, you mean the revokable proxy contruct ? (It's the same
  principle indeed)
<tschwinge>     A common design pattern in object-capability systems: given
  one reference of an object, create another reference for a proxy object
  with certain security restrictions, such as only permitting read-only
  access or allowing revocation. The proxy object performs security checks
  on messages that it receives and passes on any that are allowed. Deep
  attenuation refers to the case where the same attenuation is applied
  transitively to any
<tschwinge>     objects obtained via the original attenuated object,
  typically by use of a "membrane".
<tschwinge> http://en.wikipedia.org/wiki/Object-capability_model
<tschwinge> Yes.
<tschwinge> Good.  I understood something.  ;-)
<tschwinge> antrik: OKAY!  :-P
<tschwinge> jkoenig: And hopefully the JVM will optimize away all the
  additional indirection...  :-D
<tschwinge> jkoenig: Is there anything more to say about the VM layer?
<jkoenig> tschwinge, "hopefully", yes :-)
<tschwinge> Like, the data that I'm sharing -- is it untyped, isn't it?
<jkoenig> tschwinge, you mean that within the received/sent pages ?
<tschwinge> Yes.
<tschwinge> But that'S how it is, indeed.
<jkoenig> well actually the type descriptor should indicate what they
  contain.
<tschwinge> I cannot trust anything I receive from externally.
<jkoenig> it's most often used for MACH_MSG_TYPE_CHAR items I guess, and it
  will be type checked when retreive
<tschwinge> Yeah, and that then just *is* arbitrary data, like a block read
  from a disk file.
<jkoenig> you would have something like: ByteBuffer
  MachMsg.getBuffer(MachMsg.Type expected), and MachMsg would check the
  type descriptor against that which you specified
<tschwinge> Or a packet transmitted over the network.
<tschwinge> OK, yes.
<antrik> jkoenig: in theory ints should be used quite often too. the whole
  purpose of the type descriptors is to allow byte order swapping when
  messages are passed between hosts with different architecture...
<jkoenig> tschwinge, right, except for out-of-line port arrays, which need
  to be handled differently obviously.
<antrik> (which is totally irrelevat for our purposes -- especially since
  the actual network IPC code doesn't exist anymore ;-) )
<jkoenig> antrik, oh, interesting
<tschwinge> Yes, that was one original idea.
<jkoenig> actually my litmus test for what the bindings should be, is you
  should be able to implement such a proxy in Java :-)
<tschwinge> antrik: And hey, you now have processors that can switch
  between different modes during runtime...  :-)
<jkoenig> (although arguably that's a little bit ambitious)
<braunr> tschwinge: there should be bits in page tables to indicate the
  endianness to use on a page .. :)
<tschwinge> Hehe!
<tschwinge> jkoenig: Don't worry -- you're already known for ambitious
  projects.  One more can't hurt.
<jkoenig> Also, actually the word size is not something that I've been able
  to abstract so far, so I'll be hardcoding little-endian 32 bits for now.
<braunr> why is that ?
<antrik> some of the Hurd RPC break the idea anyways BTW
<jkoenig> the org.vmmagic package (from Jikes RVM and JNode) could help
  with that, but GCJ does not support it unfortunately (not sure about
  OpenJDK)
<jkoenig> braunr, Java does not allow us to define new unboxed types
<braunr> jkoenig: does it have its own definition of the word size ?
<jkoenig> braunr, nope.
<jkoenig> (although, maybe, and also we could use JNI to query it)
<braunr> even if virtual, i'd expect a machine to have such a defnition
<jkoenig> braunr, maybe it has, but basically in Java nothing depends on
  the word size
<jkoenig> 'int' is 32 bits, 'long' is 64 and that's it.
<braunr> oh right, i remember most types are fixed size, right ?
<jkoenig> right.
<braunr> if not all
<jkoenig> now Jikes RVM's "org.vmmagic" provides an interface to defined
  new unboxed types which can depend on the actual word size, but Jikes RVM
  is its own JVM so obviously they can use and provide whatever extensions
  they need :-)
<jkoenig> (but maybe they've implemented them in OpenJDK for bootstrap
  purposes, I'm not sure)
<tschwinge> I'm missing this detail: where does the word size come into
  play here?
<jkoenig> anyway, I _could_ indiscriminately use 'long' for port names, and
  sparkle the code with word size tests but that would be very clumsy
<braunr> jkoenig: port names are actually ints :/
<jkoenig> tschwinge, the actual format of the message header and type
  descriptors, for instance.
<braunr> jkoenig: ok, got your point
<jkoenig> braunr, by 'long' I mean 64-bits integers (which they are on
  64-bits machines I think?)
<braunr> :)
<braunr> jkoenig: port names are as large as the word size
<braunr> but in C at least, they're int, not long
<braunr> it doesn't change many things, but you get lots of warnings if you
  try with a long :)
<tschwinge> What is the reason that port names are an
  architecture-dependent word size's width, and not simply 32 bit?
<jkoenig> "4 billions of port names should be enough for everyone" :-)
<braunr> tschwinge: an optimization is to use them as pointers in the
  kernel
<antrik> tschwinge: the machine's native word size is what it can process
  most efficiently, and what should be used for most normal
  operations... it makes sense to define stuff as int, except for network
  communication
<tschwinge> jkoenig: Well, yeah, but if you want to communicate with a
  peer, you have to agree on the maximum number anyway (not for port names,
  though, which are local).
<braunr> antrik: int isn't the word size everywhere
<braunr> antrik: the most common type matching the word size is long, at
  least on ILP32/LP64 data models
<antrik> braunr: that's just because some idiots assumed int would always
  be 32 bits, and consequently when 64 architectures came up the compiler
  guys chickened out ;-)
<braunr> without int, you wouldn't have a 32 bits type
<antrik> that's not true for all architectures and/or operating systems
  though AFAIK
<braunr> or a 16 bits one
<braunr> antrik: windows guys got even more scared, so windows 64 is LLP64
<antrik> BTW, I haven't checked, but it's quite possible that 32 bit
  numbers are actually preferable even on AMD64...
<tschwinge> jkoenig: So, back on track.  :-)
<tschwinge> jkoenig: You didn't find anything yet in Mach's VM interfaces
  as well a MemoryObject, etc., that can't be used/implemented in the Java
  world?
<braunr> antrik: they consume less memory, but don't have much effect on
  performance
<jkoenig> tschwinge, once we have the basic system calls and the
  corresponding abstractions in place, I don't think anything else
  fundamentally problematic could possibly show up
<antrik> braunr: if you really *need* a type of a certain bit size, you
  should use stdint types. so not having a 16 or 32 bit type in the
  short/int/long canon is *not* an excuse
<tschwinge> jkoenig: That speaks for the Mach designers!
<braunr> antrik: right
<jkoenig> tschwinge, on trick is that for instance, mach_task_self would
  still be unsafe even if it returned a nicely wrapped Task object, because
  you could still wreck your own address space and threads with it. So we
  would need the "attenuation" pattern mentionned above to provide a safe
  one.
<jkoenig> (which would disallow thinks such as the port/thread/vm calls)
<braunr> jkoenig: you mentioned the unsafe pseudo exception earlier
<jkoenig> braunr, right, so the issue is with distinguishing safe from
  unsafe methods
<antrik> braunr: BTW, the Windows guys actually broke a lot of stuff by
  fixing long at 32 bits -- this way long doesn't match size_t and pointer
  types anymore, which was an assumption that was true for pretty much any
  system so far...
<tschwinge> jkoenig: Yes.  (And again hope for the JVM to optim...)
<braunr> antrik: that's right :)
<braunr> antrik: that's LLP64
<braunr> antrik: long long and pointers
<jkoenig> braunr, so basically the idea is that unsafe methods are declared
  as "throws Unsafe"
<jkoenig> the effect is that if you use such a method you must either
  "throw Unsafe" yourself,
<jkoenig> or if you're building a safe abstraction on top of Unsafe
  methods, you'll "catch" the "exception" in question to tell the compiler
  that it's okay.
<jkoenig> it's more or less inspired from the "semantic regimes" idea from
  the org.vmmagic paper which is referenced in my original proposal,
<jkoenig> only implementing by hijacking the exception checking machinery,
  which has a behaviour similar to what we want.
<braunr> ok
<braunr> but hmm this seems pretty normal, what's the tricky part ? :)
<tschwinge> braunr: The idea is that the programmer explicitly has to
  acknowledge if he'S using an unsafe interface.
<braunr> tschwinge: sounds pretty normal too
<jkoenig> braunr, the trick is that you would not usually declare
  exceptions which are never actually thrown (and actually since the
  compiler does not know it's never thrown, I need to work around it in a
  few places)
<braunr> oh, ok
<braunr> jkoenig: that's interesting indeed
<jkoenig> braunr, the org.vmmagic paper provides an example which uses some
  annotations called @UncheckedMemoryAccess and @AssertSafe to the same
  effect (which is kind of cleaner), but it would be a headache to
  implement without help from the compiler I think (as far as I can tell
  the annotation processor would have to inspect the bytecode)
<braunr> but hm
<braunr> what's the true problem about this ?
<jkoenig> (the paper advocates "high-level low-level programming" and is a
  very interesting read I think,
  http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.5253&rep=rep1&type=pdf,
  for what it's worth)
<braunr> what's wrong if you just declare your methods unsafe and don't
  alter anything else ?
<tschwinge> Yes, I read it and it is interesting.  Unfortunately, it seems
  I forgot most of it again...
<jkoenig> braunr, declare? alter?
<jkoenig> you mean just tag them with an annotation?
<braunr> just stating a method "throws Unsafe"
<jkoenig> braunr, well some compiler will output a warning because they can
  tell there's no way the method is going to throw such an exception.
<jkoenig> and then some other compiler will complain that my
  @SuppressWarnings("unused") does not serve any purpose to them :-)
<jkoenig> also, when initializing final fields, I need to work around the
  fact that the compiler thinks "Unsafe" might be thrown.
<jkoenig> see for instance MachPort.DEAD
<braunr> jkoenig: ok
<jkoenig> braunr, but I'm more than willing to accept this in exchange for
  a clear, compiler-enforced materialization of the border between safe an
  unsafe code.
<jkoenig> actually another question I have is the amount of static typing I
  should add to the safe version, for instance should I subclass MachPort
  into MachSendRight, MachReceiveRight and so on. I don't want to depart
  from the C inteface too much but it could be useful.
<braunr> jkoenig: can't answer that :)
<braunr> jkoenig: keep them in mind for later i think
<tschwinge> jkoenig: What's the safety concern w.r.t. having MachPort (not)
  final?
<jkoenig> tschwinge, actually I'm partly wrong in that we only need name()
  and a couple other methods to be final
<tschwinge> jkoenig: That's what I was thinking.  :-)
<tschwinge> I though I'm missing something here.
<jkoenig> tschwinge, the idea is that the user (ie., the adversary :-)
  could extend MachPort and inject their own fake port name into messages
<jkoenig> by overriding name() or clear()
<tschwinge> Yeah, but if these are final, that's not possible.
<jkoenig> right.
<tschwinge> And that *should* be enough, I think.
<tschwinge> Unless I'm missing something.
<jkoenig> I don't think so. Also I hope it is, because as mentionned above
  there might be some value in subclassing MachPort.
<tschwinge> Yep.
<jkoenig> incidentally, declaring the class or the method final will allow
  the JVM to inline them I think.
<tschwinge> It will help the JVM, yes.  It can also figure that out without
  final, though.  (And may have to de-optimize the code again in case there
  are additional classes loaded during run-time.)
<tschwinge> jkoenig: The reference counting in MachPort.  I think I'm
  beginning to understand this.
<jkoenig> oh ok
<jkoenig> tschwinge, yes the javadoc is maybe a bit obscure so far.
<jkoenig> but basically you don't want the port name you acquire to become
  invalid before you're done using it.
<tschwinge> But how is this different from the C world?
<jkoenig> here my goal is to provide some guarantees if you use only safe
  methods
<jkoenig> like, you can't forge a port name and things like that
<jkoenig> so basically it should never be possible to include an invalid
  port name in a message if you use only safe methods.
<tschwinge> Ah, I see!
<tschwinge> Now that does make sense.
<jkoenig> but the mechanism in itself is similar to the Hurd port cells and
  user_link structures
<tschwinge> It's again ``only'' helping the programmer.
<jkoenig> right, no object-capability ulterior motives :-)
<jkoenig> another assumption which the javadoc does not state yet it that
  basically there should be exactly one MachPort object for each mach-level
  port name reference (in the sense of mach_port_mod_refs)
<tschwinge> Yes, I figured out that bit.