Direct communication

From MosixWiki
Jump to: navigation, search
DIRECT COMMUNICATION(M7)       MOSIX Description      DIRECT COMMUNICATION(M7)
  
NAME
    DIRECT COMMUNICATION - migratable sockets between MOSIX processes
  
PURPOSE
    Normally, MOSIX processes do all their I/O (and most system-calls) via
    their home-node: this can be slow because operations are limited by the
    network speed and latency.  Direct communication allows processes to pass
    messages directly between them, bypassing their home-nodes.
  
    For example, if process X whose home-node is A and runs on node B wishes
    to send a message over a socket to process Y whose home-node is C and
    runs on node D, then the message has to pass over the network from B to A
    to C to D.  Using direct communication, the message will pass directly
    send messages to mailboxes of other processes anywhere within the multi-
    cluster Grid (that are willing to accept them).
  
    Direct communication makes the location of processes transparent, so the
    senders do not need to know where the receivers run, but only to identify
    them by their home-node and process-ID (PID) in their home-node.
  
    Direct communication guarantees that the order of messages per receiver
    is preserved, even when the sender(s) and receiver migrate - no matter
    where to and how many times they migrate.
    
SENDING MESSAGES
    To start sending messages to another process, use:
          them = open("/proc/mosix/mbox/{a.b.c.d}/{pid}", 1);
    where {a.b.c.d} is the IP address of the receiver's home-node and {pid}
    is the process-ID of the receiver.  To send messages to a process with
    the same home-node, you can use 0.0.0.0 instead of the local IP address
    (this is even preferable, because it allows the communication to proceed
    in the rare event when the home-node is shut-down from its cluster).
    
    The returned value (them) is not a standard (POSIX) file-descriptor: it
    can only be used within the following system calls:
  
          w = write(them, message, length);
          fcntl(them, F_SETFL, O_NONBLOCK);
          fcntl(them, F_SETFL, 0);
          dup2(them, 1);
          dup2(them, 2);
          close(them);
    
    Zero-length messages are allowed.
    
    Each process may at any time have up to 128 open direct communication
    file-descriptors for sending messages to other processes.  These file-
    descriptors are inherited by child processes (after fork(2)).
    
    When dup2 is used as above, the corresponding file-descriptor (1 for
    standard-output; 2 for standard-error) is associated with sending mes-
    sages to the same process as them.  In that case, only the above calls
    (write, fcntl, close, but not dup2) can then be used with that descriptor.
   
RECEIVING MESSAGES
    To start receiving messages, create a mailbox:
          my_mbox = open("/proc/mosix/mybox", O_CREAT, flags);
    where flags is any combination (bitwise OR) of the following:
   
    1    Allow receiving messages from other users of the same group (GID).
    2    Allow receiving messages from all other users.
    4    Allow receiving messages from processes with other home-nodes.
    8    Do not delay: normally when attempting to receive a message and no
         fitting message was received, the call blocks until either a message
         or a signal arrives, but with this flag, the call returns immedi-
         ately a value of -1 (with errno set to EAGAIN).
    16   Receive a SIGIO signal (See signal(7)) when a message is ready to be
         read (for assynchroneous operation).
    32   Normally, when attempting to read and the next message does not fit
         in the read buffer (the message length is bigger than the count
         parameter of the read(2) system-call), the next message is trun-
         cated.  When this bit is set, the first message that fits the read-
         buffer will be read (even if out of order): if none of the pending
         messages fits the buffer, the receiving process either waits for a
         new message that fits the buffer to arrive, or if bit 8 ("do not
         delay") is also set, returns -1 with errno set to EAGAIN.
    64   Treat zero-length messages as an end-of-file condition: once a zero-
         length message is read, all further reads will return 0 (pending and
         future messages are not deleted, so they can still be read once this
         flag is cleared).
   
    The returned value (my_mbox) is not a standard (POSIX) file-descriptor:
    it can only be used within the following system calls:
   
          r = read(my_mbox, buf, count);
          r = readv(my_mbox, iov, niov);
          dup2(my_mbox, 0);
          close(my_mbox);
          ioctl(my_mbox, SIOCINTERESTED, addr); (see FILTERING below).
     
    Reading my_mbox always reads a single message at a time, even when count
    allows reading more messages.  A message can have zero-length, but count
    cannot be zero.
  
    A count of -1 is a special request to test for a message without actually
    reading it.  If a message is present for reading, read(my_mbox, buf, -1)
    returns its length - otherwise it returns -1 with errno set to EAGAIN.
   
    unlike in "SENDING MESSAGES" above, my_mbox is NOT inherited by child
    processes.
   
    When dup2 is used as above, file-descriptor 0 (standard-input) is associ- 
    ated with receiving messages from other processes, but only the read,
    readv and close system-calls can then be used with file-descriptor 0.
  
    Closing my_mbox (or close(0) if dup2(my_mbox, 0) was used - whichever is
    closed last) discards all pending messages.
   
    To change the flags of the mailbox without losing any pending messages,
    open it again (without using close):
   
          my_mbox = open("/proc/mosix/mybox", O_CREAT, new_flags);
   
    Note that when removing permission-flags (1, 2 and 4) from new_flags,
    messages that were already sent earlier will still arrive, even from
    senders that are no longer allowed to send messages to the current pro-
    cess.  Re-opening always returns the same value (my_mbox) as the initial
    open (unless an error occurs and -1 is returned).  Also note that if
    dup2(my_mbox, 0) was used, new_flags will immediately apply to file-
    descriptor 0 as well.
   
    Extra information is available about the latest message that was read
    (including when the count parameter of the last read() was -1 and no
    reading actually took place).  To get this information, you should first
    define the following macro:
          static inline unsigned int GET_IP(char *file_name)
          {
                int ip = open(file_name, 0);
                return((unsigned int)((ip==-1 && errno>255) ? -errno: ip));
          }
  
    To find the IP address of the sender's home, use:
          sender_home = GET_IP("/proc/self/sender_home");
   
    To find the process-ID (PID) of the sender, use:
          sender_pid = open("/proc/self/sender_pid", 0);
 
    To find the IP address of the node where the sender was running when the
    message was sent, use:
          sender_location = GET_IP("/proc/self/sender_location");
    (this can be used, for example, to request a manual migration to bring
    together communicating processes to the same node)
 
    To find the length of the last message, use:
          bytes = open("/proc/self/message_length", 0);
    (this makes it possible to detect truncated messages: if the last message
    was truncated, bytes will contain the original length).

 FILTERING
    The following facility allows the receiver to select which types of mes-
    sages it is interested to receive:
  
    struct interested
    {
       unsigned char conditions;  /* bitmap of conditions */
       unsigned char testlen;     /* length of test-pattern (1-8 bytes) */
       short pid;                 /* Process-ID of sender */
       unsigned int home;         /* home-node of sender (0 = same home) */
       int minlen;                /* minimum message length */
       int maxlen;                /* maximum message length */
       int testoffset;            /* offset of test-pattern within message */
       unsigned char testdata[8]; /* expected test-pattern */
       int msgno;                 /* pick a specific message (starting from 1) */
       int msgoffset;             /* start reading from given offset */
    };
  
    /* conditions: */
    #define INTERESTED_IN_PID        1
    #define INTERESTED_IN_HOME       2
    #define INTERESTED_IN_MINLEN     4
    #define INTERESTED_IN_MAXLEN     8
    #define INTERESTED_IN_PATTERN   16
    #define INTERESTED_IN_MESSAGENO 32
    #define INTERESTED_IN_OFFSET    64
    #define PREVENT_REMOVAL        128
  
    struct interested filter;
  
    #define SIOCINTERESTED   0x8985
      
   A call to:
        ioctl(my_mbox, SIOCINTERESTED, &filter);
   starts applying the given filter, while a call to:
        ioctl(my_mbox, SIOCINTERESTED, NULL);
   cancels the filtering.  Closing my_mbox also cancels the filtering (but
   re-opening with different flags does not cancel the filtering).
   
   Calls to this ioctl return the address of the previous filter.
  
   When filtering is applied, only messages that comply with the filter are
   received: if there are no complying messages, the receiving process
   either waits for a complying message to arrive, or if bit 8 ("do not
   delay") of the flags from open("/proc/self/mybox", O_CREAT, flags) is
   set, read(my_mbox,...) and readv(my_mbox,...) return -1 with errno set to
   EAGAIN.  Filtering can also be used to test for particular messages using
   read(my_mbox, buf, -1).
   
   Different types of messages can be received simply by modifying the con-
   tents of the filter between calls to read(my_mbox,...) (or
   readv(my_mbox,...)).
    
   filter.conditions is a bit-map indicating which condition(s) to consider:
      
   When INTERESTED_IN_PID is set, the process-ID of the sender must match
   filter.pid.
     
   When INTERESTED_IN_HOME is set, the home-node of the sender must match
   filter.home (a value of 0 can be used to match senders from the same
   home-node).
   
   When INTERESTED_IN_MINLEN is set, the message length must be at least
   filter.minlen bytes long.
   
   When INTERESTED_IN_MAXLEN is set, the message length must be no longer
   than filter.maxlen bytes.
   
   When INTERESTED_IN_PATTERN is set, the message must contain a given pat-
   tern of data at a given offset.  The offset within the message is given
   by filter.testoffset, the pattern's length (1 to 8 bytes) in
   filter.testlen and its expected contents in filter.testdata.
   
   When INTERESTED_IN_MESSAGENO is set, the message numbered filter.msgno
   (numbering starts from 1) will be read out of the queue of received mes-
   sages.
    
   When INTERESTED_IN_OFFSET is set, reading begins at the offset
   filter.msgoffset of the message's data.
    
   When PREVENT_REMOVAL is set, read messages are not removed from the mes-
   sage-queue, so they can be re-read until this flag is cleared.
      
ERRORS
    Sender errors:
 
    ENOENT  Invalid pathname in open: the specified IP address is not part of
            this cluster/Grid, or the process-ID is out of range (must be
            2-32767).
 
    ESRCH   No such process (this error is detected only when attempting to
            send - not when opening the connection).
 
    EACCES  No permission to send to that process.
 
    ENOSPC  Non-blocking (O_NONBLOCK) was requested and the receiver has no
            more space to accept this message - perhaps try again later.
  
    ECONNABORTED
            The home-node of the receiver is no longer in our multi-
            cluster Grid.
 
    EMFILE  The maximum of 128 direct communicaiton file-descriptors is
            already in use.
 
    EINVAL  When opening, the second parameter does not contain the bit "1";
            When writing, the length is negative or more than 32MB.
 
    ETIMEDOUT
            Failed to establish connection with the mail-box managing daemon
            (postald).
 
    ECONNREFUSED
            The mail-box managing (postald) refused to serve the call (proba-
            bly a MOSIX installation error).
 
    EIO     Communication breakdown with the mail-box managing daemon (postald).
 
    Receiver errors:
 
    EAGAIN  No message is currently available for reading and the "Do not
            delay" flag is set (or count is -1).
   
    EXFULL  Messages were possibly lost (usually due to insufficient memory):
            the receiver may still be able to receive new messages.
 
    ENOMSG  The receiver had insufficient memory to store the last message.
            Despite this error, it is still possible to find out who sent the
            last message and its original length.
    
    EINVAL  One or more values in the filtering structure are illegal or
            their combination makes it impossible to receive any message (for
            example, the offset of the data-pattern is beyond the maximum
            message length).
    
    ENODATA The INTERESTED_IN_MESSAGENO filter is used, and either "no trun-
            cating" was requested (32 in the open-flags) while the message
            does not fit the read buffer, or the message does not fulfil the
            other filtering conditions.
    
    Errors that are common to both sender and receiver:
 
    EINTR   Read/write interrupted by a signal.
 
    ENOMEM  Insufficient memory to complete the operation.
 
    EFAULT  Bad read/write buffer address.
 
    ENETUNREACH
            Could not establish a connection with the mail-box managing dea-
            mon (postald).
 
    ECONNRESET
            Connection lost with the mail-box managing daemon (postald).
 
POSSIBLE APPLICATIONS
    The scope of direct communication is very wide: almost any program that
    requires communication between related processes can benefit.  Following
    are a few examples:
 
    1.   Use direct communication within standard communication packages and
         libraries, such as MPI.
 
    2.   Pipe-like applications where one process' output is the other's
         input: write your own code or use the existing mospipe(1) MOSIX
         utility.
 
    3.   Direct communiction can be used to implement fast I/O for migrated
         processes (with the cooperation of a local process on the node where
         the migrated process is running).  In particular, it can be used to
         give migrated processes access to data from a common NFS server
         without causing their home-node to become a bottleneck.
 
LIMITATIONS
    Processes that are involved in direct communication (having open file-
    descriptors for either sending or receiving messages) cannot be check-
    pointed and cannot execute mosrun recursively or native (see mosrun(1)).
 
SEE ALSO
    mosrun(1), mospipe(1), mosix(7).
 
MOSIX                              February 2009                              MOSIX