Process migration for OpenMPI

From MosixWiki
Jump to: navigation, search

OpenMPI is a popular implementation of the MPI standard that is relatively easy to use and extend. We are developing support for migratable MPI processes using the MOSIX framework. Further details are available in our EuroMPI 2012 paper: http://www.mosix.org/pub/Process_Migration_for_OpenMPI.pdf


Our code was submitted to the OpenMPI community for their approval to commit the patch into the mainline. For now, the patches for both the stable 1.6 branch and the development trunk are available. These patches are released under the OpenMPI community license http://www.open-mpi.org/community/license.php

Please contact us if you have questions or suggestions.

Installation

1. Download the developers version of the OMPI source code from the OMPI repositories, see instructions at http://www.open-mpi.org/svn/ . Make sure the file "autogen.sh" is located in the head folder. For example, run:

svn co http://svn.open-mpi.org/svn/ompi/trunk

2. Download the patch file according to the OpenMPI version you have chosen:

OpenMPI 1.6.x http://www.MOSIX.org/mpi/patch-OpenMPI-branch
OpenMPI trunk http://www.MOSIX.org/mpi/patch-OpenMPI-trunk-no-libevent

3. Enter the topmost folder of the OMPI you've just downloaded and apply the patch, by running:

cd openmpi
patch -p0 < mosix_ompi_trunk.patch

4. Build the OMPI source into binary. This step is where you can add or remove additional features. Make sure you do not skip the "autogen" phase, otherwise the patch will not be detected. MOSIX support will be included by default.

For example, you can run:

./autogen.sh
./configure
make
make install

Running OpenMPI over MOSIX

After the OpenMPI is built successfully using the patch described above, running jobs with "mpirun" should automatically use the MOSIX support to launch migratable processes. In order to fine-tune the launched application, or use advanced MOSIX features such as multi-cluster, flags can be passed on to "mpirun" like any other flags.

The MOSIX support patch for OMPI consists of 3 components, each with its own flags:

1. BTL - Responsible for send/recv actions.

2. ODLS - Responsible for launching the application (typically uses fork syscall).

3. RAS - Responsible for the job resource allocation.

In order to get the list of available flags, run:

ompi_info --param <component> mosix

For example, for launching an OpenMPI job with all processes locked (prevent migration) run:

mpirun -mca odls_mosix_lock=1 executable

Usage

Usage examples

  • Lock processes in place (prevent migration): mpirun -mca

odls_mosix_migration_lock=1 app

  • Freeze entire job: mpirun -mca odls_mosix_job_id=5 app ; mosmigrate -J5

freeze

  • Use custom MOSIX argument (e.g. -XYZ): mpirun -mca

odls_mosix_additional_arg "-XYZ" app

List of parameters

The list of parameters for MOSIX components for OMPI (result of "ompi_info -param [odls|ras|btl] mosix"):

Verbosity level of the BTL framework:

MCA btl: parameter "btl_base_verbose" (current value: <0>, data source: default value)
   

Default selection set of components for the btl framework (<none> means use all components that can be found):

MCA btl: parameter "btl" (current value: <none>, data source: default value)
   

BTL exclusivity (must be >= 0):

MCA btl: parameter "btl_mosix_exclusivity" (current value: <100>, data source: default value)
   

BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8, RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML (ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags only used by the "bfo" PML (ignored by others): FAILOVER_SUPPORT=512):

MCA btl: parameter "btl_mosix_flags" (current value: <313>, data source: default value)

Size (in bytes, including header) of "phase 1" fragment sent for all large messages (must be >= 0 and <= eager_limit):

MCA btl: parameter "btl_mosix_rndv_eager_limit" (current value: <65536>, data source: default value)
  

Maximum size (in bytes, including header) of "short" messages (must be >= 1):

MCA btl: parameter "btl_mosix_eager_limit" (current value: <65536>, data source: default value)
  

Maximum size (in bytes) of a single "phase 2" fragment of a long message when using the pipeline protocol (must be >= 1):

MCA btl: parameter "btl_mosix_max_send_size" (current value: <131072>, data source: default value)
  

Approximate maximum bandwidth of interconnect(0 = auto-detect value at run-time [not supported in all BTL modules], >= 1 = bandwidth in Mbps):

MCA btl: parameter "btl_mosix_bandwidth" (current value: <100>, data source: default value)
  

Approximate latency of interconnect (must be >= 0):

MCA btl: parameter "btl_mosix_latency" (current value: <100>, data source: default value)
  

Upper limit on message length for UDP:

MCA btl: parameter "btl_mosix_free_list_num" (current value: <8>, data source: default value)
MCA btl: parameter "btl_mosix_free_list_max" (current value: <-1>, data source: default value)
MCA btl: parameter "btl_mosix_free_list_inc" (current value: <32>, data source: default value)
MCA btl: parameter "btl_mosix_udp_max_size" (current value: <0>, data source: default value)
  

Turn on warning messages when certain NICs are not used:

MCA btl: parameter "btl_mosix_libevent_support" (current value: <0>, data source: default value)
MCA btl: parameter "btl_mosix_priority" (current value: <0>, data source: default value)
MCA btl: parameter "btl_base_warn_component_unused" (current value: <1>, data source: default value)
  

Time to wait for a process to die after issuing it a kill signal:

MCA odls: parameter "odls_base_sigkill_timeout" (current value: <1>, data source: default value)
  

Default selection set of components for the odls framework (<none> means use all components that can be found):

MCA odls: parameter "odls" (current value: <none>, data source: default value)
  

Verbosity level for the odls framework (0 = no verbosity):

MCA odls: parameter "odls_base_verbose" (current value: <0>, data source: default value)
  

Path to the mosrun/mosenv executable file:

MCA odls: parameter "odls_mosix_executable" (current value: </usr/bin/mosenv>, data source: default value)
  

Only with MOSIX debug version: Path to write debug output (e.g. filename or `tty`):

MCA odls: parameter "odls_mosix_debug_output" (current value: <none>, data source: default value)
  

Additional argument to add right after the executable:

MCA odls: parameter "odls_mosix_additional_arg" (current value: <none>, data source: default value)
  

Additional library for dynamic loading (LD_PRELOAD):

MCA odls: parameter "odls_mosix_ld_preload" (current value: <none>, data source: default value)
  

Host name to serve as a remote node for the process:

MCA odls: parameter "odls_mosix_remote_host_name" (current value: <none>, data source: default value)
  

Host address to serve as a remote node for the process:

MCA odls: parameter "odls_mosix_remote_host_address" (current value: <none>, data source: default value)
  

List of hostnames to run corresponding processes:

MCA odls: parameter "odls_mosix_remote_host_list" (current value: <none>, data source: default value)
  

Location to start the process (0: Best node, 1: At home, 2: Custom):

MCA odls: parameter "odls_mosix_start_location" (current value: <1>, data source: default value)
  

Class permitted for migration in a MOSIX multi-cluster:

MCA odls: parameter "odls_mosix_grid_class" (current value: <0>, data source: default value)
  

Priority in the MOSIX queue (0 for not queuing, default):

MCA odls: parameter "odls_mosix_queue_priority" (current value: <0>, data source: default value)
  

Whether to lock the process in place for migration (If strated remotely - stays there):

MCA odls: parameter "odls_mosix_migration_lock" (current value: <0>, data source: default value)
  

How to handle unsupported system calls (0: Exit, 1: Ignore (default), 2: Report):

MCA odls: parameter "odls_mosix_unsupported_syscalls" (current value: <1>, data source: default value)
  

MOSIX Job ID for all the processes launched in this MPI job:

MCA odls: parameter "odls_mosix_job_id" (current value: <0>, data source: default value)
  

Whether to display the allocation after it is determined:

MCA odls: parameter "odls_mosix_priority" (current value: <0>, data source: default value)
MCA ras: parameter "ras_base_display_alloc" (current value: <0>, data source: default value)
  

Whether to display a developer-detail allocation after it is determined:

MCA ras: parameter "ras_base_display_devel_alloc" (current value: <0>, data source: default value)
  

Default selection set of components for the ras framework (<none> means use all components that can be found):

MCA ras: parameter "ras" (current value: <none>, data source: default value)
  

Verbosity level for the ras framework (0 = no verbosity):

MCA ras: parameter "ras_base_verbose" (current value: <0>, data source: default value)

Priority of the loadleveler ras component

MCA ras: parameter "ras_mosix_priority" (current value: <90>, data source: default value)  

Wait for enough free resources to prevent oversubscription

MCA ras: parameter "ras_mosix_prevent_oversubscription" (current value: <0>, data source: default value)