-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
171 lines (153 loc) · 6.38 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
README: parallel_wrapper
Copyright(C) 2012, Marquette University
Copyright(C) 2012, Johns Hopkins University School of Medicine
For installation instructions, see `INSTALL' in this directory.
1. Using the Parallel Wrapper
2. Wrapper Options
3. Environment Variables
------------------------------
1. Using the Parallel Wrapper
------------------------------
The parallel wrapper is a convient executable that is used to help
run MPICH-2 (hydra) executables across a Condor cluster. This wrapper
sets a number of environment variables that can be used inside of a
script to configure and run MPI jobs on condor. This script is
largely agnostic to the underlying MPI implementation. Most of the
testing for this script occured while running Condor 7.6 on MVAPICH.
Similar to the Condor's java universe, the the wrapper is used by
providing it as the Condor executable. The actual script that the
user wishes to run is then listed in Condor's arguments variable
within in the submit file. For example:
## Example Submit File
universe = parallel
executable = parallel_wrapper
args = "my_script.sh"
# Request 4 machines with 4 processors per machine
machine_count = 4
RequestCpus = 4
should_transfer_files = ALWAYS
when_to_transfer_output = ON_EXIT
RequestMemory = 1024
queue
Additional arguments can be provided to the parallel wrapper prior to
supplying the script/executable to run:
executable = parallel_wrapper
args = "--no-timeout my_script.sh --args-to-myscript"
-------------------
2. Wrapper Options
-------------------
Flags:
--verbose verbose mode
--no-timeout disable aborts due to timeouts
Options:
-h, --help this help message
-r, --rank={value} set the rank of this host
-n {value} number of processes
-p, --ports={low:high} port range to use
-t, --timeout={value} set the execute timeouts (sec)
-k, --ka-interval={value} interval between subsequent keep-alives
Periodically, the wrapper sends keep-alive signals to the rest of the
hosts. This monitors whether each host is alive. In the event that
several keep alive signals are missed, the entire job is aborted.
The intervals can be modified using the -t and -k options. If you do
not wish to use keep-alives, the --no-timeout flag can be used.
-------------------------
3. Environment Variables
-------------------------
The wrapper sets up a number of environment variables which can be
used when constructing a script or running an executable. These
environment variables are in addition to the variables set by
Condor.
Environment Variables:
[_CONDOR_PROCNO] set the rank of this host
[_CONDOR_NUMPROCS] number of processes
Environment Variables Set By Wrapper:
[MACHINE_FILE] the location of the machine file
[HOST_FILE] identical to [MACHINE_FILE]
[SSH_CONFIG] location of the SSH_CONFIG file
[SSH_WRAPPER] wrapper around ssh which uses [SSH_CONFIG]
[REQUEST_CPUS] the number of LOCAL cpus allocated (usually 1)
[NUM_MACHINES] the number of machines (num_machines)
[NUM_PROCS] total CPUs allocated for this task. This is
the sum of request cpus across all machines.
[CPUS] identical to [NUM_PROC]
[RANK] the rank of this process (MASTER)
[CLUSTER_ID] the Condor cluster ID
[SCRATCH_DIR] wrapper scratch directory
[IWD] the job's initial working directory on the startd
[SCHEDD_IWD] the job's initial working directory on the schedd
[IP_ADDR] the IP address of this host
[CMD_PORT] the command port on this host
[TRANSFER_FILES] 'TRUE'/'FALSE' depending on if the
IWD is shared across all hosts
[SHARED_FS] the location of the shared FS for
this MPI process. For jobs which have
[TRANSFER_FILES]='FALSE', this is a 'fake'
shared FS
[SHARED_DIR] identical to [SHARED_FS]
-------------------
4. Example Script
-------------------
The following script is an example of how to run an executable
using MVAPICH-1.5.1-Hydra-1.3b1 (09/09/2010) and the parallel wrapper.
A shared file system is not required for this example. We assume that
the MVAPICH executables (mpiexec.hydra) are available on all machines
in the cluster. The example uses a script called example.sh
which is transfered to every machine upon job initialization. In
addition, we use a MPI C executable (mpi_executable) to run the test.
---------------------
4a. mpi_executable.c
---------------------
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Hello from process %d\n", rank);
MPI_Finalize();
return 0;
}
------------------------
4b. Condor Submit Script
------------------------
## Submit Script
universe = parallel
executable = parallel_wrapper
args = "--verbose --no-timeout example.sh"
output = example.out
error = example.err
log = example.log
GetEnv = True
# Request 2 machine with 2 processors per machine
machine_count = 2
RequestCpus = 2
RequestMemory = 100
# Transfer Input Files
should_transfer_files = ALWAYS
when_to_transfer_output = ON_EXIT
transfer_input_files = example.sh, mpi_executable
notification = never
queue
--------------
4c. example.sh
--------------
#!/bin/bash
mpiexec.hydra -f ${MACHINE_FILE} -wdir ${SHARED_FS} -n ${CPUS} \
-bootstrap-exec ${SSH_WRAPPER} mpi_executable
---------------------------------
4d. Expected Output (Approximate)
---------------------------------
Sun Jul 15 10:09:53 INFO: Skipping interface 127.0.0.1 because it is loopback.
Sun Jul 15 10:09:53 INFO: IP Addr: 172.16.0.80
Sun Jul 15 10:09:53 INFO: Bound to command port: 51000
Sun Jul 15 10:09:53 INFO: Using scratch directory at /var/condor/execute/dir_11053/.condor_scratch_nh0dW9
Sun Jul 15 10:09:53 INFO: Waiting for registration from rank 1
Sun Jul 15 10:09:55 INFO: Finished machine registration.
Sun Jul 15 10:09:55 INFO: Using fake file system (/tmp/condor_hydra_1322782_1342364995). IWD's across ranks differ
Hello from process 0
Hello from process 1
Hello from process 2
Hello from process 3
Sun Jul 15 10:09:57 INFO: Successfully removed scratch directory /var/condor/execute/dir_11053/.condor_scratch_nh0dW9