| 168 | {{{ |
| 169 | ssh <userid>@mn1.bsc.es |
| 170 | }}} |
| 171 | |
| 172 | === Running Jobs === |
| 173 | LSF is the utility used at MareNostrum III for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at the Cluster. |
| 174 | 5.1. Submitting jobs |
| 175 | A job is the execution unit for LSF. A job is defined by a text file containing a set of directives |
| 176 | describing the job, and the commands to execute. Please, bear in mind that there is a limit of 3600 |
| 177 | bytes for the size of the text file. |
| 178 | 5.1.1. LSF commands |
| 179 | These are the basic directives to submit jobs: |
| 180 | • bsub < job_script |
| 181 | submits a “job script” to the queue system (see below for job script directives). Remember to pass |
| 182 | it through STDIN '<' |
| 183 | • bjobs [-w][-X][-l job_id] |
| 184 | shows all the submitted jobs. |
| 185 | • bkill <job_id> |
| 186 | remove the job from the queue system, canceling the execution of the processes, if they were still |
| 187 | running. |
| 188 | 5.1.2. Job directives |
| 189 | A job must contain a series of directives to inform the batch system about the characteristics of the |
| 190 | job. These directives appear as comments in the job script, with the following syntax: |
| 191 | #BSUB -option value |
| 192 | #BSUB -J job_name |
| 193 | The name of the job. |
| 194 | #BSUB -q debug |
| 195 | This queue is only intended for small tests, so there is a limit of 1 job per user, using up to 64 cpus |
| 196 | (4 nodes), and one hour of wall clock limit. |
| 197 | #BSUB -W HH:MM |
| 198 | NOTE: take into account that you can not specify the amount of seconds in LSF. The limit of wall |
| 199 | clock time. This is a mandatory field and you must set it to a value greater than the real execution |
| 200 | time for your application and smaller than the time limits granted to the user. Notice that your job |
| 201 | will be killed after the elapsed period |
| 202 | #BSUB -cwd pathname |
| 203 | The working directory of your job (i.e. where the job will run). If not specified, it is the current |
| 204 | working directory at the time the job was submitted. |
| 205 | #BSUB -e/-eo file |
| 206 | The name of the file to collect the stderr output of the job. You can use %J for job_id. -e option will |
| 207 | APPEND the file, -eo will REPLACE the file. |
| 208 | 7 |
| 209 | MareNosutrm III User's Guide |
| 210 | #BSUB -o/-oo file |
| 211 | The name of the file to collect the standard output (stdout) of the job. -o option will APPEND the |
| 212 | file, -oo will REPLACE the file. |
| 213 | #BSUB -n number |
| 214 | The number of processes to start. |
| 215 | #BSUB -R"span[ptile=number]" |
| 216 | The number of processes assigned to a node. |
| 217 | We really encourage you to read the manual of bsub command to find out other specifications that |
| 218 | will help you to define the job script. |
| 219 | man bsub |
| 220 | 5.1.3. Examples |
| 221 | |
| 222 | Sequential job : |
| 223 | {{{ |
| 224 | #!sh |
| 225 | #!/bin/bash |
| 226 | #BSUB -n 1 |
| 227 | #BSUB -oo output_%J.out |
| 228 | #BSUB -eo output_%J.err |
| 229 | #BSUB -J sequential |
| 230 | #BSUB -W 00:05 |
| 231 | |
| 232 | ./serial.exe |
| 233 | }}} |
| 234 | |
| 235 | The job would be submitted using: |
| 236 | {{{ |
| 237 | #!sh |
| 238 | bsub < ptest.cmd |
| 239 | }}} |
| 240 | |
| 241 | Sequential job using OpenMP : |
| 242 | {{{ |
| 243 | #!sh |
| 244 | #!/bin/bash |
| 245 | #BSUB -n 1 |
| 246 | #BSUB -oo output_%J.out |
| 247 | #BSUB -eo output_%J.err |
| 248 | #BSUB -J sequential_OpenMP |
| 249 | #BSUB -W 00:05 |
| 250 | |
| 251 | export OMP_NUM_THREADS=16 |
| 252 | |
| 253 | ./serial.exe |
| 254 | }}} |
| 255 | |
| 256 | Parallel job : |
| 257 | {{{ |
| 258 | #!sh |
| 259 | #!/bin/bash |
| 260 | #BSUB -n 128 |
| 261 | #BSUB -o output_%J.out |
| 262 | #BSUB -e output_%J.err |
| 263 | # In order to launch 128 processes with 16 processes per node: |
| 264 | #BSUB -R"span[ptile=16]" |
| 265 | #BSUB -x # Exclusive use |
| 266 | #BSUB -J parallel |
| 267 | #BSUB -W 02:00 |
| 268 | # You can choose the parallel environment through modules |
| 269 | |
| 270 | module load intel openmpi |
| 271 | mpirun ./wrf.exe |
| 272 | }}} |
| 273 | |
| 274 | Parallel job using threads: |
| 275 | {{{ |
| 276 | #!sh |
| 277 | #!/bin/bash |
| 278 | # The total number of MPI processes: |
| 279 | #BSUB -n 128 |
| 280 | #BSUB -oo output_%J.out |
| 281 | #BSUB -eo output_%J.err |
| 282 | # It will allocate 4 MPI processes per node: |
| 283 | #BSUB -R"span[ptile=4]" |
| 284 | #BSUB -x # Exclusive use |
| 285 | #BSUB -J hybrid |
| 286 | #BSUB -W 02:00 |
| 287 | # You can choose the parallel environment through |
| 288 | # modules |
| 289 | |
| 290 | module load intel openmpi |
| 291 | # 4 MPI processes per node and 16 cpus available |
| 292 | # (4 threads per MPI process): |
| 293 | |
| 294 | export OMP_NUM_THREADS=4 |
| 295 | |
| 296 | mpirun ./wrf.exe |
| 297 | }}} |
| 298 | |