Changes between Version 15 and Version 16 of WRF4G/ExecutionEnvironments
- Timestamp:
- Feb 21, 2013 4:38:47 PM (9 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
WRF4G/ExecutionEnvironments
v15 v16 27 27 '''mncancel <job_id>''' removes his/her job from the queue system, canceling the execution of the job if it was already running. 28 28 29 === Job directives ===29 === Job Directives === 30 30 31 31 A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax: … … 57 57 58 58 The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted. 59 59 60 {{{ 60 61 #!sh … … 63 64 64 65 The name of the file to collect the stderr output of the job. 66 65 67 {{{ 66 68 #!sh 67 69 #@ output = file 68 70 }}} 71 69 72 The name of the file to collect the standard output (stdout) of the job. 73 70 74 {{{ 71 75 #!sh 72 76 #@ total_tasks = number 73 77 }}} 78 74 79 The number of processes to start. 80 75 81 {{{ 76 82 #!sh 77 83 #@ cpus_per_task = number 78 84 }}} 85 79 86 The number of cpus allocated for each task. This is useful for hybrid MPI+OpenMP applications, where each process will spawn a number of threads. The number of cpus per task must be between 1 and 16, since each node has 16 cores (one for each thread). 87 80 88 {{{ 81 89 #!sh 82 90 #@ tasks_per_node = number 83 91 }}} 92 84 93 The number of tasks allocated in each node. When an application uses more than 3.8 GB of memory per process, it is not possible to have 16 processes in the same node and its 64GB of memory. It can be combined with the cpus_per_task to allocate the nodes exclusively, i.e. to allocate 2, processes per node, set both directives to 2. The number of tasks per node must be between 1 and 16. 94 85 95 {{{ 86 96 #!sh 87 97 # @ gpus_per_node = number 88 98 }}} 99 89 100 The number of GPU cards assigned to the job. This number can be [0,1,2] as there are 2 cards per node. 90 101 … … 153 164 LSF is the utility used at !MareNostrum III for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at the Cluster. 154 165 155 === Submitting jobs ===166 === Submitting Jobs === 156 167 A job is the execution unit for LSF. A job is defined by a text file containing a set of directives describing the job, and the commands to execute. Please, bear in mind that there is a limit of 3600 bytes for the size of the text file. 157 168 158 === LSF commands ===169 === LSF Commands === 159 170 These are the basic directives to submit jobs: 160 171 … … 165 176 '''bkill <job_id>''' remove the job from the queue system, canceling the execution of the processes, if they were still running. 166 177 167 === Job directives ===178 === Job Directives === 168 179 A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax: 180 169 181 {{{ 170 182 #!sh 171 183 #BSUB -option value 172 184 }}} 185 173 186 {{{ 174 187 #!sh 175 188 #BSUB -J job_name 176 189 }}} 190 177 191 The name of the job. 192 178 193 {{{ 179 194 #!sh 180 195 #BSUB -q debug 181 196 }}} 197 182 198 This queue is only intended for small tests, so there is a limit of 1 job per user, using up to 64 cpus (4 nodes), and one hour of wall clock limit. 199 183 200 {{{ 184 201 #!sh 185 202 #BSUB -W HH:MM 186 203 }}} 204 187 205 NOTE: take into account that you can not specify the amount of seconds in LSF. The limit of wall clock time. This is a mandatory field and you must set it to a value greater than the real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period. 206 188 207 {{{ 189 208 #!sh 190 209 #BSUB -cwd pathname 191 210 }}} 211 192 212 The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted. 213 193 214 {{{ 194 215 #!sh 195 216 #BSUB -e/-eo file 196 217 }}} 218 197 219 The name of the file to collect the stderr output of the job. You can use %J for job_id. -e option will APPEND the file, -eo will REPLACE the file. 220 198 221 {{{ 199 222 #!sh 200 223 #BSUB -o/-oo file 201 224 }}} 225 202 226 The name of the file to collect the standard output (stdout) of the job. -o option will APPEND the file, -oo will REPLACE the file. 227 203 228 {{{ 204 229 #!sh 205 230 #BSUB -n number 206 231 }}} 232 207 233 The number of processes to start. 234 208 235 {{{ 209 236 #!sh 210 237 #BSUB -R"span[ptile=number]" 211 238 }}} 239 212 240 The number of processes assigned to a node. 213 241 214 242 We really encourage you to read the manual of bsub command to find out other specifications that will help you to define the job script. 243 215 244 {{{ 216 245 #!sh … … 221 250 222 251 Sequential job : 252 223 253 {{{ 224 254 #!sh … … 234 264 235 265 Sequential job using OpenMP : 266 236 267 {{{ 237 268 #!sh … … 249 280 250 281 Parallel job : 282 251 283 {{{ 252 284 #!sh … … 267 299 268 300 Parallel job using threads: 301 269 302 {{{ 270 303 #!sh … … 296 329 297 330 === Running Jobs === 331 298 332 Slurm+MOAB is the new utility used at Tirant for batch processing support. 299 We moved from LoadLeveler and all the jobs must be run through this new300 batch system. We tried to keep it as simple as pos ible keeping the syntax from the LoadLeveler to make the transition easier to those who usedLoadleveler in the past. This document provides information for getting started with job execution at Tirant.333 We moved from !LoadLeveler and all the jobs must be run through this new 334 batch system. We tried to keep it as simple as possible keeping the syntax from the !LoadLeveler to make the transition easier to those who used !Loadleveler in the past. This document provides information for getting started with job execution at Tirant. 301 335 302 336 ==== Classes ==== 303 337 304 The user’s limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee) and there is no reason to explicitly set the #@class directive. Anyway you are allowed to use the special class: ”debug” in order to performe some fast short tests. To use the ”debug”class you need to include the #@class directive305 Table 1: Classes 306 Class debug Max CPUs 64 CPU Time 10 min Wall time limit 10 min338 The user’s limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee) and there is no reason to explicitly set the #@class directive. Anyway you are allowed to use the special class: "debug" in order to performe some fast short tests. To use the "debug" class you need to include the #@class directive 339 340 Table 1: Classes Class debug Max CPUs 64 CPU Time 10 min Wall time limit 10 min 307 341 The specific limits assigned to each user depends on the priority granted 308 342 by the access committee. Users granted with ”high priority hours” will have access to a maximum of 1024 CPUs and a maximum wall clock limit of 72 … … 312 346 Local users of Tirant have access to a new local queue called "class t". 313 347 This queue has same parameters as "class a", but the priority is modified to restrict local time consumption to 20% of total computation time. 314 *debug: This class is reserved for testing the applications before submitting them to the 'production' queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 10 minutes. The maximum number of nodes per application 348 349 * debug: This class is reserved for testing the applications before submitting them to the 'production' queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 10 minutes. The maximum number of nodes per application 315 350 is 32. Only a limited number of jobs may be running at the same time 316 351 in this queue. … … 330 365 * '''mncancel <jobid>''' remove his/her job from the queue system, canceling the execution of the processes, if they were already running. 331 366 * '''checkjob <jobid>''' obtains detailed information about a specific job, including the assigned nodes and the possible reasons preventing the job from running. 332 * '''mnstart''' shows information about the estimated time for the especified job to be executed.367 * '''mnstart''' shows information about the estimated time for the specified job to be executed. 333 368 334 369 === Job Directives === 370 335 371 A job must contain a series of directives to inform the batch system about 336 372 the characteristics of the job. These directives appear as comments in the … … 340 376 # @ directive = value 341 377 }}} 378 342 379 Additionally, the job script may contain a set of commands to execute. 343 380 If not, an external script must be provided with the 'executable' directive. 344 381 Here you may find the most common directives: 382 345 383 {{{ 346 384 # @ class = class\_name 347 385 }}} 386 348 387 The partition where the job is to be submitted. Let this field empty 349 388 unless you need to use "interactive" or "debug" partitions. 389 350 390 {{{ 351 391 # @ wall_clock_limit = HH:MM:SS 352 392 }}} 393 353 394 The limit of wall clock time. This is a mandatory field and you must set 354 395 it to a value greater than the real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period. 396 355 397 {{{ 356 398 # @ initialdir = pathname 357 399 }}} 400 358 401 The working directory of your job (i.e. where the job will run). If not 359 402 specified, it is the current working directory at the time the job was submitted. 403 360 404 {{{ 361 405 # @ error = file 362 406 }}} 407 363 408 The name of the file to collect the stderr output of the job. 409 364 410 {{{ 365 411 # @ output = file 366 412 }}} 413 367 414 The name of the file to collect the standard output (stdout) of the job. 415 368 416 {{{ 369 417 # @ total_tasks = number 370 418 }}} 419 371 420 The number of processes to start. 372 421 … … 374 423 375 424 Serial job: 425 376 426 {{{ 377 427 #!sh … … 388 438 389 439 Parallel job : 440 390 441 {{{ 391 442 #!sh