Changes between Version 15 and Version 16 of WRF4G/ExecutionEnvironments


Ignore:
Timestamp:
Feb 21, 2013 4:38:47 PM (9 years ago)
Author:
carlos
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WRF4G/ExecutionEnvironments

    v15 v16  
    2727    '''mncancel <job_id>''' removes his/her job from the queue system, canceling the execution of the job if it was already running.
    2828
    29 === Job directives ===
     29=== Job Directives ===
    3030
    3131A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:
     
    5757
    5858The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.
     59
    5960{{{
    6061#!sh
     
    6364
    6465The name of the file to collect the stderr output of the job.
     66
    6567{{{
    6668#!sh
    6769#@ output = file
    6870}}}
     71
    6972The name of the file to collect the standard output (stdout) of the job.
     73
    7074{{{
    7175#!sh
    7276#@ total_tasks = number
    7377}}}
     78
    7479The number of processes to start.
     80
    7581{{{
    7682#!sh
    7783#@ cpus_per_task = number
    7884}}}
     85
    7986The number of cpus allocated for each task. This is useful for hybrid MPI+OpenMP applications, where each process will spawn a number of threads. The number of cpus per task must be between 1 and 16, since each node has 16 cores (one for each thread).
     87
    8088{{{
    8189#!sh
    8290#@ tasks_per_node = number
    8391}}}
     92
    8493The number of tasks allocated in each node. When an application uses more than 3.8 GB of memory per process, it is not possible to have 16 processes in the same node and its 64GB of memory. It can be combined with the cpus_per_task to allocate the nodes exclusively, i.e. to allocate 2, processes per node, set both directives to 2. The number of tasks per node must be between 1 and 16.
     94
    8595{{{
    8696#!sh
    8797# @ gpus_per_node = number
    8898}}}
     99
    89100The number of GPU cards assigned to the job. This number can be [0,1,2] as there are 2 cards per node.
    90101
     
    153164LSF is the utility used at !MareNostrum III for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at the Cluster.
    154165
    155 === Submitting jobs ===
     166=== Submitting Jobs ===
    156167A job is the execution unit for LSF. A job is defined by a text file containing a set of directives describing the job, and the commands to execute. Please, bear in mind that there is a limit of 3600 bytes for the size of the text file.
    157168
    158 === LSF commands ===
     169=== LSF Commands ===
    159170These are the basic directives to submit jobs:
    160171
     
    165176    '''bkill <job_id>''' remove the job from the queue system, canceling the execution of the processes, if they were still running.
    166177
    167 === Job directives ===
     178=== Job Directives ===
    168179A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:
     180
    169181{{{
    170182#!sh
    171183#BSUB -option value
    172184}}}
     185
    173186{{{
    174187#!sh
    175188#BSUB -J job_name
    176189}}}
     190
    177191The name of the job.
     192
    178193{{{
    179194#!sh
    180195#BSUB -q debug
    181196}}}
     197
    182198This queue is only intended for small tests, so there is a limit of 1 job per user, using up to 64 cpus (4 nodes), and one hour of wall clock limit.
     199
    183200{{{
    184201#!sh
    185202#BSUB -W HH:MM
    186203}}}
     204
    187205NOTE: take into account that you can not specify the amount of seconds in LSF. The limit of wall clock time. This is a mandatory field and you must set it to a value greater than the real execution time for your  application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period.
     206
    188207{{{
    189208#!sh
    190209#BSUB -cwd pathname
    191210}}}
     211
    192212The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.
     213
    193214{{{
    194215#!sh
    195216#BSUB -e/-eo file
    196217}}}
     218
    197219The name of the file to collect the stderr output of the job. You can use %J for job_id. -e option will APPEND the file, -eo will REPLACE the file.
     220
    198221{{{
    199222#!sh
    200223#BSUB -o/-oo file
    201224}}}
     225
    202226The name of the file to collect the standard output (stdout) of the job. -o option will APPEND the file, -oo will REPLACE the file.
     227
    203228{{{
    204229#!sh
    205230#BSUB -n number
    206231}}}
     232
    207233The number of processes to start.
     234
    208235{{{
    209236#!sh
    210237#BSUB -R"span[ptile=number]"
    211238}}}
     239
    212240The number of processes assigned to a node.
    213241
    214242We really encourage you to read the manual of bsub command to find out other specifications that will help you to define the job script.
     243
    215244{{{
    216245#!sh
     
    221250
    222251Sequential job :
     252
    223253{{{
    224254#!sh
     
    234264
    235265Sequential job using OpenMP :
     266
    236267{{{
    237268#!sh
     
    249280
    250281Parallel job :
     282
    251283{{{
    252284#!sh
     
    267299
    268300Parallel job using threads:
     301
    269302{{{
    270303#!sh
     
    296329
    297330=== Running Jobs ===
     331
    298332Slurm+MOAB is the new utility used at Tirant for batch processing support.
    299 We moved from LoadLeveler and all the jobs must be run through this new
    300 batch system. We tried to keep it as simple as posible keeping the syntax from the LoadLeveler to make the transition easier to those who used Loadleveler in the past. This document provides information for getting started with job execution at Tirant.
     333We moved from !LoadLeveler and all the jobs must be run through this new
     334batch system. We tried to keep it as simple as possible keeping the syntax from the !LoadLeveler to make the transition easier to those who used !Loadleveler in the past. This document provides information for getting started with job execution at Tirant.
    301335
    302336==== Classes ====
    303337
    304 The user’s limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee) and there is no reason to explicitly set the #@class directive. Anyway you are allowed to use the special class: ”debug” in order to performe some fast short tests. To use the ”debug” class you need to include the #@class directive
    305 Table 1: Classes
    306 Class debug Max CPUs 64 CPU Time 10 min Wall time limit 10 min
     338The user’s limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee) and there is no reason to explicitly set the #@class directive. Anyway you are allowed to use the special class: "debug" in order to performe some fast short tests. To use the "debug" class you need to include the #@class directive
     339
     340Table 1: Classes Class debug Max CPUs 64 CPU Time 10 min Wall time limit 10 min
    307341The specific limits assigned to each user depends on the priority granted
    308342by the access committee. Users granted with ”high priority hours” will have access to a maximum of 1024 CPUs and a maximum wall clock limit of 72
     
    312346Local users of Tirant have access to a new local queue called "class t".
    313347This queue has same parameters as "class a", but the priority is modified to restrict local time consumption to 20% of total computation time.
    314  *debug: This class is reserved for testing the applications before submitting them to the 'production' queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 10 minutes. The maximum number of nodes per application
     348
     349 * debug: This class is reserved for testing the applications before submitting them to the 'production' queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 10 minutes. The maximum number of nodes per application
    315350is 32. Only a limited number of jobs may be running at the same time
    316351in this queue.
     
    330365 * '''mncancel <jobid>''' remove his/her job from the queue system, canceling the execution of the processes, if they were already running.
    331366 * '''checkjob <jobid>''' obtains detailed information about a specific job, including the assigned nodes and the possible reasons preventing the job from running.
    332  * '''mnstart''' shows information about the estimated time for the especified job to be executed.
     367 * '''mnstart''' shows information about the estimated time for the specified job to be executed.
    333368
    334369=== Job Directives ===
     370
    335371A job must contain a series of directives to inform the batch system about
    336372the characteristics of the job. These directives appear as comments in the
     
    340376# @ directive = value
    341377}}}
     378
    342379Additionally, the job script may contain a set of commands to execute.
    343380If not, an external script must be provided with the 'executable' directive.
    344381Here you may find the most common directives:
     382
    345383{{{
    346384# @ class = class\_name
    347385}}}
     386
    348387The partition where the job is to be submitted. Let this field empty
    349388unless you need to use "interactive" or "debug" partitions.
     389
    350390{{{
    351391# @ wall_clock_limit = HH:MM:SS
    352392}}}
     393
    353394The limit of wall clock time. This is a mandatory field and you must set
    354395it to a value greater than the real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period.
     396
    355397{{{
    356398# @ initialdir = pathname
    357399}}}
     400
    358401The working directory of your job (i.e. where the job will run). If not
    359402specified, it is the current working directory at the time the job was submitted.
     403
    360404{{{
    361405# @ error = file
    362406}}}
     407
    363408The name of the file to collect the stderr output of the job.
     409
    364410{{{
    365411# @ output = file
    366412}}}
     413
    367414The name of the file to collect the standard output (stdout) of the job.
     415
    368416{{{
    369417# @ total_tasks = number
    370418}}}
     419
    371420The number of processes to start.
    372421
     
    374423
    375424Serial job:
     425
    376426{{{
    377427#!sh
     
    388438
    389439Parallel job :
     440
    390441{{{
    391442#!sh