Changes between Version 12 and Version 13 of WRF4G/ExecutionEnvironments


Ignore:
Timestamp:
Feb 21, 2013 3:28:42 PM (9 years ago)
Author:
carlos
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WRF4G/ExecutionEnvironments

    v12 v13  
    231231
    232232./serial.exe
    233 }}}
    234 
    235 The job would be submitted using:
    236 {{{
    237 #!sh
    238 bsub < ptest.cmd
    239233}}}
    240234
     
    299293For more information about !MareNostrum III see [http://www.bsc.es/support/MareNostrum3-ug.pdf ]
    300294
    301 
    302 Running Jobs
     295== Tirant ==
     296
     297=== Running Jobs ===
    303298Slurm+MOAB is the new utility used at Tirant for batch processing support.
    304299We moved from LoadLeveler and all the jobs must be run through this new
    305 batch system. We tried to keep it as simple as posible keeping the syntax from
    306 the LoadLeveler to make the transition easier to those who used Loadleveler
    307 in the past. This document provides information for getting started with job
    308 execution at Tirant.
    309 11
    310 5.1 Classes
    311 5.1
    312 Tirant User’s Guide
    313 Classes
    314 The user’s limits are assigned automatically to each particular user (depend-
    315 ing on the resources granted by the Access Committee) and there is no reason
    316 to explicitly set the #@class directive. Anyway you are allowed to use the
    317 special class: ”debug” in order to performe some fast short tests. To use the
    318 ”debug” class you need to include the #@class directive
     300batch system. We tried to keep it as simple as posible keeping the syntax from the LoadLeveler to make the transition easier to those who used Loadleveler in the past. This document provides information for getting started with job execution at Tirant.
     301
     302==== Classes ====
     303
     304The user’s limits are assigned automatically to each particular user (depending on the resources granted by the Access Committee) and there is no reason to explicitly set the #@class directive. Anyway you are allowed to use the special class: ”debug” in order to performe some fast short tests. To use the ”debug” class you need to include the #@class directive
    319305Table 1: Classes
    320 Class
    321 debug
    322 Max CPUs
    323 64
    324 CPU Time
    325 10 min
    326 Wall time limit
    327 10 min
     306Class debug Max CPUs 64 CPU Time 10 min Wall time limit 10 min
    328307The specific limits assigned to each user depends on the priority granted
    329 by the access committee. Users granted with ”high priority hours” will have
    330 access to a maximum of 1024 CPUs and a maximum wall clock limit of 72
     308by the access committee. Users granted with ”high priority hours” will have access to a maximum of 1024 CPUs and a maximum wall clock limit of 72
    331309hours. For users with ”low priority hours” the limits are 1024 CPUs and 24
    332310hours. If you need to increase these limits please contact the support group.
     311
    333312Local users of Tirant have access to a new local queue called ”class t”.
    334 This queue has same parameters as ”class a”, but the priority is modified to
    335 restrict local time comsumption to 20% of total computation time.
    336 • debug: This class is reserved for testing the applications before submit-
    337 ting them to the ’production’ queues. Only one job per user is allowed
    338 to run simultaneously in this queue, and the execution time will be
    339 limited to 10 minutes. The maximum number of nodes per application
     313This queue has same parameters as ”class a”, but the priority is modified to restrict local time comsumption to 20% of total computation time.
     314 *debug: This class is reserved for testing the applications before submitting them to the ’production’ queues. Only one job per user is allowed to run simultaneously in this queue, and the execution time will be limited to 10 minutes. The maximum number of nodes per application
    340315is 32. Only a limited number of jobs may be running at the same time
    341316in this queue.
    342 The specifications for each class may be adjusted in the future to adapt
    343 to changing requirements.
    344 12
    345 5.2 Submitting Jobs
    346 5.2
    347 Tirant User’s Guide
    348 Submitting Jobs
     317The specifications for each class may be adjusted in the future to adapt to changing requirements.
     318
     319=== Submitting Jobs ===
     320
    349321A job is the execution unit for the SLURM. We have created wrappers to
    350322make easier the adaptation to the new batch system to those users who have
    351323already used Tirant and MareNostrum in the past. So the commands are
    352 quite similar to the former Loadleveler commands. A job is defined by a text
    353 file containing a set of directives describing the job, and the commands to
    354 execute.
    355 5.2.1
     324quite similar to the former Loadleveler commands. A job is defined by a text file containing a set of directives describing the job, and the commands to execute.
     325
    356326SLURM wrappers commands
     327
    357328These are the basic directives to submit jobs:
    358 mnsubmit <j o b s c r i p t >
    359 submits a ’job script’ to the queue system (see below for job script direc-
    360 tives).
    361 mnq
    362 shows all the jobs submitted.
    363 mncancel <j o b i d >
    364 remove his/her job from the queue system, canceling the execution of the
    365 processes, if they were already running.
    366 c h e c k j o b <j o b i d >
    367 obtains detailed information about a specific job, including the assigned
    368 nodes and the possible reasons preventing the job from running.
    369 mnstart
    370 shows information about the estimated time for the especified job to be
    371 executed.
    372 5.2.2
    373 Job Directives
     329
     330 * '''mnsubmit <jobscript>'''submits a ’job script’ to the queue system (see below for job script direc-tives).
     331 * '''mnq''' shows all the jobs submitted.
     332 * '''mncancel <jobid>''' remove his/her job from the queue system, canceling the execution of the processes, if they were already running.
     333 * '''checkjob <jobid>''' obtains detailed information about a specific job, including the assigned nodes and the possible reasons preventing the job from running.
     334 * '''mnstart''' shows information about the estimated time for the especified job to be executed.
     335
     336=== Job Directives ===
    374337A job must contain a series of directives to inform the batch system about
    375338the characteristics of the job. These directives appear as comments in the
    376339job script, with the following syntax:
    377 13
    378 5.2 Submitting Jobs
    379 Tirant User’s Guide
    380 # @ d i r e c t i v e = value
     340
     341{{{
     342# @ directive = value
     343}}}
    381344Additionally, the job script may contain a set of commands to execute.
    382345If not, an external script must be provided with the ’executable’ directive.
    383346Here you may find the most common directives:
    384 # @ c l a s s = c l a s s \ name
     347{{{
     348# @ class = class\name
     349}}}
    385350The partition where the job is to be submitted. Let this field empty
    386351unless you need to use ”interactive” or ”debug” partitions.
    387 # @ w a l l c l o c k l i m i t = HH:M SS
    388 M:
     352{{{
     353# @ wallclocklimit = HH:MM:SS
     354}}}
    389355The limit of wall clock time. This is a mandatory field and you must set
    390 it to a value greater than the real execution time for your application and
    391 smaller than the time limits granted to the user. Notice that your job will
    392 be killed after the elapsed period.
    393 # @ i n i t i a l d i r = pathname
     356it to a value greater than the real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period.
     357{{{
     358# @ initialdir = pathname
     359}}}
    394360The working directory of your job (i.e. where the job will run). If not
    395 specified, it is the current working directory at the time the job was submit-
    396 ted.
    397 # @ error = f i l e
     361specified, it is the current working directory at the time the job was submitted.
     362{{{
     363# @ error = file
     364}}}
    398365The name of the file to collect the stderr output of the job.
    399 # @ output = f i l e
     366{{{
     367# @ output = file
     368}}}
    400369The name of the file to collect the standard output (stdout) of the job.
    401370{{{