Changes between Version 8 and Version 9 of WRF4G/ExecutionEnvironments

Feb 18, 2013 10:31:12 AM (9 years ago)



  • WRF4G/ExecutionEnvironments

    v8 v9  
    299299For more information about !MareNostrum III see [ ]
     302Running Jobs
     303Slurm+MOAB is the new utility used at Tirant for batch processing support.
     304We moved from LoadLeveler and all the jobs must be run through this new
     305batch system. We tried to keep it as simple as posible keeping the syntax from
     306the LoadLeveler to make the transition easier to those who used Loadleveler
     307in the past. This document provides information for getting started with job
     308execution at Tirant.
     3105.1 Classes
     312Tirant User’s Guide
     314The user’s limits are assigned automatically to each particular user (depend-
     315ing on the resources granted by the Access Committee) and there is no reason
     316to explicitly set the #@class directive. Anyway you are allowed to use the
     317special class: ”debug” in order to performe some fast short tests. To use the
     318”debug” class you need to include the #@class directive
     319Table 1: Classes
     322Max CPUs
     324CPU Time
     32510 min
     326Wall time limit
     32710 min
     328The specific limits assigned to each user depends on the priority granted
     329by the access committee. Users granted with ”high priority hours” will have
     330access to a maximum of 1024 CPUs and a maximum wall clock limit of 72
     331hours. For users with ”low priority hours” the limits are 1024 CPUs and 24
     332hours. If you need to increase these limits please contact the support group.
     333Local users of Tirant have access to a new local queue called ”class t”.
     334This queue has same parameters as ”class a”, but the priority is modified to
     335restrict local time comsumption to 20% of total computation time.
     336• debug: This class is reserved for testing the applications before submit-
     337ting them to the ’production’ queues. Only one job per user is allowed
     338to run simultaneously in this queue, and the execution time will be
     339limited to 10 minutes. The maximum number of nodes per application
     340is 32. Only a limited number of jobs may be running at the same time
     341in this queue.
     342The specifications for each class may be adjusted in the future to adapt
     343to changing requirements.
     3455.2 Submitting Jobs
     347Tirant User’s Guide
     348Submitting Jobs
     349A job is the execution unit for the SLURM. We have created wrappers to
     350make easier the adaptation to the new batch system to those users who have
     351already used Tirant and MareNostrum in the past. So the commands are
     352quite similar to the former Loadleveler commands. A job is defined by a text
     353file containing a set of directives describing the job, and the commands to
     356SLURM wrappers commands
     357These are the basic directives to submit jobs:
     358mnsubmit <j o b s c r i p t >
     359submits a ’job script’ to the queue system (see below for job script direc-
     362shows all the jobs submitted.
     363mncancel <j o b i d >
     364remove his/her job from the queue system, canceling the execution of the
     365processes, if they were already running.
     366c h e c k j o b <j o b i d >
     367obtains detailed information about a specific job, including the assigned
     368nodes and the possible reasons preventing the job from running.
     370shows information about the estimated time for the especified job to be
     373Job Directives
     374A job must contain a series of directives to inform the batch system about
     375the characteristics of the job. These directives appear as comments in the
     376job script, with the following syntax:
     3785.2 Submitting Jobs
     379Tirant User’s Guide
     380# @ d i r e c t i v e = value
     381Additionally, the job script may contain a set of commands to execute.
     382If not, an external script must be provided with the ’executable’ directive.
     383Here you may find the most common directives:
     384# @ c l a s s = c l a s s \ name
     385The partition where the job is to be submitted. Let this field empty
     386unless you need to use ”interactive” or ”debug” partitions.
     387# @ w a l l c l o c k l i m i t = HH:M SS
     389The limit of wall clock time. This is a mandatory field and you must set
     390it to a value greater than the real execution time for your application and
     391smaller than the time limits granted to the user. Notice that your job will
     392be killed after the elapsed period.
     393# @ i n i t i a l d i r = pathname
     394The working directory of your job (i.e. where the job will run). If not
     395specified, it is the current working directory at the time the job was submit-
     397# @ error = f i l e
     398The name of the file to collect the stderr output of the job.
     399# @ output = f i l e
     400The name of the file to collect the standard output (stdout) of the job.
     401# @ t o t a l t a s k s = number
     402The number of processes to start.
     403There are also a few SLURM environment variables you can use in your
     404scripts: (see table 2)
     407Example for a sequential job :
     4095.2 Submitting Jobs
     410Tirant User’s Guide
     411Table 2: SLURM environment variables
     420Specifies the job ID of the executing job
     421Specifies the total number of processes in the job
     422Is the actual number of nodes assigned to run your
     424Specifies the MPI rank (or relative process ID)
     425for the currnet process. The range is from 0-
     426(SLURM NPROCS-1)
     427Specifies relative node ID of the current job. The
     428range is from 0-(SLURM NNODES-1)
     429Specifies the node-local task ID for the process
     430within a job
     431#!/ b i n / bash
     432# @ job name = t e s t s e r i a l
     433# @ initialdir = .
     434# @ output = s e r i a l %j . out
     435# @ e r r o r = s e r i a l %j . e r r
     436# @ total tasks = 1
     437# @ wall clock limit = 00:02:00
     438. / s e r i a l b i n a r y > s e r i a l . out
     439The job would be submitted using:
     440u s e r t e s t @ l o g i n 1 :  ̃ / Slurm /TEST> mnsubmit p t e s t . cmd
     441Examples for a parallel job :
     445# @ job_name = test_parallel
     446# @ initialdir = .
     447# @ output = mpi%j.out
     448# @ error = mpi%j.err
     449# @ totaltasks = 56
     450# @ wall clock limit = 00:02:00
     451srun . / p a r a l l e l b i n a r y > p a r a l l e l . output
    301460= [[|DKRZ]] =