Changes between Version 3 and Version 4 of WRF4G/ExecutionEnvironments


Ignore:
Timestamp:
Feb 15, 2013 5:47:45 PM (9 years ago)
Author:
carlos
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WRF4G/ExecutionEnvironments

    v3 v4  
    2020ssh  <userid>@lizard.dkrz.de
    2121}}}
     22
    2223= RES =
    2324
    2425
    2526== Altamira ==
     27
     28=== Running Jobs ===
     29
     30SLURM is the utility used at Altamira for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at Altamira.
     31
     32In order to keep the login nodes in a propper load, a 10 minutes limitation in the cpu time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system.
     33
     34=== Submitting Jobs ===
     35
     36A job is the execution unit for the SLURM. A job is defined by a text file containing a set of directives describing the job, and the commands to execute.
     37
     38These are the basic directives to submit jobs:
     39
     40    '''mnsubmit <job_script>''' submits a job script to the queue system (see below for job script directives).
     41
     42    '''mnq''' shows all the jobs submitted.
     43
     44    '''mncancel <job_id>''' removes his/her job from the queue system, canceling the execution of the job if it was already running.
     45
     46=== Job directives ===
     47
     48A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:
     49
     50{{{
     51#!sh
     52#@ directive = value
     53}}}
     54
     55Additionally, the job script may contain a set of commands to execute. If not, an external script must be provided with the 'executable' directive. Here you may find the most common directives:
     56{{{
     57#!sh
     58#@ class = class_name
     59}}}
     60
     61The queue where the job is to be submitted. Let this field empty unless you need to use "debug" or special queues.
     62
     63{{{
     64#!sh
     65#@ wall_clock_limit = HH:MM:SS
     66}}}
     67
     68The limit of wall clock time. This is a mandatory field and you must set it to a value greater than the real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period.
     69
     70{{{
     71#!sh
     72#@ initialdir = pathname
     73}}}
     74
     75The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.
     76{{{
     77#!sh
     78#@ error = file
     79}}}
     80
     81The name of the file to collect the stderr output of the job.
     82{{{
     83#!sh
     84#@ output = file
     85}}}
     86The name of the file to collect the standard output (stdout) of the job.
     87{{{
     88#!sh
     89#@ total_tasks = number
     90}}}
     91The number of processes to start.
     92{{{
     93#!sh
     94#@ cpus_per_task = number
     95}}}
     96The number of cpus allocated for each task. This is useful for hybrid MPI+OpenMP applications, where each process will spawn a number of threads. The number of cpus per task must be between 1 and 16, since each node has 16 cores (one for each thread).
     97{{{
     98#!sh
     99#@ tasks_per_node = number
     100}}}
     101The number of tasks allocated in each node. When an application uses more than 3.8 GB of memory per process, it is not possible to have 16 processes in the same node and its 64GB of memory. It can be combined with the cpus_per_task to allocate the nodes exclusively, i.e. to allocate 2, processes per node, set both directives to 2. The number of tasks per node must be between 1 and 16.
     102{{{
     103#!sh
     104# @ gpus_per_node = number
     105}}}
     106The number of GPU cards assigned to the job. This number can be [0,1,2] as there are 2 cards per node.
     107
     108=== Job Examples ===
     109
     110In the examples, the %j part in the job directives will be sustitute by the job ID.
     111
     112Sequential job:
     113
     114{{{
     115#!sh
     116#!/bin/bash
     117#@ job_name = test_serial
     118#@ initialdir = .
     119#@ output = serial_%j.out
     120#@ error = serial_%j.err
     121#@ total_tasks = 1
     122#@ wall_clock_limit = 00:02:00
     123   
     124./serial_binary
     125}}}
     126
     127Parallel job:
     128
     129{{{
     130#!sh
     131#!/bin/bash
     132#@ job_name = test_parallel
     133#@ initialdir = .
     134#@ output = mpi_%j.out
     135#@ error = mpi_%j.err
     136#@ total_tasks = 32
     137#@ wall_clock_limit = 00:02:00
     138   
     139srun ./parallel_binary
     140}}}
     141
     142GPGPU job:
     143
     144{{{
     145#!sh
     146#!/bin/bash
     147#@ job_name = test_gpu
     148#@ initialdir = .
     149#@ output = gpu_%j.out
     150#@ error = gpu_%j.err
     151#@ total_tasks = 1
     152#@ gpus_per_node = 1
     153#@ wall_clock_limit = 00:02:00
     154 
     155./gpu_binary
     156}}}
     157
     158The jobs with GPU should execute module load CUDA in order to set the library paths before running mnsubmit.
     159
     160For more information about ALTIMARIA see [https://moin.ifca.es/wiki/Supercomputing/Userguide]
    26161
    27162== !MareNostrum ==