wiki:WRF4G/ExecutionEnvironments

Version 4 (modified by carlos, 9 years ago) (diff)

--

DKRZ

How to use DKRZ facilities?

Workflows in climate modelling research are complex and comprise, in general, a number of different tasks, such as model formulation and development (including debugging, platform porting, and performance optimization), generation of input data, performing model simulations, postprocessing, visualization and analysis of output data, long-term archiving of the data, documentation and publication of results. The DKRZ hardware and software infrastructure is optimally adapted to accomplish these tasks in an efficient way. In the graphic below we give a schematic overview on the DKRZ systems.

http://www.dkrz.de/bilder/bilder-nutzerportal/bilder-dokumentation/DKRZsystems.png

For a more detailed description of the different systems shown in the picture and basic software installed on these systems click here.

Blizzard

http://www.dkrz.de/Nutzerportal-en/doku/blizzard

ssh  <userid>@blizzard.dkrz.de

Lizard

http://www.dkrz.de/Nutzerportal-en/doku/blizzard/lizard

ssh  <userid>@lizard.dkrz.de

RES

Altamira

Running Jobs

SLURM is the utility used at Altamira for batch processing support, so all jobs must be run through it. This document provides information for getting started with job execution at Altamira.

In order to keep the login nodes in a propper load, a 10 minutes limitation in the cpu time is set for processes running interactively in these nodes. Any execution taking more than this limit should be carried out through the queue system.

Submitting Jobs

A job is the execution unit for the SLURM. A job is defined by a text file containing a set of directives describing the job, and the commands to execute.

These are the basic directives to submit jobs:

mnsubmit <job_script> submits a job script to the queue system (see below for job script directives).

mnq shows all the jobs submitted.

mncancel <job_id> removes his/her job from the queue system, canceling the execution of the job if it was already running.

Job directives

A job must contain a series of directives to inform the batch system about the characteristics of the job. These directives appear as comments in the job script, with the following syntax:

#@ directive = value

Additionally, the job script may contain a set of commands to execute. If not, an external script must be provided with the 'executable' directive. Here you may find the most common directives:

#@ class = class_name

The queue where the job is to be submitted. Let this field empty unless you need to use "debug" or special queues.

#@ wall_clock_limit = HH:MM:SS

The limit of wall clock time. This is a mandatory field and you must set it to a value greater than the real execution time for your application and smaller than the time limits granted to the user. Notice that your job will be killed after the elapsed period.

#@ initialdir = pathname

The working directory of your job (i.e. where the job will run). If not specified, it is the current working directory at the time the job was submitted.

#@ error = file

The name of the file to collect the stderr output of the job.

#@ output = file

The name of the file to collect the standard output (stdout) of the job.

#@ total_tasks = number

The number of processes to start.

#@ cpus_per_task = number

The number of cpus allocated for each task. This is useful for hybrid MPI+OpenMP applications, where each process will spawn a number of threads. The number of cpus per task must be between 1 and 16, since each node has 16 cores (one for each thread).

#@ tasks_per_node = number

The number of tasks allocated in each node. When an application uses more than 3.8 GB of memory per process, it is not possible to have 16 processes in the same node and its 64GB of memory. It can be combined with the cpus_per_task to allocate the nodes exclusively, i.e. to allocate 2, processes per node, set both directives to 2. The number of tasks per node must be between 1 and 16.

# @ gpus_per_node = number

The number of GPU cards assigned to the job. This number can be [0,1,2] as there are 2 cards per node.

Job Examples

In the examples, the %j part in the job directives will be sustitute by the job ID.

Sequential job:

#!/bin/bash
#@ job_name = test_serial
#@ initialdir = .
#@ output = serial_%j.out
#@ error = serial_%j.err
#@ total_tasks = 1
#@ wall_clock_limit = 00:02:00
   
./serial_binary

Parallel job:

#!/bin/bash 
#@ job_name = test_parallel
#@ initialdir = .
#@ output = mpi_%j.out
#@ error = mpi_%j.err
#@ total_tasks = 32
#@ wall_clock_limit = 00:02:00
   
srun ./parallel_binary 

GPGPU job:

#!/bin/bash 
#@ job_name = test_gpu
#@ initialdir = .
#@ output = gpu_%j.out
#@ error = gpu_%j.err
#@ total_tasks = 1
#@ gpus_per_node = 1
#@ wall_clock_limit = 00:02:00
 
./gpu_binary

The jobs with GPU should execute module load CUDA in order to set the library paths before running mnsubmit.

For more information about ALTIMARIA see https://moin.ifca.es/wiki/Supercomputing/Userguide

MareNostrum

National Computational Infrastructure (Australia)

http://nf.nci.org.au/facilities/

VAYU

http://nf.nci.org.au/facilities/vayu/system_doc.php

ECMWF

ssh <user>@ecaccess.ecmwf.int

HPCF

http://www.ecmwf.int/services/computing/overview/ibm_cluster.html

Ecgate