wiki:WRFKnownProblems

Known Problems

Page Contents

  1. Known Problems
    1. [x] qdel: Server could not connect to MOM 476932.ce01.macc.unican.es
    2. [] Abnormal ending of simulations due to 'Flerchinger' aproximation
    3. [] Abnormal ending of simulations
    4. [x] Not writting of wrf restart files
    5. [] Crash of running simulations by openMPI problems
    6. [x] Intel simulations crash with CAM
      1. LAST NEWs
    7. [] STOP of simulations due to library problems
    8. [] STOP of simulations due to net delays
    9. [x] In real STOP: At line 703 of file module_initialize_real.f90
    10. [x] Execution error in WRF
    11. [x] SST missing values in coastal lines
    12. [] p4_error: latest msg from perror: Invalid argument
    13. [x] p4_error: OOPS: semop lock failed: -1
    14. [] * glibc detected * malloc(): memory corruption:
    15. [x] Missing required environment variable: MPIRUN_RANK
    16. [] Different wrf.exe from different nodes
    17. [] mpiexec: Error: poll_or_block_event: tm_poll: tm: no event
    18. [x] mvapich 'call system()' failed
    19. [x] mpiexec: Warning: read_ib_one: protocol version 8 not known, but …
    20. [x] ECMWF ERA40 escena missing data
    21. [x] Large waiting in GRID-CSIC
    22. [x] cshell error in wn010
    23. [x] Stale NFS file handle
    24. [x] metgrid.exe Segmentation fault
    25. [] CAM NaN
    26. [x] p4_error: semget failed for setnum: 12
    27. [x] P4_GLOBMEMSIZE
    28. [x] SKINTEMP not found
    29. [x] WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN 81 78 …
    30. [x] Metgrid error: Error in ext_pkg_write_field in metgrid.log
    31. [] forrtl: severe (174): SIGSEGV, segmentation fault occurred
    32. [] wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) …
    33. [] No error, wrf just stops (¿¡?!)

[]: Unsolved [x]: Solved

[x] qdel: Server could not connect to MOM 476932.ce01.macc.unican.es

Some times, PBS downs in the nodes. In order to recover it. We have to stop it an restart the service.

[root@ce01 ~]$ssh wn025 'service pbs_mom restart'

It can be done for all nodes:

[root@ce01 ~]# cexec 'service pbs_mom status'
************************* macc *************************
--------- wn001---------
pbs_mom (pid 2575) is running...
--------- wn002---------
pbs_mom (pid 3061) is running...
--------- wn003---------
pbs_mom (pid 2908) is running...
--------- wn004---------
ssh(1777) Warning: the RSA host key for 'wn004' differs from the key for the IP address '192.168.202.14'
Offending key for IP in /etc/ssh/ssh_known_hosts:8
Matching host key in /root/.ssh/known_hosts:117
ssh(1777) Permission denied, please try again.
ssh(1777) Permission denied, please try again.
ssh(1777) Permission denied (publickey,password).
--------- wn005---------
pbs_mom dead but subsys locked
--------- wn006---------
pbs_mom (pid 3002) is running...
--------- wn007---------
pbs_mom (pid 29926) is running...
--------- wn008---------
pbs_mom dead but subsys locked
--------- wn009---------
ssh(1796) Permission denied, please try again.
ssh(1796) Permission denied, please try again.
ssh(1796) Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
--------- wn010---------
pbs_mom dead but subsys locked
--------- wn011---------
pbs_mom dead but subsys locked
--------- wn012---------
pbs_mom dead but subsys locked
--------- wn013---------
pbs_mom (pid 6605 6604 6422) is running...
--------- wn014---------
pbs_mom (pid 3137) is running...
--------- wn015---------
pbs_mom dead but subsys locked
--------- wn016---------
pbs_mom dead but subsys locked
--------- wn017---------
pbs_mom dead but subsys locked
--------- wn018---------
pbs_mom (pid 3284) is running...
--------- wn019---------
pbs_mom dead but subsys locked
--------- wn020---------
pbs_mom dead but subsys locked
--------- wn021---------
pbs_mom dead but subsys locked
--------- wn022---------
pbs_mom dead but subsys locked
--------- wn023---------
pbs_mom dead but subsys locked
--------- wn024---------
pbs_mom (pid 3157) is running...
--------- wn025---------
pbs_mom (pid 18308) is running...
--------- wn031---------
pbs_mom dead but subsys locked
--------- wn032---------
pbs_mom dead but subsys locked
--------- wn033---------
pbs_mom dead but subsys locked
--------- wn034---------
pbs_mom dead but subsys locked
--------- wn035---------
pbs_mom dead but subsys locked
--------- wn036---------
pbs_mom dead but subsys locked
--------- wn041---------
pbs_mom dead but subsys locked
--------- wn042---------
pbs_mom dead but subsys locked
--------- wn043---------
pbs_mom dead but subsys locked
--------- wn044---------
pbs_mom dead but subsys locked
--------- wn045---------
pbs_mom dead but subsys locked
--------- wn046---------
pbs_mom dead but subsys locked

Restarting

[root@ce01 ~]# cexec 'service pbs_mom restart'
************************* macc *************************
--------- wn001---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn002---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn003---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn004---------
ssh(2514) Warning: the RSA host key for 'wn004' differs from the key for the IP address '192.168.202.14'
Offending key for IP in /etc/ssh/ssh_known_hosts:8
Matching host key in /root/.ssh/known_hosts:117
ssh(2514) Permission denied, please try again.
ssh(2514) Permission denied, please try again.
ssh(2514) Permission denied (publickey,password).
--------- wn005---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn006---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn007---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn008---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn009---------
ssh(2524) Permission denied, please try again.
ssh(2524) Permission denied, please try again.
ssh(2524) Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
--------- wn010---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn011---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn012---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn013---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn014---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn015---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn016---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn017---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn018---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn019---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn020---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn021---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn022---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn023---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn024---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn025---------
Shutting down TORQUE Mom: [  OK  ]
Starting TORQUE Mom: [  OK  ]
--------- wn031---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn032---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn033---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn034---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn035---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn036---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn041---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn042---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn043---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn044---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn045---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]
--------- wn046---------
Shutting down TORQUE Mom: [FAILED]
Starting TORQUE Mom: [  OK  ]

And then the nodes are:

[root@ce01 ~]# pbsnodes -l
wn002                offline
wn003                offline
wn004                down,offline
wn009                down,offline
wn001                offline

[] Abnormal ending of simulations due to 'Flerchinger' aproximation

A simulation ends with the following message:

(...)
 d01 2000-09-13_00:00:00 Input data processed for wrflowinp_d<domain> for domain    1
 Flerchinger USEd in NEW version. Iterations=          10
 Flerchinger USEd in NEW version. Iterations=          10
 Flerchinger USEd in NEW version. Iterations=          10

Related to the topic:

[] Abnormal ending of simulations

While a simulation was running on 'rsl.error.0000' appears:

(...)
Timing for main: time 2001-03-03_05:00:00 on domain   1:    2.94590 elapsed seconds.
wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.
Image              PC                Routine            Line        Source
libc.so.6          0000003AC6830265  Unknown               Unknown  Unknown
libc.so.6          0000003AC6831D10  Unknown               Unknown  Unknown
libc.so.6          0000003AC68296E6  Unknown               Unknown  Unknown
wrf.exe            00000000015E846A  Unknown               Unknown  Unknown
wrf.exe            00000000015BD80D  Unknown               Unknown  Unknown
wrf.exe            00000000015CC1FE  Unknown               Unknown  Unknown
wrf.exe            0000000001571B10  Unknown               Unknown  Unknown
wrf.exe            00000000015708BD  Unknown               Unknown  Unknown
wrf.exe            000000000155F149  Unknown               Unknown  Unknown
wrf.exe            000000000155B828  Unknown               Unknown  Unknown
wrf.exe            0000000000BCA9ED  Unknown               Unknown  Unknown
wrf.exe            0000000000BC71D9  Unknown               Unknown  Unknown
wrf.exe            0000000000BC6C58  Unknown               Unknown  Unknown
wrf.exe            0000000000BC6162  Unknown               Unknown  Unknown
wrf.exe            0000000000BC5EFE  Unknown               Unknown  Unknown
wrf.exe            0000000000DEF177  Unknown               Unknown  Unknown
wrf.exe            00000000007413C2  Unknown               Unknown  Unknown
wrf.exe            00000000006BE487  Unknown               Unknown  Unknown
wrf.exe            00000000006552B9  Unknown               Unknown  Unknown
wrf.exe            000000000067A5B4  Unknown               Unknown  Unknown
wrf.exe            0000000000678591  Unknown               Unknown  Unknown
wrf.exe            00000000004CA59F  Unknown               Unknown  Unknown
wrf.exe            000000000047B093  Unknown               Unknown  Unknown
wrf.exe            000000000047B047  Unknown               Unknown  Unknown
wrf.exe            000000000047AFDC  Unknown               Unknown  Unknown
libc.so.6          0000003AC681D994  Unknown               Unknown  Unknown
wrf.exe            000000000047AEE9  Unknown               Unknown  Unknown

Related links:

[x] Not writting of wrf restart files

Along the simulation on rsl.[error/out].[nnnn] appears

(...)
Timing for Writing restart for domain        1:   48.82700 elapsed seconds.
(...)

But restart was never written. Along the simulation, a wrfrst_[.....] file is written, but it has only 32 bytes and it is alive only the 48.82700 seconds, after that time it disappears. Looking to the execution flow (via strace):

(...)
open("wrfrst_d01_2000-06-01_11:30:00", O_RDWR|O_CREAT|O_TRUNC, 0666) = 13
fstat(13, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
fstat(13, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(13, 0, SEEK_CUR)                  = 0
lseek(13, 24, SEEK_SET)                 = 24
write(13, "\0\0\0\0\0\0\0\0", 8)        = 8
lseek(13, 0, SEEK_SET)                  = 0
(...)
write(1, "Timing for Writing restart for d"..., 76) = 76
write(2, "Timing for Writing restart for d"..., 76) = 76
lseek(13, 0, SEEK_CUR)                  = 0
write(13, "CDF\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32) = 32
close(13)                               = 0
unlink("wrfrst_d01_2000-06-01_11:00:00") = 0
(...)

It seems to be something related with the size of files.

In order to allow netCDF output bigger than 2GB on should activate a variable (in compile.bash)

export WRFIO_NCD_LARGE_FILE_SUPPORT=1

Now is working

[] Crash of running simulations by openMPI problems

In a not systematic way, some simulations stopped with the following message:

[lluis@wn033 ~]$ cat /localtmp/wrf4g.20110213111149664772000/log/wrf_2000101006.out
[wn033.macc.unican.es:19865] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19864] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19869] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19866] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19867] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19876] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19870] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19874] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19877] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19863] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19871] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19875] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19873] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19868] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19862] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19872] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19872] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19868] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19875] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19862] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19871] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19866] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19873] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19863] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19864] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19876] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19867] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19877] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19869] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19870] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19865] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[wn033.macc.unican.es:19874] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
mpiexec: killing job...
--------------------------------------------------------------------------
Sorry!  You were supposed to get help about:
    orterun:unclean-exit
But I couldn't open the help file:
    /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/share/openmpi/help-orterun.txt: No such file or directory.  Sorry!
--------------------------------------------------------------------------

[x] Intel simulations crash with CAM

Simulations crash just at beginning of them:

taskid: 0 hostname: wn025.macc.unican.es
 Quilting with   1 groups of   0 I/O tasks.
 Namelist dfi_control not found in namelist.input. Using registry defaults for v
 ariables in dfi_control
 Namelist tc not found in namelist.input. Using registry defaults for variables
 in tc
 Namelist scm not found in namelist.input. Using registry defaults for variables
  in scm
 Namelist fire not found in namelist.input. Using registry defaults for variable
 s in fire
  Ntasks in X            2, ntasks in Y            4
 WRF V3.1.1 MODEL
   *** CLWRF code enabled
  *************************************
  Parent domain
  ids,ide,jds,jde            1          50           1          50
  ims,ime,jms,jme           -4          32          -4          20
  ips,ipe,jps,jpe            1          25           1          13
  *************************************
 DYNAMICS OPTION: Eulerian Mass Coordinate
    alloc_space_field: domain            1,     26649120 bytes allocated
   med_initialdata_input: calling input_model_input
 INPUT LandUse = "USGS"
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
wrf.exe            00000000013EF3E1  Unknown               Unknown  Unknown
wrf.exe            00000000013F05A7  Unknown               Unknown  Unknown
wrf.exe            00000000013F1CE8  Unknown               Unknown  Unknown
wrf.exe            00000000011BB44B  Unknown               Unknown  Unknown
wrf.exe            0000000000DE008E  Unknown               Unknown  Unknown
wrf.exe            0000000000DDAEAD  Unknown               Unknown  Unknown
wrf.exe            00000000009AF813  Unknown               Unknown  Unknown
wrf.exe            0000000000690D01  Unknown               Unknown  Unknown
wrf.exe            000000000068DB21  Unknown               Unknown  Unknown
wrf.exe            000000000047BC1B  Unknown               Unknown  Unknown
wrf.exe            000000000047B049  Unknown               Unknown  Unknown
wrf.exe            000000000047AFEC  Unknown               Unknown  Unknown
libc.so.6          0000003C6421D994  Unknown               Unknown  Unknown
wrf.exe            000000000047AEE9  Unknown               Unknown  Unknown

where:

[lluis@wn025 run]$ find /lib* -name libc.so.6
/lib/libc.so.6
/lib/i686/nosegneg/libc.so.6
/lib64/libc.so.6

and

[lluis@wn025 run]$ ldd wrf.exe
        libmpi_f90.so.0 => not found
        libmpi_f77.so.0 => not found
        libmpi.so.0 => not found
        libopen-rte.so.0 => not found
        libopen-pal.so.0 => not found
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003c64600000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003c65600000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003c65a00000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003c64e00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003c64a00000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003c64200000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003c66200000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003c63e00000)

This happens with:

NIN_ra_lw_physics      = 3
NIN_ra_sw_physics      = 3

This does not happen with:

NIN_ra_lw_physics      = 4
NIN_ra_sw_physics      = 4

NOTE: This error is also found on serial compilation. With a debug = 1000,

Namelist dfi_control not found in namelist.input. Using registry defaults for v
 ariables in dfi_control
 Namelist tc not found in namelist.input. Using registry defaults for variables
 in tc
 Namelist scm not found in namelist.input. Using registry defaults for variables
  in scm
 Namelist fire not found in namelist.input. Using registry defaults for variable
 s in fire
 WRF V3.1.1 MODEL
   wrf: calling alloc_and_configure_domain
  *************************************
  Parent domain
  ids,ide,jds,jde            1          50           1          50
  ims,ime,jms,jme           -4          55          -4          55
  ips,ipe,jps,jpe            1          50           1          50
  *************************************
 DYNAMICS OPTION: Eulerian Mass Coordinate
    alloc_space_field: domain            1,     95259880 bytes allocated
   med_initialdata_input: calling input_model_input
(...)
 INPUT LandUse = "USGS"
 LANDUSE TYPE = "USGS" FOUND          33  CATEGORIES           2  SEASONS
  WATER CATEGORY =           16  SNOW CATEGORY =           24
  *** SATURATION VAPOR PRESSURE TABLE COMPLETED ***
    num_months =           13
 AEROSOLS:  Background aerosol will be limited to bottom            6
  model interfaces.
   reading CAM_AEROPT_DATA
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
wrf.exe            000000000130FB51  Unknown               Unknown  Unknown
wrf.exe            0000000001310D17  Unknown               Unknown  Unknown
wrf.exe            0000000001312458  Unknown               Unknown  Unknown
wrf.exe            00000000010DC9BB  Unknown               Unknown  Unknown
wrf.exe            0000000000D12F9E  Unknown               Unknown  Unknown
wrf.exe            0000000000D0DDBD  Unknown               Unknown  Unknown
wrf.exe            0000000000906523  Unknown               Unknown  Unknown
wrf.exe            0000000000608AC1  Unknown               Unknown  Unknown
wrf.exe            00000000006058E1  Unknown               Unknown  Unknown
wrf.exe            0000000000404DEC  Unknown               Unknown  Unknown
wrf.exe            0000000000404249  Unknown               Unknown  Unknown
wrf.exe            00000000004041EC  Unknown               Unknown  Unknown
libc.so.6          0000003331A1D994  Unknown               Unknown  Unknown
wrf.exe            00000000004040E9  Unknown               Unknown  Unknown

And WRF configuration:

[lluis@mar run]$ ldd wrf.exe
        libm.so.6 => /lib64/libm.so.6 (0x00000039cac00000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039cb000000)
        libc.so.6 => /lib64/libc.so.6 (0x00000039ca400000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003ac2200000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00000039ca800000)
        /lib64/ld-linux-x86-64.so.2 (0x00000039ca000000)

Segmentation fault appears after line # 3757 of phys/module_ra_cam_support.F (from WRFV3.1.1)

NOTE: on activation of

ulimit -s unlimited

It Works !!!!

On ESCENA domain simulation works with CAM ra_lw/sw when a checked compilation is used (SERIAL in: /oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/WRF/icif64/SERIALchk/WRFV3/main/wrf.exe). Following messages are shown in std. output:

(...)
Timing for main: time 2001-11-10_00:02:30 on domain   1:  358.70981 elapsed seconds.
forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #49
forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #51
forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #134
forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #136
forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #181
forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #183
forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #87
forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #89
forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #21
forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #23
forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #58
forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #60
forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #55
forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #57
forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #20
forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #22
Timing for main: time 2001-11-10_00:05:00 on domain   1:   32.11490 elapsed seconds.
(...)
  • On phys/module_radiation_driver.F, subroutine pre_radiation_driver arguments #49, 51 are: i_end, j_end
  • On phys/module_radiation_driver.F, subroutine radiation_driver arguments #134, 136 are: i_end, j_end
  • On phys/module_surface_driver.F, subroutine surface_driver arguments #181, 183 are: i_end, j_end
  • On phys/module_pbl_driver.F, subroutine pbl_driver arguments #87, 89 are: i_end, j_end
  • On phys/module_cumulus_driver.F, subroutine cumulus_driver arguments #21, 23 are: i_end, j_end
  • On phys/module_fddagd_driver.F, subroutine fddagd_driver arguments #58, 60 are: i_end, j_end
  • On phys/module_microphysics_driver.F, subroutine microphysics_driver arguments #55, 57 are: i_end, j_end
  • On phys/module_diagnostics.F, subroutine diagnostic_output_calc arguments #20, 22 are: i_end, j_end

Subroutine definitions

  INTEGER, DIMENSION(num_tiles), INTENT(IN) ::                   &
  &                                    i_start,i_end,j_start,j_end

Definition in frame/module_domain_type.F, WRF derived type for the domain TYPE(domain):

TYPE domain
 (...)
      INTEGER,POINTER                                     :: i_start(:),i_end(:)
      INTEGER,POINTER                                     :: j_start(:),j_end(:)
 (...)
      INTEGER                                             :: num_tiles        ! taken out of namelist 20000908
 (...)

Some information about WRF tiles

In frame/module_tiles.F is seen in subroutines set_tiles1, set_tiles2, set_tiles3:

   IF ( ASSOCIATED(grid%i_start) ) THEN ; DEALLOCATE( grid%i_start ) ; NULLIFY( grid%i_start ) ; ENDIF
       IF ( ASSOCIATED(grid%i_end) )   THEN ; DEALLOCATE( grid%i_end   ) ; NULLIFY( grid%i_end   ) ; ENDIF
       IF ( ASSOCIATED(grid%j_start) ) THEN ; DEALLOCATE( grid%j_start ) ; NULLIFY( grid%j_start ) ; ENDIF
       IF ( ASSOCIATED(grid%j_end) )   THEN ; DEALLOCATE( grid%j_end   ) ; NULLIFY( grid%j_end   ) ; ENDIF
       ALLOCATE(grid%i_start(num_tiles))
       ALLOCATE(grid%i_end(num_tiles))
       ALLOCATE(grid%j_start(num_tiles))
       ALLOCATE(grid%j_end(num_tiles))
       grid%max_tiles = num_tiles

The recommended WRF compilation (intel shared memory) is:

(...)
DMPARALLEL      =        1
OMPCPP          =       # -D_OPENMP
OMP             =       # -openmp -fpp -auto
SFC             =       ifort
SCC             =       icc
DM_FC           =       mpif90 -f90=$(SFC)
DM_CC           =       mpicc -cc=$(SCC) -DMPI2_SUPPORT
FC              =        $(DM_FC)
CC              =       $(DM_CC) -DFSEEKO64_OK
LD              =       $(FC)
RWORDSIZE       =       $(NATIVE_RWORDSIZE)
PROMOTION       =       -i4
ARCH_LOCAL      =       -DNONSTANDARD_SYSTEM_FUNC
CFLAGS_LOCAL    =       -w -O3 -ip
LDFLAGS_LOCAL   =       -ip
CPLUSPLUSLIB    =
ESMF_LDFLAG     =       $(CPLUSPLUSLIB)
FCOPTIM         =       -O3
FCREDUCEDOPT    =       $(FCOPTIM)
FCNOOPT         =       -O0 -fno-inline -fno-ip
FCDEBUG         =       # -g $(FCNOOPT) -traceback
FORMAT_FIXED    =       -FI
FORMAT_FREE     =       -FR
FCSUFFIX        =
BYTESWAPIO      =       -convert big_endian
FCBASEOPTS      =       -w -ftz -align all -fno-alias -fp-model precisee
 $(FCDEBUG) $(FORMAT_FREE) $(BYTESWAPIO)
MODULE_SRCH_FLAG =
TRADFLAG        =      -traditional
CPP             =      /lib/cpp -C -P
AR              =      ar
(...)

Simply changing the '-O3' optimization to '-O2' it works proprertly. (It also work with '-O1' but it makes the simulations slower)

LAST NEWs

It works just adding a new compilation flag '-heap-arrays' which means:

      -heap-arrays [size]
       -no-heap-arrays
              Puts  automatic  arrays  and  arrays created for temporary
              computations on the heap instead of the stack.
              Architectures: IA-32, Intel® 64, IA-64 architectures
              Default:
              -no-heap-arrays   The compiler puts automatic  arrays  and
                                arrays  created  for  temporary computa-
                                tions in temporary storage in the  stack
                                storage area.
              Description:
              This  option  puts automatic arrays and arrays created for
              temporary computations on the heap instead of the stack.
              If heap-arrays is specified and size is omitted, all auto-
              matic  and  temporary arrays are put on the heap. If 10 is
              specified for size, all  automatic  and  temporary  arrays
              larger than 10 KB are put on the heap.

It has been added on configure.wrf just adding the flag

(...)
CFLAGS_LOCAL    =       -w -O3 -heap-arrays -ip
(...)
FCOPTIM         =       -O3 -heap-arrays

A complete history post on ' intel's ' forum is available intel forum

[] STOP of simulations due to library problems

Some executions are stopped giving these error message:

/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory

Any rsl.[error/out].[nnnn] file is written

Doing a ldd of 'wrf.exe' is obtained the same in both nodes:

[lluis@wn031 ~]$ ldd /oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/CLW//gcgf64/OMPI/WRFV3/main/wrf.exe
        libmpi_f90.so.0 => not found
        libmpi_f77.so.0 => not found
        libmpi.so.0 => not found
        libopen-rte.so.0 => not found
        libopen-pal.so.0 => not found
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003a9d800000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003a9e400000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003aa0400000)
        libgfortran.so.3 => not found
        libm.so.6 => /lib64/libm.so.6 (0x0000003a9dc00000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003a9f400000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003a9e000000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003a9d400000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003a9d000000)

gfortran is instaled:

[lluis@wn041 ~]$ which gfortran
/usr/bin/gfortran
[lluis@wn041 ~]$ ldd /usr/bin/gfortran
        libc.so.6 => /lib64/libc.so.6 (0x0000003a9d400000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003a9d000000)

On a working node ldd gives:

[lluis@wn010 ~]$ ldd /oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/CLW//gcgf64/OMPI/WRFV3/main/wrf.exe
        libmpi_f90.so.0 => not found
        libmpi_f77.so.0 => not found
        libmpi.so.0 => not found
        libopen-rte.so.0 => not found
        libopen-pal.so.0 => not found
        libdl.so.2 => /lib64/libdl.so.2 (0x0000003cd0c00000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003cd3000000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003cd3400000)
        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00002b5f55924000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003cd1400000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003cd2800000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cd1000000)
        libc.so.6 => /lib64/libc.so.6 (0x0000003cd0800000)
        /lib64/ld-linux-x86-64.so.2 (0x0000003cd0400000)

gfortran ldd message is the same on 'wn010' and 'wn031/041'

[] STOP of simulations due to net delays

On execution of wrf.exe, simulations stops with following messages (with openMPI):

rsl.error.0004

taskid: 4 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Invalid argument (22)

rsl.error.0005

taskid: 5 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],5][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

rsl.error.0006

taskid: 6 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

rsl.error.0007

[wn017.macc.unican.es][[20060,1],7][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)

We are experience some net problems with important fall of system response of the cluster machine ('dinamic' queue)

[x] In real STOP: At line 703 of file module_initialize_real.f90

On execution of real appears:

Namelist dfi_control not found in namelist.input. Using registry defaults for variables in dfi_control
 Namelist tc not found in namelist.input. Using registry defaults for variables in tc
 Namelist scm not found in namelist.input. Using registry defaults for variables in scm
 Namelist fire not found in namelist.input. Using registry defaults for variables in fire
 REAL_EM V3.1.1 PREPROCESSOR
  *************************************
  Parent domain
  ids,ide,jds,jde            1         167           1         139
  ims,ime,jms,jme           -4         172          -4         144
  ips,ipe,jps,jpe            1         167           1         139
  *************************************
 DYNAMICS OPTION: Eulerian Mass Coordinate
    alloc_space_field: domain            1 ,    804753800  bytes allocated
Time period #   1 to process = 2025-01-01_00:00:00.
Time period #   2 to process = 2025-01-01_06:00:00.
(...)
Time period #  56 to process = 2025-01-14_18:00:00.
Time period #  57 to process = 2025-01-15_00:00:00.
Total analysis times to input =   57.

 -----------------------------------------------------------------------------

 Domain  1: Current date being processed: 2025-01-01_00:00:00.0000, which is loop #   1 out of   57
 configflags%julyr, %julday, %gmt:        2025           1   0.000000
 d01 2025-01-01_00:00:00 Timing for input          0 s.
 d01 2025-01-01_00:00:00          flag_soil_layers read from met_em file is  1
At line 703 of file module_initialize_real.f90
Fortran runtime error: End of record

The error messages

At line 703 of file module_initialize_real.f90
Fortran runtime error: End of record

Are gfortran run-time errors

This occurrs because input data does not have PSML/PSFC ! From ungrib.log

(...)
Inventory for date = 2025-01-01 00:00:00
PRES   HGT      TT       UU       VV       RH       SOILHGT  LANDSEA  PSFC     PMSL     SST      SKINTEMP SNOW     ST000007 ST007028 ST028100 ST100255 SM000007 SM007028 SM028100 SM100255
-------------------------------------------------------------------------------
2001.1  O        O        O        O        O        O        O        O        O        O        O        X        O        O        O        O        O        O        O        O
2001.0  O        X        X        X        X        X        X        O        O        O        X        O        O        O        O        O        O        O        O        O
1000.0  X        X        X        X        X
 925.0  X        X        X        X        X
 850.0  X        X        X        X        X
 700.0  X        X        X        X        X
 500.0  X        X        X        X        X
 300.0  X        X        X        X        X
 200.0  X        X        X        X        X
 100.0  X        X        X        X        X
  50.0  X        X        X        X        X
-------------------------------------------------------------------------------
(...)

Removing PSFC/MSLP from working input data the error is reproduced!

[x] Execution error in WRF

On GRIDUI appears this error on different experiments scnc1a, scnc1b:

On /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/log/rsl_wrf/rsl.error.0000, simulation started at 19750423000000:

(...)
Timing for main: time 1975-05-13_22:32:30 on domain   1:    2.23900 elapsed seconds.
[gcsic019wn:19507] *** Process received signal ***
[gcsic019wn:19507] Signal: Segmentation fault (11)
[gcsic019wn:19507] Signal code: Address not mapped (1)
[gcsic019wn:19507] Failing at address: 0xfffffffc01fd0668
[gcsic019wn:19507] [ 0] /lib64/libpthread.so.0 [0x3df980e930]
[gcsic019wn:19507] [ 1] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam_support__estblf+0x56) [0x1411806]
[gcsic019wn:19507] [ 2] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam_support__aqsat+0x189) [0x14119b9]
[gcsic019wn:19507] [ 3] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam__radctl+0x151f) [0x118ee8f]
[gcsic019wn:19507] [ 4] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam__camrad+0x584d) [0x1196ded]
[gcsic019wn:19507] [ 5] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_radiation_driver__radiation_driver+0x4d8b) [0xcddd3b]
[gcsic019wn:19507] [ 6] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_first_rk_step_part1__first_rk_step_part1+0x3edb) [0xdb734b]
[gcsic019wn:19507] [ 7] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(solve_em_+0x1c89f) [0x95939f]
[gcsic019wn:19507] [ 8] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(solve_interface_+0x6c8) [0x635e30]
[gcsic019wn:19507] [ 9] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_integrate__integrate+0x1b0) [0x494444]
[gcsic019wn:19507] [10] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_wrf_top__wrf_run+0x22) [0x46ef12]
[gcsic019wn:19507] [11] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(MAIN__+0x3a) [0x46e5ca]
[gcsic019wn:19507] [12] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(main+0xe) [0x14e9cae]
[gcsic019wn:19507] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3df901d994]
[gcsic019wn:19507] [14] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe [0x46e4d9]
[gcsic019wn:19507] *** End of error message ***

Same in /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/log/rsl_wrf/rsl.error.0000, simulation started at 19500507000000:

(...)
Timing for main: time 1950-05-10_16:17:30 on domain   1:    4.13800 elapsed seconds.
Timing for [gcsic116wn:30182] *** Process received signal ***
[gcsic116wn:30182] Signal: Segmentation fault (11)
[gcsic116wn:30182] Signal code: Address not mapped (1)
[gcsic116wn:30182] Failing at address: 0xfffffffc01fd0668
[gcsic116wn:30182] [ 0] /lib64/libpthread.so.0 [0x3dc780e930]
[gcsic116wn:30182] [ 1] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam_support__estblf+0x56) [0x1411806]
[gcsic116wn:30182] [ 2] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam_support__aqsat+0x110) [0x1411940]
[gcsic116wn:30182] [ 3] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam__radctl+0x151f) [0x118ee8f]
[gcsic116wn:30182] [ 4] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam__camrad+0x584d) [0x1196ded]
[gcsic116wn:30182] [ 5] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_radiation_driver__radiation_driver+0x4d8b) [0xcddd3b]
[gcsic116wn:30182] [ 6] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_first_rk_step_part1__first_rk_step_part1+0x3edb) [0xdb734b]
[gcsic116wn:30182] [ 7] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(solve_em_+0x1c89f) [0x95939f]
[gcsic116wn:30182] [ 8] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(solve_interface_+0x6c8) [0x635e30]
[gcsic116wn:30182] [ 9] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_integrate__integrate+0x1b0) [0x494444]
[gcsic116wn:30182] [10] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_wrf_top__wrf_run+0x22) [0x46ef12]
[gcsic116wn:30182] [11] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(MAIN__+0x3a) [0x46e5ca]
[gcsic116wn:30182] [12] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(main+0xe) [0x14e9cae]
[gcsic116wn:30182] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3dc701d994]
[gcsic116wn:30182] [14] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe [0x46e4d9]
[gcsic116wn:30182] *** End of error message ***

This happens because WPS uses 'iDirectionIncrementInDegrees' attribute, but cdo (used to transform input files) does not (gives an 'MISSING' value). In order to prevent this error and others (not enough decimals in codification of the value in the input grib files), a new namelist options has been introduced in nameslit.wps 'ungrib' section.

 is_global                = 1,

With this value, is said that input files are global. Thus the increment in degrees in the 'i-direction' will be computed according to the range of x dimension of the input files. In order to allow this, some modifications have to be done in some modules of WPS/ungrib source code.

NOTE: This option is only available if input data is in a regular matrix in x direction

  • ungrib/src/rd_grib1.F
    62       SUBROUTINE rd_grib1(IUNIT, gribflnm, level, field, hdate,  &
     63            ierr, iuarr, debug_level, is_g)
    (...)
     77       ! L. Fita. UC. August 2010
     78         INTEGER :: is_g
    (...)
    372      ! L. Fita. UC. August 2010
    373           IF (is_g == 1) THEN
    374             PRINT *,"*********** L. Fita. UC . August 2010 ***********"
    375             PRINT *,"***       Assuming global regular grid.       ***"
    376             PRINT *,"*** Computing 'dx' from number of points 'Nx' ***"
    377             PRINT *,"*************************************************"
    378             map%dx = 360.0 / map%nx
    379             PRINT *,'Nx = ',map%nx,' dx:',map%dx
    380           ELSE
    381             map%dx = ginfo(8)
    382           ENDIF
    (...)
    423       ! L. Fita. UC. August 2010
    424            IF (is_g == 1) THEN
    425              PRINT *,"*********** L. Fita. UC . August 2010 ***********"
    426              PRINT *,"***       Assuming global regular grid.       ***"
    427              PRINT *,"*** Computing 'dx' from number of points 'Nx' ***"
    428              PRINT *,"*************************************************"
    429              map%dx = 360.0 / map%nx
    430              PRINT *,'Nx = ',map%nx,' dx:',map%dx
    431            ELSE
    432              map%dx = ginfo(8)
    433            ENDIF
    

ungrib/src/read_namelist.F

1       subroutine read_namelist(hstart, hend, delta_time, ntimes,&
2                  ordered_by_date, debug_level, out_format, prefix, is_global)
3             ! L. Fita. UC. August 2010
4             !   Adding new 'namelsit.wps' value in '&ungrib' section: is_global (0, No; 1,
5             !     Yes [default 0]).
6             ! NOTE: This modification is only useful for global GRIBs with a regular
7             !   longitude distribution
8             !
9             ! EXPLANATION:
10            !   In some global files, grid information, is not correctly extacted and/or
11            !   they could not be exactly fitted in an entire earth. By this modification,
12            !   gris spacing in x direction is computed from the number of grid points in
13            !   this direction
(...)
58            ! L. fita. UC. August 2010
59              INTEGER :: is_global
(...)
72              ordered_by_date, prefix, is_global

ungrib/src/ungrib.F

. {{{

74 ! L. Fita. UC 2010 August 75 INTEGER :: is_global (...) 97 call read_namelist(hstart, hend, interval, ntimes, & 98 ordered_by_date, debug_level, out_format, prefix, is_global) (...) 207 call rd_grib1(nunit1, gribflnm, level, field, & 208 hdate, ierr, iuarr, debug_level, is_global) }}} Will appear during ungrib.exe execution:

*** Starting program ungrib.exe ***
Start_date =  1975-07-16_00:00:00 ,      End_date = 1975-07-30_00:00:00
output format is WPS
Path to intermediate files is ./
 ungrib - grib edition num           1
 *********** L. Fita. UC . August 2010 ***********
 ***       Assuming global regular grid.       ***
 *** Computing 'dx' from number of points 'Nx' ***
 *************************************************
 Nx =          128  dx:   2.812500
 *********** L. Fita. UC . August 2010 ***********
 ***       Assuming global regular grid.       ***
 *** Computing 'dx' from number of points 'Nx' ***
 *************************************************
 Nx =          128  dx:   2.812500
(...)

NOTE: With data from CNRM in the period 1950-1970 the error is still there...

[x] SST missing values in coastal lines

Along coastal lines, SST is badly interpolated. This is fixed changing in METGRID.TBL how is made SST interpolation (thanks to Dr. Priscilla A. Mooney, National University of Ireland, Maynooth, Ireland):

= = = = = = = = = = = = = = = = = = = = = = = = = = = =
 name=SST
        interp_option=sixteen_pt+four_pt+wt_average_4pt+search
        missing_value=-1e+30
        interp_mask=LANDSEA(1)
        masked=land
        fill_missing=0.
        flag_in_output=FLAG_SST
 = = = = = = = = = = = = = = = = = = = = = = = = = = = =

[] p4_error: latest msg from perror: Invalid argument

Simulation stops. Message appears at first time-step after open 'wrfrst' file

[x] p4_error: OOPS: semop lock failed: -1

Simulation stopped. Reference in:

Same as in p4_error: semget

From ce01 run

cexec /opt/mpich/gnu/sbin/cleanipcs

[] * glibc detected * malloc(): memory corruption:

Simulation stopped. In some rsl.error.00[nn] appear next line

  • rsl.error.0006 . {{{

(...) * glibc detected * malloc(): memory corruption: 0x000000000b215c50 * }}}

Error appeared during CLWRF implemention. Some nasty numerics things must happen. Once errors have been repared error disappears... (luckly?)

[x] Missing required environment variable: MPIRUN_RANK

WRF real.exe stopped with message:

PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required environment variable: MPIRUN_RANK
mpiexec: Warning: task 0 exited with status 1.

Incorrect version of mpiexec. You must run an addequated mpiexec version, look to the path of mpiexec (execute which to see it) which mpiexec

[] Different wrf.exe from different nodes

From wn001 to wn024 >ls -la /oceano/gmeteo/DATA/WRF/WRF_bin/3.1/WRF4G/MVAPICH/WRFV3/main/

...
-rw-rw----   1 lluis gmeteo    62797 May 19 13:28 wrf_ESMFMod.F
-rwxr-x--x   1 lluis gmeteo 21147307 May 26 14:58 wrf.exe
-rw-rw----   1 lluis gmeteo      918 May 19 13:28 wrf.F
...

From wn025 >ls -la /oceano/gmeteo/DATA/WRF/WRF_bin/3.1/WRF4G/MVAPICH/WRFV3/main/

...
-rw-rw----   1 lluis gmeteo    62797 May 19 13:28 wrf_ESMFMod.F
-rwxr-x--x   0 lluis gmeteo 21147057 May 25 17:39 wrf.exe
-rw-rw----   1 lluis gmeteo      918 May 19 13:28 wrf.F
...

Differences in Hard-link (see ls and Hard_Link), date and on size!? During a simulation each node is running a different wrf.exe!

Problem 'solved' rebooting wn025

[] mpiexec: Error: poll_or_block_event: tm_poll: tm: no event

A second try of run does not give this error ?¿!¡ No memory/space left on nodes (bad ending of a previous simulation)

[x] mvapich 'call system()' failed

When WRF4G is used, when 2nd file is started to be written, simulation stopped. (Probably due to $WRFGEL_SCRIPT ?)

See comments:

And user guide http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-350007.1.2

Old version of linux kernel. It is recommended that kernel should be at version 2.6.16 or newer

Linux wn010.macc.unican.es 2.6.9-78.0.13.EL.cernsmp #1 SMP Mon Jan 19 14:00:58 CET 2009 x86_64 x86_64 x86_64 GNU/Linux

and OFED version

>mpichversion
MPICH Version:          1.2.7
MPICH Release date:     $Date: 2005/06/22 16:33:49$
MPICH Patches applied:  none
MPICH configure:        --with-device=ch_gen2 --with-arch=LINUX -prefix=/software/ScientificLinux/4.6/mvapich/1.1/pgi_7.1-6_gcc --with-romio --without-mpe -lib=-L/usr/lib64 -Wl,-rpath=/usr/lib64 -libverbs -libumad -lpthread
MPICH Device:           ch_gen2

Problem solved at the moment declaring a new environment variable:

export IBV_FORK_SAFE=1

[x] mpiexec: Warning: read_ib_one: protocol version 8 not known, but might still work.

Error message when execute mpiexec:

mpiexec: Warning: read_ib_one: protocol version 8 not known, but might still work.
mpiexec: Error: read_ib_one: mixed version executables (6 and 8), no hope.

This error message appears when is used a wrong version of mpiexec. On must indicate correct one in: /software/ScientificLinux/4.6/mpiexec/mpiexec

[x] ECMWF ERA40 escena missing data

Incompleted escena domain downloaded ERA40 data in /oceano/gmeteo/DATA/ECMWF/ERA40/escena Years: 1968, 1969, 1971 and 1979

[x] Large waiting in GRID-CSIC

More than 1 day are waiting jobs in IFCA GRID-CSIC. In a selection of nodes=[N]:ppn=[M] (N=2, M=8).

  • In EGEEUI01 nodes can be occupied only with one core job. Thus it makes difficult that node exclusive jobs can be running. It is more addequated to send jobs with total number of cores, without indication of exclusivity of one phisical machine (EGEEUI01 cluster has 8 cores nodes).
  • Changes in wrf_AUTOlauncher_iteration.bash now make core assignation as [N]*[M] without pmiexec -npernode [M] line in [template].job. A new one has been created MPI_job-EGEEUI01.pbs
    #!/bin/bash (-)
    ### Job name
    #PBS -N @JOBnameSIM@
    ### Queue name
    #PBS -q lmeteo
    ### Dependency
    #PBS -W depend=afterany:@IDpbs@
    ###  Total number of processes
    #PBS -l nodes=@Nnodes@
    # This job's working directory
    echo Working directory is $PBS_O_WORKDIR
    cd $PBS_O_WORKDIR
    echo Running on host `hostname`
    echo Time is `date`
    echo Directory is `pwd`
    echo This jobs runs on the following processors:
    echo `cat $PBS_NODEFILE`
    ##
    #Running WRF
    ##
    export OMP_NUM_THREADS=@Ntrh@
    echo "Numero de Threads: $OMP_NUM_THREADS"
    echo "Numero de Jobs MPI: $Nprocess"
    mpiexec ./wrf.exe
    

It can only work if nodes is not set as an entire physical machine. It must be set to a cpu (or core). More information in:

[x] cshell error in wn010

In wn010, appears a systematic csh error, just open a csh terminal

setenv: Too many arguments

A problem in a csh.profile have been repared

[x] Stale NFS file handle

In IFCA GRID-CSIC, with wrf.exe appears a NFS file handle (for BIGescena domain)

/var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 400: 22711 Bus error               mpiexec -npernode 8 ./wrf.exe
rm: cannot remove `wrf.exe': Stale NFS file handle
rm: cannot remove `*.TBL': Stale NFS file handle
rm: cannot remove `*_DATA*': Stale NFS file handle
rm: cannot remove `met_em*': Stale NFS file handle
rm: cannot remove `wrfbdy*': Stale NFS file handle
rm: cannot remove `wrfinput*': Stale NFS file handle
/var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 345: /gpfs/ifca.es/meteo/forest//bats/change_in_file.bash: Stale NFS file handle
/var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 356: cd: /gpfs/ifca.es/meteo/SCRATCH/BIGescena/1970_1975Restart28d/simulations/1970010100_1970012900: Stale NFS file handle
(...)

Some errors in NFS server occurred

[x] metgrid.exe Segmentation fault

When metgrid.exe is running, a segmentation fault (in IFCA GRID-CSIC) appears (for Africa_25km domain). From [job].e[nnnnn]:

/var/spool/pbs/mom_priv/jobs/1073948.tor.SC: line 195: 19831 Segmentation fault

Global analyses used where defined only for an European region

[] CAM NaN

module_ra_cam_support.F generates NaN outputs at a given time step (about the 350th julian day of 1996 and 2001, 1996/XII/15 and 2001/XII/16). rsl.out.[nnnn] files become as large as hard disk (because of the output to these files). Has been done:

 vert_interpolate: mmr < 0, m, col, lev, mmr            2            2             1                       NaN
 vert_interpolate: aerosol(k),(k+1)   1.0000000116860974E-007     0.000000000000000
 vert_interpolate: pint(k+1),(k)                       NaN                        NaN
 n,c            1            1
  • FATAL_ERROR signal: call wrf_error_fatal ('Error of computation') line has been introduce in WRFV3/phys/module_ra_cam_support.F file
  • isnand(): internal pgi instruction added in some places of module_ra_cam_support.F and module_ra_cam.F allowing to know where appear first 'NaN' values

Possible WRFv3.0.1.1 bug related to temporal interpolation of CO2 concentrations at 15/XII of any year (change of monthly value)

[x] p4_error: semget failed for setnum: 12

Information sources:

This error means that there is not enough shared memory available to allocate a new memory segment for interprocess communication. Often what happens is there are some extra memory segments left over from a crash or programming error of a previous job that needs to be cleaned up. There is a script called cleanipcs that will remove all of your left over ipcs. Users are responsible for cleaning up extra shared memory segments after a crash or when their job is complete.

You can use /usr/bin/ipcs to check memory state in one node:(given example for ssh wn013 ipcs)

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0x00000000 2654211    lluis     600        33554432   0
------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x000000a7 0          root      666        1
0x00000000 11337729   lluis     600        10
0x00000000 11370498   lluis     600        10
0x00000000 11403267   lluis     600        10
0x00000000 11436036   lluis     600        10
0x00000000 11468805   lluis     600        10
0x00000000 11501574   lluis     600        10
0x00000000 11534343   lluis     600        10
0x00000000 11567112   lluis     600        10
0x00000000 11599881   lluis     600        10
0x00000000 11632650   lluis     600        10
0x00000000 11665419   lluis     600        10
0x00000000 11698188   lluis     600        10
0x00000000 11730957   lluis     600        10
0x00000000 11763726   lluis     600        10
0x00000000 11796495   lluis     600        10
0x00000000 11829264   lluis     600        10
0x00000000 11862033   lluis     600        10
0x00000000 11894802   lluis     600        10
0x00000000 11927571   lluis     600        10
0x00000000 11960340   lluis     600        10
0x00000000 11993109   lluis     600        10
0x00000000 12025878   lluis     600        10
0x00000000 12058647   lluis     600        10
0x00000000 14352408   lluis     600        10
0x00000000 14385177   lluis     600        10
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
[lluis@wn010 WRFV3]$ ssh wn013 ipcs
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
0x00000000 2654211    lluis     600        33554432   0
------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x000000a7 0          root      666        1
0x00000000 11337729   lluis     600        10
0x00000000 11370498   lluis     600        10
0x00000000 11403267   lluis     600        10
0x00000000 11436036   lluis     600        10
0x00000000 11468805   lluis     600        10
0x00000000 11501574   lluis     600        10
0x00000000 11534343   lluis     600        10
0x00000000 11567112   lluis     600        10
0x00000000 11599881   lluis     600        10
0x00000000 11632650   lluis     600        10
0x00000000 11665419   lluis     600        10
0x00000000 11698188   lluis     600        10
0x00000000 11730957   lluis     600        10
0x00000000 11763726   lluis     600        10
0x00000000 11796495   lluis     600        10
0x00000000 11829264   lluis     600        10
0x00000000 11862033   lluis     600        10
0x00000000 11894802   lluis     600        10
0x00000000 11927571   lluis     600        10
0x00000000 11960340   lluis     600        10
0x00000000 11993109   lluis     600        10
0x00000000 12025878   lluis     600        10
0x00000000 12058647   lluis     600        10
0x00000000 14352408   lluis     600        10
0x00000000 14385177   lluis     600        10
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    

Use the following command to clean up all memory segments owned by your user id on a cluster:

cexec /opt/mpich/gnu/sbin/cleanipcs

Or for each working node: (be carefull to don't run the script in any node with a right working simulation!!)

ssh wn[NNN] /software/ScientificLinux/4.6/mpich/1.2.7p1/pgi_7.1-6_gcc/sbin/cleanipcs

After that: (in wn013):

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x00000000 0          root      644        72         2
0x00000000 32769      root      644        16384      2
0x00000000 65538      root      644        280        2
------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x000000a7 0          root      666        1
------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

[x] P4_GLOBMEMSIZE

Not enough memory for mpich processes for the simulation. Error message looks like:

p3_15324: (1.777344) xx_shmalloc: returning NULL; requested 262192 bytes
p3_15324: (1.777344) p4_shmalloc returning NULL; request = 262192 bytes
You can increase the amount of memory by setting the environment variable
P4_GLOBMEMSIZE (in bytes); the current size is 4194304
p3_15324:  p4_error: alloc_p4_msg failed: 0

Tipical error for simulations with domains bigger as Europe_10 and BIGescena domains. Default value is 4MB (4194304)

Increase value to:

  • 32 MB export P4_GLOBMEMSIZE=33554432
  • 64 MB export P4_GLOBMEMSIZE=67108864
  • 128 MB export P4_GLOBMEMSIZE=134217728
  • 256 MB export P4_GLOBMEMSIZE=268435456

[x] SKINTEMP not found

ERA40 ECMWF files have a different codification of variables. A modification in Vtable.ECMWF is carried out:

Original line

 34 |  1   |   0  |      | SST      | K        | Sea-Surface Temperature                  |
139 | 112  |   0  |   7  | ST000007 | K        | T of 0-7 cm ground layer                 |

Modification

139 |  1   |   0  |      | SST      | K        | Sea-Surface Temperature                  |
139 | 112  |   0  |   7  | SKINTEMP | K        | T of 0-7 cm ground layer                 |

[x] WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN 81 78 NaN 5000.000

See http://forum.wrfforum.com/viewtopic.php?f=6&t=263

Many causes are possible. CFLs, problems with initial or boundary conditions...etc Lowering the time step or swiching off feedback between nests are possible solutions.

[x] Metgrid error: Error in ext_pkg_write_field in metgrid.log

Also in log/metgrid_1995030912.out:

ERROR: Error in ext_pkg_write_field
 WRF_DEBUG: Warning DIM            4 , NAME num_metgrid_levels REDIFINED  by var GHT           17          18  in wrf_io.F90 line        2424

This error means that probably one or more surface variables are missing in the model input (for example NCEP reanalyses). Input grib files must be checked and fixed.

[] forrtl: severe (174): SIGSEGV, segmentation fault occurred

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source wrf.exe 00000000013EF561 Unknown Unknown
Unknown
wrf.exe 00000000013F0727 Unknown Unknown Unknown
wrf.exe 00000000013F1E68 Unknown Unknown Unknown
wrf.exe 00000000011BB5CB Unknown Unknown Unknown
wrf.exe 0000000000DE0913 Unknown Unknown Unknown
wrf.exe 0000000000DDAEBD Unknown Unknown Unknown
wrf.exe 00000000009AF823 Unknown Unknown Unknown
wrf.exe 0000000000690D01 Unknown Unknown Unknown
wrf.exe 000000000068DB21 Unknown Unknown Unknown
wrf.exe 000000000047BC1B Unknown Unknown Unknown
wrf.exe 000000000047B049 Unknown Unknown Unknown
wrf.exe 000000000047AFEC Unknown Unknown Unknown
libc.so.6 0000003AD001D994 Unknown Unknown Unknown
wrf.exe 000000000047AEE9 Unknown Unknown Unknown

Causes are unknown, but it worked just sending the simulation again, without any change.

[] wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.

It appeared in a continous simulation with spectral nudging, using wrf 3.1.1. rsl.error.0000 shows:

wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
libc.so.6          0000003AD0030265  Unknown               Unknown  Unknown
libc.so.6          0000003AD0031D10  Unknown               Unknown  Unknown
libc.so.6          0000003AD00296E6  Unknown               Unknown  Unknown
wrf.exe            000000000154368A  Unknown               Unknown  Unknown
wrf.exe            0000000001518A2D  Unknown               Unknown  Unknown
wrf.exe            000000000152741E  Unknown               Unknown  Unknown
wrf.exe            00000000014CCD30  Unknown               Unknown  Unknown
wrf.exe            00000000014CBADD  Unknown               Unknown  Unknown
wrf.exe            00000000014BAD59  Unknown               Unknown  Unknown
wrf.exe            00000000014B76A3  Unknown               Unknown  Unknown
wrf.exe            0000000000BB258D  Unknown               Unknown  Unknown
wrf.exe            0000000000BAED79  Unknown               Unknown  Unknown
wrf.exe            0000000000BAE7F8  Unknown               Unknown  Unknown
wrf.exe            0000000000BADD02  Unknown               Unknown  Unknown
wrf.exe            0000000000BADA9E  Unknown               Unknown  Unknown
wrf.exe            0000000000DD5E47  Unknown               Unknown  Unknown
wrf.exe            00000000007A81D6  Unknown               Unknown  Unknown
wrf.exe            00000000006B8424  Unknown               Unknown  Unknown
wrf.exe            0000000000653E19  Unknown               Unknown  Unknown
wrf.exe            0000000000677927  Unknown               Unknown  Unknown
wrf.exe            0000000000674047  Unknown               Unknown  Unknown
wrf.exe            00000000004C9DF7  Unknown               Unknown  Unknown
wrf.exe            000000000047B0A3  Unknown               Unknown  Unknown
wrf.exe            000000000047B057  Unknown               Unknown  Unknown
wrf.exe            000000000047AFEC  Unknown               Unknown  Unknown
libc.so.6          0000003AD001D994  Unknown               Unknown  Unknown
wrf.exe            000000000047AEE9  Unknown               Unknown  Unknown

wrf_2001112400.out shows:

/oceano/gmeteo/WORK/ASNA/WRF/run/SeaWind_N1540_SN/SeaWind_N1540_SN/0029/bin/wrf_wrapper.exe: line 9:  4500 Aborted                 ${0/_wrapper/} $*

Causes are unknown.

[] No error, wrf just stops (¿¡?!)

Change the debug_level (up to 300) in namelist.input &time_cotrol.

If there isn't any error yet, run wrf using the debugging version (OMPIchk)

Last modified 5 years ago Last modified on May 2, 2013 4:52:42 PM