Version 2 (modified by MarkelGarcia, 9 years ago) (diff) |
---|
Known Problems
Page Contents
-
Known Problems
- [x] qdel: Server could not connect to MOM 476932.ce01.macc.unican.es
- [] Abnormal ending of simulations due to 'Flerchinger' aproximation
- [] Abnormal ending of simulations
- [x] Not writting of wrf restart files
- [] Crash of running simulations by openMPI problems
- [x] Intel simulations crash with CAM
- [x] SST missing values in coastal lines
- [] p4_error: latest msg from perror: Invalid argument
- [x] p4_error: OOPS: semop lock failed: -1
- [] * glibc detected * malloc(): memory corruption:
- [x] Missing required environment variable: MPIRUN_RANK
- [] Different wrf.exe from different nodes
- [] mpiexec: Error: poll_or_block_event: tm_poll: tm: no event
- [x] mvapich 'call system()' failed
- [x] mpiexec: Warning: read_ib_one: protocol version 8 not known, but …
- [x] ECMWF ERA40 escena missing data
- [x] Large waiting in GRID-CSIC
- [x] cshell error in wn010
- [x] Stale NFS file handle
- [x] metgrid.exe Segmentation fault
- [] CAM NaN
- [x] p4_error: semget failed for setnum: 12
- [x] P4_GLOBMEMSIZE
- [x] SKINTEMP not found
- [x] WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN 81 78 …
- [x] Metgrid error: Error in ext_pkg_write_field in metgrid.log
- [] forrtl: severe (174): SIGSEGV, segmentation fault occurred
- [] wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) …
- [] No error, wrf just stops (¿¡?!)
[]: Unsolved [x]: Solved
[x] qdel: Server could not connect to MOM 476932.ce01.macc.unican.es
Some times, PBS downs in the nodes. In order to recover it. We have to stop it an restart the service.
[root@ce01 ~]$ssh wn025 'service pbs_mom restart'
It can be done for all nodes:
[root@ce01 ~]# cexec 'service pbs_mom status' ************************* macc ************************* --------- wn001--------- pbs_mom (pid 2575) is running... --------- wn002--------- pbs_mom (pid 3061) is running... --------- wn003--------- pbs_mom (pid 2908) is running... --------- wn004--------- ssh(1777) Warning: the RSA host key for 'wn004' differs from the key for the IP address '192.168.202.14' Offending key for IP in /etc/ssh/ssh_known_hosts:8 Matching host key in /root/.ssh/known_hosts:117 ssh(1777) Permission denied, please try again. ssh(1777) Permission denied, please try again. ssh(1777) Permission denied (publickey,password). --------- wn005--------- pbs_mom dead but subsys locked --------- wn006--------- pbs_mom (pid 3002) is running... --------- wn007--------- pbs_mom (pid 29926) is running... --------- wn008--------- pbs_mom dead but subsys locked --------- wn009--------- ssh(1796) Permission denied, please try again. ssh(1796) Permission denied, please try again. ssh(1796) Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). --------- wn010--------- pbs_mom dead but subsys locked --------- wn011--------- pbs_mom dead but subsys locked --------- wn012--------- pbs_mom dead but subsys locked --------- wn013--------- pbs_mom (pid 6605 6604 6422) is running... --------- wn014--------- pbs_mom (pid 3137) is running... --------- wn015--------- pbs_mom dead but subsys locked --------- wn016--------- pbs_mom dead but subsys locked --------- wn017--------- pbs_mom dead but subsys locked --------- wn018--------- pbs_mom (pid 3284) is running... --------- wn019--------- pbs_mom dead but subsys locked --------- wn020--------- pbs_mom dead but subsys locked --------- wn021--------- pbs_mom dead but subsys locked --------- wn022--------- pbs_mom dead but subsys locked --------- wn023--------- pbs_mom dead but subsys locked --------- wn024--------- pbs_mom (pid 3157) is running... --------- wn025--------- pbs_mom (pid 18308) is running... --------- wn031--------- pbs_mom dead but subsys locked --------- wn032--------- pbs_mom dead but subsys locked --------- wn033--------- pbs_mom dead but subsys locked --------- wn034--------- pbs_mom dead but subsys locked --------- wn035--------- pbs_mom dead but subsys locked --------- wn036--------- pbs_mom dead but subsys locked --------- wn041--------- pbs_mom dead but subsys locked --------- wn042--------- pbs_mom dead but subsys locked --------- wn043--------- pbs_mom dead but subsys locked --------- wn044--------- pbs_mom dead but subsys locked --------- wn045--------- pbs_mom dead but subsys locked --------- wn046--------- pbs_mom dead but subsys locked
Restarting
[root@ce01 ~]# cexec 'service pbs_mom restart' ************************* macc ************************* --------- wn001--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn002--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn003--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn004--------- ssh(2514) Warning: the RSA host key for 'wn004' differs from the key for the IP address '192.168.202.14' Offending key for IP in /etc/ssh/ssh_known_hosts:8 Matching host key in /root/.ssh/known_hosts:117 ssh(2514) Permission denied, please try again. ssh(2514) Permission denied, please try again. ssh(2514) Permission denied (publickey,password). --------- wn005--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn006--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn007--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn008--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn009--------- ssh(2524) Permission denied, please try again. ssh(2524) Permission denied, please try again. ssh(2524) Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). --------- wn010--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn011--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn012--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn013--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn014--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn015--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn016--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn017--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn018--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn019--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn020--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn021--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn022--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn023--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn024--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn025--------- Shutting down TORQUE Mom: [ OK ] Starting TORQUE Mom: [ OK ] --------- wn031--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn032--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn033--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn034--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn035--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn036--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn041--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn042--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn043--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn044--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn045--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ] --------- wn046--------- Shutting down TORQUE Mom: [FAILED] Starting TORQUE Mom: [ OK ]
And then the nodes are:
[root@ce01 ~]# pbsnodes -l wn002 offline wn003 offline wn004 down,offline wn009 down,offline wn001 offline
[] Abnormal ending of simulations due to 'Flerchinger' aproximation
A simulation ends with the following message:
(...) d01 2000-09-13_00:00:00 Input data processed for wrflowinp_d<domain> for domain 1 Flerchinger USEd in NEW version. Iterations= 10 Flerchinger USEd in NEW version. Iterations= 10 Flerchinger USEd in NEW version. Iterations= 10
Related to the topic:
- WRF forum
- phys/module_sf_noahdrv.F code
[] Abnormal ending of simulations
While a simulation was running on 'rsl.error.0000' appears:
(...) Timing for main: time 2001-03-03_05:00:00 on domain 1: 2.94590 elapsed seconds. wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed. Image PC Routine Line Source libc.so.6 0000003AC6830265 Unknown Unknown Unknown libc.so.6 0000003AC6831D10 Unknown Unknown Unknown libc.so.6 0000003AC68296E6 Unknown Unknown Unknown wrf.exe 00000000015E846A Unknown Unknown Unknown wrf.exe 00000000015BD80D Unknown Unknown Unknown wrf.exe 00000000015CC1FE Unknown Unknown Unknown wrf.exe 0000000001571B10 Unknown Unknown Unknown wrf.exe 00000000015708BD Unknown Unknown Unknown wrf.exe 000000000155F149 Unknown Unknown Unknown wrf.exe 000000000155B828 Unknown Unknown Unknown wrf.exe 0000000000BCA9ED Unknown Unknown Unknown wrf.exe 0000000000BC71D9 Unknown Unknown Unknown wrf.exe 0000000000BC6C58 Unknown Unknown Unknown wrf.exe 0000000000BC6162 Unknown Unknown Unknown wrf.exe 0000000000BC5EFE Unknown Unknown Unknown wrf.exe 0000000000DEF177 Unknown Unknown Unknown wrf.exe 00000000007413C2 Unknown Unknown Unknown wrf.exe 00000000006BE487 Unknown Unknown Unknown wrf.exe 00000000006552B9 Unknown Unknown Unknown wrf.exe 000000000067A5B4 Unknown Unknown Unknown wrf.exe 0000000000678591 Unknown Unknown Unknown wrf.exe 00000000004CA59F Unknown Unknown Unknown wrf.exe 000000000047B093 Unknown Unknown Unknown wrf.exe 000000000047B047 Unknown Unknown Unknown wrf.exe 000000000047AFDC Unknown Unknown Unknown libc.so.6 0000003AC681D994 Unknown Unknown Unknown wrf.exe 000000000047AEE9 Unknown Unknown Unknown
Related links:
[x] Not writting of wrf restart files
Along the simulation on rsl.[error/out].[nnnn] appears
(...) Timing for Writing restart for domain 1: 48.82700 elapsed seconds. (...)
But restart was never written. Along the simulation, a wrfrst_[.....] file is written, but it has only 32 bytes and it is alive only the 48.82700 seconds, after that time it disappears. Looking to the execution flow (via strace):
(...) open("wrfrst_d01_2000-06-01_11:30:00", O_RDWR|O_CREAT|O_TRUNC, 0666) = 13 fstat(13, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 fstat(13, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0 lseek(13, 0, SEEK_CUR) = 0 lseek(13, 24, SEEK_SET) = 24 write(13, "\0\0\0\0\0\0\0\0", 8) = 8 lseek(13, 0, SEEK_SET) = 0 (...) write(1, "Timing for Writing restart for d"..., 76) = 76 write(2, "Timing for Writing restart for d"..., 76) = 76 lseek(13, 0, SEEK_CUR) = 0 write(13, "CDF\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32) = 32 close(13) = 0 unlink("wrfrst_d01_2000-06-01_11:00:00") = 0 (...)
It seems to be something related with the size of files.
In order to allow netCDF output bigger than 2GB on should activate a variable (in compile.bash)
export WRFIO_NCD_LARGE_FILE_SUPPORT=1
Now is working
[] Crash of running simulations by openMPI problems
In a not systematic way, some simulations stopped with the following message:
[lluis@wn033 ~]$ cat /localtmp/wrf4g.20110213111149664772000/log/wrf_2000101006.out [wn033.macc.unican.es:19865] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19864] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19869] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19866] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19867] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19876] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19870] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19874] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19877] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19863] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19871] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19875] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19873] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19868] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19862] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19872] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19872] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19868] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19875] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19862] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19871] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19866] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19873] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19863] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19864] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19876] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19867] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19877] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19869] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19870] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19865] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [wn033.macc.unican.es:19874] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) mpiexec: killing job... -------------------------------------------------------------------------- Sorry! You were supposed to get help about: orterun:unclean-exit But I couldn't open the help file: /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/share/openmpi/help-orterun.txt: No such file or directory. Sorry! --------------------------------------------------------------------------
[x] Intel simulations crash with CAM
Simulations crash just at beginning of them:
taskid: 0 hostname: wn025.macc.unican.es Quilting with 1 groups of 0 I/O tasks. Namelist dfi_control not found in namelist.input. Using registry defaults for v ariables in dfi_control Namelist tc not found in namelist.input. Using registry defaults for variables in tc Namelist scm not found in namelist.input. Using registry defaults for variables in scm Namelist fire not found in namelist.input. Using registry defaults for variable s in fire Ntasks in X 2, ntasks in Y 4 WRF V3.1.1 MODEL *** CLWRF code enabled ************************************* Parent domain ids,ide,jds,jde 1 50 1 50 ims,ime,jms,jme -4 32 -4 20 ips,ipe,jps,jpe 1 25 1 13 ************************************* DYNAMICS OPTION: Eulerian Mass Coordinate alloc_space_field: domain 1, 26649120 bytes allocated med_initialdata_input: calling input_model_input INPUT LandUse = "USGS" forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source wrf.exe 00000000013EF3E1 Unknown Unknown Unknown wrf.exe 00000000013F05A7 Unknown Unknown Unknown wrf.exe 00000000013F1CE8 Unknown Unknown Unknown wrf.exe 00000000011BB44B Unknown Unknown Unknown wrf.exe 0000000000DE008E Unknown Unknown Unknown wrf.exe 0000000000DDAEAD Unknown Unknown Unknown wrf.exe 00000000009AF813 Unknown Unknown Unknown wrf.exe 0000000000690D01 Unknown Unknown Unknown wrf.exe 000000000068DB21 Unknown Unknown Unknown wrf.exe 000000000047BC1B Unknown Unknown Unknown wrf.exe 000000000047B049 Unknown Unknown Unknown wrf.exe 000000000047AFEC Unknown Unknown Unknown libc.so.6 0000003C6421D994 Unknown Unknown Unknown wrf.exe 000000000047AEE9 Unknown Unknown Unknown
where:
[lluis@wn025 run]$ find /lib* -name libc.so.6 /lib/libc.so.6 /lib/i686/nosegneg/libc.so.6 /lib64/libc.so.6
and
[lluis@wn025 run]$ ldd wrf.exe libmpi_f90.so.0 => not found libmpi_f77.so.0 => not found libmpi.so.0 => not found libopen-rte.so.0 => not found libopen-pal.so.0 => not found libdl.so.2 => /lib64/libdl.so.2 (0x0000003c64600000) libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003c65600000) libutil.so.1 => /lib64/libutil.so.1 (0x0000003c65a00000) libm.so.6 => /lib64/libm.so.6 (0x0000003c64e00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003c64a00000) libc.so.6 => /lib64/libc.so.6 (0x0000003c64200000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003c66200000) /lib64/ld-linux-x86-64.so.2 (0x0000003c63e00000)
This happens with:
NIN_ra_lw_physics = 3 NIN_ra_sw_physics = 3
This does not happen with:
NIN_ra_lw_physics = 4 NIN_ra_sw_physics = 4
NOTE: This error is also found on serial compilation. With a debug = 1000,
Namelist dfi_control not found in namelist.input. Using registry defaults for v ariables in dfi_control Namelist tc not found in namelist.input. Using registry defaults for variables in tc Namelist scm not found in namelist.input. Using registry defaults for variables in scm Namelist fire not found in namelist.input. Using registry defaults for variable s in fire WRF V3.1.1 MODEL wrf: calling alloc_and_configure_domain ************************************* Parent domain ids,ide,jds,jde 1 50 1 50 ims,ime,jms,jme -4 55 -4 55 ips,ipe,jps,jpe 1 50 1 50 ************************************* DYNAMICS OPTION: Eulerian Mass Coordinate alloc_space_field: domain 1, 95259880 bytes allocated med_initialdata_input: calling input_model_input (...) INPUT LandUse = "USGS" LANDUSE TYPE = "USGS" FOUND 33 CATEGORIES 2 SEASONS WATER CATEGORY = 16 SNOW CATEGORY = 24 *** SATURATION VAPOR PRESSURE TABLE COMPLETED *** num_months = 13 AEROSOLS: Background aerosol will be limited to bottom 6 model interfaces. reading CAM_AEROPT_DATA forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source wrf.exe 000000000130FB51 Unknown Unknown Unknown wrf.exe 0000000001310D17 Unknown Unknown Unknown wrf.exe 0000000001312458 Unknown Unknown Unknown wrf.exe 00000000010DC9BB Unknown Unknown Unknown wrf.exe 0000000000D12F9E Unknown Unknown Unknown wrf.exe 0000000000D0DDBD Unknown Unknown Unknown wrf.exe 0000000000906523 Unknown Unknown Unknown wrf.exe 0000000000608AC1 Unknown Unknown Unknown wrf.exe 00000000006058E1 Unknown Unknown Unknown wrf.exe 0000000000404DEC Unknown Unknown Unknown wrf.exe 0000000000404249 Unknown Unknown Unknown wrf.exe 00000000004041EC Unknown Unknown Unknown libc.so.6 0000003331A1D994 Unknown Unknown Unknown wrf.exe 00000000004040E9 Unknown Unknown Unknown}}} And WRF configuration: {{{ [lluis@mar run]$ ldd wrf.exe libm.so.6 => /lib64/libm.so.6 (0x00000039cac00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039cb000000) libc.so.6 => /lib64/libc.so.6 (0x00000039ca400000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003ac2200000) libdl.so.2 => /lib64/libdl.so.2 (0x00000039ca800000) /lib64/ld-linux-x86-64.so.2 (0x00000039ca000000) }}} Segmentation fault appears after line # 3757 of {{{phys/module_ra_cam_support.F}}} (from WRFV3.1.1) '''NOTE:''' on activation of {{{ ulimit -s unlimited }}} It Works !!!! On ESCENA domain simulation works with CAM ra_lw/sw when a checked compilation is used (SERIAL in: {{{/oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/WRF/icif64/SERIALchk/WRFV3/main/wrf.exe}}}). Following messages are shown in std. output: {{{ (...) Timing for main: time 2001-11-10_00:02:30 on domain 1: 358.70981 elapsed seconds. forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #49 forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #51 forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #134 forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #136 forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #181 forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #183 forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #87 forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #89 forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #21 forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #23 forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #58 forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #60 forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #55 forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #57 forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #20 forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #22 Timing for main: time 2001-11-10_00:05:00 on domain 1: 32.11490 elapsed seconds. (...) }}} * On {{{phys/module_radiation_driver.F}}}, subroutine {{{pre_radiation_driver}}} arguments #49, 51 are: i_end, j_end * On {{{phys/module_radiation_driver.F}}}, subroutine {{{radiation_driver}}} arguments #134, 136 are: i_end, j_end * On {{{phys/module_surface_driver.F}}}, subroutine {{{surface_driver}}} arguments #181, 183 are: i_end, j_end * On {{{phys/module_pbl_driver.F}}}, subroutine {{{pbl_driver}}} arguments #87, 89 are: i_end, j_end * On {{{phys/module_cumulus_driver.F}}}, subroutine {{{cumulus_driver}}} arguments #21, 23 are: i_end, j_end * On {{{phys/module_fddagd_driver.F}}}, subroutine {{{fddagd_driver}}} arguments #58, 60 are: i_end, j_end * On {{{phys/module_microphysics_driver.F}}}, subroutine {{{microphysics_driver}}} arguments #55, 57 are: i_end, j_end * On {{{phys/module_diagnostics.F}}}, subroutine {{{diagnostic_output_calc}}} arguments #20, 22 are: i_end, j_end Subroutine definitions {{{ INTEGER, DIMENSION(num_tiles), INTENT(IN) :: & & i_start,i_end,j_start,j_end }}} Definition in {{{frame/module_domain_type.F}}}, WRF derived type for the domain {{{TYPE(domain)}}}: {{{ TYPE domain (...) INTEGER,POINTER :: i_start(:),i_end(:) INTEGER,POINTER :: j_start(:),j_end(:) (...) INTEGER :: num_tiles ! taken out of namelist 20000908 (...) }}} Some information about [http://www.mmm.ucar.edu/wrf/WG2/topics/settiles/ WRF tiles] In {{{frame/module_tiles.F}}} is seen in subroutines {{{set_tiles1, set_tiles2, set_tiles3}}}: {{{ IF ( ASSOCIATED(grid%i_start) ) THEN ; DEALLOCATE( grid%i_start ) ; NULLIFY( grid%i_start ) ; ENDIF IF ( ASSOCIATED(grid%i_end) ) THEN ; DEALLOCATE( grid%i_end ) ; NULLIFY( grid%i_end ) ; ENDIF IF ( ASSOCIATED(grid%j_start) ) THEN ; DEALLOCATE( grid%j_start ) ; NULLIFY( grid%j_start ) ; ENDIF IF ( ASSOCIATED(grid%j_end) ) THEN ; DEALLOCATE( grid%j_end ) ; NULLIFY( grid%j_end ) ; ENDIF ALLOCATE(grid%i_start(num_tiles)) ALLOCATE(grid%i_end(num_tiles)) ALLOCATE(grid%j_start(num_tiles)) ALLOCATE(grid%j_end(num_tiles)) grid%max_tiles = num_tiles }}} The recommended WRF compilation (intel shared memory) is: {{{ (...) DMPARALLEL = 1 OMPCPP = # -D_OPENMP OMP = # -openmp -fpp -auto SFC = ifort SCC = icc DM_FC = mpif90 -f90=$(SFC) DM_CC = mpicc -cc=$(SCC) -DMPI2_SUPPORT FC = $(DM_FC) CC = $(DM_CC) -DFSEEKO64_OK LD = $(FC) RWORDSIZE = $(NATIVE_RWORDSIZE) PROMOTION = -i4 ARCH_LOCAL = -DNONSTANDARD_SYSTEM_FUNC CFLAGS_LOCAL = -w -O3 -ip LDFLAGS_LOCAL = -ip CPLUSPLUSLIB = ESMF_LDFLAG = $(CPLUSPLUSLIB) FCOPTIM = -O3 FCREDUCEDOPT = $(FCOPTIM) FCNOOPT = -O0 -fno-inline -fno-ip FCDEBUG = # -g $(FCNOOPT) -traceback FORMAT_FIXED = -FI FORMAT_FREE = -FR FCSUFFIX = BYTESWAPIO = -convert big_endian FCBASEOPTS = -w -ftz -align all -fno-alias -fp-model precisee $(FCDEBUG) $(FORMAT_FREE) $(BYTESWAPIO) MODULE_SRCH_FLAG = TRADFLAG = -traditional CPP = /lib/cpp -C -P AR = ar (...) }}} Simply changing the '-O3' optimization to '-O2' it works proprertly. (It also work with '-O1' but it makes the simulations slower) === LAST NEWs === It works just adding a new compilation flag '-heap-arrays' which means: {{{ -heap-arrays [size] -no-heap-arrays Puts automatic arrays and arrays created for temporary computations on the heap instead of the stack. Architectures: IA-32, Intel® 64, IA-64 architectures Default: -no-heap-arrays The compiler puts automatic arrays and arrays created for temporary computa- tions in temporary storage in the stack storage area. Description: This option puts automatic arrays and arrays created for temporary computations on the heap instead of the stack. If heap-arrays is specified and size is omitted, all auto- matic and temporary arrays are put on the heap. If 10 is specified for size, all automatic and temporary arrays larger than 10 KB are put on the heap. }}} It has been added on {{{configure.wrf}}} just adding the flag {{{ (...) CFLAGS_LOCAL = -w -O3 -heap-arrays -ip (...) FCOPTIM = -O3 -heap-arrays }}} A complete history post on ' intel's ' forum is available [http://software.intel.com/en-us/forums/showthread.php?t=72109&p=1#146890 intel forum] == [] STOP of simulations due to library problems == Some executions are stopped giving these error message: {{{ /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory /oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory }}} Any {{{rsl.[error/out].[nnnn]}}} file is written Doing a ldd of 'wrf.exe' is obtained the same in both nodes: {{{ [lluis@wn031 ~]$ ldd /oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/CLW//gcgf64/OMPI/WRFV3/main/wrf.exe libmpi_f90.so.0 => not found libmpi_f77.so.0 => not found libmpi.so.0 => not found libopen-rte.so.0 => not found libopen-pal.so.0 => not found libdl.so.2 => /lib64/libdl.so.2 (0x0000003a9d800000) libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003a9e400000) libutil.so.1 => /lib64/libutil.so.1 (0x0000003aa0400000) libgfortran.so.3 => not found libm.so.6 => /lib64/libm.so.6 (0x0000003a9dc00000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003a9f400000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003a9e000000) libc.so.6 => /lib64/libc.so.6 (0x0000003a9d400000) /lib64/ld-linux-x86-64.so.2 (0x0000003a9d000000) }}} gfortran is instaled: {{{ [lluis@wn041 ~]$ which gfortran /usr/bin/gfortran [lluis@wn041 ~]$ ldd /usr/bin/gfortran libc.so.6 => /lib64/libc.so.6 (0x0000003a9d400000) /lib64/ld-linux-x86-64.so.2 (0x0000003a9d000000) }}} On a working node ldd gives: {{{ [lluis@wn010 ~]$ ldd /oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/CLW//gcgf64/OMPI/WRFV3/main/wrf.exe libmpi_f90.so.0 => not found libmpi_f77.so.0 => not found libmpi.so.0 => not found libopen-rte.so.0 => not found libopen-pal.so.0 => not found libdl.so.2 => /lib64/libdl.so.2 (0x0000003cd0c00000) libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003cd3000000) libutil.so.1 => /lib64/libutil.so.1 (0x0000003cd3400000) libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00002b5f55924000) libm.so.6 => /lib64/libm.so.6 (0x0000003cd1400000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003cd2800000) libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cd1000000) libc.so.6 => /lib64/libc.so.6 (0x0000003cd0800000) /lib64/ld-linux-x86-64.so.2 (0x0000003cd0400000) }}} gfortran ldd message is the same on 'wn010' and 'wn031/041' == [] STOP of simulations due to net delays == On execution of wrf.exe, simulations stops with following messages (with openMPI): rsl.error.0004 {{{ taskid: 4 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Invalid argument (22) }}} rsl.error.0005 {{{ taskid: 5 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],5][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) }}} rsl.error.0006 {{{ taskid: 6 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) }}} rsl.error.0007 {{{ [wn017.macc.unican.es][[20060,1],7][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110) }}} We are experience some net problems with important fall of system response of the cluster machine ({{{'dinamic'}}} queue) == [x] In real STOP: At line 703 of file module_initialize_real.f90 == On execution of real appears: {{{ Namelist dfi_control not found in namelist.input. Using registry defaults for variables in dfi_control Namelist tc not found in namelist.input. Using registry defaults for variables in tc Namelist scm not found in namelist.input. Using registry defaults for variables in scm Namelist fire not found in namelist.input. Using registry defaults for variables in fire REAL_EM V3.1.1 PREPROCESSOR ************************************* Parent domain ids,ide,jds,jde 1 167 1 139 ims,ime,jms,jme -4 172 -4 144 ips,ipe,jps,jpe 1 167 1 139 ************************************* DYNAMICS OPTION: Eulerian Mass Coordinate alloc_space_field: domain 1 , 804753800 bytes allocated Time period # 1 to process = 2025-01-01_00:00:00. Time period # 2 to process = 2025-01-01_06:00:00. (...) Time period # 56 to process = 2025-01-14_18:00:00. Time period # 57 to process = 2025-01-15_00:00:00. Total analysis times to input = 57. ----------------------------------------------------------------------------- Domain 1: Current date being processed: 2025-01-01_00:00:00.0000, which is loop # 1 out of 57 configflags%julyr, %julday, %gmt: 2025 1 0.000000 d01 2025-01-01_00:00:00 Timing for input 0 s. d01 2025-01-01_00:00:00 flag_soil_layers read from met_em file is 1 At line 703 of file module_initialize_real.f90 Fortran runtime error: End of record }}} The error messages {{{ At line 703 of file module_initialize_real.f90 Fortran runtime error: End of record }}} Are [http://gcc.gnu.org/ml/fortran/2005-02/msg00394.html gfortran run-time errors] This occurrs because input data does not have PSML/PSFC ! From ungrib.log {{{ (...) Inventory for date = 2025-01-01 00:00:00 PRES HGT TT UU VV RH SOILHGT LANDSEA PSFC PMSL SST SKINTEMP SNOW ST000007 ST007028 ST028100 ST100255 SM000007 SM007028 SM028100 SM100255 ------------------------------------------------------------------------------- 2001.1 O O O O O O O O O O O X O O O O O O O O 2001.0 O X X X X X X O O O X O O O O O O O O O 1000.0 X X X X X 925.0 X X X X X 850.0 X X X X X 700.0 X X X X X 500.0 X X X X X 300.0 X X X X X 200.0 X X X X X 100.0 X X X X X 50.0 X X X X X ------------------------------------------------------------------------------- (...) }}} Removing PSFC/MSLP from working input data the error is reproduced! == [x] Execution error in WRF == On GRIDUI appears this error on different experiments {{{scnc1a}}}, {{{scnc1b}}}: On {{{/gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/log/rsl_wrf/rsl.error.0000}}}, simulation started at 19750423000000: {{{ (...) Timing for main: time 1975-05-13_22:32:30 on domain 1: 2.23900 elapsed seconds. [gcsic019wn:19507] *** Process received signal *** [gcsic019wn:19507] Signal: Segmentation fault (11) [gcsic019wn:19507] Signal code: Address not mapped (1) [gcsic019wn:19507] Failing at address: 0xfffffffc01fd0668 [gcsic019wn:19507] [ 0] /lib64/libpthread.so.0 [0x3df980e930] [gcsic019wn:19507] [ 1] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam_support__estblf+0x56) [0x1411806] [gcsic019wn:19507] [ 2] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam_support__aqsat+0x189) [0x14119b9] [gcsic019wn:19507] [ 3] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam__radctl+0x151f) [0x118ee8f] [gcsic019wn:19507] [ 4] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam__camrad+0x584d) [0x1196ded] [gcsic019wn:19507] [ 5] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_radiation_driver__radiation_driver+0x4d8b) [0xcddd3b] [gcsic019wn:19507] [ 6] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_first_rk_step_part1__first_rk_step_part1+0x3edb) [0xdb734b] [gcsic019wn:19507] [ 7] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(solve_em_+0x1c89f) [0x95939f] [gcsic019wn:19507] [ 8] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(solve_interface_+0x6c8) [0x635e30] [gcsic019wn:19507] [ 9] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_integrate__integrate+0x1b0) [0x494444] [gcsic019wn:19507] [10] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_wrf_top__wrf_run+0x22) [0x46ef12] [gcsic019wn:19507] [11] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(MAIN__+0x3a) [0x46e5ca] [gcsic019wn:19507] [12] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(main+0xe) [0x14e9cae] [gcsic019wn:19507] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3df901d994] [gcsic019wn:19507] [14] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe [0x46e4d9] [gcsic019wn:19507] *** End of error message *** }}} Same in {{{/gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/log/rsl_wrf/rsl.error.0000}}}, simulation started at 19500507000000: {{{ (...) Timing for main: time 1950-05-10_16:17:30 on domain 1: 4.13800 elapsed seconds. Timing for [gcsic116wn:30182] *** Process received signal *** [gcsic116wn:30182] Signal: Segmentation fault (11) [gcsic116wn:30182] Signal code: Address not mapped (1) [gcsic116wn:30182] Failing at address: 0xfffffffc01fd0668 [gcsic116wn:30182] [ 0] /lib64/libpthread.so.0 [0x3dc780e930] [gcsic116wn:30182] [ 1] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam_support__estblf+0x56) [0x1411806] [gcsic116wn:30182] [ 2] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam_support__aqsat+0x110) [0x1411940] [gcsic116wn:30182] [ 3] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam__radctl+0x151f) [0x118ee8f] [gcsic116wn:30182] [ 4] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam__camrad+0x584d) [0x1196ded] [gcsic116wn:30182] [ 5] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_radiation_driver__radiation_driver+0x4d8b) [0xcddd3b] [gcsic116wn:30182] [ 6] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_first_rk_step_part1__first_rk_step_part1+0x3edb) [0xdb734b] [gcsic116wn:30182] [ 7] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(solve_em_+0x1c89f) [0x95939f] [gcsic116wn:30182] [ 8] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(solve_interface_+0x6c8) [0x635e30] [gcsic116wn:30182] [ 9] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_integrate__integrate+0x1b0) [0x494444] [gcsic116wn:30182] [10] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_wrf_top__wrf_run+0x22) [0x46ef12] [gcsic116wn:30182] [11] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(MAIN__+0x3a) [0x46e5ca] [gcsic116wn:30182] [12] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(main+0xe) [0x14e9cae] [gcsic116wn:30182] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3dc701d994] [gcsic116wn:30182] [14] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe [0x46e4d9] [gcsic116wn:30182] *** End of error message *** }}} This happens because WPS uses 'iDirectionIncrementInDegrees' attribute, but cdo (used to transform input files) does not (gives an 'MISSING' value). In order to prevent this error and others (not enough decimals in codification of the value in the input grib files), a new namelist options has been introduced in {{{nameslit.wps}}} 'ungrib' section. {{{ is_global = 1, }}} With this value, is said that input files are global. Thus the increment in degrees in the {{{'i-direction'}}} will be computed according to the range of x dimension of the input files. In order to allow this, some modifications have to be done in some modules of {{{WPS/ungrib}}} source code. '''NOTE:''' This option is only available if input data is in a regular matrix in x direction * {{{ungrib/src/rd_grib1.F}}} {{{ 62 SUBROUTINE rd_grib1(IUNIT, gribflnm, level, field, hdate, & 63 ierr, iuarr, debug_level, is_g) (...) 77 ! L. Fita. UC. August 2010 78 INTEGER :: is_g (...) 372 ! L. Fita. UC. August 2010 373 IF (is_g == 1) THEN 374 PRINT *,"*********** L. Fita. UC . August 2010 ***********" 375 PRINT *,"*** Assuming global regular grid. ***" 376 PRINT *,"*** Computing 'dx' from number of points 'Nx' ***" 377 PRINT *,"*************************************************" 378 map%dx = 360.0 / map%nx 379 PRINT *,'Nx = ',map%nx,' dx:',map%dx 380 ELSE 381 map%dx = ginfo(8) 382 ENDIF (...) 423 ! L. Fita. UC. August 2010 424 IF (is_g == 1) THEN 425 PRINT *,"*********** L. Fita. UC . August 2010 ***********" 426 PRINT *,"*** Assuming global regular grid. ***" 427 PRINT *,"*** Computing 'dx' from number of points 'Nx' ***" 428 PRINT *,"*************************************************" 429 map%dx = 360.0 / map%nx 430 PRINT *,'Nx = ',map%nx,' dx:',map%dx 431 ELSE 432 map%dx = ginfo(8) 433 ENDIF }}} {{{ungrib/src/read_namelist.F}}} {{{ 1 subroutine read_namelist(hstart, hend, delta_time, ntimes,& 2 ordered_by_date, debug_level, out_format, prefix, is_global) 3 ! L. Fita. UC. August 2010 4 ! Adding new 'namelsit.wps' value in '&ungrib' section: is_global (0, No; 1, 5 ! Yes [default 0]). 6 ! NOTE: This modification is only useful for global GRIBs with a regular 7 ! longitude distribution 8 ! 9 ! EXPLANATION: 10 ! In some global files, grid information, is not correctly extacted and/or 11 ! they could not be exactly fitted in an entire earth. By this modification, 12 ! gris spacing in x direction is computed from the number of grid points in 13 ! this direction (...) 58 ! L. fita. UC. August 2010 59 INTEGER :: is_global (...) 72 ordered_by_date, prefix, is_global }}} {{{ungrib/src/ungrib.F }}} . {{{ 74 ! L. Fita. UC 2010 August 75 INTEGER :: is_global (...) 97 call read_namelist(hstart, hend, interval, ntimes, & 98 ordered_by_date, debug_level, out_format, prefix, is_global) (...) 207 call rd_grib1(nunit1, gribflnm, level, field, & 208 hdate, ierr, iuarr, debug_level, is_global)
Will appear during ungrib.exe execution:
*** Starting program ungrib.exe *** Start_date = 1975-07-16_00:00:00 , End_date = 1975-07-30_00:00:00 output format is WPS Path to intermediate files is ./ ungrib - grib edition num 1 *********** L. Fita. UC . August 2010 *********** *** Assuming global regular grid. *** *** Computing 'dx' from number of points 'Nx' *** ************************************************* Nx = 128 dx: 2.812500 *********** L. Fita. UC . August 2010 *********** *** Assuming global regular grid. *** *** Computing 'dx' from number of points 'Nx' *** ************************************************* Nx = 128 dx: 2.812500 (...)
NOTE: With data from CNRM in the period 1950-1970 the error is still there...
[x] SST missing values in coastal lines
Along coastal lines, SST is badly interpolated. This is fixed changing in METGRID.TBL how is made SST interpolation (thanks to Dr. Priscilla A. Mooney, National University of Ireland, Maynooth, Ireland):
= = = = = = = = = = = = = = = = = = = = = = = = = = = = name=SST interp_option=sixteen_pt+four_pt+wt_average_4pt+search missing_value=-1e+30 interp_mask=LANDSEA(1) masked=land fill_missing=0. flag_in_output=FLAG_SST = = = = = = = = = = = = = = = = = = = = = = = = = = = =
[] p4_error: latest msg from perror: Invalid argument
Simulation stops. Message appears at first time-step after open 'wrfrst' file
[x] p4_error: OOPS: semop lock failed: -1
Simulation stopped. Reference in:
Same as in p4_error: semget
From ce01 run
cexec /opt/mpich/gnu/sbin/cleanipcs
[] * glibc detected * malloc(): memory corruption:
Simulation stopped. In some rsl.error.00[nn] appear next line
- rsl.error.0006 . {{{
(...) * glibc detected * malloc(): memory corruption: 0x000000000b215c50 * }}}
- rsl.error.0013
(...) *** glibc detected *** malloc(): memory corruption: 0x000000000af50bb0 ***
- C-language related posts:
- WRF related post:
Error appeared during CLWRF implemention. Some nasty numerics things must happen. Once errors have been repared error disappears... (luckly?)
[x] Missing required environment variable: MPIRUN_RANK
WRF real.exe stopped with message:
PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required environment variable: MPIRUN_RANK mpiexec: Warning: task 0 exited with status 1.
Incorrect version of mpiexec. You must run an addequated mpiexec version, look to the path of mpiexec (execute which to see it) which mpiexec
[] Different wrf.exe from different nodes
From wn001 to wn024 >ls -la /oceano/gmeteo/DATA/WRF/WRF_bin/3.1/WRF4G/MVAPICH/WRFV3/main/
... -rw-rw---- 1 lluis gmeteo 62797 May 19 13:28 wrf_ESMFMod.F -rwxr-x--x 1 lluis gmeteo 21147307 May 26 14:58 wrf.exe -rw-rw---- 1 lluis gmeteo 918 May 19 13:28 wrf.F ...
From wn025 >ls -la /oceano/gmeteo/DATA/WRF/WRF_bin/3.1/WRF4G/MVAPICH/WRFV3/main/
... -rw-rw---- 1 lluis gmeteo 62797 May 19 13:28 wrf_ESMFMod.F -rwxr-x--x 0 lluis gmeteo 21147057 May 25 17:39 wrf.exe -rw-rw---- 1 lluis gmeteo 918 May 19 13:28 wrf.F ...
Differences in Hard-link (see ls and Hard_Link), date and on size!? During a simulation each node is running a different wrf.exe!
Problem 'solved' rebooting wn025
[] mpiexec: Error: poll_or_block_event: tm_poll: tm: no event
A second try of run does not give this error ?¿!¡ No memory/space left on nodes (bad ending of a previous simulation)
[x] mvapich 'call system()' failed
When WRF4G is used, when 2nd file is started to be written, simulation stopped. (Probably due to $WRFGEL_SCRIPT ?)
See comments:
- http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-October/000394.html
- http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-November/002041.html
And user guide http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-350007.1.2
Old version of linux kernel. It is recommended that kernel should be at version 2.6.16 or newer
Linux wn010.macc.unican.es 2.6.9-78.0.13.EL.cernsmp #1 SMP Mon Jan 19 14:00:58 CET 2009 x86_64 x86_64 x86_64 GNU/Linux
and OFED version
>mpichversion MPICH Version: 1.2.7 MPICH Release date: $Date: 2005/06/22 16:33:49$ MPICH Patches applied: none MPICH configure: --with-device=ch_gen2 --with-arch=LINUX -prefix=/software/ScientificLinux/4.6/mvapich/1.1/pgi_7.1-6_gcc --with-romio --without-mpe -lib=-L/usr/lib64 -Wl,-rpath=/usr/lib64 -libverbs -libumad -lpthread MPICH Device: ch_gen2
Problem solved at the moment declaring a new environment variable:
export IBV_FORK_SAFE=1
[x] mpiexec: Warning: read_ib_one: protocol version 8 not known, but might still work.
Error message when execute mpiexec:
mpiexec: Warning: read_ib_one: protocol version 8 not known, but might still work. mpiexec: Error: read_ib_one: mixed version executables (6 and 8), no hope.
This error message appears when is used a wrong version of mpiexec. On must indicate correct one in: /software/ScientificLinux/4.6/mpiexec/mpiexec
[x] ECMWF ERA40 escena missing data
Incompleted escena domain downloaded ERA40 data in /oceano/gmeteo/DATA/ECMWF/ERA40/escena Years: 1968, 1969, 1971 and 1979
[x] Large waiting in GRID-CSIC
More than 1 day are waiting jobs in IFCA GRID-CSIC. In a selection of nodes=[N]:ppn=[M] (N=2, M=8).
- In EGEEUI01 nodes can be occupied only with one core job. Thus it makes difficult that node exclusive jobs can be running. It is more addequated to send jobs with total number of cores, without indication of exclusivity of one phisical machine (EGEEUI01 cluster has 8 cores nodes).
- Changes in wrf_AUTOlauncher_iteration.bash now make core assignation as [N]*[M] without pmiexec -npernode [M] line in [template].job. A new one has been created MPI_job-EGEEUI01.pbs
#!/bin/bash (-) ### Job name #PBS -N @JOBnameSIM@ ### Queue name #PBS -q lmeteo ### Dependency #PBS -W depend=afterany:@IDpbs@ ### Total number of processes #PBS -l nodes=@Nnodes@ # This job's working directory echo Working directory is $PBS_O_WORKDIR cd $PBS_O_WORKDIR echo Running on host `hostname` echo Time is `date` echo Directory is `pwd` echo This jobs runs on the following processors: echo `cat $PBS_NODEFILE` ## #Running WRF ## export OMP_NUM_THREADS=@Ntrh@ echo "Numero de Threads: $OMP_NUM_THREADS" echo "Numero de Jobs MPI: $Nprocess" mpiexec ./wrf.exe
It can only work if nodes is not set as an entire physical machine. It must be set to a cpu (or core). More information in:
- http://www.clusterresources.com/torquedocs21/2.1jobsubmission.shtml#resources
- http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#j
[x] cshell error in wn010
In wn010, appears a systematic csh error, just open a csh terminal
setenv: Too many arguments
A problem in a csh.profile have been repared
[x] Stale NFS file handle
In IFCA GRID-CSIC, with wrf.exe appears a NFS file handle (for BIGescena domain)
/var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 400: 22711 Bus error mpiexec -npernode 8 ./wrf.exe rm: cannot remove `wrf.exe': Stale NFS file handle rm: cannot remove `*.TBL': Stale NFS file handle rm: cannot remove `*_DATA*': Stale NFS file handle rm: cannot remove `met_em*': Stale NFS file handle rm: cannot remove `wrfbdy*': Stale NFS file handle rm: cannot remove `wrfinput*': Stale NFS file handle /var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 345: /gpfs/ifca.es/meteo/forest//bats/change_in_file.bash: Stale NFS file handle /var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 356: cd: /gpfs/ifca.es/meteo/SCRATCH/BIGescena/1970_1975Restart28d/simulations/1970010100_1970012900: Stale NFS file handle (...)
Some errors in NFS server occurred
[x] metgrid.exe Segmentation fault
When metgrid.exe is running, a segmentation fault (in IFCA GRID-CSIC) appears (for Africa_25km domain). From [job].e[nnnnn]:
/var/spool/pbs/mom_priv/jobs/1073948.tor.SC: line 195: 19831 Segmentation fault
Global analyses used where defined only for an European region
[] CAM NaN
module_ra_cam_support.F generates NaN outputs at a given time step (about the 350th julian day of 1996 and 2001, 1996/XII/15 and 2001/XII/16). rsl.out.[nnnn] files become as large as hard disk (because of the output to these files). Has been done:
vert_interpolate: mmr < 0, m, col, lev, mmr 2 2 1 NaN vert_interpolate: aerosol(k),(k+1) 1.0000000116860974E-007 0.000000000000000 vert_interpolate: pint(k+1),(k) NaN NaN n,c 1 1
- FATAL_ERROR signal: call wrf_error_fatal ('Error of computation') line has been introduce in WRFV3/phys/module_ra_cam_support.F file
- isnand(): internal pgi instruction added in some places of module_ra_cam_support.F and module_ra_cam.F allowing to know where appear first 'NaN' values
Possible WRFv3.0.1.1 bug related to temporal interpolation of CO2 concentrations at 15/XII of any year (change of monthly value)
[x] p4_error: semget failed for setnum: 12
Information sources:
- http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs/mpichman-chp4/node133.htm
- https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2008-May/030470.html
This error means that there is not enough shared memory available to allocate a new memory segment for interprocess communication. Often what happens is there are some extra memory segments left over from a crash or programming error of a previous job that needs to be cleaned up. There is a script called cleanipcs that will remove all of your left over ipcs. Users are responsible for cleaning up extra shared memory segments after a crash or when their job is complete.
You can use /usr/bin/ipcs to check memory state in one node:(given example for ssh wn013 ipcs)
------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 0 root 644 72 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0x00000000 2654211 lluis 600 33554432 0 ------ Semaphore Arrays -------- key semid owner perms nsems 0x000000a7 0 root 666 1 0x00000000 11337729 lluis 600 10 0x00000000 11370498 lluis 600 10 0x00000000 11403267 lluis 600 10 0x00000000 11436036 lluis 600 10 0x00000000 11468805 lluis 600 10 0x00000000 11501574 lluis 600 10 0x00000000 11534343 lluis 600 10 0x00000000 11567112 lluis 600 10 0x00000000 11599881 lluis 600 10 0x00000000 11632650 lluis 600 10 0x00000000 11665419 lluis 600 10 0x00000000 11698188 lluis 600 10 0x00000000 11730957 lluis 600 10 0x00000000 11763726 lluis 600 10 0x00000000 11796495 lluis 600 10 0x00000000 11829264 lluis 600 10 0x00000000 11862033 lluis 600 10 0x00000000 11894802 lluis 600 10 0x00000000 11927571 lluis 600 10 0x00000000 11960340 lluis 600 10 0x00000000 11993109 lluis 600 10 0x00000000 12025878 lluis 600 10 0x00000000 12058647 lluis 600 10 0x00000000 14352408 lluis 600 10 0x00000000 14385177 lluis 600 10 ------ Message Queues -------- key msqid owner perms used-bytes messages [lluis@wn010 WRFV3]$ ssh wn013 ipcs ------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 0 root 644 72 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0x00000000 2654211 lluis 600 33554432 0 ------ Semaphore Arrays -------- key semid owner perms nsems 0x000000a7 0 root 666 1 0x00000000 11337729 lluis 600 10 0x00000000 11370498 lluis 600 10 0x00000000 11403267 lluis 600 10 0x00000000 11436036 lluis 600 10 0x00000000 11468805 lluis 600 10 0x00000000 11501574 lluis 600 10 0x00000000 11534343 lluis 600 10 0x00000000 11567112 lluis 600 10 0x00000000 11599881 lluis 600 10 0x00000000 11632650 lluis 600 10 0x00000000 11665419 lluis 600 10 0x00000000 11698188 lluis 600 10 0x00000000 11730957 lluis 600 10 0x00000000 11763726 lluis 600 10 0x00000000 11796495 lluis 600 10 0x00000000 11829264 lluis 600 10 0x00000000 11862033 lluis 600 10 0x00000000 11894802 lluis 600 10 0x00000000 11927571 lluis 600 10 0x00000000 11960340 lluis 600 10 0x00000000 11993109 lluis 600 10 0x00000000 12025878 lluis 600 10 0x00000000 12058647 lluis 600 10 0x00000000 14352408 lluis 600 10 0x00000000 14385177 lluis 600 10 ------ Message Queues -------- key msqid owner perms used-bytes messages
Use the following command to clean up all memory segments owned by your user id on a cluster:
cexec /opt/mpich/gnu/sbin/cleanipcs
Or for each working node: (be carefull to don't run the script in any node with a right working simulation!!)
ssh wn[NNN] /software/ScientificLinux/4.6/mpich/1.2.7p1/pgi_7.1-6_gcc/sbin/cleanipcs
After that: (in wn013):
------ Shared Memory Segments -------- key shmid owner perms bytes nattch status 0x00000000 0 root 644 72 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 ------ Semaphore Arrays -------- key semid owner perms nsems 0x000000a7 0 root 666 1 ------ Message Queues -------- key msqid owner perms used-bytes messages
[x] P4_GLOBMEMSIZE
Not enough memory for mpich processes for the simulation. Error message looks like:
p3_15324: (1.777344) xx_shmalloc: returning NULL; requested 262192 bytes p3_15324: (1.777344) p4_shmalloc returning NULL; request = 262192 bytes You can increase the amount of memory by setting the environment variable P4_GLOBMEMSIZE (in bytes); the current size is 4194304 p3_15324: p4_error: alloc_p4_msg failed: 0
Tipical error for simulations with domains bigger as Europe_10 and BIGescena domains. Default value is 4MB (4194304)
Increase value to:
- 32 MB export P4_GLOBMEMSIZE=33554432
- 64 MB export P4_GLOBMEMSIZE=67108864
- 128 MB export P4_GLOBMEMSIZE=134217728
- 256 MB export P4_GLOBMEMSIZE=268435456
[x] SKINTEMP not found
ERA40 ECMWF files have a different codification of variables. A modification in Vtable.ECMWF is carried out:
Original line
34 | 1 | 0 | | SST | K | Sea-Surface Temperature | 139 | 112 | 0 | 7 | ST000007 | K | T of 0-7 cm ground layer |
Modification
139 | 1 | 0 | | SST | K | Sea-Surface Temperature | 139 | 112 | 0 | 7 | SKINTEMP | K | T of 0-7 cm ground layer |
[x] WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN 81 78 NaN 5000.000
See http://forum.wrfforum.com/viewtopic.php?f=6&t=263
Many causes are possible. CFLs, problems with initial or boundary conditions...etc Lowering the time step or swiching off feedback between nests are possible solutions.
[x] Metgrid error: Error in ext_pkg_write_field in metgrid.log
Also in log/metgrid_1995030912.out:
ERROR: Error in ext_pkg_write_field WRF_DEBUG: Warning DIM 4 , NAME num_metgrid_levels REDIFINED by var GHT 17 18 in wrf_io.F90 line 2424
This error means that probably one or more surface variables are missing in the model input (for example NCEP reanalyses). Input grib files must be checked and fixed.
[] forrtl: severe (174): SIGSEGV, segmentation fault occurred
forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source wrf.exe 00000000013EF561 Unknown Unknown Unknown wrf.exe 00000000013F0727 Unknown Unknown Unknown wrf.exe 00000000013F1E68 Unknown Unknown Unknown wrf.exe 00000000011BB5CB Unknown Unknown Unknown wrf.exe 0000000000DE0913 Unknown Unknown Unknown wrf.exe 0000000000DDAEBD Unknown Unknown Unknown wrf.exe 00000000009AF823 Unknown Unknown Unknown wrf.exe 0000000000690D01 Unknown Unknown Unknown wrf.exe 000000000068DB21 Unknown Unknown Unknown wrf.exe 000000000047BC1B Unknown Unknown Unknown wrf.exe 000000000047B049 Unknown Unknown Unknown wrf.exe 000000000047AFEC Unknown Unknown Unknown libc.so.6 0000003AD001D994 Unknown Unknown Unknown wrf.exe 000000000047AEE9 Unknown Unknown Unknown
Causes are unknown, but it worked just sending the simulation again, without any change.
[] wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.
It appeared in a continous simulation with spectral nudging, using wrf 3.1.1. rsl.error.0000 shows:
wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed. forrtl: error (76): Abort trap signal Image PC Routine Line Source libc.so.6 0000003AD0030265 Unknown Unknown Unknown libc.so.6 0000003AD0031D10 Unknown Unknown Unknown libc.so.6 0000003AD00296E6 Unknown Unknown Unknown wrf.exe 000000000154368A Unknown Unknown Unknown wrf.exe 0000000001518A2D Unknown Unknown Unknown wrf.exe 000000000152741E Unknown Unknown Unknown wrf.exe 00000000014CCD30 Unknown Unknown Unknown wrf.exe 00000000014CBADD Unknown Unknown Unknown wrf.exe 00000000014BAD59 Unknown Unknown Unknown wrf.exe 00000000014B76A3 Unknown Unknown Unknown wrf.exe 0000000000BB258D Unknown Unknown Unknown wrf.exe 0000000000BAED79 Unknown Unknown Unknown wrf.exe 0000000000BAE7F8 Unknown Unknown Unknown wrf.exe 0000000000BADD02 Unknown Unknown Unknown wrf.exe 0000000000BADA9E Unknown Unknown Unknown wrf.exe 0000000000DD5E47 Unknown Unknown Unknown wrf.exe 00000000007A81D6 Unknown Unknown Unknown wrf.exe 00000000006B8424 Unknown Unknown Unknown wrf.exe 0000000000653E19 Unknown Unknown Unknown wrf.exe 0000000000677927 Unknown Unknown Unknown wrf.exe 0000000000674047 Unknown Unknown Unknown wrf.exe 00000000004C9DF7 Unknown Unknown Unknown wrf.exe 000000000047B0A3 Unknown Unknown Unknown wrf.exe 000000000047B057 Unknown Unknown Unknown wrf.exe 000000000047AFEC Unknown Unknown Unknown libc.so.6 0000003AD001D994 Unknown Unknown Unknown wrf.exe 000000000047AEE9 Unknown Unknown Unknown
wrf_2001112400.out shows:
/oceano/gmeteo/WORK/ASNA/WRF/run/SeaWind_N1540_SN/SeaWind_N1540_SN/0029/bin/wrf_wrapper.exe: line 9: 4500 Aborted ${0/_wrapper/} $*
Causes are unknown.
[] No error, wrf just stops (¿¡?!)
Change the debug_level (up to 300) in namelist.input &time_cotrol.
If there isn't any error yet, run wrf using the debugging version (OMPIchk)