Changes between Initial Version and Version 1 of WRFKnownProblems


Ignore:
Timestamp:
May 2, 2013 4:50:51 PM (9 years ago)
Author:
MarkelGarcia
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • WRFKnownProblems

    v1 v1  
     1= Known Problems =
     2
     3[[PageOutline(1-10,Page Contents)]]
     4
     5[]: Unsolved   [x]: Solved
     6
     7== [x] qdel: Server could not connect to MOM 476932.ce01.macc.unican.es ==
     8Some times, PBS downs in the nodes. In order to recover it. We have to stop it an restart the service.
     9
     10 {{{
     11[root@ce01 ~]$ssh wn025 'service pbs_mom restart'
     12}}}
     13It can be done for all nodes:
     14
     15 {{{
     16[root@ce01 ~]# cexec 'service pbs_mom status'
     17************************* macc *************************
     18--------- wn001---------
     19pbs_mom (pid 2575) is running...
     20--------- wn002---------
     21pbs_mom (pid 3061) is running...
     22--------- wn003---------
     23pbs_mom (pid 2908) is running...
     24--------- wn004---------
     25ssh(1777) Warning: the RSA host key for 'wn004' differs from the key for the IP address '192.168.202.14'
     26Offending key for IP in /etc/ssh/ssh_known_hosts:8
     27Matching host key in /root/.ssh/known_hosts:117
     28ssh(1777) Permission denied, please try again.
     29ssh(1777) Permission denied, please try again.
     30ssh(1777) Permission denied (publickey,password).
     31--------- wn005---------
     32pbs_mom dead but subsys locked
     33--------- wn006---------
     34pbs_mom (pid 3002) is running...
     35--------- wn007---------
     36pbs_mom (pid 29926) is running...
     37--------- wn008---------
     38pbs_mom dead but subsys locked
     39--------- wn009---------
     40ssh(1796) Permission denied, please try again.
     41ssh(1796) Permission denied, please try again.
     42ssh(1796) Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
     43--------- wn010---------
     44pbs_mom dead but subsys locked
     45--------- wn011---------
     46pbs_mom dead but subsys locked
     47--------- wn012---------
     48pbs_mom dead but subsys locked
     49--------- wn013---------
     50pbs_mom (pid 6605 6604 6422) is running...
     51--------- wn014---------
     52pbs_mom (pid 3137) is running...
     53--------- wn015---------
     54pbs_mom dead but subsys locked
     55--------- wn016---------
     56pbs_mom dead but subsys locked
     57--------- wn017---------
     58pbs_mom dead but subsys locked
     59--------- wn018---------
     60pbs_mom (pid 3284) is running...
     61--------- wn019---------
     62pbs_mom dead but subsys locked
     63--------- wn020---------
     64pbs_mom dead but subsys locked
     65--------- wn021---------
     66pbs_mom dead but subsys locked
     67--------- wn022---------
     68pbs_mom dead but subsys locked
     69--------- wn023---------
     70pbs_mom dead but subsys locked
     71--------- wn024---------
     72pbs_mom (pid 3157) is running...
     73--------- wn025---------
     74pbs_mom (pid 18308) is running...
     75--------- wn031---------
     76pbs_mom dead but subsys locked
     77--------- wn032---------
     78pbs_mom dead but subsys locked
     79--------- wn033---------
     80pbs_mom dead but subsys locked
     81--------- wn034---------
     82pbs_mom dead but subsys locked
     83--------- wn035---------
     84pbs_mom dead but subsys locked
     85--------- wn036---------
     86pbs_mom dead but subsys locked
     87--------- wn041---------
     88pbs_mom dead but subsys locked
     89--------- wn042---------
     90pbs_mom dead but subsys locked
     91--------- wn043---------
     92pbs_mom dead but subsys locked
     93--------- wn044---------
     94pbs_mom dead but subsys locked
     95--------- wn045---------
     96pbs_mom dead but subsys locked
     97--------- wn046---------
     98pbs_mom dead but subsys locked
     99}}}
     100Restarting
     101
     102{{{
     103[root@ce01 ~]# cexec 'service pbs_mom restart'
     104************************* macc *************************
     105--------- wn001---------
     106Shutting down TORQUE Mom: [  OK  ]
     107Starting TORQUE Mom: [  OK  ]
     108--------- wn002---------
     109Shutting down TORQUE Mom: [  OK  ]
     110Starting TORQUE Mom: [  OK  ]
     111--------- wn003---------
     112Shutting down TORQUE Mom: [  OK  ]
     113Starting TORQUE Mom: [  OK  ]
     114--------- wn004---------
     115ssh(2514) Warning: the RSA host key for 'wn004' differs from the key for the IP address '192.168.202.14'
     116Offending key for IP in /etc/ssh/ssh_known_hosts:8
     117Matching host key in /root/.ssh/known_hosts:117
     118ssh(2514) Permission denied, please try again.
     119ssh(2514) Permission denied, please try again.
     120ssh(2514) Permission denied (publickey,password).
     121--------- wn005---------
     122Shutting down TORQUE Mom: [FAILED]
     123Starting TORQUE Mom: [  OK  ]
     124--------- wn006---------
     125Shutting down TORQUE Mom: [  OK  ]
     126Starting TORQUE Mom: [  OK  ]
     127--------- wn007---------
     128Shutting down TORQUE Mom: [  OK  ]
     129Starting TORQUE Mom: [  OK  ]
     130--------- wn008---------
     131Shutting down TORQUE Mom: [FAILED]
     132Starting TORQUE Mom: [  OK  ]
     133--------- wn009---------
     134ssh(2524) Permission denied, please try again.
     135ssh(2524) Permission denied, please try again.
     136ssh(2524) Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password).
     137--------- wn010---------
     138Shutting down TORQUE Mom: [FAILED]
     139Starting TORQUE Mom: [  OK  ]
     140--------- wn011---------
     141Shutting down TORQUE Mom: [FAILED]
     142Starting TORQUE Mom: [  OK  ]
     143--------- wn012---------
     144Shutting down TORQUE Mom: [FAILED]
     145Starting TORQUE Mom: [  OK  ]
     146--------- wn013---------
     147Shutting down TORQUE Mom: [  OK  ]
     148Starting TORQUE Mom: [  OK  ]
     149--------- wn014---------
     150Shutting down TORQUE Mom: [  OK  ]
     151Starting TORQUE Mom: [  OK  ]
     152--------- wn015---------
     153Shutting down TORQUE Mom: [FAILED]
     154Starting TORQUE Mom: [  OK  ]
     155--------- wn016---------
     156Shutting down TORQUE Mom: [FAILED]
     157Starting TORQUE Mom: [  OK  ]
     158--------- wn017---------
     159Shutting down TORQUE Mom: [FAILED]
     160Starting TORQUE Mom: [  OK  ]
     161--------- wn018---------
     162Shutting down TORQUE Mom: [  OK  ]
     163Starting TORQUE Mom: [  OK  ]
     164--------- wn019---------
     165Shutting down TORQUE Mom: [FAILED]
     166Starting TORQUE Mom: [  OK  ]
     167--------- wn020---------
     168Shutting down TORQUE Mom: [FAILED]
     169Starting TORQUE Mom: [  OK  ]
     170--------- wn021---------
     171Shutting down TORQUE Mom: [FAILED]
     172Starting TORQUE Mom: [  OK  ]
     173--------- wn022---------
     174Shutting down TORQUE Mom: [FAILED]
     175Starting TORQUE Mom: [  OK  ]
     176--------- wn023---------
     177Shutting down TORQUE Mom: [FAILED]
     178Starting TORQUE Mom: [  OK  ]
     179--------- wn024---------
     180Shutting down TORQUE Mom: [  OK  ]
     181Starting TORQUE Mom: [  OK  ]
     182--------- wn025---------
     183Shutting down TORQUE Mom: [  OK  ]
     184Starting TORQUE Mom: [  OK  ]
     185--------- wn031---------
     186Shutting down TORQUE Mom: [FAILED]
     187Starting TORQUE Mom: [  OK  ]
     188--------- wn032---------
     189Shutting down TORQUE Mom: [FAILED]
     190Starting TORQUE Mom: [  OK  ]
     191--------- wn033---------
     192Shutting down TORQUE Mom: [FAILED]
     193Starting TORQUE Mom: [  OK  ]
     194--------- wn034---------
     195Shutting down TORQUE Mom: [FAILED]
     196Starting TORQUE Mom: [  OK  ]
     197--------- wn035---------
     198Shutting down TORQUE Mom: [FAILED]
     199Starting TORQUE Mom: [  OK  ]
     200--------- wn036---------
     201Shutting down TORQUE Mom: [FAILED]
     202Starting TORQUE Mom: [  OK  ]
     203--------- wn041---------
     204Shutting down TORQUE Mom: [FAILED]
     205Starting TORQUE Mom: [  OK  ]
     206--------- wn042---------
     207Shutting down TORQUE Mom: [FAILED]
     208Starting TORQUE Mom: [  OK  ]
     209--------- wn043---------
     210Shutting down TORQUE Mom: [FAILED]
     211Starting TORQUE Mom: [  OK  ]
     212--------- wn044---------
     213Shutting down TORQUE Mom: [FAILED]
     214Starting TORQUE Mom: [  OK  ]
     215--------- wn045---------
     216Shutting down TORQUE Mom: [FAILED]
     217Starting TORQUE Mom: [  OK  ]
     218--------- wn046---------
     219Shutting down TORQUE Mom: [FAILED]
     220Starting TORQUE Mom: [  OK  ]
     221}}}
     222And then the nodes are:
     223
     224 {{{
     225[root@ce01 ~]# pbsnodes -l
     226wn002                offline
     227wn003                offline
     228wn004                down,offline
     229wn009                down,offline
     230wn001                offline
     231}}}
     232== [] Abnormal ending of simulations due to 'Flerchinger' aproximation ==
     233A simulation ends with the following message:
     234
     235{{{
     236(...)
     237 d01 2000-09-13_00:00:00 Input data processed for wrflowinp_d<domain> for domain    1
     238 Flerchinger USEd in NEW version. Iterations=          10
     239 Flerchinger USEd in NEW version. Iterations=          10
     240 Flerchinger USEd in NEW version. Iterations=          10
     241}}}
     242Related to the topic:
     243
     244 * [http://forum.wrfforum.com/viewtopic.php?f=6&t=2531 WRF forum]
     245 * {{{phys/module_sf_noahdrv.F}}} code
     246== [] Abnormal ending of simulations ==
     247While a simulation was running on '{{{rsl.error.0000}}}' appears:
     248
     249{{{
     250(...)
     251Timing for main: time 2001-03-03_05:00:00 on domain   1:    2.94590 elapsed seconds.
     252wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.
     253Image              PC                Routine            Line        Source
     254libc.so.6          0000003AC6830265  Unknown               Unknown  Unknown
     255libc.so.6          0000003AC6831D10  Unknown               Unknown  Unknown
     256libc.so.6          0000003AC68296E6  Unknown               Unknown  Unknown
     257wrf.exe            00000000015E846A  Unknown               Unknown  Unknown
     258wrf.exe            00000000015BD80D  Unknown               Unknown  Unknown
     259wrf.exe            00000000015CC1FE  Unknown               Unknown  Unknown
     260wrf.exe            0000000001571B10  Unknown               Unknown  Unknown
     261wrf.exe            00000000015708BD  Unknown               Unknown  Unknown
     262wrf.exe            000000000155F149  Unknown               Unknown  Unknown
     263wrf.exe            000000000155B828  Unknown               Unknown  Unknown
     264wrf.exe            0000000000BCA9ED  Unknown               Unknown  Unknown
     265wrf.exe            0000000000BC71D9  Unknown               Unknown  Unknown
     266wrf.exe            0000000000BC6C58  Unknown               Unknown  Unknown
     267wrf.exe            0000000000BC6162  Unknown               Unknown  Unknown
     268wrf.exe            0000000000BC5EFE  Unknown               Unknown  Unknown
     269wrf.exe            0000000000DEF177  Unknown               Unknown  Unknown
     270wrf.exe            00000000007413C2  Unknown               Unknown  Unknown
     271wrf.exe            00000000006BE487  Unknown               Unknown  Unknown
     272wrf.exe            00000000006552B9  Unknown               Unknown  Unknown
     273wrf.exe            000000000067A5B4  Unknown               Unknown  Unknown
     274wrf.exe            0000000000678591  Unknown               Unknown  Unknown
     275wrf.exe            00000000004CA59F  Unknown               Unknown  Unknown
     276wrf.exe            000000000047B093  Unknown               Unknown  Unknown
     277wrf.exe            000000000047B047  Unknown               Unknown  Unknown
     278wrf.exe            000000000047AFDC  Unknown               Unknown  Unknown
     279libc.so.6          0000003AC681D994  Unknown               Unknown  Unknown
     280wrf.exe            000000000047AEE9  Unknown               Unknown  Unknown
     281}}}
     282Related links:
     283
     284 * [http://blog.gmane.org/gmane.comp.lib.netcdf/month=20091201 discussion on matters of common interest to netCDF user ()]
     285 * [http://forum.wrfforum.com/viewtopic.php?f=6&t=2053&start=0 WRF forum 1]
     286 * [http://forum.wrfforum.com/viewtopic.php?f=6&t=2496&start=0 WRF forum 2]
     287== [x] Not writting of wrf restart files ==
     288Along the simulation on {{{rsl.[error/out].[nnnn]}}} appears
     289
     290{{{
     291(...)
     292Timing for Writing restart for domain        1:   48.82700 elapsed seconds.
     293(...)
     294}}}
     295But restart was never written. Along the simulation, a {{{wrfrst_[.....]}}} file is written, but it has only 32 bytes and it is alive only the 48.82700 seconds, after that time it disappears. Looking to the execution flow (via strace):
     296
     297{{{
     298(...)
     299open("wrfrst_d01_2000-06-01_11:30:00", O_RDWR|O_CREAT|O_TRUNC, 0666) = 13
     300fstat(13, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
     301fstat(13, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
     302lseek(13, 0, SEEK_CUR)                  = 0
     303lseek(13, 24, SEEK_SET)                 = 24
     304write(13, "\0\0\0\0\0\0\0\0", 8)        = 8
     305lseek(13, 0, SEEK_SET)                  = 0
     306(...)
     307write(1, "Timing for Writing restart for d"..., 76) = 76
     308write(2, "Timing for Writing restart for d"..., 76) = 76
     309lseek(13, 0, SEEK_CUR)                  = 0
     310write(13, "CDF\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32) = 32
     311close(13)                               = 0
     312unlink("wrfrst_d01_2000-06-01_11:00:00") = 0
     313(...)
     314}}}
     315It seems to be something related with the size of files.
     316
     317In order to allow netCDF output bigger than 2GB on should activate a variable (in compile.bash)
     318
     319{{{
     320export WRFIO_NCD_LARGE_FILE_SUPPORT=1
     321}}}
     322Now is working
     323
     324== [] Crash of running simulations by openMPI problems ==
     325In a not systematic way, some simulations stopped with the following message:
     326
     327{{{
     328[lluis@wn033 ~]$ cat /localtmp/wrf4g.20110213111149664772000/log/wrf_2000101006.out
     329[wn033.macc.unican.es:19865] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     330[wn033.macc.unican.es:19864] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     331[wn033.macc.unican.es:19869] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     332[wn033.macc.unican.es:19866] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     333[wn033.macc.unican.es:19867] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     334[wn033.macc.unican.es:19876] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     335[wn033.macc.unican.es:19870] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     336[wn033.macc.unican.es:19874] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     337[wn033.macc.unican.es:19877] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     338[wn033.macc.unican.es:19863] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     339[wn033.macc.unican.es:19871] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     340[wn033.macc.unican.es:19875] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     341[wn033.macc.unican.es:19873] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     342[wn033.macc.unican.es:19868] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     343[wn033.macc.unican.es:19862] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     344[wn033.macc.unican.es:19872] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_ofud: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     345[wn033.macc.unican.es:19872] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     346[wn033.macc.unican.es:19868] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     347[wn033.macc.unican.es:19875] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     348[wn033.macc.unican.es:19862] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     349[wn033.macc.unican.es:19871] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     350[wn033.macc.unican.es:19866] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     351[wn033.macc.unican.es:19873] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     352[wn033.macc.unican.es:19863] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     353[wn033.macc.unican.es:19864] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     354[wn033.macc.unican.es:19876] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     355[wn033.macc.unican.es:19867] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     356[wn033.macc.unican.es:19877] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     357[wn033.macc.unican.es:19869] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     358[wn033.macc.unican.es:19870] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     359[wn033.macc.unican.es:19865] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     360[wn033.macc.unican.es:19874] mca: base: component_find: unable to open /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/lib/openmpi/mca_btl_openib: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
     361mpiexec: killing job...
     362--------------------------------------------------------------------------
     363Sorry!  You were supposed to get help about:
     364    orterun:unclean-exit
     365But I couldn't open the help file:
     366    /oceano/gmeteo/WORK/ASNA/WRF/run/hvt125/hvt125__2000101006_2000101200/0001/openmpi/share/openmpi/help-orterun.txt: No such file or directory.  Sorry!
     367--------------------------------------------------------------------------
     368}}}
     369== [x] Intel simulations crash with CAM ==
     370Simulations crash just at beginning of them:
     371
     372{{{
     373taskid: 0 hostname: wn025.macc.unican.es
     374 Quilting with   1 groups of   0 I/O tasks.
     375 Namelist dfi_control not found in namelist.input. Using registry defaults for v
     376 ariables in dfi_control
     377 Namelist tc not found in namelist.input. Using registry defaults for variables
     378 in tc
     379 Namelist scm not found in namelist.input. Using registry defaults for variables
     380  in scm
     381 Namelist fire not found in namelist.input. Using registry defaults for variable
     382 s in fire
     383  Ntasks in X            2, ntasks in Y            4
     384 WRF V3.1.1 MODEL
     385   *** CLWRF code enabled
     386  *************************************
     387  Parent domain
     388  ids,ide,jds,jde            1          50           1          50
     389  ims,ime,jms,jme           -4          32          -4          20
     390  ips,ipe,jps,jpe            1          25           1          13
     391  *************************************
     392 DYNAMICS OPTION: Eulerian Mass Coordinate
     393    alloc_space_field: domain            1,     26649120 bytes allocated
     394   med_initialdata_input: calling input_model_input
     395 INPUT LandUse = "USGS"
     396forrtl: severe (174): SIGSEGV, segmentation fault occurred
     397Image              PC                Routine            Line        Source
     398wrf.exe            00000000013EF3E1  Unknown               Unknown  Unknown
     399wrf.exe            00000000013F05A7  Unknown               Unknown  Unknown
     400wrf.exe            00000000013F1CE8  Unknown               Unknown  Unknown
     401wrf.exe            00000000011BB44B  Unknown               Unknown  Unknown
     402wrf.exe            0000000000DE008E  Unknown               Unknown  Unknown
     403wrf.exe            0000000000DDAEAD  Unknown               Unknown  Unknown
     404wrf.exe            00000000009AF813  Unknown               Unknown  Unknown
     405wrf.exe            0000000000690D01  Unknown               Unknown  Unknown
     406wrf.exe            000000000068DB21  Unknown               Unknown  Unknown
     407wrf.exe            000000000047BC1B  Unknown               Unknown  Unknown
     408wrf.exe            000000000047B049  Unknown               Unknown  Unknown
     409wrf.exe            000000000047AFEC  Unknown               Unknown  Unknown
     410libc.so.6          0000003C6421D994  Unknown               Unknown  Unknown
     411wrf.exe            000000000047AEE9  Unknown               Unknown  Unknown
     412}}}
     413where:
     414
     415{{{
     416[lluis@wn025 run]$ find /lib* -name libc.so.6
     417/lib/libc.so.6
     418/lib/i686/nosegneg/libc.so.6
     419/lib64/libc.so.6
     420}}}
     421and
     422
     423{{{
     424[lluis@wn025 run]$ ldd wrf.exe
     425        libmpi_f90.so.0 => not found
     426        libmpi_f77.so.0 => not found
     427        libmpi.so.0 => not found
     428        libopen-rte.so.0 => not found
     429        libopen-pal.so.0 => not found
     430        libdl.so.2 => /lib64/libdl.so.2 (0x0000003c64600000)
     431        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003c65600000)
     432        libutil.so.1 => /lib64/libutil.so.1 (0x0000003c65a00000)
     433        libm.so.6 => /lib64/libm.so.6 (0x0000003c64e00000)
     434        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003c64a00000)
     435        libc.so.6 => /lib64/libc.so.6 (0x0000003c64200000)
     436        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003c66200000)
     437        /lib64/ld-linux-x86-64.so.2 (0x0000003c63e00000)
     438}}}
     439This happens with:
     440
     441{{{
     442NIN_ra_lw_physics      = 3
     443NIN_ra_sw_physics      = 3
     444}}}
     445This does not happen with:
     446
     447{{{
     448NIN_ra_lw_physics      = 4
     449NIN_ra_sw_physics      = 4
     450}}}
     451'''NOTE:''' This error is also found on serial compilation. With a  {{{debug          = 1000,}}}
     452
     453{{{
     454Namelist dfi_control not found in namelist.input. Using registry defaults for v
     455 ariables in dfi_control
     456 Namelist tc not found in namelist.input. Using registry defaults for variables
     457 in tc
     458 Namelist scm not found in namelist.input. Using registry defaults for variables
     459  in scm
     460 Namelist fire not found in namelist.input. Using registry defaults for variable
     461 s in fire
     462 WRF V3.1.1 MODEL
     463   wrf: calling alloc_and_configure_domain
     464  *************************************
     465  Parent domain
     466  ids,ide,jds,jde            1          50           1          50
     467  ims,ime,jms,jme           -4          55          -4          55
     468  ips,ipe,jps,jpe            1          50           1          50
     469  *************************************
     470 DYNAMICS OPTION: Eulerian Mass Coordinate
     471    alloc_space_field: domain            1,     95259880 bytes allocated
     472   med_initialdata_input: calling input_model_input
     473(...)
     474 INPUT LandUse = "USGS"
     475 LANDUSE TYPE = "USGS" FOUND          33  CATEGORIES           2  SEASONS
     476  WATER CATEGORY =           16  SNOW CATEGORY =           24
     477  *** SATURATION VAPOR PRESSURE TABLE COMPLETED ***
     478    num_months =           13
     479 AEROSOLS:  Background aerosol will be limited to bottom            6
     480  model interfaces.
     481   reading CAM_AEROPT_DATA
     482forrtl: severe (174): SIGSEGV, segmentation fault occurred
     483Image              PC                Routine            Line        Source
     484wrf.exe            000000000130FB51  Unknown               Unknown  Unknown
     485wrf.exe            0000000001310D17  Unknown               Unknown  Unknown
     486wrf.exe            0000000001312458  Unknown               Unknown  Unknown
     487wrf.exe            00000000010DC9BB  Unknown               Unknown  Unknown
     488wrf.exe            0000000000D12F9E  Unknown               Unknown  Unknown
     489wrf.exe            0000000000D0DDBD  Unknown               Unknown  Unknown
     490wrf.exe            0000000000906523  Unknown               Unknown  Unknown
     491wrf.exe            0000000000608AC1  Unknown               Unknown  Unknown
     492wrf.exe            00000000006058E1  Unknown               Unknown  Unknown
     493wrf.exe            0000000000404DEC  Unknown               Unknown  Unknown
     494wrf.exe            0000000000404249  Unknown               Unknown  Unknown
     495wrf.exe            00000000004041EC  Unknown               Unknown  Unknown
     496libc.so.6          0000003331A1D994  Unknown               Unknown  Unknown
     497wrf.exe            00000000004040E9  Unknown               Unknown  Unknown}}}
     498And WRF configuration:
     499
     500{{{
     501[lluis@mar run]$ ldd wrf.exe
     502        libm.so.6 => /lib64/libm.so.6 (0x00000039cac00000)
     503        libpthread.so.0 => /lib64/libpthread.so.0 (0x00000039cb000000)
     504        libc.so.6 => /lib64/libc.so.6 (0x00000039ca400000)
     505        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003ac2200000)
     506        libdl.so.2 => /lib64/libdl.so.2 (0x00000039ca800000)
     507        /lib64/ld-linux-x86-64.so.2 (0x00000039ca000000)
     508}}}
     509Segmentation fault appears after line # 3757 of {{{phys/module_ra_cam_support.F}}} (from WRFV3.1.1)
     510
     511'''NOTE:''' on activation of
     512
     513{{{
     514ulimit -s unlimited
     515}}}
     516It Works !!!!
     517
     518On ESCENA domain simulation works with CAM ra_lw/sw when a checked compilation is used (SERIAL in: {{{/oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/WRF/icif64/SERIALchk/WRFV3/main/wrf.exe}}}). Following messages are shown in std. output:
     519
     520{{{
     521(...)
     522Timing for main: time 2001-11-10_00:02:30 on domain   1:  358.70981 elapsed seconds.
     523forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #49
     524forrtl: warning (402): fort: (1): In call to PRE_RADIATION_DRIVER, an array temporary was created for argument #51
     525forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #134
     526forrtl: warning (402): fort: (1): In call to RADIATION_DRIVER, an array temporary was created for argument #136
     527forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #181
     528forrtl: warning (402): fort: (1): In call to SURFACE_DRIVER, an array temporary was created for argument #183
     529forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #87
     530forrtl: warning (402): fort: (1): In call to PBL_DRIVER, an array temporary was created for argument #89
     531forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #21
     532forrtl: warning (402): fort: (1): In call to CUMULUS_DRIVER, an array temporary was created for argument #23
     533forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #58
     534forrtl: warning (402): fort: (1): In call to FDDAGD_DRIVER, an array temporary was created for argument #60
     535forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #55
     536forrtl: warning (402): fort: (1): In call to MICROPHYSICS_DRIVER, an array temporary was created for argument #57
     537forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #20
     538forrtl: warning (402): fort: (1): In call to DIAGNOSTIC_OUTPUT_CALC, an array temporary was created for argument #22
     539Timing for main: time 2001-11-10_00:05:00 on domain   1:   32.11490 elapsed seconds.
     540(...)
     541}}}
     542 * On {{{phys/module_radiation_driver.F}}}, subroutine {{{pre_radiation_driver}}} arguments #49, 51 are: i_end, j_end
     543 * On {{{phys/module_radiation_driver.F}}}, subroutine {{{radiation_driver}}} arguments #134, 136 are: i_end, j_end
     544 * On {{{phys/module_surface_driver.F}}}, subroutine {{{surface_driver}}} arguments #181, 183 are: i_end, j_end
     545 * On {{{phys/module_pbl_driver.F}}}, subroutine {{{pbl_driver}}} arguments #87, 89 are: i_end, j_end
     546 * On {{{phys/module_cumulus_driver.F}}}, subroutine {{{cumulus_driver}}} arguments #21, 23 are: i_end, j_end
     547 * On {{{phys/module_fddagd_driver.F}}}, subroutine {{{fddagd_driver}}} arguments #58, 60 are: i_end, j_end
     548 * On {{{phys/module_microphysics_driver.F}}}, subroutine {{{microphysics_driver}}} arguments #55, 57 are: i_end, j_end
     549 * On {{{phys/module_diagnostics.F}}}, subroutine {{{diagnostic_output_calc}}} arguments #20, 22 are: i_end, j_end
     550Subroutine definitions
     551
     552{{{
     553  INTEGER, DIMENSION(num_tiles), INTENT(IN) ::                   &
     554  &                                    i_start,i_end,j_start,j_end
     555}}}
     556Definition in {{{frame/module_domain_type.F}}}, WRF derived type for the domain {{{TYPE(domain)}}}:
     557
     558{{{
     559TYPE domain
     560 (...)
     561      INTEGER,POINTER                                     :: i_start(:),i_end(:)
     562      INTEGER,POINTER                                     :: j_start(:),j_end(:)
     563 (...)
     564      INTEGER                                             :: num_tiles        ! taken out of namelist 20000908
     565 (...)
     566}}}
     567Some information about [http://www.mmm.ucar.edu/wrf/WG2/topics/settiles/ WRF tiles]
     568
     569In {{{frame/module_tiles.F}}} is seen in subroutines {{{set_tiles1, set_tiles2, set_tiles3}}}:
     570
     571{{{
     572   IF ( ASSOCIATED(grid%i_start) ) THEN ; DEALLOCATE( grid%i_start ) ; NULLIFY( grid%i_start ) ; ENDIF
     573       IF ( ASSOCIATED(grid%i_end) )   THEN ; DEALLOCATE( grid%i_end   ) ; NULLIFY( grid%i_end   ) ; ENDIF
     574       IF ( ASSOCIATED(grid%j_start) ) THEN ; DEALLOCATE( grid%j_start ) ; NULLIFY( grid%j_start ) ; ENDIF
     575       IF ( ASSOCIATED(grid%j_end) )   THEN ; DEALLOCATE( grid%j_end   ) ; NULLIFY( grid%j_end   ) ; ENDIF
     576       ALLOCATE(grid%i_start(num_tiles))
     577       ALLOCATE(grid%i_end(num_tiles))
     578       ALLOCATE(grid%j_start(num_tiles))
     579       ALLOCATE(grid%j_end(num_tiles))
     580       grid%max_tiles = num_tiles
     581}}}
     582The recommended WRF compilation (intel shared memory) is:
     583
     584{{{
     585(...)
     586DMPARALLEL      =        1
     587OMPCPP          =       # -D_OPENMP
     588OMP             =       # -openmp -fpp -auto
     589SFC             =       ifort
     590SCC             =       icc
     591DM_FC           =       mpif90 -f90=$(SFC)
     592DM_CC           =       mpicc -cc=$(SCC) -DMPI2_SUPPORT
     593FC              =        $(DM_FC)
     594CC              =       $(DM_CC) -DFSEEKO64_OK
     595LD              =       $(FC)
     596RWORDSIZE       =       $(NATIVE_RWORDSIZE)
     597PROMOTION       =       -i4
     598ARCH_LOCAL      =       -DNONSTANDARD_SYSTEM_FUNC
     599CFLAGS_LOCAL    =       -w -O3 -ip
     600LDFLAGS_LOCAL   =       -ip
     601CPLUSPLUSLIB    =
     602ESMF_LDFLAG     =       $(CPLUSPLUSLIB)
     603FCOPTIM         =       -O3
     604FCREDUCEDOPT    =       $(FCOPTIM)
     605FCNOOPT         =       -O0 -fno-inline -fno-ip
     606FCDEBUG         =       # -g $(FCNOOPT) -traceback
     607FORMAT_FIXED    =       -FI
     608FORMAT_FREE     =       -FR
     609FCSUFFIX        =
     610BYTESWAPIO      =       -convert big_endian
     611FCBASEOPTS      =       -w -ftz -align all -fno-alias -fp-model precisee
     612 $(FCDEBUG) $(FORMAT_FREE) $(BYTESWAPIO)
     613MODULE_SRCH_FLAG =
     614TRADFLAG        =      -traditional
     615CPP             =      /lib/cpp -C -P
     616AR              =      ar
     617(...)
     618}}}
     619Simply changing the '-O3' optimization to '-O2' it works proprertly. (It also work with '-O1' but it makes the simulations slower)
     620
     621=== LAST NEWs ===
     622It works just adding a new compilation flag '-heap-arrays' which means:
     623
     624{{{
     625      -heap-arrays [size]
     626       -no-heap-arrays
     627              Puts  automatic  arrays  and  arrays created for temporary
     628              computations on the heap instead of the stack.
     629              Architectures: IA-32, Intel® 64, IA-64 architectures
     630              Default:
     631              -no-heap-arrays   The compiler puts automatic  arrays  and
     632                                arrays  created  for  temporary computa-
     633                                tions in temporary storage in the  stack
     634                                storage area.
     635              Description:
     636              This  option  puts automatic arrays and arrays created for
     637              temporary computations on the heap instead of the stack.
     638              If heap-arrays is specified and size is omitted, all auto-
     639              matic  and  temporary arrays are put on the heap. If 10 is
     640              specified for size, all  automatic  and  temporary  arrays
     641              larger than 10 KB are put on the heap.
     642}}}
     643It has been added on {{{configure.wrf}}} just adding the flag
     644
     645{{{
     646(...)
     647CFLAGS_LOCAL    =       -w -O3 -heap-arrays -ip
     648(...)
     649FCOPTIM         =       -O3 -heap-arrays
     650}}}
     651A complete history post on ' intel's ' forum is available [http://software.intel.com/en-us/forums/showthread.php?t=72109&p=1#146890 intel forum]
     652
     653== [] STOP of simulations due to library problems ==
     654Some executions are stopped giving these error message:
     655
     656{{{
     657/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     658/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     659/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     660/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     661/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     662/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     663/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     664/oceano/gmeteo/WORK/ASNA/WRF/run/scne2a/scne2a/0016/bin/wrf.exe: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
     665}}}
     666Any {{{rsl.[error/out].[nnnn]}}} file is written
     667
     668Doing a ldd of 'wrf.exe' is obtained the same in both nodes:
     669
     670{{{
     671[lluis@wn031 ~]$ ldd /oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/CLW//gcgf64/OMPI/WRFV3/main/wrf.exe
     672        libmpi_f90.so.0 => not found
     673        libmpi_f77.so.0 => not found
     674        libmpi.so.0 => not found
     675        libopen-rte.so.0 => not found
     676        libopen-pal.so.0 => not found
     677        libdl.so.2 => /lib64/libdl.so.2 (0x0000003a9d800000)
     678        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003a9e400000)
     679        libutil.so.1 => /lib64/libutil.so.1 (0x0000003aa0400000)
     680        libgfortran.so.3 => not found
     681        libm.so.6 => /lib64/libm.so.6 (0x0000003a9dc00000)
     682        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003a9f400000)
     683        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003a9e000000)
     684        libc.so.6 => /lib64/libc.so.6 (0x0000003a9d400000)
     685        /lib64/ld-linux-x86-64.so.2 (0x0000003a9d000000)
     686}}}
     687gfortran is instaled:
     688
     689{{{
     690[lluis@wn041 ~]$ which gfortran
     691/usr/bin/gfortran
     692[lluis@wn041 ~]$ ldd /usr/bin/gfortran
     693        libc.so.6 => /lib64/libc.so.6 (0x0000003a9d400000)
     694        /lib64/ld-linux-x86-64.so.2 (0x0000003a9d000000)
     695}}}
     696On a working node ldd gives:
     697
     698{{{
     699[lluis@wn010 ~]$ ldd /oceano/gmeteo/WORK/ASNA/WRF/Binaries/3.1.1/A/CLW//gcgf64/OMPI/WRFV3/main/wrf.exe
     700        libmpi_f90.so.0 => not found
     701        libmpi_f77.so.0 => not found
     702        libmpi.so.0 => not found
     703        libopen-rte.so.0 => not found
     704        libopen-pal.so.0 => not found
     705        libdl.so.2 => /lib64/libdl.so.2 (0x0000003cd0c00000)
     706        libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003cd3000000)
     707        libutil.so.1 => /lib64/libutil.so.1 (0x0000003cd3400000)
     708        libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00002b5f55924000)
     709        libm.so.6 => /lib64/libm.so.6 (0x0000003cd1400000)
     710        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003cd2800000)
     711        libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003cd1000000)
     712        libc.so.6 => /lib64/libc.so.6 (0x0000003cd0800000)
     713        /lib64/ld-linux-x86-64.so.2 (0x0000003cd0400000)
     714}}}
     715gfortran ldd message is the same on 'wn010' and 'wn031/041'
     716
     717== [] STOP of simulations due to net delays ==
     718On execution of wrf.exe, simulations stops with following messages (with openMPI):
     719
     720rsl.error.0004
     721
     722{{{
     723taskid: 4 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],4][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Invalid argument (22)
     724}}}
     725rsl.error.0005
     726
     727{{{
     728taskid: 5 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],5][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
     729}}}
     730rsl.error.0006
     731
     732{{{
     733taskid: 6 hostname: wn017.macc.unican.es[wn017.macc.unican.es][[20060,1],6][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
     734}}}
     735rsl.error.0007
     736
     737{{{
     738[wn017.macc.unican.es][[20060,1],7][btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
     739}}}
     740We are experience some net problems with important fall of system response of the cluster machine ({{{'dinamic'}}} queue)
     741
     742== [x] In real STOP: At line 703 of file module_initialize_real.f90 ==
     743On execution of real appears:
     744
     745{{{
     746Namelist dfi_control not found in namelist.input. Using registry defaults for variables in dfi_control
     747 Namelist tc not found in namelist.input. Using registry defaults for variables in tc
     748 Namelist scm not found in namelist.input. Using registry defaults for variables in scm
     749 Namelist fire not found in namelist.input. Using registry defaults for variables in fire
     750 REAL_EM V3.1.1 PREPROCESSOR
     751  *************************************
     752  Parent domain
     753  ids,ide,jds,jde            1         167           1         139
     754  ims,ime,jms,jme           -4         172          -4         144
     755  ips,ipe,jps,jpe            1         167           1         139
     756  *************************************
     757 DYNAMICS OPTION: Eulerian Mass Coordinate
     758    alloc_space_field: domain            1 ,    804753800  bytes allocated
     759Time period #   1 to process = 2025-01-01_00:00:00.
     760Time period #   2 to process = 2025-01-01_06:00:00.
     761(...)
     762Time period #  56 to process = 2025-01-14_18:00:00.
     763Time period #  57 to process = 2025-01-15_00:00:00.
     764Total analysis times to input =   57.
     765
     766 -----------------------------------------------------------------------------
     767
     768 Domain  1: Current date being processed: 2025-01-01_00:00:00.0000, which is loop #   1 out of   57
     769 configflags%julyr, %julday, %gmt:        2025           1   0.000000
     770 d01 2025-01-01_00:00:00 Timing for input          0 s.
     771 d01 2025-01-01_00:00:00          flag_soil_layers read from met_em file is  1
     772At line 703 of file module_initialize_real.f90
     773Fortran runtime error: End of record
     774}}}
     775The error messages
     776
     777{{{
     778At line 703 of file module_initialize_real.f90
     779Fortran runtime error: End of record
     780}}}
     781Are [http://gcc.gnu.org/ml/fortran/2005-02/msg00394.html gfortran run-time errors]
     782
     783This occurrs because input data does not have PSML/PSFC ! From ungrib.log
     784
     785{{{
     786(...)
     787Inventory for date = 2025-01-01 00:00:00
     788PRES   HGT      TT       UU       VV       RH       SOILHGT  LANDSEA  PSFC     PMSL     SST      SKINTEMP SNOW     ST000007 ST007028 ST028100 ST100255 SM000007 SM007028 SM028100 SM100255
     789-------------------------------------------------------------------------------
     7902001.1  O        O        O        O        O        O        O        O        O        O        O        X        O        O        O        O        O        O        O        O
     7912001.0  O        X        X        X        X        X        X        O        O        O        X        O        O        O        O        O        O        O        O        O
     7921000.0  X        X        X        X        X
     793 925.0  X        X        X        X        X
     794 850.0  X        X        X        X        X
     795 700.0  X        X        X        X        X
     796 500.0  X        X        X        X        X
     797 300.0  X        X        X        X        X
     798 200.0  X        X        X        X        X
     799 100.0  X        X        X        X        X
     800  50.0  X        X        X        X        X
     801-------------------------------------------------------------------------------
     802(...)
     803}}}
     804Removing PSFC/MSLP from working input data the error is reproduced!
     805
     806== [x] Execution error in WRF ==
     807On GRIDUI appears this error on different experiments {{{scnc1a}}}, {{{scnc1b}}}:
     808
     809On {{{/gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/log/rsl_wrf/rsl.error.0000}}}, simulation started at 19750423000000:
     810
     811{{{
     812(...)
     813Timing for main: time 1975-05-13_22:32:30 on domain   1:    2.23900 elapsed seconds.
     814[gcsic019wn:19507] *** Process received signal ***
     815[gcsic019wn:19507] Signal: Segmentation fault (11)
     816[gcsic019wn:19507] Signal code: Address not mapped (1)
     817[gcsic019wn:19507] Failing at address: 0xfffffffc01fd0668
     818[gcsic019wn:19507] [ 0] /lib64/libpthread.so.0 [0x3df980e930]
     819[gcsic019wn:19507] [ 1] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam_support__estblf+0x56) [0x1411806]
     820[gcsic019wn:19507] [ 2] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam_support__aqsat+0x189) [0x14119b9]
     821[gcsic019wn:19507] [ 3] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam__radctl+0x151f) [0x118ee8f]
     822[gcsic019wn:19507] [ 4] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_ra_cam__camrad+0x584d) [0x1196ded]
     823[gcsic019wn:19507] [ 5] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_radiation_driver__radiation_driver+0x4d8b) [0xcddd3b]
     824[gcsic019wn:19507] [ 6] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_first_rk_step_part1__first_rk_step_part1+0x3edb) [0xdb734b]
     825[gcsic019wn:19507] [ 7] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(solve_em_+0x1c89f) [0x95939f]
     826[gcsic019wn:19507] [ 8] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(solve_interface_+0x6c8) [0x635e30]
     827[gcsic019wn:19507] [ 9] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_integrate__integrate+0x1b0) [0x494444]
     828[gcsic019wn:19507] [10] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(__module_wrf_top__wrf_run+0x22) [0x46ef12]
     829[gcsic019wn:19507] [11] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(MAIN__+0x3a) [0x46e5ca]
     830[gcsic019wn:19507] [12] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe(main+0xe) [0x14e9cae]
     831[gcsic019wn:19507] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3df901d994]
     832[gcsic019wn:19507] [14] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1b/scnc1b/0003/bin/wrf.exe [0x46e4d9]
     833[gcsic019wn:19507] *** End of error message ***
     834}}}
     835Same in {{{/gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/log/rsl_wrf/rsl.error.0000}}}, simulation started at 19500507000000:
     836
     837{{{
     838(...)
     839Timing for main: time 1950-05-10_16:17:30 on domain   1:    4.13800 elapsed seconds.
     840Timing for [gcsic116wn:30182] *** Process received signal ***
     841[gcsic116wn:30182] Signal: Segmentation fault (11)
     842[gcsic116wn:30182] Signal code: Address not mapped (1)
     843[gcsic116wn:30182] Failing at address: 0xfffffffc01fd0668
     844[gcsic116wn:30182] [ 0] /lib64/libpthread.so.0 [0x3dc780e930]
     845[gcsic116wn:30182] [ 1] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam_support__estblf+0x56) [0x1411806]
     846[gcsic116wn:30182] [ 2] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam_support__aqsat+0x110) [0x1411940]
     847[gcsic116wn:30182] [ 3] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam__radctl+0x151f) [0x118ee8f]
     848[gcsic116wn:30182] [ 4] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_ra_cam__camrad+0x584d) [0x1196ded]
     849[gcsic116wn:30182] [ 5] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_radiation_driver__radiation_driver+0x4d8b) [0xcddd3b]
     850[gcsic116wn:30182] [ 6] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_first_rk_step_part1__first_rk_step_part1+0x3edb) [0xdb734b]
     851[gcsic116wn:30182] [ 7] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(solve_em_+0x1c89f) [0x95939f]
     852[gcsic116wn:30182] [ 8] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(solve_interface_+0x6c8) [0x635e30]
     853[gcsic116wn:30182] [ 9] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_integrate__integrate+0x1b0) [0x494444]
     854[gcsic116wn:30182] [10] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(__module_wrf_top__wrf_run+0x22) [0x46ef12]
     855[gcsic116wn:30182] [11] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(MAIN__+0x3a) [0x46e5ca]
     856[gcsic116wn:30182] [12] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe(main+0xe) [0x14e9cae]
     857[gcsic116wn:30182] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3dc701d994]
     858[gcsic116wn:30182] [14] /gpfs/csic_projects/meteo/WORK/GRIDUI/run/scnc1a/scnc1a/0004/bin/wrf.exe [0x46e4d9]
     859[gcsic116wn:30182] *** End of error message ***
     860}}}
     861This happens because WPS uses 'iDirectionIncrementInDegrees' attribute, but cdo (used to transform input files) does not (gives an 'MISSING' value). In order to prevent this error and others (not enough decimals in codification of the value in the input grib files), a new namelist options has been introduced in {{{nameslit.wps}}} 'ungrib' section.
     862
     863{{{
     864 is_global                = 1,
     865}}}
     866With this value, is said that input files are global. Thus the increment in degrees in the {{{'i-direction'}}} will be computed according to the range of x dimension of the input files. In order to allow this, some modifications have to be done in some modules of {{{WPS/ungrib}}} source code.
     867
     868'''NOTE:''' This option is only available if input data is in a regular matrix in x direction
     869
     870 * {{{ungrib/src/rd_grib1.F}}}
     871{{{
     87262       SUBROUTINE rd_grib1(IUNIT, gribflnm, level, field, hdate,  &
     873 63            ierr, iuarr, debug_level, is_g)
     874(...)
     875 77       ! L. Fita. UC. August 2010
     876 78         INTEGER :: is_g
     877(...)
     878372      ! L. Fita. UC. August 2010
     879373           IF (is_g == 1) THEN
     880374             PRINT *,"*********** L. Fita. UC . August 2010 ***********"
     881375             PRINT *,"***       Assuming global regular grid.       ***"
     882376             PRINT *,"*** Computing 'dx' from number of points 'Nx' ***"
     883377             PRINT *,"*************************************************"
     884378             map%dx = 360.0 / map%nx
     885379             PRINT *,'Nx = ',map%nx,' dx:',map%dx
     886380           ELSE
     887381             map%dx = ginfo(8)
     888382           ENDIF
     889(...)
     890423       ! L. Fita. UC. August 2010
     891424            IF (is_g == 1) THEN
     892425              PRINT *,"*********** L. Fita. UC . August 2010 ***********"
     893426              PRINT *,"***       Assuming global regular grid.       ***"
     894427              PRINT *,"*** Computing 'dx' from number of points 'Nx' ***"
     895428              PRINT *,"*************************************************"
     896429              map%dx = 360.0 / map%nx
     897430              PRINT *,'Nx = ',map%nx,' dx:',map%dx
     898431            ELSE
     899432              map%dx = ginfo(8)
     900433            ENDIF
     901}}}
     902{{{ungrib/src/read_namelist.F}}}
     903 {{{
     9041       subroutine read_namelist(hstart, hend, delta_time, ntimes,&
     9052                  ordered_by_date, debug_level, out_format, prefix, is_global)
     9063             ! L. Fita. UC. August 2010
     9074             !   Adding new 'namelsit.wps' value in '&ungrib' section: is_global (0, No; 1,
     9085             !     Yes [default 0]).
     9096             ! NOTE: This modification is only useful for global GRIBs with a regular
     9107             !   longitude distribution
     9118             !
     9129             ! EXPLANATION:
     91310            !   In some global files, grid information, is not correctly extacted and/or
     91411            !   they could not be exactly fitted in an entire earth. By this modification,
     91512            !   gris spacing in x direction is computed from the number of grid points in
     91613            !   this direction
     917(...)
     91858            ! L. fita. UC. August 2010
     91959              INTEGER :: is_global
     920(...)
     92172              ordered_by_date, prefix, is_global
     922}}}
     923{{{ungrib/src/ungrib.F }}}
     924  . {{{
     92574       ! L. Fita. UC 2010 August
     92675          INTEGER :: is_global
     927(...)
     92897          call read_namelist(hstart, hend, interval, ntimes, &
     92998               ordered_by_date, debug_level, out_format, prefix, is_global)
     930(...)
     931207                       call rd_grib1(nunit1, gribflnm, level, field, &
     932208                            hdate, ierr, iuarr, debug_level, is_global)
     933}}}
     934Will appear during {{{ungrib.exe}}} execution:
     935
     936{{{
     937*** Starting program ungrib.exe ***
     938Start_date =  1975-07-16_00:00:00 ,      End_date = 1975-07-30_00:00:00
     939output format is WPS
     940Path to intermediate files is ./
     941 ungrib - grib edition num           1
     942 *********** L. Fita. UC . August 2010 ***********
     943 ***       Assuming global regular grid.       ***
     944 *** Computing 'dx' from number of points 'Nx' ***
     945 *************************************************
     946 Nx =          128  dx:   2.812500
     947 *********** L. Fita. UC . August 2010 ***********
     948 ***       Assuming global regular grid.       ***
     949 *** Computing 'dx' from number of points 'Nx' ***
     950 *************************************************
     951 Nx =          128  dx:   2.812500
     952(...)
     953}}}
     954'''NOTE:''' With data from CNRM in the period 1950-1970 the error is still there...
     955
     956== [x] SST missing values in coastal lines ==
     957Along coastal lines, SST is badly interpolated. This is fixed changing in {{{METGRID.TBL}}} how is made SST interpolation (thanks to Dr. Priscilla A. Mooney, National University of Ireland, Maynooth, Ireland):
     958
     959{{{
     960= = = = = = = = = = = = = = = = = = = = = = = = = = = =
     961 name=SST
     962        interp_option=sixteen_pt+four_pt+wt_average_4pt+search
     963        missing_value=-1e+30
     964        interp_mask=LANDSEA(1)
     965        masked=land
     966        fill_missing=0.
     967        flag_in_output=FLAG_SST
     968 = = = = = = = = = = = = = = = = = = = = = = = = = = = =
     969}}}
     970== [] p4_error: latest msg from perror: Invalid argument ==
     971Simulation stops. Message appears at first time-step after open 'wrfrst' file
     972
     973== [x] p4_error: OOPS: semop lock failed: -1 ==
     974Simulation stopped. Reference in:
     975
     976 * http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs/mpichman-chp4/node133.htm
     977Same as in p4_error: semget
     978
     979From ce01 run
     980
     981{{{
     982cexec /opt/mpich/gnu/sbin/cleanipcs
     983}}}
     984== [] *** glibc detected *** malloc(): memory corruption: ==
     985Simulation stopped. In some rsl.error.00[nn] appear next line
     986
     987 * rsl.error.0006
     988  . {{{
     989(...)
     990*** glibc detected *** malloc(): memory corruption: 0x000000000b215c50 ***
     991}}}
     992 * rsl.error.0013
     993{{{
     994(...)
     995*** glibc detected *** malloc(): memory corruption: 0x000000000af50bb0 ***
     996}}}
     997 * C-language related posts:
     998  * http://bytes.com/groups/c/223310-glibc-detected-malloc-memory-corruption-fast-0x0804c008
     999  * http://www.linuxquestions.org/questions/programming-9/glibc-detected-malloc-free-double-349135/
     1000 * WRF related post:
     1001  * http://forum.wrfforum.com/viewtopic.php?f=6&t=104
     1002 Error appeared during CLWRF implemention. Some nasty numerics things must happen. Once errors have been repared error disappears... (luckly?)
     1003== [x]  Missing required environment variable: MPIRUN_RANK ==
     1004WRF real.exe stopped with message:
     1005
     1006{{{
     1007PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required environment variable: MPIRUN_RANK
     1008mpiexec: Warning: task 0 exited with status 1.
     1009}}}
     1010Incorrect version of mpiexec. You must run an addequated mpiexec version, look to the path of mpiexec (execute which to see it) {{{which mpiexec}}}
     1011
     1012== [] Different wrf.exe from different nodes ==
     1013From wn001 to wn024 {{{>ls -la /oceano/gmeteo/DATA/WRF/WRF_bin/3.1/WRF4G/MVAPICH/WRFV3/main/}}}
     1014
     1015{{{
     1016...
     1017-rw-rw----   1 lluis gmeteo    62797 May 19 13:28 wrf_ESMFMod.F
     1018-rwxr-x--x   1 lluis gmeteo 21147307 May 26 14:58 wrf.exe
     1019-rw-rw----   1 lluis gmeteo      918 May 19 13:28 wrf.F
     1020...
     1021}}}
     1022From wn025 {{{>ls -la /oceano/gmeteo/DATA/WRF/WRF_bin/3.1/WRF4G/MVAPICH/WRFV3/main/}}}
     1023
     1024{{{
     1025...
     1026-rw-rw----   1 lluis gmeteo    62797 May 19 13:28 wrf_ESMFMod.F
     1027-rwxr-x--x   0 lluis gmeteo 21147057 May 25 17:39 wrf.exe
     1028-rw-rw----   1 lluis gmeteo      918 May 19 13:28 wrf.F
     1029...
     1030}}}
     1031Differences in Hard-link (see [http://en.wikipedia.org/wiki/Ls ls] and [http://en.wikipedia.org/wiki/Hard_link Hard_Link]), date and on size!? During a simulation each node is running a different wrf.exe!
     1032
     1033Problem 'solved' rebooting wn025
     1034
     1035== [] mpiexec: Error: poll_or_block_event: tm_poll: tm: no event ==
     1036A second try of run does not give this error ?¿!¡ No memory/space left on nodes (bad ending of a previous simulation)
     1037
     1038== [x] mvapich 'call system()' failed ==
     1039When WRF4G is used, when 2nd file is started to be written, simulation stopped. (Probably due to $WRFGEL_SCRIPT ?)
     1040
     1041See comments:
     1042
     1043 * http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-October/000394.html
     1044 * http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-November/002041.html
     1045And user guide http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-350007.1.2
     1046
     1047Old version of linux kernel. It is recommended that kernel should be at version 2.6.16 or newer
     1048
     1049{{{
     1050Linux wn010.macc.unican.es 2.6.9-78.0.13.EL.cernsmp #1 SMP Mon Jan 19 14:00:58 CET 2009 x86_64 x86_64 x86_64 GNU/Linux
     1051}}}
     1052and OFED version
     1053
     1054{{{
     1055>mpichversion
     1056MPICH Version:          1.2.7
     1057MPICH Release date:     $Date: 2005/06/22 16:33:49$
     1058MPICH Patches applied:  none
     1059MPICH configure:        --with-device=ch_gen2 --with-arch=LINUX -prefix=/software/ScientificLinux/4.6/mvapich/1.1/pgi_7.1-6_gcc --with-romio --without-mpe -lib=-L/usr/lib64 -Wl,-rpath=/usr/lib64 -libverbs -libumad -lpthread
     1060MPICH Device:           ch_gen2
     1061}}}
     1062Problem solved at the moment declaring a new environment variable:
     1063
     1064{{{
     1065export IBV_FORK_SAFE=1
     1066}}}
     1067== [x] mpiexec: Warning: read_ib_one: protocol version 8 not known, but might still work. ==
     1068Error message when execute {{{mpiexec}}}:
     1069
     1070{{{
     1071mpiexec: Warning: read_ib_one: protocol version 8 not known, but might still work.
     1072mpiexec: Error: read_ib_one: mixed version executables (6 and 8), no hope.
     1073}}}
     1074This error message appears when is used a wrong version of ''mpiexec''. On must indicate correct one in: {{{/software/ScientificLinux/4.6/mpiexec/mpiexec}}}
     1075
     1076== [x] ECMWF ERA40 escena missing data ==
     1077Incompleted escena domain downloaded ERA40 data in /oceano/gmeteo/DATA/ECMWF/ERA40/escena Years: 1968, 1969, 1971 and 1979
     1078
     1079== [x] Large waiting in GRID-CSIC ==
     1080More than 1 day are waiting jobs in IFCA GRID-CSIC. In a selection of nodes=[N]:ppn=[M] (N=2, M=8).
     1081
     1082 * In EGEEUI01 nodes can be occupied only with one core job. Thus it makes difficult that node exclusive jobs can be running. It is more addequated to send jobs with total number of cores, without indication of exclusivity of one phisical machine (EGEEUI01 cluster has 8 cores nodes).
     1083 * Changes in ''wrf_AUTOlauncher_iteration.bash'' now make core assignation as [N]*[M] without {{{pmiexec -npernode [M]}}} line in {{{[template].job}}}. A new one has been created {{{MPI_job-EGEEUI01.pbs}}}
     1084{{{
     1085#!/bin/bash (-)
     1086### Job name
     1087#PBS -N @JOBnameSIM@
     1088### Queue name
     1089#PBS -q lmeteo
     1090### Dependency
     1091#PBS -W depend=afterany:@IDpbs@
     1092###  Total number of processes
     1093#PBS -l nodes=@Nnodes@
     1094# This job's working directory
     1095echo Working directory is $PBS_O_WORKDIR
     1096cd $PBS_O_WORKDIR
     1097echo Running on host `hostname`
     1098echo Time is `date`
     1099echo Directory is `pwd`
     1100echo This jobs runs on the following processors:
     1101echo `cat $PBS_NODEFILE`
     1102##
     1103#Running WRF
     1104##
     1105export OMP_NUM_THREADS=@Ntrh@
     1106echo "Numero de Threads: $OMP_NUM_THREADS"
     1107echo "Numero de Jobs MPI: $Nprocess"
     1108mpiexec ./wrf.exe
     1109}}}
     1110It can only work if ''nodes'' is not set as an entire physical machine. It must be set to a cpu (or core). More information in:
     1111
     1112 * http://www.clusterresources.com/torquedocs21/2.1jobsubmission.shtml#resources
     1113 * http://www.clusterresources.com/products/mwm/docs/a.fparameters.shtml#j
     1114== [x] cshell error in wn010 ==
     1115In wn010, appears a systematic csh error, just open a csh terminal
     1116
     1117{{{setenv: Too many arguments}}}
     1118
     1119A problem in a ''csh.profile'' have been repared
     1120
     1121== [x] Stale NFS file handle ==
     1122In IFCA GRID-CSIC, with ''wrf.exe'' appears a NFS file handle (for BIGescena domain)
     1123
     1124{{{
     1125/var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 400: 22711 Bus error               mpiexec -npernode 8 ./wrf.exe
     1126rm: cannot remove `wrf.exe': Stale NFS file handle
     1127rm: cannot remove `*.TBL': Stale NFS file handle
     1128rm: cannot remove `*_DATA*': Stale NFS file handle
     1129rm: cannot remove `met_em*': Stale NFS file handle
     1130rm: cannot remove `wrfbdy*': Stale NFS file handle
     1131rm: cannot remove `wrfinput*': Stale NFS file handle
     1132/var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 345: /gpfs/ifca.es/meteo/forest//bats/change_in_file.bash: Stale NFS file handle
     1133/var/spool/pbs/mom_priv/jobs/1070626.tor.SC: line 356: cd: /gpfs/ifca.es/meteo/SCRATCH/BIGescena/1970_1975Restart28d/simulations/1970010100_1970012900: Stale NFS file handle
     1134(...)
     1135}}}
     1136Some errors in NFS server occurred
     1137
     1138== [x] metgrid.exe Segmentation fault ==
     1139When ''metgrid.exe'' is running, a segmentation fault (in IFCA GRID-CSIC) appears (for Africa_25km domain). From ''[job].e[nnnnn]'':
     1140
     1141{{{
     1142/var/spool/pbs/mom_priv/jobs/1073948.tor.SC: line 195: 19831 Segmentation fault
     1143}}}
     1144Global analyses used where defined only for an European region
     1145
     1146== [] CAM NaN ==
     1147''module_ra_cam_support.F'' generates NaN outputs at a given time step (about the 350th julian day of 1996 and 2001, 1996/XII/15 and 2001/XII/16). ''rsl.out.[nnnn]'' files become as large as hard disk (because of the output to these files). Has been done:
     1148
     1149{{{
     1150 vert_interpolate: mmr < 0, m, col, lev, mmr            2            2             1                       NaN
     1151 vert_interpolate: aerosol(k),(k+1)   1.0000000116860974E-007     0.000000000000000
     1152 vert_interpolate: pint(k+1),(k)                       NaN                        NaN
     1153 n,c            1            1
     1154}}}
     1155 * '''FATAL_ERROR signal:''' call wrf_error_fatal ('Error of computation') line has been introduce in ''WRFV3/phys/module_ra_cam_support.F'' file
     1156 * '''isnand():''' internal pgi instruction added in some places of ''module_ra_cam_support.F'' and ''module_ra_cam.F'' allowing to know where appear first 'NaN' values
     1157Possible WRFv3.0.1.1 bug related to temporal interpolation of CO,,2,, concentrations at 15/XII of any year (change of monthly value)
     1158
     1159== [x] p4_error: semget failed for setnum: 12 ==
     1160Information sources:
     1161
     1162 * http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs/mpichman-chp4/node133.htm
     1163 * https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2008-May/030470.html
     1164>> This error means that there is not enough shared memory available to allocate a new memory segment for interprocess communication. Often what happens is there are some extra memory segments left over from a crash or programming error of a previous job that needs to be cleaned up. There is a script called cleanipcs that will remove all of your left over ipcs. Users are responsible for cleaning up extra shared memory segments after a crash or when their job is complete.
     1165
     1166You can use {{{/usr/bin/ipcs}}} to check memory state in one node:(given example for {{{ssh wn013 ipcs}}})
     1167
     1168{{{
     1169------ Shared Memory Segments --------
     1170key        shmid      owner      perms      bytes      nattch     status
     11710x00000000 0          root      644        72         2
     11720x00000000 32769      root      644        16384      2
     11730x00000000 65538      root      644        280        2
     11740x00000000 2654211    lluis     600        33554432   0
     1175------ Semaphore Arrays --------
     1176key        semid      owner      perms      nsems
     11770x000000a7 0          root      666        1
     11780x00000000 11337729   lluis     600        10
     11790x00000000 11370498   lluis     600        10
     11800x00000000 11403267   lluis     600        10
     11810x00000000 11436036   lluis     600        10
     11820x00000000 11468805   lluis     600        10
     11830x00000000 11501574   lluis     600        10
     11840x00000000 11534343   lluis     600        10
     11850x00000000 11567112   lluis     600        10
     11860x00000000 11599881   lluis     600        10
     11870x00000000 11632650   lluis     600        10
     11880x00000000 11665419   lluis     600        10
     11890x00000000 11698188   lluis     600        10
     11900x00000000 11730957   lluis     600        10
     11910x00000000 11763726   lluis     600        10
     11920x00000000 11796495   lluis     600        10
     11930x00000000 11829264   lluis     600        10
     11940x00000000 11862033   lluis     600        10
     11950x00000000 11894802   lluis     600        10
     11960x00000000 11927571   lluis     600        10
     11970x00000000 11960340   lluis     600        10
     11980x00000000 11993109   lluis     600        10
     11990x00000000 12025878   lluis     600        10
     12000x00000000 12058647   lluis     600        10
     12010x00000000 14352408   lluis     600        10
     12020x00000000 14385177   lluis     600        10
     1203------ Message Queues --------
     1204key        msqid      owner      perms      used-bytes   messages
     1205[lluis@wn010 WRFV3]$ ssh wn013 ipcs
     1206------ Shared Memory Segments --------
     1207key        shmid      owner      perms      bytes      nattch     status
     12080x00000000 0          root      644        72         2
     12090x00000000 32769      root      644        16384      2
     12100x00000000 65538      root      644        280        2
     12110x00000000 2654211    lluis     600        33554432   0
     1212------ Semaphore Arrays --------
     1213key        semid      owner      perms      nsems
     12140x000000a7 0          root      666        1
     12150x00000000 11337729   lluis     600        10
     12160x00000000 11370498   lluis     600        10
     12170x00000000 11403267   lluis     600        10
     12180x00000000 11436036   lluis     600        10
     12190x00000000 11468805   lluis     600        10
     12200x00000000 11501574   lluis     600        10
     12210x00000000 11534343   lluis     600        10
     12220x00000000 11567112   lluis     600        10
     12230x00000000 11599881   lluis     600        10
     12240x00000000 11632650   lluis     600        10
     12250x00000000 11665419   lluis     600        10
     12260x00000000 11698188   lluis     600        10
     12270x00000000 11730957   lluis     600        10
     12280x00000000 11763726   lluis     600        10
     12290x00000000 11796495   lluis     600        10
     12300x00000000 11829264   lluis     600        10
     12310x00000000 11862033   lluis     600        10
     12320x00000000 11894802   lluis     600        10
     12330x00000000 11927571   lluis     600        10
     12340x00000000 11960340   lluis     600        10
     12350x00000000 11993109   lluis     600        10
     12360x00000000 12025878   lluis     600        10
     12370x00000000 12058647   lluis     600        10
     12380x00000000 14352408   lluis     600        10
     12390x00000000 14385177   lluis     600        10
     1240------ Message Queues --------
     1241key        msqid      owner      perms      used-bytes   messages   
     1242}}}
     1243Use the following command to clean up all memory segments owned by your user id on a cluster:
     1244
     1245{{{cexec /opt/mpich/gnu/sbin/cleanipcs}}}
     1246
     1247Or for each working node: (be carefull to don't run the script in any node with a right working simulation!!)
     1248
     1249{{{ssh wn[NNN] /software/ScientificLinux/4.6/mpich/1.2.7p1/pgi_7.1-6_gcc/sbin/cleanipcs}}}
     1250
     1251After that: (in wn013):
     1252
     1253{{{
     1254------ Shared Memory Segments --------
     1255key        shmid      owner      perms      bytes      nattch     status
     12560x00000000 0          root      644        72         2
     12570x00000000 32769      root      644        16384      2
     12580x00000000 65538      root      644        280        2
     1259------ Semaphore Arrays --------
     1260key        semid      owner      perms      nsems
     12610x000000a7 0          root      666        1
     1262------ Message Queues --------
     1263key        msqid      owner      perms      used-bytes   messages
     1264}}}
     1265== [x] P4_GLOBMEMSIZE ==
     1266Not enough memory for mpich processes for the simulation. Error message looks like:
     1267
     1268{{{
     1269p3_15324: (1.777344) xx_shmalloc: returning NULL; requested 262192 bytes
     1270p3_15324: (1.777344) p4_shmalloc returning NULL; request = 262192 bytes
     1271You can increase the amount of memory by setting the environment variable
     1272P4_GLOBMEMSIZE (in bytes); the current size is 4194304
     1273p3_15324:  p4_error: alloc_p4_msg failed: 0
     1274}}}
     1275Tipical error for simulations with domains bigger as ''Europe_10'' and ''BIGescena'' domains. Default value is 4MB (4194304)
     1276
     1277Increase value to:
     1278
     1279 * '''32 MB''' {{{ export P4_GLOBMEMSIZE=33554432}}}
     1280 * '''64 MB''' {{{ export P4_GLOBMEMSIZE=67108864}}}
     1281 * '''128 MB''' {{{ export P4_GLOBMEMSIZE=134217728}}}
     1282 * '''256 MB''' {{{ export P4_GLOBMEMSIZE=268435456}}}
     1283== [x] SKINTEMP not found ==
     1284ERA40 ECMWF files have a different codification of variables. A modification in Vtable.ECMWF is carried out:
     1285
     1286Original line
     1287
     1288{{{
     1289 34 |  1   |   0  |      | SST      | K        | Sea-Surface Temperature                  |
     1290139 | 112  |   0  |   7  | ST000007 | K        | T of 0-7 cm ground layer                 |
     1291}}}
     1292Modification
     1293
     1294{{{
     1295139 |  1   |   0  |      | SST      | K        | Sea-Surface Temperature                  |
     1296139 | 112  |   0  |   7  | SKINTEMP | K        | T of 0-7 cm ground layer                 |
     1297}}}
     1298== [x] WOULD GO OFF TOP: KF_ETA_PARA I,J,DPTHMX,DPMIN      81    78     NaN    5000.000 ==
     1299See http://forum.wrfforum.com/viewtopic.php?f=6&t=263
     1300
     1301Many causes are possible. CFLs, problems with initial or boundary conditions...etc Lowering the time step or swiching off feedback between nests are possible solutions.
     1302
     1303== [x] Metgrid error:  Error in ext_pkg_write_field in metgrid.log ==
     1304Also in log/metgrid_1995030912.out:
     1305
     1306{{{
     1307ERROR: Error in ext_pkg_write_field
     1308 WRF_DEBUG: Warning DIM            4 , NAME num_metgrid_levels REDIFINED  by var GHT           17          18  in wrf_io.F90 line        2424
     1309}}}
     1310This error means that probably one or more surface variables are missing in the model input (for example NCEP reanalyses). Input grib files must be checked and fixed.
     1311
     1312== [] forrtl: severe (174): SIGSEGV, segmentation fault occurred ==
     1313{{{
     1314forrtl: severe (174): SIGSEGV, segmentation fault occurred
     1315Image PC Routine Line Source wrf.exe 00000000013EF561 Unknown Unknown
     1316Unknown
     1317wrf.exe 00000000013F0727 Unknown Unknown Unknown
     1318wrf.exe 00000000013F1E68 Unknown Unknown Unknown
     1319wrf.exe 00000000011BB5CB Unknown Unknown Unknown
     1320wrf.exe 0000000000DE0913 Unknown Unknown Unknown
     1321wrf.exe 0000000000DDAEBD Unknown Unknown Unknown
     1322wrf.exe 00000000009AF823 Unknown Unknown Unknown
     1323wrf.exe 0000000000690D01 Unknown Unknown Unknown
     1324wrf.exe 000000000068DB21 Unknown Unknown Unknown
     1325wrf.exe 000000000047BC1B Unknown Unknown Unknown
     1326wrf.exe 000000000047B049 Unknown Unknown Unknown
     1327wrf.exe 000000000047AFEC Unknown Unknown Unknown
     1328libc.so.6 0000003AD001D994 Unknown Unknown Unknown
     1329wrf.exe 000000000047AEE9 Unknown Unknown Unknown
     1330}}}
     1331Causes are unknown, but it worked just sending the simulation again, without any change.
     1332
     1333== [] wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed. ==
     1334It appeared in a continous simulation with spectral nudging, using wrf 3.1.1. rsl.error.0000 shows:
     1335
     1336{{{
     1337wrf.exe: posixio.c:213: px_pgout: Assertion `*posp == ((off_t)(-1)) || *posp == lseek(nciop->fd, 0, 1)' failed.
     1338forrtl: error (76): Abort trap signal
     1339Image              PC                Routine            Line        Source
     1340libc.so.6          0000003AD0030265  Unknown               Unknown  Unknown
     1341libc.so.6          0000003AD0031D10  Unknown               Unknown  Unknown
     1342libc.so.6          0000003AD00296E6  Unknown               Unknown  Unknown
     1343wrf.exe            000000000154368A  Unknown               Unknown  Unknown
     1344wrf.exe            0000000001518A2D  Unknown               Unknown  Unknown
     1345wrf.exe            000000000152741E  Unknown               Unknown  Unknown
     1346wrf.exe            00000000014CCD30  Unknown               Unknown  Unknown
     1347wrf.exe            00000000014CBADD  Unknown               Unknown  Unknown
     1348wrf.exe            00000000014BAD59  Unknown               Unknown  Unknown
     1349wrf.exe            00000000014B76A3  Unknown               Unknown  Unknown
     1350wrf.exe            0000000000BB258D  Unknown               Unknown  Unknown
     1351wrf.exe            0000000000BAED79  Unknown               Unknown  Unknown
     1352wrf.exe            0000000000BAE7F8  Unknown               Unknown  Unknown
     1353wrf.exe            0000000000BADD02  Unknown               Unknown  Unknown
     1354wrf.exe            0000000000BADA9E  Unknown               Unknown  Unknown
     1355wrf.exe            0000000000DD5E47  Unknown               Unknown  Unknown
     1356wrf.exe            00000000007A81D6  Unknown               Unknown  Unknown
     1357wrf.exe            00000000006B8424  Unknown               Unknown  Unknown
     1358wrf.exe            0000000000653E19  Unknown               Unknown  Unknown
     1359wrf.exe            0000000000677927  Unknown               Unknown  Unknown
     1360wrf.exe            0000000000674047  Unknown               Unknown  Unknown
     1361wrf.exe            00000000004C9DF7  Unknown               Unknown  Unknown
     1362wrf.exe            000000000047B0A3  Unknown               Unknown  Unknown
     1363wrf.exe            000000000047B057  Unknown               Unknown  Unknown
     1364wrf.exe            000000000047AFEC  Unknown               Unknown  Unknown
     1365libc.so.6          0000003AD001D994  Unknown               Unknown  Unknown
     1366wrf.exe            000000000047AEE9  Unknown               Unknown  Unknown
     1367}}}
     1368wrf_2001112400.out shows:
     1369
     1370{{{
     1371/oceano/gmeteo/WORK/ASNA/WRF/run/SeaWind_N1540_SN/SeaWind_N1540_SN/0029/bin/wrf_wrapper.exe: line 9:  4500 Aborted                 ${0/_wrapper/} $*
     1372}}}
     1373Causes are unknown.
     1374
     1375== [] No error, wrf just stops (¿¡?!) ==
     1376
     1377Change the debub_level (up to 300) in namelist.input &time_cotrol.
     1378
     1379If there isn't any error yet, run wrf using the debugging version (OMPIchk)