Posts for the month of July 2012

How the solaris mpt_sas driver name SATA devices

With the the SAS2 LSI driver for Solaris (mpt_sas) the device names for SATA drivesattached to SAS ports are a bit confusing.

Apparentliy the main reason for this confusing naming is because the mpt_sas driver support device multi path (MPXIO). Therefore it needs some identification mechanism resistent to changes on the SAS topology. For a SAS device this is be done using the SAS WWN but with SATA devices it can be done using the GUID for the SATA device.

From the source code for the OpenSolaris's mpt_sas driver I have discovered that the driver is inquiryng something called 'page 83'

13091 uint64_t mptsas_get_sata_guid(mptsas_t *mpt, mptsas_target_t *ptgt, int lun)
13092 {
13093         uint64_t        sata_guid = 0, *pwwn = NULL;
13094         int             target = ptgt->m_devhdl;
13095         uchar_t         *inq83 = NULL;
13096         int             inq83_len = 0xFF;
13097         uchar_t         *dblk = NULL;
13098         int             inq83_retry = 3;
13099         int             rval = DDI_FAILURE;
13100 
13101         inq83   = kmem_zalloc(inq83_len, KM_SLEEP);
13102 
13103 inq83_retry:
13104         rval = mptsas_inquiry(mpt, ptgt, lun, 0x83, inq83,
13105             inq83_len, NULL, 1);
13106         if (rval != DDI_SUCCESS) {
13107                 mptsas_log(mpt, CE_WARN, "!mptsas request inquiry page "
13108                     "0x83 for target:%x, lun:%x failed!", target, lun);
13109                 goto out;
13110         }
13111         /* According to SAT2, the first descriptor is logic unit name */
13112         dblk = &inq83[4];
13113         if ((dblk[1] & 0x30) != 0) {
13114                 mptsas_log(mpt, CE_WARN, "!Descriptor is not lun associated.");
13115                 goto out;
13116         }
13117         pwwn = (uint64_t *)(void *)(&dblk[4]);
13118         if ((dblk[4] & 0xf0) == 0x50) {
13119                 sata_guid = BE_64(*pwwn);
13120                 goto out;
13121         } else if (dblk[4] == 'A') {
13122                 NDBG20(("SATA drive has no NAA format GUID."));
13123                 goto out;
13124         } else {
13125                 /* The data is not ready, wait and retry */
13126                 inq83_retry--;
13127                 if (inq83_retry <= 0) {
13128                         goto out;
13129                 }
13130                 NDBG20(("The GUID is not ready, retry..."));
13131                 delay(1 * drv_usectohz(1000000));
13132                 goto inq83_retry;
13133         }
13134 out:
13135         kmem_free(inq83, inq83_len);
13136         return (sata_guid);
13137 }

As exercise I will try to discover this information for the following device

root@seal.macc.unican.es:~# prtconf -v /dev/dsk/c10t5000CCA221C25B1Ed0
disk, instance #217
    Driver properties:
        name='inquiry-serial-no' type=string items=1 dev=none
            value='JK1133YAG55NLU'
        name='pm-components' type=string items=3 dev=none
            value='NAME=spindle-motor' + '0=off' + '1=on'
        name='pm-hardware-state' type=string items=1 dev=none
            value='needs-suspend-resume'
        name='ddi-failfast-supported' type=boolean dev=none
        name='ddi-kernel-ioctl' type=boolean dev=none
        name='fm-ereport-capable' type=boolean dev=none
        name='device-nblocks' type=int64 items=1 dev=none
            value=00000000e8e088b0
        name='device-blksize' type=int items=1 dev=none
            value=00000200
    Hardware properties:
        name='devid' type=string items=1
            value='id1,sd@n5000cca221c25b1e'
        name='inquiry-device-type' type=int items=1
            value=00000000
        name='inquiry-revision-id' type=string items=1
            value='JKAOA20N'
        name='inquiry-product-id' type=string items=1
            value='HDS722020ALA330'
        name='inquiry-vendor-id' type=string items=1
            value='Hitachi'
        name='class' type=string items=1
            value='scsi'
        name='obp-path' type=string items=1
            value='/pci@7a,0/pci8086,340e@7/pci1000,3080@0/disk@w5000cca221c25b1e,0'
        name='pm-capable' type=int items=1
            value=00000001
        name='guid' type=string items=1
            value='5000cca221c25b1e'
        name='sas-mpt' type=boolean
        name='port-wwn' type=byte items=8
            value=50.00.cc.a2.21.c2.5b.1e
        name='target-port' type=string items=1
            value='5000cca221c25b1e'
        name='compatible' type=string items=4
            value='scsiclass,00.vATA.pHitachi_HDS72202.rA20N' + 'scsiclass,00.vATA.pHitachi_HDS72202' + 'scsiclass,00' + 'scsiclass'
        name='lun' type=int items=1
            value=00000000

We can inquiry the ´83h´ (section 10.3.4 on attached document) page using the LSIUTL tool (version 1.63)

root@seal.macc.unican.es:~/LSIUTIL# ./lsiutil

LSI Logic MPT Configuration Utility, Version 1.63, June 4, 2009

5 MPT Ports found

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  mpt1              LSI Logic SAS1068E B3     105      011e0000     0
 2.  mpt2              LSI Logic SAS1068E B3     105      011e0000     0
 3.  mpt_sas0          LSI Logic SAS2008 03      200      0b000000     0
 4.  mpt_sas1          LSI Logic SAS2008 03      200      0b000000     0
 5.  mpt_sas6          LSI Logic SAS2008 02      200      0a000200     0

Select a device:  [1-5 or 0 to quit] 3

 1.  Identify firmware, BIOS, and/or FCode
 2.  Download firmware (update the FLASH)
 4.  Download/erase BIOS and/or FCode (update the FLASH)
 8.  Scan for devices
10.  Change IOC settings (interrupt coalescing)
13.  Change SAS IO Unit settings
16.  Display attached devices
20.  Diagnostics
21.  RAID actions
23.  Reset target
42.  Display operating system names for devices
43.  Diagnostic Buffer actions
45.  Concatenate SAS firmware and NVDATA files
59.  Dump PCI config space
60.  Show non-default settings
61.  Restore default settings
66.  Show SAS discovery errors
69.  Show board manufacturing information
97.  Reset SAS link, HARD RESET
98.  Reset SAS link
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] e

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 20

 1.  Inquiry Test
 2.  WriteBuffer/ReadBuffer/Compare Test
 3.  Read Test
 4.  Write/Read/Compare Test
 5.  Write Test
 6.  Read/Compare Test
 7.  Log Sense Test
 8.  Read Capacity / Read Block Limits Test
 9.  Mode Page Test
10.  SATA Identify Device Test
11.  SATA Clear Affiliation Test
12.  Display phy counters
13.  Clear phy counters
14.  SATA SMART Read Test
15.  SEP (SCSI Enclosure Processor) Test
16.  Issue product-specific SAS IO Unit Control
17.  Diag data upload
18.  Report LUNs Test
19.  Drive firmware download
20.  Expander firmware download
21.  Read Logical Blocks
22.  Write Logical Blocks
23.  Verify Logical Blocks
24.  Read Buffer (for firmware upload)
25.  Display Expander Log entries
26.  Clear (erase) Expander Log entries
29.  Diagnostic Page Test
30.  Inject media error
31.  Repair media error
32.  Set software write protect
33.  Clear software write protect
34.  Enable read cache
35.  Disable read cache
36.  Enable write cache
37.  Disable write cache
98.  Reset expander
99.  Reset port
 e   Disable expert mode in menus
 p   Enable paged mode
 w   Enable logging

Diagnostics menu, select an option:  [1-99 or e/p/w or 0 to quit] 1

Bus:  [0-2 or RETURN to quit] 0
Target:  [0-255 or RETURN to quit] 31
LUN:  [0-255 or RETURN to quit] 0
VPD Page:  [00-FF or RETURN for normal Inquiry] 83

 B___T___L  Page
 0  31   0   83

36 bytes of Inquiry Data returned

0000 : 00 83 00 20 01 03 00 08 50 00 cc a2 21 c2 5b 1e            P   ! [
0010 : 61 93 00 08 50 03 04 80 01 15 5a 61 01 14 00 04    a   P     Za
0020 : 00 00 00 00

and the bytes 9-16 corresponds to the device name ´/dev/dsk/c10t5000CCA221C25B1Ed0´ given by the mpt_sas driver we can check that serial number corresponds to this drive inquirying the page ´80h´ (section 10.3.3 on the attached document)

Bus:  [0-2 or RETURN to quit] 0
Target:  [0-255 or RETURN to quit] 31
LUN:  [0-255 or RETURN to quit] 0
VPD Page:  [00-FF or RETURN for normal Inquiry] 80

 B___T___L  Page
 0  31   0   80

24 bytes of Inquiry Data returned

0000 : 00 80 00 14 20 20 20 20 20 20 4a 4b 31 31 33 33              JK1133
0010 : 59 41 47 35 35 4e 4c 55                            YAG55NLU

  • Posted: 2012-07-29 19:04 (Updated: 2012-07-29 19:15)
  • Author: antonio
  • Categories: (none)
  • Comments (475)

Grid Certificates

PROBLEM

After migrating the SE to another machine, you could have problems relating to grid certificates. You have been copied the whole content of /etc/grid-security, which probably includes files starting with % in the /etc/grid-security/certificates directory (these files are encoded).

SOLUTION

If you remove these files, you should get your service running fine again.

How to deploy a CREAM CE

The following deployment models are possible for a CREAM-CE:

  • CREAM-CE can be configured without worrying about the glite-CLUSTER node. This can be useful for small sites who don't want to worry about cluster/subcluster configurations because they have a very simple setup. In this case CREAM-CE will publish a single cluster/subcluster. This is called no cluster mode. This is done as described below by defining the yaim setting CREAMCE_CLUSTER_MODE=no (or by no defining at all that variable).
  • CREAM-CE can work on cluster mode using the glite-CLUSTER node type. This is done as described below by defining the yaim setting CREAMCE_CLUSTER_MODE=yes. The CREAM-CE can be in the same host or in a different host wrt the glite-CLUSTER node.

Installation of a CREAM CE node in no cluster mode

We select this mode because it is the easier way to deploy a CREAM CE site. This configuration of a CREAM CE in no cluster mode using Torque as batch system, with the CREAM CE not being also Torque server. Also, CREAM CE will be APEL Publisher and BDII site.

  • Repositories
    • the EPEL repository
    • the EMI middleware repository
    • the CA repository
      [root@ce01 ~]# cat /etc/yum.repos.d/epel.repo 
      [epel]
      name=Extra Packages for Enterprise Linux 5 - $basearch
      #baseurl=http://download.fedoraproject.org/pub/epel/5/$basearch
      mirrorlist=http://mirrors.fedoraproject.org/mirrorlist?repo=epel-5&arch=$basearch
      failovermethod=priority
      enabled=1
      gpgcheck=1
      gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL
      
      [epel-debuginfo]
      name=Extra Packages for Enterprise Linux 5 - $basearch - Debug
      #baseurl=http://download.fedoraproject.org/pub/epel/5/$basearch/debug
      mirrorlist=http://mirrors.fedoraproject.org/mirrorlist?repo=epel-debug-5&arch=$basearch
      failovermethod=priority
      enabled=0
      gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL
      gpgcheck=1
      
      [epel-source]
      name=Extra Packages for Enterprise Linux 5 - $basearch - Source
      #baseurl=http://download.fedoraproject.org/pub/epel/5/SRPMS
      mirrorlist=http://mirrors.fedoraproject.org/mirrorlist?repo=epel-source-5&arch=$basearch
      failovermethod=priority
      enabled=0
      gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL
      gpgcheck=1
      
      [root@ce01 ~]# cat /etc/yum.repos.d/UMD-1-base.repo 
      [UMD-1-base]
      name=UMD 1 base (SL5)
      baseurl=http://repository.egi.eu/sw/production/umd/1/sl5/$basearch/base
      protect=1
      enabled=1
      # To use priorities you must have yum-priorities installed
      priority=45
      gpgcheck=1
      gpgkey=http://emisoft.web.cern.ch/emisoft/dist/EMI/1/RPM-GPG-KEY-emi http://repo-rpm.ige-project.eu/RPM-GPG-KEY-IGE
      [root@ce01 ~]# cat /etc/yum.repos.d/UMD-1-updates.repo 
      [UMD-1-updates]
      name=UMD 1 updates (SL5)
      baseurl=http://repository.egi.eu/sw/production/umd/1/sl5/$basearch/updates
      protect=1
      enabled=1
      # To use priorities you must have yum-priorities installed
      priority=40
      gpgcheck=1
      gpgkey=http://emisoft.web.cern.ch/emisoft/dist/EMI/1/RPM-GPG-KEY-emi http://repo-rpm.ige-project.eu/RPM-GPG-KEY-IGE
      
      [root@ce01 ~]# cat /etc/yum.repos.d/egi-trustanchors.repo 
      [EGI-trustanchors]
      name=EGI-trustanchors
      baseurl=http://repository.egi.eu/sw/production/cas/1/current/
      gpgkey=http://repository.egi.eu/sw/production/cas/1/GPG-KEY-EUGridPMA-RPM-3
      gpgcheck=1
      enabled=1
      
  • yum install
    [root@ce01 ~]# yum clean all
    [root@ce01 ~]# yum install yum-protectbase
    [root@ce01 ~]# yum install ca-policy-egi-core 
    [root@ce01 ~]# yum install xml-commons-apis 
    [root@ce01 ~]# yum install emi-cream-ce
    [root@ce01 ~]# yum install emi-torque-utils
    [root@ce01 ~]# yum install openldap2.4 openldap2.4-servers
    [root@ce01 ~]# yum install nfs-utils.x86_64
    
  • CREAM CE update
    [root@ce01 ~]# yum update
    
  • Torque

If you want to install a different version of torque for some reason (to see the current version rpm -qa | grep torque-clitent), you can execute these commands:

[root@ce01 ~]# tar xzvf torque-2.3.9.tar.gz
[root@ce01 ~]# cd torque-2.3.9
[root@ce01 ~]# ./configure --prefix=/usr
[root@ce01 ~]# make install
  • Install host certificate

Once you have obtained a valid certificate. You have to create the hostcert.pem and hostkey.pem and place in the /etc/grid-security directory. Then set the proper mode and ownerships doing:

[root@ce01 ~]# cd /etc/grid-security/
[root@ce01 ~]# openssl pkcs12 -nocerts -nodes -in ce02.p12 -out hostkey.pem
[root@ce01 ~]# openssl pkcs12 -clcerts -nokeys -in ce02.p12 -out hostcert.pem
[root@ce01 ~]# chown root.root hostcert.pem
[root@ce01 ~]# chown root.root hostkey.pem
[root@ce01 ~]# chmod 644 hostcert.pem
[root@ce01 ~]# chmod 400 hostkey.pem
  • Create site-info.def file for YAIM
    [root@ce01 ~]# cat siteinfo/site_ce.def 
    MY_DOMAIN=macc.unican.es
    INSTALL_ROOT=/opt
    
    #SERVICIOS CENTRALES 
    RB_HOST=rb-eela.ceta-ciemat.es 
    LB_HOST=rb-eela.ceta-ciemat.es 
    WMS_HOST=wms-eela.ceta-ciemat.es 
    MON_HOST=ce01.$MY_DOMAIN
    LFC_HOST=lfc01.lip.pt
    BDII_HOST=bdii.pic.es
    GSSKLOG=no
    
    PX_HOST=grid001.ct.infn.it  
    REG_HOST=nosirve
    
    WN_LIST=/root/siteinfo/wn-list.conf
    USERS_CONF=/root/siteinfo/users.conf
    GROUPS_CONF=/root/siteinfo/groups.conf
    SLAPD=/usr/sbin/slapd2.4
    YAIM_VERSION=4.0.3
    
    INSTALL_ROOT=/opt
    OUTPUT_STORAGE=/tmp/jobOutput
    JAVA_LOCATION=/usr
    CRON_DIR=/etc/cron.d
    GLOBUS_TCP_PORT_RANGE="20000,25000"
    
    #########################
    #CREAM CE
    ########################
    CEMON_HOST=ce01.$MY_DOMAIN
    CREAM_DB_USER=creamdb
    CREAM_DB_PASSWORD=cream1730
    CREAM_CE_STATE="Production"
    CREAMCE_CLUSTER_MODE=no
    CE_CAPABILITY="none"
    CE_OTHERDESCR="Cores=2"
    SE_MOUNT_INFO_LIST="$DPM_FILESYSTEMS"
    CONFIG_MAUI=no
    
    ########################
    
    APEL_MYSQL_HOST=$MON_HOST
    MYSQL_PASSWORD=mysql1730
    APEL_DB_PASSWORD="apel1730"
    APEL_PUBLISH_USER_DN=yes
    
    GRIDICE_SERVER_HOST=$MON_HOST
    GRIDICE_MON_WN=no
    GRIDICE_HIDE_USER_DN=no
    
    
    #########################################
    # Torque server configuration variables #
    #########################################
    BATCH_SERVER=encina
    JOB_MANAGER=lcgpbs
    CE_BATCH_SYS=torque
    BATCH_BIN_DIR=/usr/bin/
    BATCH_VERSION=torque-2.1.9
    BATCH_LOG_DIR=/var/spool/torque/
    CE_PHYSCPU=14 
    CE_LOGCPU=14 
    CE_OS_ARCH=x86_64
    
    
    ##############################
    # CE configuration variables #
    ##############################
    
    CE_HOST=ce01.$MY_DOMAIN
    CE_CPU_MODEL=PD   #PENTIUM D 930, 3.0GHZ/2X2
    CE_CPU_VENDOR=intel
    CE_CPU_SPEED=3000
    CE_OS="ScientificSL"
    CE_OS_RELEASE=5.5
    CE_OS_VERSION="SLC"
    CE_MINPHYSMEM=2048
    CE_MINVIRTMEM=4096 #/proc/meminfo
    CE_SMPSIZE=2
    CE_SI00=381
    CE_SF00=0
    CE_OUTBOUNDIP=TRUE
    CE_INBOUNDIP=TRUE
    CE_RUNTIMEENV="
        LCG-2
        LCG-2_1_0
        LCG-2_1_1
        LCG-2_2_0
        LCG-2_3_0
        LCG-2_3_1
        LCG-2_4_0
        LCG-2_5_0
        LCG-2_6_0
        LCG-2_7_0
        GLITE-3_0_0
        GLITE-3_1_0
        R-GMA
    "
    
    ###############################
    # DPM configuration variables #
    ###############################
    
    DPM_HOST="se01.$MY_DOMAIN"   # my-dpm.$MY_DOMAIN. DPM head node hostname
    DPMPOOL=permanent #the_dpm_pool_name
    DPM_FILESYSTEMS="$DPM_HOST:/storage"
    DPM_DB_USER=dpm-db-user
    DPM_DB_PASSWORD=dpm1730
    DPM_DB_HOST=$DPM_HOST
    DPM_INFO_USER=dpm-info-user
    DPM_INFO_PASS=dpm1730
    DPMFSIZE=200M
    
    ###########
    # SE_LIST #
    ###########
    SE_LIST="$DPM_HOST"
    SE_ARCH="multidisk" # "disk, tape, multidisk, other"
    
    
    ################################
    # BDII configuration variables #
    ################################
    SITE_BDII_HOST=ce01.$MY_DOMAIN
    SITE_DESC="University of Cantabria site"
    SITE_SECURITY_EMAIL="iglesiaa@gestion.unican.es"
    SITE_EMAIL=grid-prod@unican.es
    SITE_NAME=UNICAN
    SITE_LOC="Santander,SPAIN"
    SITE_LAT=43.5
    SITE_LONG=-3.8
    SITE_COUNTRY=Spain
    SITE_WEB="http://www.unican.es"
    SITE_SUPPORT_EMAIL=grid-prod@unican.es
    SITE_OTHER_GRID="EGEE|EELA"
    EGEE_ROC="SWE"
    BDII_REGIONS="BDII CE SE"    # list of the services provided by the site
    
    # If you the node type is using BDII instead (all 3.1 nodes)
    # change the port to 2170 and mds-vo-name=resource
    BDII_BDII_URL="ldap://$SITE_BDII_HOST:2170/mds-vo-name=resource,o=grid"
    BDII_CE_URL="ldap://$CE_HOST:2170/mds-vo-name=resource,o=grid"
    BDII_SE_URL="ldap://$DPM_HOST:2170/mds-vo-name=resource,o=grid"
    BDII_DPM_URL="ldap://$DPM_HOST:2170/mds-vo-name=resource,o=grid"
    
    BDII_RESOURCE_TIMEOUT=30
    GIP_RESPONSE=30
    GIP_FRESHNESS=60
    GIP_CACHE_TTL=300
    GIP_TIMEOUT=150
    
    ##############################
    # VO configuration variables #
    ##############################
    
    VOS="esr ops dteam prod.vo.eu-eela.eu oper.vo.eu-eela.eu chem.vo.ibergrid.eu eng.vo.ibergrid.eu ict.vo.ibergrid.eu ops.vo.ibergrid.eu social.vo.ibergrid.eu earth.vo.ibergrid.eu iber.vo.ibergrid.eu life.vo.ibergrid.eu phys.vo.ibergrid.eu"
    
    QUEUES="grid"
    
    VO_SW_DIR=/opt/exp_soft
    EDG_WL_SCRATCH=""
    
    GRID_GROUP_ENABLE=$VOS
    
    #####
    #esr#
    #####
    VO_ESR_SW_DIR=$VO_SW_DIR/esr
    VO_ESR_DEFAULT_SE=$DPM_HOST
    VO_ESR_STORAGE_DIR=$CLASSIC_STORAGE_DIR/esr
    VO_ESR_VOMS_SERVERS="'vomss://voms.grid.sara.nl:8443/voms/esr?/esr/'"
    VO_ESR_VOMSES="'esr voms.grid.sara.nl 30001 /O=dutchgrid/O=hosts/OU=sara.nl/CN=voms.grid.sara.nl esr'"
    VO_ESR_VOMS_CA_DN="'/C=NL/O=NIKHEF/CN=NIKHEF medium-security certification auth'"
    
    #########
    # dteam #
    #########
    VO_DTEAM_SW_DIR=$VO_SW_DIR/dteam
    VO_DTEAM_DEFAULT_SE=$DPM_HOST
    VO_DTEAM_STORAGE_DIR=$CLASSIC_STORAGE_DIR/dteam
    VO_DTEAM_VOMS_SERVERS="'vomss://voms.cern.ch:8443/voms/dteam?/dteam/'"
    VO_DTEAM_VOMSES="'dteam lcg-voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch dteam 24' 'dteam voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch dteam 24'"
    VO_DTEAM_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"
    
    #######
    # ops #
    #######
    VO_OPS_SW_DIR=$VO_SW_DIR/ops
    VO_OPS_DEFAULT_SE=$DPM_HOST
    VO_OPS_STORAGE_DIR=$CLASSIC_STORAGE_DIR/ops
    VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/"
    VO_OPS_VOMSES="'ops lcg-voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'"
    VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"
    
    ######
    #EELA#
    ######
    #oper
    VO_OPER_VO_EU_EELA_EU_SW_DIR=$VO_SW_DIR/eelaoper
    VO_OPER_VO_EU_EELA_EU_DEFAULT_SE=$DPM_HOST
    VO_OPER_VO_EU_EELA_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/eelaoper
    VO_OPER_VO_EU_EELA_EU_VOMS_SERVERS="'vomss://voms.eela.ufrj.br:8443/voms/oper.vo.eu-eela.eu?/oper.vo.eu-eela.eu'"
    VO_OPER_VO_EU_EELA_EU_VOMSES="'oper.vo.eu-eela.eu voms.eela.ufrj.br 15004 /C=BR/O=ICPEDU/O=UFF BrGrid CA/O=UFRJ/OU=IF/CN=host/voms.eela.ufrj.br oper.vo.eu-eela.eu' 'oper.vo.eu-eela.eu voms-eela.ceta-ciemat.es 15004 /DC=es/DC=irisgrid/O=ceta-ciemat/CN=host/voms-eela.ceta-ciemat.es oper.vo.eu-eela.eu'"
    VO_OPER_VO_EU_EELA_EU_VOMS_CA_DN="'/C=BR/O=ICPEDU/O=UFF BrGrid CA/CN=UFF Brazilian Grid Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    #prod
    VO_PROD_VO_EU_EELA_EU_SW_DIR=$VO_SW_DIR/eelaprod
    VO_PROD_VO_EU_EELA_EU_DEFAULT_SE=$DPM_HOST
    VO_PROD_VO_EU_EELA_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/eelaprod
    VO_PROD_VO_EU_EELA_EU_VOMS_SERVERS="'vomss://voms.eela.ufrj.br:8443/voms/prod.vo.eu-eela.eu?/prod.vo.eu-eela.eu'"
    VO_PROD_VO_EU_EELA_EU_VOMSES="'prod.vo.eu-eela.eu voms.eela.ufrj.br 15003 /C=BR/O=ICPEDU/O=UFF BrGrid CA/O=UFRJ/OU=IF/CN=host/voms.eela.ufrj.br prod.vo.eu-eela.eu' 'prod.vo.eu-eela.eu voms-eela.ceta-ciemat.es 15003 /DC=es/DC=irisgrid/O=ceta-ciemat/CN=host/voms-eela.ceta-ciemat.es prod.vo.eu-eela.eu'"
    VO_PROD_VO_EU_EELA_EU_VOMS_CA_DN="'/C=BR/O=ICPEDU/O=UFF BrGrid CA/CN=UFF Brazilian Grid Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    ################
    # IBERGRID VOS #
    ################
    # ops.vo.ibergrid.eu
    VO_OPS_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_OPS_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_OPS_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_OPS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/ops.vo.ibergrid.eu?/ops.vo.ibergrid.eu'"
    VO_OPS_VO_IBERGRID_EU_VOMSES="'ops.vo.ibergrid.eu voms01.ncg.ingrid.pt 40001 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt ops.vo.ibergrid.eu' 'ops.vo.ibergrid.eu ibergrid-voms.ifca.es 40001 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es ops.vo.ibergrid.eu'"
    VO_OPS_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # iber.vo.ibergrid.eu
    VO_IBER_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_IBER_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_IBER_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_IBER_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/iber.vo.ibergrid.eu?/iber.vo.ibergrid.eu'"
    VO_IBER_VO_IBERGRID_EU_VOMSES="'iber.vo.ibergrid.eu voms01.ncg.ingrid.pt 40003 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt iber.vo.ibergrid.eu' 'iber.vo.ibergrid.eu ibergrid-voms.ifca.es 40003 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es iber.vo.ibergrid.eu'"
    VO_IBER_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # eng.vo.ibergrid.eu
    VO_ENG_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_ENG_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_ENG_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_ENG_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/eng.vo.ibergrid.eu?/eng.vo.ibergrid.eu"
    VO_ENG_VO_IBERGRID_EU_VOMSES="'eng.vo.ibergrid.eu voms01.ncg.ingrid.pt 40013 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt eng.vo.ibergrid.eu' 'eng.vo.ibergrid.eu ibergrid-voms.ifca.es 40013 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es eng.vo.ibergrid.eu'"
    VO_ENG_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # ict.vo.ibergrid.eu
    VO_ICT_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_ICT_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_ICT_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_ICT_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/ict.vo.ibergrid.eu?/ict.vo.ibergrid.eu"
    VO_ICT_VO_IBERGRID_EU_VOMSES="'ict.vo.ibergrid.eu voms01.ncg.ingrid.pt 40008 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt ict.vo.ibergrid.eu' 'ict.vo.ibergrid.eu ibergrid-voms.ifca.es 40008 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es ict.vo.ibergrid.eu'"
    VO_ICT_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # life.vo.ibergrid.eu
    VO_LIFE_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test 
    VO_LIFE_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_LIFE_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_LIFE_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/life.vo.ibergrid.eu?/life.vo.ibergrid.eu"
    VO_LIFE_VO_IBERGRID_EU_VOMSES="'life.vo.ibergrid.eu voms01.ncg.ingrid.pt 40010 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt life.vo.ibergrid.eu' 'life.vo.ibergrid.eu ibergrid-voms.ifca.es 40010 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es life.vo.ibergrid.eu'"
    VO_LIFE_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # earth.vo.ibergrid.eu
    VO_EARTH_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_EARTH_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_EARTH_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_EARTH_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/earth.vo.ibergrid.eu?/earth.vo.ibergrid.eu"
    VO_EARTH_VO_IBERGRID_EU_VOMSES="'earth.vo.ibergrid.eu voms01.ncg.ingrid.pt 40011 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt earth.vo.ibergrid.eu' 'earth.vo.ibergrid.eu ibergrid-voms.ifca.es 40011 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es earth.vo.ibergrid.eu'"
    VO_EARTH_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # phys.vo.ibergrid.eu
    VO_PHYS_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_PHYS_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_PHYS_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_PHYS_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/phys.vo.ibergrid.eu?/phys.vo.ibergrid.eu"
    VO_PHYS_VO_IBERGRID_EU_VOMSES="'phys.vo.ibergrid.eu voms01.ncg.ingrid.pt 40007 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt phys.vo.ibergrid.eu' 'phys.vo.ibergrid.eu ibergrid-voms.ifca.es 40007 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es phys.vo.ibergrid.eu'"
    VO_PHYS_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # social.vo.ibergrid.eu
    VO_SOCIAL_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_SOCIAL_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_SOCIAL_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_SOCIAL_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/social.vo.ibergrid.eu"?/social.vo.ibergrid.eu""
    VO_SOCIAL_VO_IBERGRID_EU_VOMSES="'social.vo.ibergrid.eu voms01.ncg.ingrid.pt 40012 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt social.vo.ibergrid.eu' 'social.vo.ibergrid.eu ibergrid-voms.ifca.es 40012 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es social.vo.ibergrid.eu'"
    VO_SOCIAL_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    # chem.vo.ibergrid.eu
    VO_CHEM_VO_IBERGRID_EU_SW_DIR=$VO_SW_DIR/test
    VO_CHEM_VO_IBERGRID_EU_DEFAULT_SE=$DPM_HOST
    VO_CHEM_VO_IBERGRID_EU_STORAGE_DIR=$CLASSIC_STORAGE_DIR/test
    VO_CHEM_VO_IBERGRID_EU_VOMS_SERVERS="'vomss://voms01.ncg.ingrid.pt:8443/voms/chem.vo.ibergrid.eu"?/chem.vo.ibergrid.eu""
    VO_CHEM_VO_IBERGRID_EU_VOMSES="'chem.vo.ibergrid.eu voms01.ncg.ingrid.pt 40009 /C=PT/O=LIPCA/O=LIP/OU=Lisboa/CN=voms01.ncg.ingrid.pt chem.vo.ibergrid.eu' 'chem.vo.ibergrid.eu ibergrid-voms.ifca.es 40009 /DC=es/DC=irisgrid/O=ifca/CN=host/ibergrid-voms.ifca.es chem.vo.ibergrid.eu'"
    VO_CHEM_VO_IBERGRID_EU_VOMS_CA_DN="'/C=PT/O=LIPCA/CN=LIP Certification Authority' '/DC=es/DC=irisgrid/CN=IRISGridCA'"
    
    
    #YAIM_LOGGING_LEVEL=WARNING
    YAIM_LOGGING_LEVEL=DEBUG
    
  • Users and groups configuration

Define pool accounts (users.conf) and groups (groups.conf) for several VOs

  • WN list configuration

Set in this file the WNs list (wn-list.conf)

  • Run yaim

After having filled the siteinfo.def file, run yaim:

[root@ce01 ~]# /opt/glite/yaim/bin/yaim -c -s site-info.def -n creamCE -n TORQUE_utils -n glite-APEL -n site-BDII
  • Sharing of the CREAM sandbox area between the CREAM CE and the WN for Torque

When Torque is used as batch system, to share the CREAM sandbox area between the CREAM CE node and the WNs:

Mount the cream_sandbox directory also in the WNs. Let's assume that in the CE node the cream sandbox directory is called /var/cream_sandbox and on the WN is mounted as /cream_sandbox) On the WNs, add the following to the Torque client config file:

$usecp <CE node>://var/cream_sandbox /cream_sandbox

  • Sharing of the job accounting

The accounting service running on the CREAM CE will periodically check for new data in the directory /var/spool/torque/server_priv/accounting. If this directory does not exist on the CREAM CE, you need to export this directory from the batch system server to the compute element.

How to republish APEL information

If you have problems with Nagios' test such as:

  • org.apel.APEL-Pub
  • org.apel.APEL-Sync

It could be that APEL has not published infomation since XX days ago.

To update the infomation, you can follow the steps below:

  • Change <Logs searchSubDirs="yes" reprocess="no"> into <Logs searchSubDirs="yes" reprocess="yes"> in /etc/glite-apel-pbs/parser-config-yaim.xml file
  • Change <Republish>missing</Republish> into <Republish>all</Republish> in /etc/glite-apel-publisher/publisher-config-yaim.xml file
  • Run these scripts:
    $ env APEL_HOME=/ /usr/bin/apel-pbs-log-parser -f /etc/glite-apel-pbs/parser-config-yaim.xml >> /var/log/apel.log 2>&1
    $ env APEL_HOME=/ JAVA_HOME=/usr /usr/bin/apel-publisher -f /etc/glite-apel-publisher/publisher-config-yaim.xml >> /var/log/apel.log 2>&1
    

Gestion de los nodos usando IMPMItool

Para poder gestionar los nodos en linea de comandos, nos basta con conectarnos a NAT. y desde allí ejecutar el siguiente comando para conocer el estado del nodo (wn014)

[root@nat ~]# ipmitool -H 192.168.200.24 -U ADMIN power status
Password:
Chassis Power is on

y para rstearlo usamos el siguiente comando

[root@nat ~]# ipmitool -H 192.168.200.24 -U ADMIN chassis power reset
Password:
Chassis Power Control: Reset

Comprobación de la saturación del RPCIOd desde SEAL

A veces os ha pasado que el las operaciones de borrado de directorios directamente sobre seal son más lentas que si las haceis sobre NFS.

Esto sucede cuando seal está siendo "atacado" por NFS originando un DoS, ya que las operaciones NFS al ser un modulo del kernel tienen mayor prioridad que las operaciones iniciadas por el usuario.

Después de mucho leer y buscar la mejor forma de monitorear que es lo que está sucendiendo he podido diagnisticar el problema, o al menos creo haberlo hecho.

Para ello he usado la herramienta de monitorización que trae Solaris de forma nativa (también Mac OS X y FreeBSD), y la verdad es que es una herramienta muy potente sin parangón en Linux. Hay un libro que tienen un montón de ejmplos y que está adjunto a esta incidencia

El DoS del NFS parace venir originado por los RPCIOD relacionados con el NFS, y parace que es un bug en la implementación NFS de los clientes Linux (falta una referencia). Cuando seal se apaga, los clientes se quedan esperando a que vuelva el servidor, pero a veces los clientes se quedan colgados y los RPCIOD empienzan a consumir CPU. Parace que los procesos se han quedado en un bucle, en el cual hacen peticiones al servidor NFS, consumiendo ancho de banda y recursos del servidor sin hacer en realidad ninguna operacion de lectura/escritura.

Para detectar que hay algún cliente con el RPCIOD descontrolado en seal, podemos usar el DTrace desde seal. Para ello contamos la cantidad de operaciones de NFSV4 que se están realizando.

root@seal.macc.unican.es:~# time dtrace -n 'nfsv4::: { @[probename] = count(); }'
dtrace: description 'nfsv4::: ' matched 81 probes
^C

  op-read-done                                                      1
  op-read-start                                                     1
  op-setattr-done                                                   3
  op-setattr-start                                                  3
  op-access-done                                                    7
  op-access-start                                                   7
  op-commit-done                                                   13
  op-commit-start                                                  13
  op-close-done                                                    19
  op-close-start                                                   19
  op-open-done                                                     31
  op-open-start                                                    31
  op-restorefh-done                                                31
  op-restorefh-start                                               31
  op-savefh-done                                                   31
  op-savefh-start                                                  31
  op-lookup-done                                                 1018
  op-lookup-start                                                1018
  op-getfh-done                                                  1049
  op-getfh-start                                                 1049
  op-getattr-start                                               2367
  op-getattr-done                                                2369
  op-renew-done                                                  3233
  op-renew-start                                                 3233
  op-write-done                                                403884
  op-write-start                                               403884
  op-putfh-done                                                406206
  op-putfh-start                                               406207
  compound-start                                               409440
  compound-done                                                409441

real    0m15.303s
user    0m0.421s
sys     0m0.394s

el comando dtrace va precedido de un time, ya que el comando solo nos devuelve las estadisticas una vez que pulsamos CTRL+C. Con el time saberemos cuanto tiempo ha estado el dtrace recolectando datos.

En este caso en los 15 segundos que ha durado la captura ha habido más de 400k ops sobre el servidor NFS de seal, que son excesivas.

Lo siguiente es averiguar que cliente NFS está generando todas esas operaciones, para ello usamos otro comando dtrace

root@seal.macc.unican.es:~# time dtrace -n 'nfsv4:::compound-start { @[args[0]->ci_remote] = count(); }'
dtrace: description 'nfsv4:::compound-start ' matched 1 probe
^C

  192.168.202.43                                                    1
  192.168.202.44                                                   33
  192.168.202.131                                                  48
  192.168.202.45                                                   71
  193.144.184.29                                                 1914
  192.168.202.133                                              332377

real    0m13.027s
user    0m0.432s
sys     0m0.372s

y está claro que el cliente 192.168.202.133 (ce01) está realizando demasiadas operaciones (sobre todo son sobre home_grid)

si comprobamos el consumo de CPU en CE01, observamos que este está consumiendo una gran cantidad de CPU

[antonio@ce01 ~]$ top -b
top - 21:09:10 up 1 day, 20:12,  1 user,  load average: 0.60, 0.46, 0.84
Tasks: 250 total,   2 running, 248 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.3% us,  1.2% sy,  0.0% ni, 95.6% id,  0.7% wa,  0.0% hi,  0.1% si
Mem:   1536000k total,  1314904k used,   221096k free,   138524k buffers
Swap:  7335664k total,        0k used,  7335664k free,   544060k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4798 root      10  -5     0    0    0 S 11.3  0.0   1:22.41 rpciod/0
    1 root      15   0  1644  544  468 S  0.0  0.0   0:00.64 init
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/0
    3 root      34  19     0    0    0 S  0.0  0.0   0:03.40 ksoftirqd/0
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0
    5 root      10  -5     0    0    0 S  0.0  0.0   0:00.01 events/0
    6 root      14  -5     0    0    0 S  0.0  0.0   0:00.00 khelper
    7 root      20  -5     0    0    0 S  0.0  0.0   0:11.07 kthread
    9 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 xenwatch
   10 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 xenbus
   17 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kblockd/0
   18 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 cqueue/0
   22 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 khubd
   24 root      10  -5     0    0    0 S  0.0  0.0   0:00.00 kseriod
   82 root      25   0     0    0    0 S  0.0  0.0   0:00.00 pdflush
   83 root      15   0     0    0    0 S  0.0  0.0   0:01.43 pdflush
   84 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 kswapd0
   85 root      20  -5     0    0    0 S  0.0  0.0   0:00.00 aio/0
  215 root      11  -5     0    0    0 S  0.0  0.0   0:00.00 kpsmoused

así que el cliente NFS de CE01 se ha quedado atorado y debido a la 'idiosincrasia' de NFS hace necesario reiniciarlo para que no atore el NFS de SEAL. Como es norma, el NFS bloque el reinicio haciendo necesario apagar la máquina mediente un reseteo.

Comprobamos que ha pasado con las ops sobre el servidor NFS

root@seal.macc.unican.es:~# time dtrace -n 'nfsv4::: { @[probename] = count(); }'
dtrace: description 'nfsv4::: ' matched 81 probes
^C

  op-open-downgrade-done                                            1
  op-open-downgrade-start                                           1
  op-setclientid-confirm-done                                       1
  op-setclientid-confirm-start                                      1
  op-setclientid-done                                               1
  op-setclientid-start                                              1
  null-done                                                         2
  null-start                                                        2
  op-putrootfh-done                                                 2
  op-putrootfh-start                                                2
  op-renew-done                                                     3
  op-renew-start                                                    3
  op-open-confirm-done                                             12
  op-open-confirm-start                                            12
  op-rename-done                                                   20
  op-rename-start                                                  20
  op-commit-done                                                   30
  op-commit-start                                                  30
  op-create-done                                                   32
  op-create-start                                                  32
  op-readdir-done                                                  40
  op-readdir-start                                                 40
  op-link-done                                                     58
  op-link-start                                                    58
  op-remove-done                                                   86
  op-remove-start                                                  86
  op-setattr-done                                                  90
  op-setattr-start                                                 90
  op-write-done                                                    93
  op-write-start                                                   93
  op-access-done                                                  184
  op-access-start                                                 184
  op-open-done                                                    284
  op-open-start                                                   284
  op-close-done                                                   303
  op-close-start                                                  303
  op-restorefh-done                                               384
  op-restorefh-start                                              384
  op-savefh-done                                                  394
  op-savefh-start                                                 394
  op-lookup-done                                                 1833
  op-lookup-start                                                1833
  op-getfh-done                                                  2008
  op-getfh-start                                                 2008
  op-getattr-done                                                8778
  op-getattr-start                                               8779
  op-read-done                                                  14310
  op-read-start                                                 14310
  compound-done                                                 23015
  compound-start                                                23016
  op-putfh-done                                                 23088
  op-putfh-start                                                23088

real    0m22.732s
user    0m0.492s
sys     0m0.345s

Creo que siguen siendo altas, así que comprobemos que cliente las origina

root@seal.macc.unican.es:~# time dtrace -n 'nfsv4:::compound-start { @[args[0]->ci_remote] = count(); }'
dtrace: description 'nfsv4:::compound-start ' matched 1 probe
^C

  192.168.202.43                                                   71
  192.168.202.44                                                   72
  192.168.202.131                                                  76
  192.168.202.133                                                 109
  192.168.202.15                                                 2517
  193.144.184.29                                                 9280

real    0m24.262s
user    0m0.420s
sys     0m0.371s

Parece que es oceano, así que veamos que es lo que está sucediendo

[antonio@oceano ~]$ top
top - 21:41:24 up 10 days,  3:25,  4 users,  load average: 9.82, 6.91, 4.41
Tasks: 306 total,   1 running, 305 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.8%us,  1.0%sy,  0.0%ni, 85.0%id, 12.1%wa,  0.1%hi,  1.1%si,  0.0%st
Mem:  24557908k total, 24434096k used,   123812k free,    75000k buffers
Swap: 37552112k total,      160k used, 37551952k free, 18610848k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
21069 daniel    15   0 61292 3108 2172 S  5.0  0.0   0:51.87 sftp-server
 4018 antonio   22   0 6652m 394m  10m S  3.0  1.6  24:01.74 java
22398 daniel    15   0 61280 3072 2176 S  1.3  0.0   0:10.23 sftp-server
 3406 root      10  -5     0    0    0 S  0.7  0.0   9:00.22 rpciod/7
21019 daniel    15   0 61288 3108 2172 S  0.7  0.0   0:41.71 sftp-server
 4071 root      10  -5     0    0    0 S  0.3  0.0   2:30.25 nfsiod
21016 daniel    15   0 62052 3076 2168 D  0.3  0.0   0:26.26 sftp-server
21111 daniel    15   0 61284 3100 2168 D  0.3  0.0   0:23.25 sftp-server
21144 daniel    15   0 61284 3100 2168 D  0.3  0.0   0:23.47 sftp-server
21230 daniel    15   0 61288 3104 2168 S  0.3  0.0   0:23.18 sftp-server
21297 daniel    15   0 61284 3100 2168 D  0.3  0.0   0:23.04 sftp-server
21363 daniel    16   0 61280 3096 2168 D  0.3  0.0   0:23.26 sftp-server
21450 daniel    15   0 61280 3096 2168 D  0.3  0.0   0:23.15 sftp-server
21457 daniel    15   0 61288 3112 2172 D  0.3  0.0   0:22.32 sftp-server
21521 daniel    16   0 61292 3112 2172 D  0.3  0.0   0:22.61 sftp-server
21559 daniel    15   0 62052 3076 2168 D  0.3  0.0   0:23.47 sftp-server
22504 antonio   15   0 30996 2352 1552 R  0.3  0.0   0:00.07 top

parece que han sido todos los procesos de copia, que se han quedado colgados, esperaremos a ver que sucede y vemos sin terminan sin problema y nos aseguramos que el rpciod no se queda colgado.

[continuara ...]