Comprobación de la saturación del RPCIOd desde SEAL
A veces os ha pasado que el las operaciones de borrado de directorios directamente sobre seal son más lentas que si las haceis sobre NFS.
Esto sucede cuando seal está siendo "atacado" por NFS originando un DoS, ya que las operaciones NFS al ser un modulo del kernel tienen mayor prioridad que las operaciones iniciadas por el usuario.
Después de mucho leer y buscar la mejor forma de monitorear que es lo que está sucendiendo he podido diagnisticar el problema, o al menos creo haberlo hecho.
Para ello he usado la herramienta de monitorización que trae Solaris de forma nativa (también Mac OS X y FreeBSD), y la verdad es que es una herramienta muy potente sin parangón en Linux. Hay un libro que tienen un montón de ejmplos y que está adjunto a esta incidencia
El DoS del NFS parace venir originado por los RPCIOD relacionados con el NFS, y parace que es un bug en la implementación NFS de los clientes Linux (falta una referencia). Cuando seal se apaga, los clientes se quedan esperando a que vuelva el servidor, pero a veces los clientes se quedan colgados y los RPCIOD empienzan a consumir CPU. Parace que los procesos se han quedado en un bucle, en el cual hacen peticiones al servidor NFS, consumiendo ancho de banda y recursos del servidor sin hacer en realidad ninguna operacion de lectura/escritura.
Para detectar que hay algún cliente con el RPCIOD descontrolado en seal, podemos usar el DTrace desde seal. Para ello contamos la cantidad de operaciones de NFSV4 que se están realizando.
root@seal.macc.unican.es:~# time dtrace -n 'nfsv4::: { @[probename] = count(); }' dtrace: description 'nfsv4::: ' matched 81 probes ^C op-read-done 1 op-read-start 1 op-setattr-done 3 op-setattr-start 3 op-access-done 7 op-access-start 7 op-commit-done 13 op-commit-start 13 op-close-done 19 op-close-start 19 op-open-done 31 op-open-start 31 op-restorefh-done 31 op-restorefh-start 31 op-savefh-done 31 op-savefh-start 31 op-lookup-done 1018 op-lookup-start 1018 op-getfh-done 1049 op-getfh-start 1049 op-getattr-start 2367 op-getattr-done 2369 op-renew-done 3233 op-renew-start 3233 op-write-done 403884 op-write-start 403884 op-putfh-done 406206 op-putfh-start 406207 compound-start 409440 compound-done 409441 real 0m15.303s user 0m0.421s sys 0m0.394s
el comando dtrace va precedido de un time, ya que el comando solo nos devuelve las estadisticas una vez que pulsamos CTRL+C. Con el time saberemos cuanto tiempo ha estado el dtrace recolectando datos.
En este caso en los 15 segundos que ha durado la captura ha habido más de 400k ops sobre el servidor NFS de seal, que son excesivas.
Lo siguiente es averiguar que cliente NFS está generando todas esas operaciones, para ello usamos otro comando dtrace
root@seal.macc.unican.es:~# time dtrace -n 'nfsv4:::compound-start { @[args[0]->ci_remote] = count(); }' dtrace: description 'nfsv4:::compound-start ' matched 1 probe ^C 192.168.202.43 1 192.168.202.44 33 192.168.202.131 48 192.168.202.45 71 193.144.184.29 1914 192.168.202.133 332377 real 0m13.027s user 0m0.432s sys 0m0.372s
y está claro que el cliente 192.168.202.133 (ce01) está realizando demasiadas operaciones (sobre todo son sobre home_grid)
si comprobamos el consumo de CPU en CE01, observamos que este está consumiendo una gran cantidad de CPU
[antonio@ce01 ~]$ top -b top - 21:09:10 up 1 day, 20:12, 1 user, load average: 0.60, 0.46, 0.84 Tasks: 250 total, 2 running, 248 sleeping, 0 stopped, 0 zombie Cpu(s): 2.3% us, 1.2% sy, 0.0% ni, 95.6% id, 0.7% wa, 0.0% hi, 0.1% si Mem: 1536000k total, 1314904k used, 221096k free, 138524k buffers Swap: 7335664k total, 0k used, 7335664k free, 544060k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4798 root 10 -5 0 0 0 S 11.3 0.0 1:22.41 rpciod/0 1 root 15 0 1644 544 468 S 0.0 0.0 0:00.64 init 2 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:03.40 ksoftirqd/0 4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 5 root 10 -5 0 0 0 S 0.0 0.0 0:00.01 events/0 6 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 khelper 7 root 20 -5 0 0 0 S 0.0 0.0 0:11.07 kthread 9 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenwatch 10 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenbus 17 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kblockd/0 18 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/0 22 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 khubd 24 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod 82 root 25 0 0 0 0 S 0.0 0.0 0:00.00 pdflush 83 root 15 0 0 0 0 S 0.0 0.0 0:01.43 pdflush 84 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kswapd0 85 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0 215 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
así que el cliente NFS de CE01 se ha quedado atorado y debido a la 'idiosincrasia' de NFS hace necesario reiniciarlo para que no atore el NFS de SEAL. Como es norma, el NFS bloque el reinicio haciendo necesario apagar la máquina mediente un reseteo.
Comprobamos que ha pasado con las ops sobre el servidor NFS
root@seal.macc.unican.es:~# time dtrace -n 'nfsv4::: { @[probename] = count(); }' dtrace: description 'nfsv4::: ' matched 81 probes ^C op-open-downgrade-done 1 op-open-downgrade-start 1 op-setclientid-confirm-done 1 op-setclientid-confirm-start 1 op-setclientid-done 1 op-setclientid-start 1 null-done 2 null-start 2 op-putrootfh-done 2 op-putrootfh-start 2 op-renew-done 3 op-renew-start 3 op-open-confirm-done 12 op-open-confirm-start 12 op-rename-done 20 op-rename-start 20 op-commit-done 30 op-commit-start 30 op-create-done 32 op-create-start 32 op-readdir-done 40 op-readdir-start 40 op-link-done 58 op-link-start 58 op-remove-done 86 op-remove-start 86 op-setattr-done 90 op-setattr-start 90 op-write-done 93 op-write-start 93 op-access-done 184 op-access-start 184 op-open-done 284 op-open-start 284 op-close-done 303 op-close-start 303 op-restorefh-done 384 op-restorefh-start 384 op-savefh-done 394 op-savefh-start 394 op-lookup-done 1833 op-lookup-start 1833 op-getfh-done 2008 op-getfh-start 2008 op-getattr-done 8778 op-getattr-start 8779 op-read-done 14310 op-read-start 14310 compound-done 23015 compound-start 23016 op-putfh-done 23088 op-putfh-start 23088 real 0m22.732s user 0m0.492s sys 0m0.345s
Creo que siguen siendo altas, así que comprobemos que cliente las origina
root@seal.macc.unican.es:~# time dtrace -n 'nfsv4:::compound-start { @[args[0]->ci_remote] = count(); }' dtrace: description 'nfsv4:::compound-start ' matched 1 probe ^C 192.168.202.43 71 192.168.202.44 72 192.168.202.131 76 192.168.202.133 109 192.168.202.15 2517 193.144.184.29 9280 real 0m24.262s user 0m0.420s sys 0m0.371s
Parece que es oceano, así que veamos que es lo que está sucediendo
[antonio@oceano ~]$ top top - 21:41:24 up 10 days, 3:25, 4 users, load average: 9.82, 6.91, 4.41 Tasks: 306 total, 1 running, 305 sleeping, 0 stopped, 0 zombie Cpu(s): 0.8%us, 1.0%sy, 0.0%ni, 85.0%id, 12.1%wa, 0.1%hi, 1.1%si, 0.0%st Mem: 24557908k total, 24434096k used, 123812k free, 75000k buffers Swap: 37552112k total, 160k used, 37551952k free, 18610848k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21069 daniel 15 0 61292 3108 2172 S 5.0 0.0 0:51.87 sftp-server 4018 antonio 22 0 6652m 394m 10m S 3.0 1.6 24:01.74 java 22398 daniel 15 0 61280 3072 2176 S 1.3 0.0 0:10.23 sftp-server 3406 root 10 -5 0 0 0 S 0.7 0.0 9:00.22 rpciod/7 21019 daniel 15 0 61288 3108 2172 S 0.7 0.0 0:41.71 sftp-server 4071 root 10 -5 0 0 0 S 0.3 0.0 2:30.25 nfsiod 21016 daniel 15 0 62052 3076 2168 D 0.3 0.0 0:26.26 sftp-server 21111 daniel 15 0 61284 3100 2168 D 0.3 0.0 0:23.25 sftp-server 21144 daniel 15 0 61284 3100 2168 D 0.3 0.0 0:23.47 sftp-server 21230 daniel 15 0 61288 3104 2168 S 0.3 0.0 0:23.18 sftp-server 21297 daniel 15 0 61284 3100 2168 D 0.3 0.0 0:23.04 sftp-server 21363 daniel 16 0 61280 3096 2168 D 0.3 0.0 0:23.26 sftp-server 21450 daniel 15 0 61280 3096 2168 D 0.3 0.0 0:23.15 sftp-server 21457 daniel 15 0 61288 3112 2172 D 0.3 0.0 0:22.32 sftp-server 21521 daniel 16 0 61292 3112 2172 D 0.3 0.0 0:22.61 sftp-server 21559 daniel 15 0 62052 3076 2168 D 0.3 0.0 0:23.47 sftp-server 22504 antonio 15 0 30996 2352 1552 R 0.3 0.0 0:00.07 top
parece que han sido todos los procesos de copia, que se han quedado colgados, esperaremos a ver que sucede y vemos sin terminan sin problema y nos aseguramos que el rpciod no se queda colgado.
[continuara ...]
Comments
I think I have never seen such blogs ever before that has complete things with all details which I want. So kindly update this ever for us. https://www.seoheromedia.com
Your content shows the power, I’m about to add this to my bookmarks. 100 dollar monster scam
Quality stuff may be the key to invite the users to visit begin to see the blog site, that’s what this site provides. deutscheonlinecasino.de
I further than doubtfulness revalue your articles and blogs. claims adjusters
I truly get pleasure from while I read your blogs and its content. Unlimited wireless internet
I am definitely coming back again for additional contents of yours. www.comfortsolutionstc.com
Great details here, better yet to discover out your blog which is fantastic. Nicely done!!! zero to multimillionaire training
I’m glad to locate so much of informative data in your blog. Ryan Cataldo
This is actually a fantastic blogs! More of these details are superb -it is nice to see one that current. Ryan Cataldo Criminal Record
This is an excellent blog along with the great knowledge. RC New Jersey
I will prefer this blog because it has much more informative stuff. RC
Cool blog site friend I'm about to suggest this to all my listing contacts. Ryan Cataldo Criminal Record
Awesome! Immense information there. Ryan Cataldo NJ Criminal Record
Your blogs and each of its stuff is so pleasurable and valuable it is making me come back soon. Ryan Cataldo CPG Financial
Well, it’s a nice one, I have been looking for. Thanks for sharing such informative stuff. RC Insurance
This article is actually remarkable one it helps many new users that desire to read always the best stuff. Ryan Cataldo Tequesta
I wanna thanks to a great extent for providing such informative and qualitative material therefore often. RC Jupiter
I’m glad to locate so much of informative data in your blog. Insurance RC
This is actually a fantastic blogs! More of these details are superb -it is nice to see one that current. New Jersey RC
Thanks for your beyond belief blogs stuff http://dahsyatqq.com/app/Default0.aspx?lang=id
You guys present there are performing an excellent job. Call UK Contact
You have shared the best blog. This is what I really need immediately. how to talk to girls
It’s really very informative that I wanted ever, thanks for this. get back with ex
I certainly appreciate your stuff provided in the blogs. free credit card machines for small business
Amiable articles and the blogs really helped me a lot, thanks for the valuable information. free credit card machines
This is really quite informative blog such qualitative info I have never seen anywhere. top merchant cash advance companies
This is surely a very good blog, thanks a lot for sharing such nice information here. vt bharadwaj sequoia
I sent your articles links to all my contacts and they all adore it including me. tax preparer insurance
Your writers are extremely fantastic that have made easy to understand everything for us. tax preparer insurance
guest test post <a href="http://googlee.te/">bbcode</a> <a href="http://googlee.te/">html</a> http://googlee.te/ simple
http://100mg-viagracanada.com/ - 100mg-viagracanada.com.ankor <a href="http://sertralinezoloftonline.com/">sertralinezoloftonline.com.ankor</a> http://20mg-tadalafil-lowest-price.com/
http://100mg-viagracanada.com/ - 100mg-viagracanada.com.ankor <a href="http://sertralinezoloftonline.com/">sertralinezoloftonline.com.ankor</a> http://20mg-tadalafil-lowest-price.com/
http://100mg-viagracanada.com/ - 100mg-viagracanada.com.ankor <a href="http://sertralinezoloftonline.com/">sertralinezoloftonline.com.ankor</a> http://20mg-tadalafil-lowest-price.com/
http://100mg-viagracanada.com/ - 100mg-viagracanada.com.ankor <a href="http://sertralinezoloftonline.com/">sertralinezoloftonline.com.ankor</a> http://20mg-tadalafil-lowest-price.com/
http://100mg-viagracanada.com/ - 100mg-viagracanada.com.ankor <a href="http://sertralinezoloftonline.com/">sertralinezoloftonline.com.ankor</a> http://20mg-tadalafil-lowest-price.com/
http://100mg-viagracanada.com/ - 100mg-viagracanada.com.ankor <a href="http://sertralinezoloftonline.com/">sertralinezoloftonline.com.ankor</a> http://20mg-tadalafil-lowest-price.com/
http://100mg-viagracanada.com/ - 100mg-viagracanada.com.ankor <a href="http://sertralinezoloftonline.com/">sertralinezoloftonline.com.ankor</a> http://20mg-tadalafil-lowest-price.com/
http://20mg-cheapesttadalafil.com/ - 20mg-cheapesttadalafil.com.ankor <a href="http://20mg-cheapesttadalafil.com/">20mg-cheapesttadalafil.com.ankor</a> http://20mg-cheapesttadalafil.com/
I absolutely feel delighted once I realize articles appropriate to my work and my subject. wireless internet for rural areas
Your articles are very well written and unique. unlimited high speed internet anywhere
Mind-boggling job mates, I take pleasure in longing your articles. seo services packages
I have no words to appreciate this post ..... I'm really impressed with this post .... the person who created this post was a big thank you man .. for sharing with us.
http://olpcaustria.org