一个伪linux粉丝的blog

  1. 首页
  2. unix/linux
  3. 正文

troubleshooting-high-io-wait-nfs

20 6 月, 2020 1914点热度 1人点赞 0条评论

1. 问题描述

客户通过监控发现有一个节点cpu iowait异常的高,初步排查下来,没有发现大量读写等待,请求我这边参与排查,检查发现是nfs服务挂起,带来的进程不可中断及僵尸进程产生,iowait 虚高,umount 对应的/etc/mtab 信息后恢复正常。

2. 原因分析

2.1. 检查

通过常用的top、iostat 、iotop 等命令查看,有一定的读写,cpu iowait 很高,实际压力不大。

top结果,iowait高,有一些st现象

iotop结果,读写很少。

ps auxf 查看进程,有几个怀疑对象,如 红框 id  103166  ,随后lsof -p 进程id,竟然卡住了,怀疑加深,同时df -h查询,卡住无响应,问号变多 :idea:

查找状态D不可中断的进程--客户的方法

for x in seq 1 1 10;do ps -eo state,pid,cmd |grep "^D"; echo "-----"; sleep 5; done

 

查找状态D不可中断的进程--我的方法见这篇 http://rebootcat.com/2017/12/14/instability-of-cpu/查询结果, 代码存档如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<span class="line">1</span>
<span class="line">2</span>
<span class="line">3</span>
<span class="line">4</span>
<span class="line">5</span>
<span class="line">6</span>
<span class="line">7</span>
<span class="line">8</span>
<span class="line">9</span>
<span class="line">10</span>
<span class="line">11</span>
<span class="line">12</span>
<span class="line">13</span>
<span class="line">14</span>
<span class="line">15</span>
<span class="line">16</span>
<span class="line">17</span>
<span class="line">18</span>
<span class="line">19</span>
<span class="line">20</span>
<span class="line">21</span>
<span class="line">22</span>
<span class="line">23</span>
<span class="line">24</span>
<span class="line">25</span>
<span class="line">26</span>
<span class="line">27</span>
<span class="line">28</span>
<span class="line">29</span>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<span class="line"><span class="comment">#!/usr/bin/env python</span></span>
<span class="line"><span class="comment">#-*- coding:utf-8 -*-</span></span>
 
<span class="line">import os</span>
<span class="line">import <span class="built_in">time</span></span>
 
<span class="line">def run():</span>
<span class="line">  <span class="built_in">result</span> = []</span>
<span class="line">  timestamp = <span class="built_in">time</span>.strftime(<span class="string">'%Y-%m-%d %H:%M:%S'</span>, <span class="built_in">time</span>.localtime(<span class="built_in">time</span>.<span class="built_in">time</span>()))</span>
<span class="line">  cmd = <span class="string">'ps auxf'</span></span>
<span class="line">  ret = os.popen(cmd).readlines()</span>
<span class="line">  <span class="keyword">for</span> i <span class="keyword">in</span> ret:</span>
<span class="line">    <span class="keyword">item</span> = i.<span class="built_in">split</span>()</span>
<span class="line">    <span class="keyword">if</span> <span class="built_in">len</span>(<span class="keyword">item</span>) &gt; <span class="number">7</span>:</span>
<span class="line">      <span class="keyword">item</span> = <span class="keyword">item</span>[<span class="number">7</span>]</span>
<span class="line">      <span class="keyword">if</span> <span class="keyword">item</span> == <span class="string">'D'</span>:</span>
<span class="line">        <span class="built_in">result</span>.append(i)</span>
 
<span class="line">  <span class="keyword">for</span> r <span class="keyword">in</span> <span class="built_in">result</span>:</span>
<span class="line">    <span class="built_in">process</span> = <span class="string">'%s %s'</span> % ( timestamp,r[:<span class="number">-1</span>])</span>
<span class="line">    print <span class="built_in">process</span></span>
 
<span class="line">  <span class="keyword">if</span> <span class="built_in">result</span>:</span>
<span class="line">    print <span class="string">'\n'</span></span>
 
<span class="line"><span class="keyword">if</span> __name__ == <span class="string">"__main__"</span>:</span>
<span class="line">  <span class="keyword">while</span> True:</span>
<span class="line">    run()</span>
<span class="line">    <span class="built_in">time</span>.sleep(<span class="number">0.5</span>)</span>

同样指向 103166、12584、60477等进程

2.2. 定位

查看进程 103166来进行突破, 发现卡在管道这里

继续查stack,终于找到nfs_wait_killable 字样,与客户确认,他们做了日志持持久化,挂载了nfs服务,当时nfs 服务挂过一次,至此基本确认是nfs服务挂了引起的锅。

3. 解决方法

3.1. 前人的坑(依据)

  • 网上找到1篇文章,说的正好是nfs相关挂起 https://www.cnblogs.com/embedded-linux/p/7043569.html

D状态,往往是由于 I/O 资源得不到满足,而引发等待,在内核源码 fs/proc/array.c 里,其文字定义为“ "D (disk sleep)", /* 2 */ ”(由此可知 D 原是Disk的打头字母),对应着 include/linux/sched.h 里的“ #define TASK_UNINTERRUPTIBLE 2 ”。

举个例子,当 NFS 服务端关闭之时,若未事先 umount 相关目录,在 NFS 客户端执行 df 就会挂住整个登录会话,按 Ctrl+C 、Ctrl+Z 都无济于事。断开连接再登录,执行 ps axf 则看到刚才的 df 进程状态位已变成了 D ,kill -9 无法杀灭。

正确的处理方式,是马上恢复 NFS 服务端,再度提供服务,刚才挂起的 df 进程发现了其苦苦等待的资源,便完成任务,自动消亡。若 NFS 服务端无法恢复服务,在 reboot 之前也应将 /etc/mtab 里的相关 NFS mount 项删除,以免 reboot 过程例行调用 netfs stop 时再次发生等待资源,导致系统重启过程挂起。

  • 客户也找到一篇介绍  https://segmentfault.com/a/1190000022829696

当pvc重新被挂载时,如果pod是被调度到不同的node,好像没问题。线上出问题的现象是nfs所在主机突然挂了,然后pod一直等待就hang住报错,然后我们手动去重启就会开始调度,调度到同一个的话,就算原pvc存在感觉认不到,要手动删除pvc后中间挂载,这个时候会出现僵尸进程。调度到不同的node不会出现任何问题。

3.2. 实际操作

  • 参考前面2段提示,客户 先恢复了nfs 服务,然后在宿主机上 unmount 掉对应的 /etc/mtab 信息,卡死问题解决,僵尸进程消失,iowait 降下来了,问题解决。

4. 后续改进

  • nfs 挂载做容器日志持久化需谨慎,服务挂起后先优先恢复服务,再观察监控相关服务是否异常。
  • 硬挂载时添加 intr属性, The –intr option is set by default for all mounts. If a program hangs with a server not responding message, you can terminate the program with the keyboard interrupt Control-C. (这条待验证,因为已经不是红帽推荐选项了https://access.redhat.com/solutions/157873)
  • 如上解释还是帖一下,这段可能是过时的设置参数, NFS MOUNT 参数:当NFS在NFS客户端加载时,系统会问是使用 soft-mount 还是hard-mount, 它们之间有什么区别?它们的区别在于当发生网络或NFS服务器端故障时,选用hard-mount选项会引起NFS客户端的程序挂起,而soft-mount则不会。soft-mount: 当客户端加载NFS不成功时,重试retrans设定的次数.如果retrans次都不成功,则放弃此操作,返回错误信息 "Connect time out"

    hard-mount: 当客户端加载NFS不成功时,一直重试,直到NFS服务器有响应。hard-mount 是系统的缺省值。在选定hard-mount 时,最好同时选 intr , 允许中断系统的调用请求,避免引起系统的挂起。当NFS服务器不能响应NFS客户端的 hard-mount请求时, NFS客户端会显示

    "NFS server hostname not responding, still trying"

相关文章:

  1. 破解Wi-Fi WEP密码
  2. ss-panel+shadowsocks-note
  3. Docker "fork/exec /proc/self/exe: no such file or directory\""
标签: iowait nfs
最后更新:20 6 月, 2020

wanjie

这个人很懒,什么都没留下

点赞
< 上一篇
下一篇 >

文章评论

razz evil exclaim smile redface biggrin eek confused idea lol mad twisted rolleyes wink cool arrow neutral cry mrgreen drooling persevering
取消回复

This site uses Akismet to reduce spam. Learn how your comment data is processed.

归档
分类
  • network / 332篇
  • Uncategorized / 116篇
  • unix/linux / 122篇
  • 业界资讯 / 38篇
  • 公司杂事 / 11篇
  • 数码影像 / 12篇
  • 美剧 / 3篇
  • 美图共赏 / 21篇
  • 英语学习 / 3篇
标签聚合
VPS postgres ldap kernel squid Google Voice 刷机 d90 nexus 虚拟主机 deepseek Ubuntu dreamhost gitlab webhook Google Nginx Linux 网站运营 iMac 邮件归档 泰国 docker 网通 brew 天翼live unveiled today 职责 k8s dreamhost空间

COPYRIGHT © 2008-2025 wanjie.info. ALL RIGHTS RESERVED.

Theme Kratos Made By Seaton Jiang