How to detect the processes to consume I/O of CPU on Linux server

This is the story that I investigated the reason when certain web server had been dead very frequently.

Symptoms

  • One of web server had been dead frequently.
  • "Dead" means we can't open website on the browsers, can't ssh, can't ping. They all occurred timeout.
  • All we did for recovery was rebooting the server.


What I researched after reboot

Check access / error logs of web server

In order to confirm this server was attacked or not, I checked access / error logs. This server uses Apache so I checked apache logs.
In fact, I was able to find lots of error logs but I found they can't be the reason (at least root cause) because I found the same logs in the other time than dead time.

Server resources

In order to confirm the server resources status, the most basic command is top command but it's just a snapshot. We can't find the past data basically.
Sar command is quite convenient. With using it, we can find several types of server performance including the past data.

If your server doesn't include sysstat package, you can install it with yum command.
# yum -y install sysstat

Memory

# sar -r
07時40分01秒 kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
07時50分01秒   1016128   2841240     73.66    148620   1052292   2046452     16.98
08時00分01秒    925436   2931932     76.01    150236   1053164   2131820     17.69

10時20分01秒    786540   3070828     79.61    175720   1092320   2202388     18.28
10時30分01秒    846600   3010768     78.05    176892   1093136   2140968     17.77
10時52分15秒     88640   3768728     97.70      2680     34836  13792364    114.47
11時05分32秒     88232   3769136     97.71      2676     36556  14498608    120.33

15時33分34秒     88184   3769184     97.71       780     15944  15577636    129.28
16時07分20秒     88108   3769260     97.72      1168     26596  15596068    129.43
19時19分32秒     88724   3768644     97.70      1780     21972  16252548    134.88
19時31分13秒     89272   3768096     97.69      1496     22752  15153816    125.76
平均値:     930214   2927154     75.88     96410    795749   4463894     37.05

19時48分07秒       LINUX RESTART

19時50分02秒 kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
20時00分01秒   1594480   2262888     58.66     33980   1050588   1741996     14.46
20時10分01秒   1538900   2318468     60.10     37868   1072128   1739820     14.44
20時20分01秒   1240676   2616692     67.84     43060   1082476   2025780     16.81
平均値:    1458019   2399349     62.20     38303   1068397   1835865     15.24
Can't find the strange figures.

Traffic

# sar -n DEV
14時04分39秒     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
14時11分39秒        lo      0.10      0.10      0.01      0.01      0.00      0.00      0.00

10時30分01秒      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
10時52分15秒        lo      0.42      0.42      0.03      0.03      0.00      0.00      0.00
10時52分15秒      eth0      7.09      4.55      0.91      3.84      0.00      0.00      1.42
10時52分15秒      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11時05分32秒        lo      0.11      0.11      0.01      0.01      0.00      0.00      0.00
Nothing special.

CPU

# sar -u
07時40分01秒     CPU     %user     %nice   %system   %iowait    %steal     %idle
07時50分01秒     all      4.84      0.00      0.40      0.99      0.00     93.77
08時00分01秒     all      4.48      0.00      0.31      0.83      0.00     94.39

10時20分01秒     all      6.37      0.00      0.35      1.09      0.00     92.19
10時30分01秒     all      6.11      0.00      0.35      1.03      0.00     92.51
10時52分15秒     all      3.19      0.00      0.52     65.46      0.00     30.84
11時05分32秒     all      0.63      0.00      0.29     97.48      0.00      1.60
11時55分49秒     all      0.25      0.00      0.16     99.30      0.00      0.29
Oh...! iowait is obviously strange and almost 100%.

Load average

# sar -q
07時40分01秒   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
07時50分01秒         2       346      0.05      0.13      0.16

10時20分01秒         2       346      0.13      0.31      0.25
10時30分01秒         3       347      0.20      0.27      0.24
10時52分15秒         1       678    186.74    170.94    108.57
11時05分32秒         0       674    180.50    181.70    151.83
Load average is over 180!
# nproc
4
This has 4 cores so it should be under 4. I can understand 180 is very dangerous value.

Swap

# sar -W
07時40分01秒  pswpin/s pswpout/s
10時20分01秒      0.00      0.00
10時30分01秒      0.00      0.00
10時52分15秒    565.99   1460.19
11時05分32秒    538.52    432.01
At the same timing, this server started to consume swap too.

Summary with these investigation

The root cause is iowait of CPU, and consume all swap too.
Assumed reasons of this symptom are:
  • Application itself access to the disk very slowly.
  • Due to lack of memory, it caused very frequent  access to swap and it got slow.

What I needed to detect

What process occurred this high consumption of I/O?


Further investigation

The way to detect processes that consumed I/O

In general, we can detect the processes to consume I/O with this way for example.
#Processess with high I/O
find /proc -name io | xargs egrep "write|read" | sort -n -k 2

#Detect the programs with process ID
ls -l /proc/1 | awk 'length($11)!=0 {print $11}'

#To which files does this program read / write?
#An example with the process ID = 1536
strace -p 1536 -f -e pwrite,pread

Get the history of top command

Unfortunately, in this case, we restarted the server so the server didn't remain the process ID. So I tried to capture the history of the result of top command.

We can use batch mode of top command for that like this.
# top -b -n 1 > /var/log/top/top.log
Then I got the result like this for example.
top - 09:10:25 up 13:22,  0 users,  load average: 71.44, 23.13, 8.19
Tasks: 343 total,   1 running, 342 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.7%us,  0.4%sy,  0.0%ni, 93.6%id,  1.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3857368k total,  3769264k used,    88104k free,     3664k buffers
Swap:  8191996k total,  1948252k used,  6243744k free,    32212k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1600 apache    20   0  253m  49m 3584 D  0.9  1.3   0:00.16 httpd
 1609 apache    20   0  257m  52m 3548 D  0.9  1.4   0:00.16 httpd
 1504 apache    20   0  261m  45m 3572 D  0.5  1.2   0:00.19 httpd
 1536 apache    20   0  263m  55m 3556 D  0.5  1.5   0:00.20 httpd
 1538 apache    20   0  234m  30m 3576 S  0.5  0.8   0:00.09 httpd
 1540 apache    20   0  257m  48m 3548 D  0.5  1.3   0:00.18 httpd
 1541 apache    20   0  262m  53m 3568 D  0.5  1.4   0:00.20 httpd
If you liked this article

Let's subscribe the updates of Scuti!
Share on Google Plus

About Tomohide Kakeya

This is a short description in the author block about the author. You edit it by entering text in the "Biographical Info" field in the user admin panel.
    Blogger Comment
    Facebook Comment

0 Comments:

Post a Comment