Symptoms
- One of web server had been dead frequently.
- "Dead" means we can't open website on the browsers, can't ssh, can't ping. They all occurred timeout.
- All we did for recovery was rebooting the server.
What I researched after reboot
Check access / error logs of web server
In order to confirm this server was attacked or not, I checked access / error logs. This server uses Apache so I checked apache logs.
In fact, I was able to find lots of error logs but I found they can't be the reason (at least root cause) because I found the same logs in the other time than dead time.
Server resources
In order to confirm the server resources status, the most basic command is top command but it's just a snapshot. We can't find the past data basically.
Sar command is quite convenient. With using it, we can find several types of server performance including the past data.
If your server doesn't include sysstat package, you can install it with yum command.
# yum -y install sysstat
Memory
# sar -r 07時40分01秒 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 07時50分01秒 1016128 2841240 73.66 148620 1052292 2046452 16.98 08時00分01秒 925436 2931932 76.01 150236 1053164 2131820 17.69 10時20分01秒 786540 3070828 79.61 175720 1092320 2202388 18.28 10時30分01秒 846600 3010768 78.05 176892 1093136 2140968 17.77 10時52分15秒 88640 3768728 97.70 2680 34836 13792364 114.47 11時05分32秒 88232 3769136 97.71 2676 36556 14498608 120.33 15時33分34秒 88184 3769184 97.71 780 15944 15577636 129.28 16時07分20秒 88108 3769260 97.72 1168 26596 15596068 129.43 19時19分32秒 88724 3768644 97.70 1780 21972 16252548 134.88 19時31分13秒 89272 3768096 97.69 1496 22752 15153816 125.76 平均値: 930214 2927154 75.88 96410 795749 4463894 37.05 19時48分07秒 LINUX RESTART 19時50分02秒 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 20時00分01秒 1594480 2262888 58.66 33980 1050588 1741996 14.46 20時10分01秒 1538900 2318468 60.10 37868 1072128 1739820 14.44 20時20分01秒 1240676 2616692 67.84 43060 1082476 2025780 16.81 平均値: 1458019 2399349 62.20 38303 1068397 1835865 15.24
Can't find the strange figures.
Traffic
# sar -n DEV 14時04分39秒 IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s 14時11分39秒 lo 0.10 0.10 0.01 0.01 0.00 0.00 0.00 10時30分01秒 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 10時52分15秒 lo 0.42 0.42 0.03 0.03 0.00 0.00 0.00 10時52分15秒 eth0 7.09 4.55 0.91 3.84 0.00 0.00 1.42 10時52分15秒 eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 11時05分32秒 lo 0.11 0.11 0.01 0.01 0.00 0.00 0.00Nothing special.
CPU
# sar -u 07時40分01秒 CPU %user %nice %system %iowait %steal %idle 07時50分01秒 all 4.84 0.00 0.40 0.99 0.00 93.77 08時00分01秒 all 4.48 0.00 0.31 0.83 0.00 94.39 10時20分01秒 all 6.37 0.00 0.35 1.09 0.00 92.19 10時30分01秒 all 6.11 0.00 0.35 1.03 0.00 92.51 10時52分15秒 all 3.19 0.00 0.52 65.46 0.00 30.84 11時05分32秒 all 0.63 0.00 0.29 97.48 0.00 1.60 11時55分49秒 all 0.25 0.00 0.16 99.30 0.00 0.29Oh...! iowait is obviously strange and almost 100%.
Load average
# sar -q 07時40分01秒 runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 07時50分01秒 2 346 0.05 0.13 0.16 10時20分01秒 2 346 0.13 0.31 0.25 10時30分01秒 3 347 0.20 0.27 0.24 10時52分15秒 1 678 186.74 170.94 108.57 11時05分32秒 0 674 180.50 181.70 151.83
# nproc 4This has 4 cores so it should be under 4. I can understand 180 is very dangerous value.
Swap
# sar -W 07時40分01秒 pswpin/s pswpout/s 10時20分01秒 0.00 0.00 10時30分01秒 0.00 0.00 10時52分15秒 565.99 1460.19 11時05分32秒 538.52 432.01At the same timing, this server started to consume swap too.
Summary with these investigation
The root cause is iowait of CPU, and consume all swap too.Assumed reasons of this symptom are:
- Application itself access to the disk very slowly.
- Due to lack of memory, it caused very frequent access to swap and it got slow.
What I needed to detect
What process occurred this high consumption of I/O?Further investigation
The way to detect processes that consumed I/O
In general, we can detect the processes to consume I/O with this way for example.#Processess with high I/O find /proc -name io | xargs egrep "write|read" | sort -n -k 2 #Detect the programs with process ID ls -l /proc/1 | awk 'length($11)!=0 {print $11}' #To which files does this program read / write? #An example with the process ID = 1536 strace -p 1536 -f -e pwrite,pread
Get the history of top command
Unfortunately, in this case, we restarted the server so the server didn't remain the process ID. So I tried to capture the history of the result of top command.We can use batch mode of top command for that like this.
# top -b -n 1 > /var/log/top/top.logThen I got the result like this for example.
top - 09:10:25 up 13:22, 0 users, load average: 71.44, 23.13, 8.19 Tasks: 343 total, 1 running, 342 sleeping, 0 stopped, 0 zombie Cpu(s): 4.7%us, 0.4%sy, 0.0%ni, 93.6%id, 1.3%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3857368k total, 3769264k used, 88104k free, 3664k buffers Swap: 8191996k total, 1948252k used, 6243744k free, 32212k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1600 apache 20 0 253m 49m 3584 D 0.9 1.3 0:00.16 httpd 1609 apache 20 0 257m 52m 3548 D 0.9 1.4 0:00.16 httpd 1504 apache 20 0 261m 45m 3572 D 0.5 1.2 0:00.19 httpd 1536 apache 20 0 263m 55m 3556 D 0.5 1.5 0:00.20 httpd 1538 apache 20 0 234m 30m 3576 S 0.5 0.8 0:00.09 httpd 1540 apache 20 0 257m 48m 3548 D 0.5 1.3 0:00.18 httpd 1541 apache 20 0 262m 53m 3568 D 0.5 1.4 0:00.20 httpd
0 Comments:
Post a Comment