昔我往矣

利用strace做PHP故障排查

2015年11月24日

tools.jpg
收到报警线上运行PHP程序的8核服务器load average达到200多,这个太吓人了。赶紧登录系统运行top,看到好几个php-fpm进程CPU利用率达到100%,看到大多数CPU都消耗在内核态,即sys模式,运行状态良好的程序大多时候应该运行在用户态才对呀。

用strace看看进程内部的系统调用,为了方便重复查看,把输出重定向到临时文件/tmp/php.debug中:

# strace -p <PID> 2> /tmp/php.debug

用vim查看调试数据,发现有较多的lstat,但是具体是多少呢?运行命令查看下:

#  
awk '{if($0~/lstat/){a++}}END{print a,NR,a/NR}' /tmp/php.debug
97183 106370 0.913632

10万多个系统调用,其中97万多是lstat,占了全部的91%还多,为什么会有产生这个问题?经过一番Google,找到这篇文章**[
Running PHP on NFS: huge performance problems and one simple solution.
](http://php.webtutor.pl/en/2011/06/02/running-php-on-nfs-huge-performance-problems-and-one-simple-solution/)**

文章中解释的很清楚,PHP有个bug,在php.ini中启用open_basedir之后,会导致realpath_cache_size和realpath_cache_ttl失效。realpath_cache参数的作用是缓存文件绝对路径,如果该参数失效,php会在每次处理请求的时候使用lstat系统调用去获取文件信息,这就是大量lstat的由来,一般服务器上都是尽量调大这两个参数。但是,前几天,安全部门让加上open_basedir限制php访问的系统路径。这就解释了为什么每到访问高峰,系统的load就能一路狂飚到200多。

为了确认这个问题,在测试环境做了一次对照实验,用ab对php做压力测试,发起1w个请求,300个并发,在使用open_basedir的时候。

$ ./ab -n 10000 -c 300 http://www.xnow.com/index.php
...
Time taken for tests:   189.941 seconds
Complete requests:      10000
Failed requests:        38
   (Connect: 0, Receive: 0, Length: 38, Exceptions: 0)
Write errors:           0
Total transferred:      774878208 bytes
HTML transferred:       763258208 bytes
Requests per second:    52.65 [#/sec] (mean)
Time per request:       5698.245 [ms] (mean)
Time per request:       18.994 [ms] (mean, across all concurrent requests)
Transfer rate:          3983.95 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   2.3      0      23
Processing:  2984 5667 708.6   5458   30421
Waiting:     2952 5391 701.3   5195   30413
Total:       2984 5667 708.7   5459   30422

Percentage of the requests served within a certain time (ms)
  50%   5459
  66%   5700
  75%   6043
  80%   6305
  90%   6724
  95%   6928
  98%   7121
  99%   7288
 100%  30422 (longest request)

去掉open_basedir之后的压力测试结果

$ ./ab -n 10000 -c 300 http://www.xnow.com/index.php
...
Concurrency Level:      300
Time taken for tests:   154.962 seconds
Complete requests:      10000
Failed requests:        30
   (Connect: 0, Receive: 0, Length: 30, Exceptions: 0)
Write errors:           0
Total transferred:      774933832 bytes
HTML transferred:       763312670 bytes
Requests per second:    64.53 [#/sec] (mean)
Time per request:       4648.870 [ms] (mean)
Time per request:       15.496 [ms] (mean, across all concurrent requests)
Transfer rate:          4883.58 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   2.3      0      23
Processing:  1548 4623 679.4   4409    7268
Waiting:     1357 4310 650.1   4101    6517
Total:       1551 4624 679.6   4410    7268

Percentage of the requests served within a certain time (ms)
  50%   4410
  66%   4615
  75%   4964
  80%   5245
  90%   5724
  95%   5943
  98%   6125
  99%   6256
 100%   7268 (longest request)

两次对比起来看,在响应速度方面,去掉open_basedir确实有性能提升,同样用strace统计系统调用的情况,结果如下,一共发生了不到2w次的系统请求,其中lstat被调用了199次,占总调用次数的1%

awk '{if($0~/lstat/){a++}}END{print a,NR,a/NR}'  /tmp/php.strace 
199 19423 0.0102456

而且top中看到89.0%us,表明大多数的CPU都消耗在用户态了,但压力还是挺大!

参考文档:
Running PHP on NFS: huge performance problems and one simple solution.
手把手教你用Strace诊断问题

当前暂无评论 »

添加新评论 »