利用strace做PHP故障排查
收到报警线上运行PHP程序的8核服务器load average达到200多,这个太吓人了。赶紧登录系统运行top,看到好几个php-fpm进程CPU利用率达到100%,看到大多数CPU都消耗在内核态,即sys模式,运行状态良好的程序大多时候应该运行在用户态才对呀。
用strace看看进程内部的系统调用,为了方便重复查看,把输出重定向到临时文件/tmp/php.debug中:
# strace -p <PID> 2> /tmp/php.debug
用vim查看调试数据,发现有较多的lstat,但是具体是多少呢?运行命令查看下:
#
awk '{if($0~/lstat/){a++}}END{print a,NR,a/NR}' /tmp/php.debug
97183 106370 0.913632
10万多个系统调用,其中97万多是lstat,占了全部的91%还多,为什么会有产生这个问题?经过一番Google,找到这篇文章**[
Running PHP on NFS: huge performance problems and one simple solution.
](http://php.webtutor.pl/en/2011/06/02/running-php-on-nfs-huge-performance-problems-and-one-simple-solution/)**
文章中解释的很清楚,PHP有个bug,在php.ini中启用open_basedir之后,会导致realpath_cache_size和realpath_cache_ttl失效。realpath_cache参数的作用是缓存文件绝对路径,如果该参数失效,php会在每次处理请求的时候使用lstat系统调用去获取文件信息,这就是大量lstat的由来,一般服务器上都是尽量调大这两个参数。但是,前几天,安全部门让加上open_basedir限制php访问的系统路径。这就解释了为什么每到访问高峰,系统的load就能一路狂飚到200多。
为了确认这个问题,在测试环境做了一次对照实验,用ab对php做压力测试,发起1w个请求,300个并发,在使用open_basedir的时候。
$ ./ab -n 10000 -c 300 http://www.xnow.com/index.php
...
Time taken for tests: 189.941 seconds
Complete requests: 10000
Failed requests: 38
(Connect: 0, Receive: 0, Length: 38, Exceptions: 0)
Write errors: 0
Total transferred: 774878208 bytes
HTML transferred: 763258208 bytes
Requests per second: 52.65 [#/sec] (mean)
Time per request: 5698.245 [ms] (mean)
Time per request: 18.994 [ms] (mean, across all concurrent requests)
Transfer rate: 3983.95 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 2.3 0 23
Processing: 2984 5667 708.6 5458 30421
Waiting: 2952 5391 701.3 5195 30413
Total: 2984 5667 708.7 5459 30422
Percentage of the requests served within a certain time (ms)
50% 5459
66% 5700
75% 6043
80% 6305
90% 6724
95% 6928
98% 7121
99% 7288
100% 30422 (longest request)
去掉open_basedir之后的压力测试结果
$ ./ab -n 10000 -c 300 http://www.xnow.com/index.php
...
Concurrency Level: 300
Time taken for tests: 154.962 seconds
Complete requests: 10000
Failed requests: 30
(Connect: 0, Receive: 0, Length: 30, Exceptions: 0)
Write errors: 0
Total transferred: 774933832 bytes
HTML transferred: 763312670 bytes
Requests per second: 64.53 [#/sec] (mean)
Time per request: 4648.870 [ms] (mean)
Time per request: 15.496 [ms] (mean, across all concurrent requests)
Transfer rate: 4883.58 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 2.3 0 23
Processing: 1548 4623 679.4 4409 7268
Waiting: 1357 4310 650.1 4101 6517
Total: 1551 4624 679.6 4410 7268
Percentage of the requests served within a certain time (ms)
50% 4410
66% 4615
75% 4964
80% 5245
90% 5724
95% 5943
98% 6125
99% 6256
100% 7268 (longest request)
两次对比起来看,在响应速度方面,去掉open_basedir确实有性能提升,同样用strace统计系统调用的情况,结果如下,一共发生了不到2w次的系统请求,其中lstat被调用了199次,占总调用次数的1%
awk '{if($0~/lstat/){a++}}END{print a,NR,a/NR}' /tmp/php.strace
199 19423 0.0102456
而且top中看到89.0%us,表明大多数的CPU都消耗在用户态了,但压力还是挺大!
参考文档:
Running PHP on NFS: huge performance problems and one simple solution.
手把手教你用Strace诊断问题
当前暂无评论 »