SLAC CPE Software
Engineering Group |
||||||||||
|
|
|
This is Luciano's debugging session to find a process that was making too many requests to his IOC
|
1.9.5. OTRDMP DebuggingSymptom: Purple PVs and Matlab Profile Monitor GUI can't get images. Logged to ioc-dmp1-pm01 and verified current CPU load with the 'rtems_cpu_usage_report()' command. This reports the usage since reboot or since the last time 'rtems_cpu_usage_reset()' was invoked: Cexp@ioc-dmp1-pm01>rtems_cpu_usage_report() CPU Usage by thread ID NAME SECONDS PERCENT 0x09010001 IDLE 29.299800 0.901 .. 0x0A010002 ntwk 81.899736 2.519 0x0A010003 MVEd 172.744188 5.314 ... 0x0A010012 OTRD 1070.112497 32.923 ... 0x0A010014 evrR 51.466433 1.583 ... 0x0A01001A scan 116.708529 3.590 ... <2 scan processes with little CPU usage> 0x0A01001D scan 11.813567 0.363 ... <52 CAS- processes with little CPU usage> ... <2 CAC- processes with little CPU usage> 0x0A010024 save 1.477819 0.045 ... 0x0A01002D CAS- 391.056027 12.031 ... 0x0A010049 CAS- 501.076111 15.416 ... 0x0A01004B CAS- 499.279729 15.360 ... Time since last CPU Usage reset 3250.346453 seconds 0x00000000 (0) Cexp@ioc-dmp1-pm01> Did a reset and another report: Cexp@ioc-dmp1-pm01>rtems_cpu_usage_reset() 0x00000000 (0) Cexp@ioc-dmp1-pm01>rtems_cpu_usage_report() CPU Usage by thread ID NAME SECONDS PERCENT 0x09010001 IDLE 0.000000 0.000 0x0A010001 UI1 0.001182 0.085 0x0A010002 ntwk 0.032901 2.372 0x0A010003 MVEd 0.068603 4.947 0x0A010004 RPCd 0.000000 0.000 0x0A010005 PMAP 0.000000 0.000 0x0A010006 TNTD 0.000000 0.000 0x0A010007 CLKd 0.005735 0.413 0x0A010008 NTPd 0.000000 0.000 0x0A010009 Imsg 0.000000 0.000 0x0A01000A Cloc 0.000000 0.000 0x0A01000B task 0.000000 0.000 0x0A01000C errl 0.000000 0.000 0x0A01000D time 0.000021 0.001 0x0A01000E cbLo 0.000000 0.000 0x0A01000F cbMe 0.000000 0.000 0x0A010010 cbHi 0.000504 0.036 0x0A010011 dbCa 0.000000 0.000 0x0A010012 OTRD 0.454533 32.777 0x0A010013 evrT 0.008014 0.577 0x0A010014 evrR 0.022075 1.591 0x0A010015 time 0.000000 0.000 0x0A010016 scan 0.000000 0.000 0x0A010017 scan 0.000000 0.000 0x0A010018 scan 0.000000 0.000 0x0A010019 scan 0.000073 0.005 0x0A01001A scan 0.074280 5.356 0x0A01001B scan 0.000054 0.003 0x0A01001C scan 0.000119 0.008 0x0A01001D scan 0.005115 0.368 0x0A01001E CAS- 0.000000 0.000 0x0A01001F CAS- 0.000000 0.000 0x0A010020 CAS- 0.000000 0.000 0x0A010021 CAC- 0.000000 0.000 0x0A010024 save 0.000428 0.030 0x0A010025 CAC- 0.000000 0.000 0x0A010028 CAS- 0.000000 0.000 0x0A010029 CAS- 0.000000 0.000 0x0A01002C logR 0.000000 0.000 0x0A01002D CAS- 0.280594 20.234 0x0A01002E CAS- 0.000000 0.000 0x0A01002F CAS- 0.000000 0.000 0x0A010030 CAS- 0.000000 0.000 0x0A010031 CAS- 0.000000 0.000 0x0A010032 CAS- 0.000000 0.000 0x0A010033 CAS- 0.000000 0.000 0x0A010034 CAS- 0.001025 0.073 0x0A010035 CAS- 0.000000 0.000 0x0A010036 CAS- 0.000000 0.000 0x0A01003D CAS- 0.000000 0.000 0x0A01003E CAS- 0.000000 0.000 0x0A01003F CAS- 0.000000 0.000 0x0A010040 CAS- 0.000147 0.010 0x0A010041 CAS- 0.000000 0.000 0x0A010042 CAS- 0.000000 0.000 0x0A010043 CAS- 0.000000 0.000 0x0A010044 CAS- 0.000000 0.000 0x0A010045 CAS- 0.000000 0.000 0x0A010046 CAS- 0.000000 0.000 0x0A010047 CAS- 0.000000 0.000 0x0A010048 CAS- 0.000000 0.000 0x0A010049 CAS- 0.091357 6.587 0x0A01004A CAS- 0.000000 0.000 0x0A01004B CAS- 0.340195 24.532 0x0A01004C CAS- 0.000000 0.000 0x0A01004D CAS- 0.000000 0.000 0x0A01004E CAS- 0.000000 0.000 0x0A010051 CAS- 0.000000 0.000 0x0A010052 CAS- 0.000000 0.000 0x0A010053 CAS- 0.000000 0.000 0x0A010054 CAS- 0.000000 0.000 0x0A010055 CAS- 0.000126 0.009 0x0A010056 CAS- 0.000000 0.000 0x0A01005B CAS- 0.000000 0.000 0x0A01005C CAS- 0.000000 0.000 0x0A01005F CAS- 0.000000 0.000 0x0A010060 CAS- 0.000000 0.000 0x0A010065 CAS- 0.000000 0.000 0x0A010066 CAS- 0.000000 0.000 0x0A010067 CAS- 0.000000 0.000 0x0A010068 CAS- 0.000000 0.000 0x0A010069 CAS- 0.000000 0.000 0x0A01006A CAS- 0.000000 0.000 0x0A01006B CAS- 0.000000 0.000 0x0A01006C CAS- 0.000000 0.000 0x0A01006D CAS- 0.000146 0.010 0x0A01006E CAS- 0.000000 0.000 0x0A01006F CAS- 0.000000 0.000 0x0A010070 CAS- 0.000000 0.000 Time since last CPU Usage reset 1.386725 seconds 0x00000000 (0) Cexp@ioc-dmp1-pm01> Next had to figure out which command prints information on the open sockets, to find who is requesting data from the IOC: Cexp@ioc-dmp1-pm01>lkup("bsdnet_s") ===== In module 'SYSTEM-BUILTIN' (0x01962020) =====: 0x125a68[ 48]: long (*)() rtems_bsdnet_schednetisr 0x125664[ 56]: long (*)() rtems_bsdnet_seconds_since_boot 0x125ba4[ 512]: long (*)() rtems_bsdnet_semaphore_obtain 0x125b34[ 112]: long (*)() rtems_bsdnet_semaphore_release 0x127b98[ 760]: long (*)() rtems_bsdnet_show_icmp_stats 0x127f00[ 1024]: long (*)() rtems_bsdnet_show_if_stats 0x128954[ 744]: long (*)() rtems_bsdnet_show_inet_routes 0x128300[ 1060]: long (*)() rtems_bsdnet_show_ip_stats 0x128724[ 560]: long (*)() rtems_bsdnet_show_mbuf_stats 0x128e30[ 2460]: long (*)() rtems_bsdnet_show_tcp_stats 0x1297cc[ 460]: long (*)() rtems_bsdnet_show_udp_stats 0xf195c[ 116]: long (*)() rtems_bsdnet_synchronize_ntp ===== In module 'miscUtils.obj' (0x0198ccf8) =====: 0x22f424[ 1428]: long (*)() rtems_bsdnet_show_socket_stats USER VARIABLES: 0x00000000 (0) Cexp@ioc-dmp1-pm01>rtems_bsdnet_show_socket_stats() 3: 0.0.0.0: 1024 <-> 0.0.0.0: 0 4: 0.0.0.0: 1023 <-> 0.0.0.0: 0 5: 172.27.10.55: 5068 <-> 172.27.8.26:39234 6: 0.0.0.0: 111 <-> 0.0.0.0: 0 7: 0.0.0.0: 111 <-> 0.0.0.0: 0 8: 0.0.0.0: 23 <-> 0.0.0.0: 0 9: 0.0.0.0: 1027 <-> 0.0.0.0: 0 11: 0.0.0.0: 1028 <-> 0.0.0.0: 0 12: 172.27.10.55: 1029 <-> 134.79.151.21: 5069 13: 0.0.0.0: 5068 <-> 0.0.0.0: 0 14: 0.0.0.0: 5068 <-> 0.0.0.0: 0 15: 172.27.10.55: 5068 <-> 172.27.8.46:36288 17: 0.0.0.0: 1030 <-> 0.0.0.0: 0 18: 172.27.10.55: 5068 <-> 172.27.8.30:54139 19: 172.27.10.55: 5068 <-> 172.27.8.26:42723 20: 172.27.10.55: 1024 <-> 172.27.8.31: 7004 21: 172.27.10.55: 5068 <-> 172.27.8.27:34129 22: 172.27.10.55: 5068 <-> 172.27.8.27:34130 23: 172.27.10.55: 5068 <-> 172.27.8.27:34131 24: 172.27.10.55: 5068 <-> 172.27.8.27:34132 25: 172.27.10.55: 5068 <-> 172.27.8.91:34616 26: 172.27.10.55: 5068 <-> 172.27.8.30:37176 29: 172.27.10.55: 5068 <-> 172.27.8.26:39082 30: 172.27.10.55: 5068 <-> 172.27.8.91:34618 31: 172.27.10.55: 5068 <-> 172.27.8.30:54140 32: 172.27.10.55: 5068 <-> 172.27.8.31:60872 33: 172.27.10.55: 5068 <-> 172.27.8.31:60873 34: 172.27.10.55: 5068 <-> 172.27.8.26:39120 35: 172.27.10.55: 5068 <-> 172.27.8.25:47399 36: 172.27.10.55: 5068 <-> 172.27.8.28:48928 37: 172.27.10.55: 5068 <-> 172.27.8.30:54144 38: 172.27.10.55: 5068 <-> 172.27.8.24:51806 40: 172.27.10.55: 5068 <-> 172.27.224.220:40535 41: 172.27.10.55: 5068 <-> 172.27.8.27:34134 42: 172.27.10.55: 5068 <-> 172.27.8.93:54566 43: 172.27.10.55: 5068 <-> 172.27.8.91:37027 44: 172.27.10.55: 5068 <-> 172.27.8.30:54177 45: 172.27.10.55: 5068 <-> 172.27.8.26:35409 0x00000026 (38) Passed an argument to the 'rtems_bsdnet_show_socket_stats()', usually these commands take a verbosity level as argument. Cexp@ioc-dmp1-pm01>rtems_bsdnet_show_socket_stats(1) ... 24: 172.27.10.55: 5068 <-> 172.27.8.27:34132 Type: SOCK_STREAM; Options: 0x000c; Protocol: TCP Receiving buffers: Number of mbufs in chain : 1 Chars of mbufs used : 2176 Number of cluster refs : 1 single references : 1 multiple references : 0 total use of clusters : 1 Transmitting buffers: Number of mbufs in chain : 0 Chars of mbufs used : 0 Number of cluster refs : 0 single references : 0 multiple references : 0 total use of clusters : 0 ... Information for all sockets are displayed, only the entry above shows non-zero values. The suspect is from host 172.27.8.27 port 34132. The host name is lcls-srv03: [softegr@lcls-builder lcls]$ nslookup 172.27.8.27 Server: 134.79.151.12 Address: 134.79.151.12#53 27.8.27.172.in-addr.arpa name = lcls-srv03.slac.stanford.edu. Logged in as physics to lcls-srv03 and used the lsof command. Below is an example, since the actual command was not logged: [physics@lcls-srv03 ~ ]$ /usr/sbin/lsof -i tcp:36263 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME MATLAB 8131 physics 122u IPv4 28712554 0t0 TCP lcls-srv03.slac.stanford.edu:36263->ioc-dmp1-pm01.slac.stanford.edu:5068 (ESTABLISHED) This shows the PID of the process talking to the IOC. My next command was 'kill <PID>', but I should have gathered more information about the process before killing it, e.g.: [physics@lcls-srv03 ~ ]$ ps 8131 PID TTY STAT TIME COMMAND 8131 pts/40 Ssl+ 15:02 /usr/local/matlab/matlab75/bin/glnx86/MATLAB -nosplash -r guardian -logfile /u1/lcls/physics/log/matlab/guardian.log -nodesktop This would have told me which Matlab script was used. After killing the Matlab process the IDLE process on the IOC was getting some CPU time, meaning that the load was not as high as before: Cexp@ioc-dmp1-pm01>rtems_cpu_usage_reset() 0x00000000 (0) Cexp@ioc-dmp1-pm01>rtems_cpu_usage_report() CPU Usage by thread ID NAME SECONDS PERCENT 0x09010001 IDLE 0.372780 19.279 0x0A010001 UI1 0.001484 0.076 0x0A010002 ntwk 0.033682 1.741 0x0A010003 MVEd 0.073222 3.786 0x0A010004 RPCd 0.000339 0.017 0x0A010005 PMAP 0.000000 0.000 0x0A010006 TNTD 0.000000 0.000 0x0A010007 CLKd 0.006973 0.360 0x0A010008 NTPd 0.000000 0.000 0x0A010009 Imsg 0.000000 0.000 0x0A01000A Cloc 0.000000 0.000 0x0A01000B task 0.000000 0.000 0x0A01000C errl 0.000000 0.000 0x0A01000D time 0.000046 0.002 0x0A01000E cbLo 0.005307 0.274 0x0A01000F cbMe 0.000722 0.037 0x0A010010 cbHi 0.000847 0.043 0x0A010011 dbCa 0.000000 0.000 0x0A010012 OTRD 0.636909 32.939 0x0A010013 evrT 0.010186 0.526 0x0A010014 evrR 0.028624 1.480 0x0A010015 time 0.000052 0.002 0x0A010016 scan 0.000000 0.000 0x0A010017 scan 0.000017 0.000 0x0A010018 scan 0.000000 0.000 0x0A010019 scan 0.000085 0.004 0x0A01001A scan 0.073178 3.784 0x0A01001B scan 0.000068 0.003 0x0A01001C scan 0.000173 0.008 0x0A01001D scan 0.006759 0.349 0x0A01001E CAS- 0.000000 0.000 0x0A01001F CAS- 0.000000 0.000 0x0A010020 CAS- 0.001286 0.066 0x0A010021 CAC- 0.000000 0.000 0x0A010024 save 0.001406 0.072 0x0A010025 CAC- 0.000000 0.000 0x0A010028 CAS- 0.000000 0.000 0x0A010029 CAS- 0.000000 0.000 0x0A01002C logR 0.000000 0.000 0x0A01002D CAS- 0.330435 17.089 0x0A01002E CAS- 0.000000 0.000 0x0A01002F CAS- 0.000000 0.000 0x0A010030 CAS- 0.000000 0.000 0x0A010031 CAS- 0.000000 0.000 0x0A010032 CAS- 0.000000 0.000 0x0A010035 CAS- 0.000000 0.000 0x0A010036 CAS- 0.000129 0.006 0x0A01003D CAS- 0.000000 0.000 0x0A01003E CAS- 0.000000 0.000 0x0A01003F CAS- 0.000000 0.000 0x0A010040 CAS- 0.000287 0.014 0x0A010041 CAS- 0.000000 0.000 0x0A010042 CAS- 0.000000 0.000 0x0A010043 CAS- 0.000000 0.000 0x0A010044 CAS- 0.000000 0.000 0x0A010045 CAS- 0.000000 0.000 0x0A010046 CAS- 0.000000 0.000 0x0A010047 CAS- 0.000000 0.000 0x0A010048 CAS- 0.000000 0.000 0x0A010049 CAS- 0.348217 18.009 0x0A01004A CAS- 0.000000 0.000 0x0A010051 CAS- 0.000000 0.000 0x0A010052 CAS- 0.000000 0.000 0x0A010053 CAS- 0.000000 0.000 0x0A010054 CAS- 0.000000 0.000 0x0A010055 CAS- 0.000387 0.020 0x0A010056 CAS- 0.000000 0.000 0x0A01005B CAS- 0.000000 0.000 0x0A01005C CAS- 0.000000 0.000 0x0A01005F CAS- 0.000000 0.000 0x0A010060 CAS- 0.000000 0.000 0x0A010067 CAS- 0.000000 0.000 0x0A010068 CAS- 0.000000 0.000 0x0A010069 CAS- 0.000000 0.000 0x0A01006A CAS- 0.000000 0.000 0x0A01006B CAS- 0.000000 0.000 0x0A01006C CAS- 0.000000 0.000 0x0A01006D CAS- 0.000181 0.009 0x0A01006E CAS- 0.000000 0.000 0x0A01006F CAS- 0.000000 0.000 0x0A010070 CAS- 0.000000 0.000 0x0A010071 CAS- 0.000198 0.010 0x0A010072 CAS- 0.000000 0.000 0x0A010073 CAS- 0.000000 0.000 0x0A010074 CAS- 0.000000 0.000 0x0A010075 CAS- 0.000094 0.004 0x0A010076 CAS- 0.000000 0.000
|
Programmers' Guides, Users' Guides, Requirements, Design, Papers, Administration, How-To, Hardware, IOC, Database
Modified: July 24, 2013