SLAC CPE Software Engineering Group
Stanford Linear Accelerator Center
System Admin

IOC Debugging

 

SLAC Detailed
SLAC Computing
Software Home
Software Detailed
 

 

 

 


This is Luciano's debugging session to find a process that was making too many requests to his IOC

 

 

1.9.5. OTRDMP Debugging

Symptom: Purple PVs and Matlab Profile Monitor GUI can't get images.

Logged to ioc-dmp1-pm01 and verified current CPU load with the 'rtems_cpu_usage_report()' command.

This reports the usage since reboot or since the last time 'rtems_cpu_usage_reset()' was invoked:

Cexp@ioc-dmp1-pm01>rtems_cpu_usage_report()
CPU Usage by thread
   ID            NAME         SECONDS   PERCENT
0x09010001   IDLE            29.299800    0.901
..
0x0A010002   ntwk            81.899736    2.519
0x0A010003   MVEd           172.744188    5.314
...
0x0A010012   OTRD           1070.112497   32.923
...
0x0A010014   evrR            51.466433    1.583
...
0x0A01001A   scan           116.708529    3.590
... <2 scan processes with little CPU usage>
0x0A01001D   scan            11.813567    0.363
... <52 CAS- processes with little CPU usage>
... <2 CAC- processes with little CPU usage>
0x0A010024   save             1.477819    0.045
...
0x0A01002D   CAS-           391.056027   12.031
...
0x0A010049   CAS-           501.076111   15.416
...
0x0A01004B   CAS-           499.279729   15.360
...
Time since last CPU Usage reset 3250.346453 seconds
0x00000000 (0)
Cexp@ioc-dmp1-pm01>

Did a reset and another report:

Cexp@ioc-dmp1-pm01>rtems_cpu_usage_reset()
0x00000000 (0)
Cexp@ioc-dmp1-pm01>rtems_cpu_usage_report()
CPU Usage by thread
   ID            NAME         SECONDS   PERCENT
0x09010001   IDLE             0.000000    0.000
0x0A010001   UI1              0.001182    0.085
0x0A010002   ntwk             0.032901    2.372
0x0A010003   MVEd             0.068603    4.947
0x0A010004   RPCd             0.000000    0.000
0x0A010005   PMAP             0.000000    0.000
0x0A010006   TNTD             0.000000    0.000
0x0A010007   CLKd             0.005735    0.413
0x0A010008   NTPd             0.000000    0.000
0x0A010009   Imsg             0.000000    0.000
0x0A01000A   Cloc             0.000000    0.000
0x0A01000B   task             0.000000    0.000
0x0A01000C   errl             0.000000    0.000
0x0A01000D   time             0.000021    0.001
0x0A01000E   cbLo             0.000000    0.000
0x0A01000F   cbMe             0.000000    0.000
0x0A010010   cbHi             0.000504    0.036
0x0A010011   dbCa             0.000000    0.000
0x0A010012   OTRD             0.454533   32.777
0x0A010013   evrT             0.008014    0.577
0x0A010014   evrR             0.022075    1.591
0x0A010015   time             0.000000    0.000
0x0A010016   scan             0.000000    0.000
0x0A010017   scan             0.000000    0.000
0x0A010018   scan             0.000000    0.000
0x0A010019   scan             0.000073    0.005
0x0A01001A   scan             0.074280    5.356
0x0A01001B   scan             0.000054    0.003
0x0A01001C   scan             0.000119    0.008
0x0A01001D   scan             0.005115    0.368
0x0A01001E   CAS-             0.000000    0.000
0x0A01001F   CAS-             0.000000    0.000
0x0A010020   CAS-             0.000000    0.000
0x0A010021   CAC-             0.000000    0.000
0x0A010024   save             0.000428    0.030
0x0A010025   CAC-             0.000000    0.000
0x0A010028   CAS-             0.000000    0.000
0x0A010029   CAS-             0.000000    0.000
0x0A01002C   logR             0.000000    0.000
0x0A01002D   CAS-             0.280594   20.234
0x0A01002E   CAS-             0.000000    0.000
0x0A01002F   CAS-             0.000000    0.000
0x0A010030   CAS-             0.000000    0.000
0x0A010031   CAS-             0.000000    0.000
0x0A010032   CAS-             0.000000    0.000
0x0A010033   CAS-             0.000000    0.000
0x0A010034   CAS-             0.001025    0.073
0x0A010035   CAS-             0.000000    0.000
0x0A010036   CAS-             0.000000    0.000
0x0A01003D   CAS-             0.000000    0.000
0x0A01003E   CAS-             0.000000    0.000
0x0A01003F   CAS-             0.000000    0.000
0x0A010040   CAS-             0.000147    0.010
0x0A010041   CAS-             0.000000    0.000
0x0A010042   CAS-             0.000000    0.000
0x0A010043   CAS-             0.000000    0.000
0x0A010044   CAS-             0.000000    0.000
0x0A010045   CAS-             0.000000    0.000
0x0A010046   CAS-             0.000000    0.000
0x0A010047   CAS-             0.000000    0.000
0x0A010048   CAS-             0.000000    0.000
0x0A010049   CAS-             0.091357    6.587
0x0A01004A   CAS-             0.000000    0.000
0x0A01004B   CAS-             0.340195   24.532
0x0A01004C   CAS-             0.000000    0.000
0x0A01004D   CAS-             0.000000    0.000
0x0A01004E   CAS-             0.000000    0.000
0x0A010051   CAS-             0.000000    0.000
0x0A010052   CAS-             0.000000    0.000
0x0A010053   CAS-             0.000000    0.000
0x0A010054   CAS-             0.000000    0.000
0x0A010055   CAS-             0.000126    0.009
0x0A010056   CAS-             0.000000    0.000
0x0A01005B   CAS-             0.000000    0.000
0x0A01005C   CAS-             0.000000    0.000
0x0A01005F   CAS-             0.000000    0.000
0x0A010060   CAS-             0.000000    0.000
0x0A010065   CAS-             0.000000    0.000
0x0A010066   CAS-             0.000000    0.000
0x0A010067   CAS-             0.000000    0.000
0x0A010068   CAS-             0.000000    0.000
0x0A010069   CAS-             0.000000    0.000
0x0A01006A   CAS-             0.000000    0.000
0x0A01006B   CAS-             0.000000    0.000
0x0A01006C   CAS-             0.000000    0.000
0x0A01006D   CAS-             0.000146    0.010
0x0A01006E   CAS-             0.000000    0.000
0x0A01006F   CAS-             0.000000    0.000
0x0A010070   CAS-             0.000000    0.000
Time since last CPU Usage reset 1.386725 seconds
0x00000000 (0)
Cexp@ioc-dmp1-pm01>

Next had to figure out which command prints information on the open sockets, to find who is requesting data from the IOC:

Cexp@ioc-dmp1-pm01>lkup("bsdnet_s")
=====  In module 'SYSTEM-BUILTIN' (0x01962020) =====:
  0x125a68[    48]:                               long (*)()  rtems_bsdnet_schednetisr
  0x125664[    56]:                               long (*)()  rtems_bsdnet_seconds_since_boot
  0x125ba4[   512]:                               long (*)()  rtems_bsdnet_semaphore_obtain
  0x125b34[   112]:                               long (*)()  rtems_bsdnet_semaphore_release
  0x127b98[   760]:                               long (*)()  rtems_bsdnet_show_icmp_stats
  0x127f00[  1024]:                               long (*)()  rtems_bsdnet_show_if_stats
  0x128954[   744]:                               long (*)()  rtems_bsdnet_show_inet_routes
  0x128300[  1060]:                               long (*)()  rtems_bsdnet_show_ip_stats
  0x128724[   560]:                               long (*)()  rtems_bsdnet_show_mbuf_stats
  0x128e30[  2460]:                               long (*)()  rtems_bsdnet_show_tcp_stats
  0x1297cc[   460]:                               long (*)()  rtems_bsdnet_show_udp_stats
   0xf195c[   116]:                               long (*)()  rtems_bsdnet_synchronize_ntp
=====  In module 'miscUtils.obj' (0x0198ccf8) =====:
  0x22f424[  1428]:                               long (*)()  rtems_bsdnet_show_socket_stats

USER VARIABLES:
0x00000000 (0)
Cexp@ioc-dmp1-pm01>rtems_bsdnet_show_socket_stats()
3:          0.0.0.0: 1024 <->          0.0.0.0:    0

4:          0.0.0.0: 1023 <->          0.0.0.0:    0

5:     172.27.10.55: 5068 <->      172.27.8.26:39234

6:          0.0.0.0:  111 <->          0.0.0.0:    0

7:          0.0.0.0:  111 <->          0.0.0.0:    0

8:          0.0.0.0:   23 <->          0.0.0.0:    0

9:          0.0.0.0: 1027 <->          0.0.0.0:    0

11:          0.0.0.0: 1028 <->          0.0.0.0:    0

12:     172.27.10.55: 1029 <->    134.79.151.21: 5069

13:          0.0.0.0: 5068 <->          0.0.0.0:    0

14:          0.0.0.0: 5068 <->          0.0.0.0:    0

15:     172.27.10.55: 5068 <->      172.27.8.46:36288

17:          0.0.0.0: 1030 <->          0.0.0.0:    0

18:     172.27.10.55: 5068 <->      172.27.8.30:54139

19:     172.27.10.55: 5068 <->      172.27.8.26:42723

20:     172.27.10.55: 1024 <->      172.27.8.31: 7004

21:     172.27.10.55: 5068 <->      172.27.8.27:34129

22:     172.27.10.55: 5068 <->      172.27.8.27:34130

23:     172.27.10.55: 5068 <->      172.27.8.27:34131

24:     172.27.10.55: 5068 <->      172.27.8.27:34132

25:     172.27.10.55: 5068 <->      172.27.8.91:34616

26:     172.27.10.55: 5068 <->      172.27.8.30:37176

29:     172.27.10.55: 5068 <->      172.27.8.26:39082

30:     172.27.10.55: 5068 <->      172.27.8.91:34618

31:     172.27.10.55: 5068 <->      172.27.8.30:54140

32:     172.27.10.55: 5068 <->      172.27.8.31:60872

33:     172.27.10.55: 5068 <->      172.27.8.31:60873

34:     172.27.10.55: 5068 <->      172.27.8.26:39120

35:     172.27.10.55: 5068 <->      172.27.8.25:47399

36:     172.27.10.55: 5068 <->      172.27.8.28:48928

37:     172.27.10.55: 5068 <->      172.27.8.30:54144

38:     172.27.10.55: 5068 <->      172.27.8.24:51806

40:     172.27.10.55: 5068 <->   172.27.224.220:40535

41:     172.27.10.55: 5068 <->      172.27.8.27:34134

42:     172.27.10.55: 5068 <->      172.27.8.93:54566

43:     172.27.10.55: 5068 <->      172.27.8.91:37027

44:     172.27.10.55: 5068 <->      172.27.8.30:54177

45:     172.27.10.55: 5068 <->      172.27.8.26:35409

0x00000026 (38)

Passed an argument to the 'rtems_bsdnet_show_socket_stats()', usually these commands take a verbosity level as argument.

Cexp@ioc-dmp1-pm01>rtems_bsdnet_show_socket_stats(1)
...
24:     172.27.10.55: 5068 <->      172.27.8.27:34132
   Type: SOCK_STREAM; Options: 0x000c;   Protocol:  TCP
   Receiving buffers:
     Number of mbufs in chain    : 1
     Chars of mbufs used         : 2176
     Number of cluster refs      : 1
        single references        : 1
        multiple references      : 0
        total use of clusters    : 1
   Transmitting buffers:
     Number of mbufs in chain    : 0
     Chars of mbufs used         : 0
     Number of cluster refs      : 0
        single references        : 0
        multiple references      : 0
        total use of clusters    : 0
...

Information for all sockets are displayed, only the entry above shows non-zero values. The suspect is from host 172.27.8.27 port 34132.

The host name is lcls-srv03:

[softegr@lcls-builder lcls]$ nslookup 172.27.8.27
Server:         134.79.151.12
Address:        134.79.151.12#53

27.8.27.172.in-addr.arpa        name = lcls-srv03.slac.stanford.edu.

Logged in as physics to lcls-srv03 and used the lsof command. Below is an example, since the actual command was not logged:

[physics@lcls-srv03 ~ ]$ /usr/sbin/lsof -i tcp:36263
COMMAND  PID    USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
MATLAB  8131 physics  122u  IPv4 28712554      0t0  TCP lcls-srv03.slac.stanford.edu:36263->ioc-dmp1-pm01.slac.stanford.edu:5068 (ESTABLISHED)

This shows the PID of the process talking to the IOC. My next command was 'kill <PID>', but I should have gathered more information about the process before killing it, e.g.:

[physics@lcls-srv03 ~ ]$ ps 8131
  PID TTY      STAT   TIME COMMAND
 8131 pts/40   Ssl+  15:02 /usr/local/matlab/matlab75/bin/glnx86/MATLAB -nosplash -r guardian -logfile /u1/lcls/physics/log/matlab/guardian.log -nodesktop

This would have told me which Matlab script was used.

After killing the Matlab process the IDLE process on the IOC was getting some CPU time, meaning that the load was not as high as before:

Cexp@ioc-dmp1-pm01>rtems_cpu_usage_reset()
0x00000000 (0)
Cexp@ioc-dmp1-pm01>rtems_cpu_usage_report()
CPU Usage by thread
   ID            NAME         SECONDS   PERCENT
0x09010001   IDLE             0.372780   19.279
0x0A010001   UI1              0.001484    0.076
0x0A010002   ntwk             0.033682    1.741
0x0A010003   MVEd             0.073222    3.786
0x0A010004   RPCd             0.000339    0.017
0x0A010005   PMAP             0.000000    0.000
0x0A010006   TNTD             0.000000    0.000
0x0A010007   CLKd             0.006973    0.360
0x0A010008   NTPd             0.000000    0.000
0x0A010009   Imsg             0.000000    0.000
0x0A01000A   Cloc             0.000000    0.000
0x0A01000B   task             0.000000    0.000
0x0A01000C   errl             0.000000    0.000
0x0A01000D   time             0.000046    0.002
0x0A01000E   cbLo             0.005307    0.274
0x0A01000F   cbMe             0.000722    0.037
0x0A010010   cbHi             0.000847    0.043
0x0A010011   dbCa             0.000000    0.000
0x0A010012   OTRD             0.636909   32.939
0x0A010013   evrT             0.010186    0.526
0x0A010014   evrR             0.028624    1.480
0x0A010015   time             0.000052    0.002
0x0A010016   scan             0.000000    0.000
0x0A010017   scan             0.000017    0.000
0x0A010018   scan             0.000000    0.000
0x0A010019   scan             0.000085    0.004
0x0A01001A   scan             0.073178    3.784
0x0A01001B   scan             0.000068    0.003
0x0A01001C   scan             0.000173    0.008
0x0A01001D   scan             0.006759    0.349
0x0A01001E   CAS-             0.000000    0.000
0x0A01001F   CAS-             0.000000    0.000
0x0A010020   CAS-             0.001286    0.066
0x0A010021   CAC-             0.000000    0.000
0x0A010024   save             0.001406    0.072
0x0A010025   CAC-             0.000000    0.000
0x0A010028   CAS-             0.000000    0.000
0x0A010029   CAS-             0.000000    0.000
0x0A01002C   logR             0.000000    0.000
0x0A01002D   CAS-             0.330435   17.089
0x0A01002E   CAS-             0.000000    0.000
0x0A01002F   CAS-             0.000000    0.000
0x0A010030   CAS-             0.000000    0.000
0x0A010031   CAS-             0.000000    0.000
0x0A010032   CAS-             0.000000    0.000
0x0A010035   CAS-             0.000000    0.000
0x0A010036   CAS-             0.000129    0.006
0x0A01003D   CAS-             0.000000    0.000
0x0A01003E   CAS-             0.000000    0.000
0x0A01003F   CAS-             0.000000    0.000
0x0A010040   CAS-             0.000287    0.014
0x0A010041   CAS-             0.000000    0.000
0x0A010042   CAS-             0.000000    0.000
0x0A010043   CAS-             0.000000    0.000
0x0A010044   CAS-             0.000000    0.000
0x0A010045   CAS-             0.000000    0.000
0x0A010046   CAS-             0.000000    0.000
0x0A010047   CAS-             0.000000    0.000
0x0A010048   CAS-             0.000000    0.000
0x0A010049   CAS-             0.348217   18.009
0x0A01004A   CAS-             0.000000    0.000
0x0A010051   CAS-             0.000000    0.000
0x0A010052   CAS-             0.000000    0.000
0x0A010053   CAS-             0.000000    0.000
0x0A010054   CAS-             0.000000    0.000
0x0A010055   CAS-             0.000387    0.020
0x0A010056   CAS-             0.000000    0.000
0x0A01005B   CAS-             0.000000    0.000
0x0A01005C   CAS-             0.000000    0.000
0x0A01005F   CAS-             0.000000    0.000
0x0A010060   CAS-             0.000000    0.000
0x0A010067   CAS-             0.000000    0.000
0x0A010068   CAS-             0.000000    0.000
0x0A010069   CAS-             0.000000    0.000
0x0A01006A   CAS-             0.000000    0.000
0x0A01006B   CAS-             0.000000    0.000
0x0A01006C   CAS-             0.000000    0.000
0x0A01006D   CAS-             0.000181    0.009
0x0A01006E   CAS-             0.000000    0.000
0x0A01006F   CAS-             0.000000    0.000
0x0A010070   CAS-             0.000000    0.000
0x0A010071   CAS-             0.000198    0.010
0x0A010072   CAS-             0.000000    0.000
0x0A010073   CAS-             0.000000    0.000
0x0A010074   CAS-             0.000000    0.000
0x0A010075   CAS-             0.000094    0.004
0x0A010076   CAS-             0.000000    0.000

 

 


 
 
 



 

Programmers' Guides, Users' Guides, Requirements, Design, Papers, Administration, How-To, Hardware, IOC, Database

 

[SLAC CPE Software Engineering Group][ SLAC Home Page]

 

Modified: July 24, 2013