Out-of-sync DNS prevents access to IEPM-BW host at NIIT

Jared Greeno, Les Cottrell, and Booker Bense, August, 2007

Reported Problem

After an IP address change to the iepm-maggie.niit.edu.pk host at NIIT, nodes at SLAC were unable to access content off of it reliably. When accessing web content from the server, blank pages or 404 errors would be returned.

First look

Initial investigation showed that when resolving iepm-maggie's IP address, we were still getting the old value of 202.125.157.212 from one of SLAC's DNS servers.

pinger@pinger $ dig @134.79.18.40 iepm-maggie.niit.edu.pk.

; <<>> DiG 9.4.1-P1 <<>> @134.79.18.40 iepm-maggie.niit.edu.pk.
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18449
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;iepm-maggie.niit.edu.pk. IN A

;; ANSWER SECTION:
iepm-maggie.niit.edu.pk. 85951 IN A 202.125.157.212

;; AUTHORITY SECTION:
niit.edu.pk. 85951 IN NS ns.niit.edu.pk.
niit.edu.pk. 85951 IN NS ns2.niit.edu.pk.

;; ADDITIONAL SECTION:
ns.niit.edu.pk. 81253 IN A 202.125.157.194
ns2.niit.edu.pk. 77112 IN A 203.99.50.202

;; Query time: 1 msec
;; SERVER: 134.79.18.40#53(134.79.18.40)
;; WHEN: Thu Aug 23 10:42:45 2007
;; MSG SIZE rcvd: 124

Another SLAC DNS server was showing the current up-to-date A record as confirmed by conversations with NIIT.

pinger@pinger $ dig @134.79.18.45 iepm-maggie.niit.edu.pk.

; <<>> DiG 9.4.1-P1 <<>> @134.79.18.45 iepm-maggie.niit.edu.pk.
; (1 server found)
;; global options: printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64814
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;iepm-maggie.niit.edu.pk. IN A

;; ANSWER SECTION:
iepm-maggie.niit.edu.pk. 81218 IN A 202.125.157.206

;; AUTHORITY SECTION:
niit.edu.pk. 81218 IN NS ns2.niit.edu.pk.
niit.edu.pk. 81218 IN NS mail.niit.edu.pk.

;; ADDITIONAL SECTION:
ns2.niit.edu.pk. 63458 IN A 203.99.50.202
mail.niit.edu.pk. 63458 IN A 202.125.157.195

;; Query time: 1 msec
;; SERVER: 134.79.18.45#53(134.79.18.45)
;; WHEN: Thu Aug 23 10:43:26 2007
;; MSG SIZE rcvd: 126

For machines at SLAC that were using the name server with the wrong answer cached in it, they were being pointed to a different web server at NIIT that did not have the IEPM-BW content on it.

Additional Troubleshooting

SLAC unix-admin was contacted and asked to flush the cache on the errant name server, but that was of little avail -- the bad answer was still cached upstream. We waited for the cached entries to expire.

A day later, we were still seeing the inconsistent answers. It was then noticed that the authoritative name servers for niit.edu.pk were not responding. NIIT was contacted and their admins brought the servers back up.

The second listed name server still had the wrong value for iepm-maggie, though:

dig @ns2.niit.edu.pk iepm-maggie.niit.edu.pk.

; <<>> DiG 9.4.1-P1 <<>> @ns2.niit.edu.pk iepm-maggie.niit.edu.pk.
; (1 server found)
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18386
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2

;; QUESTION SECTION:
;iepm-maggie.niit.edu.pk.       IN      A

;; ANSWER SECTION:
iepm-maggie.niit.edu.pk. 86400  IN      A       202.125.157.212

;; AUTHORITY SECTION:
niit.edu.pk.            86400   IN      NS      ns.niit.edu.pk.
niit.edu.pk.            86400   IN      NS      ns2.niit.edu.pk.

;; ADDITIONAL SECTION:
ns.niit.edu.pk.         86400   IN      A       202.125.157.194
ns2.niit.edu.pk.        86400   IN      A       203.99.50.202

;; Query time: 1141 msec
;; SERVER: 203.99.50.202#53(203.99.50.202)
;; WHEN: Mon Aug 27 11:58:55 2007
;; MSG SIZE  rcvd: 124

ns.niit.edu.pk was returning the right A value for iepm-maggie at this point.

Resolution

NIIT updated both servers and after the TTL expired on the bad cached entries, all the DNS servers at SLAC returned the correct values.

It was suggested to NIIT to lower the TTL values on their DNS entries to something between two and six hours to decrease the latency between when updates are made and when they propagate to other DNS servers.