NTP Debugging Techniques

from Pogo, Walt Kelly

We make house calls and bring our own bugs.


Once the NTP software distribution has been compiled and installed and the configuration file constructed, the next step is to verify correct operation and fix any bugs that may result. Usually, the command line that starts the daemon is included in the system startup file, so it is executed only at system boot time; however, the daemon can be stopped and restarted from root at any time. Usually, no command-line arguments are required, unless special actions described in the ntpd page are required. Once started, the daemon will begin sending and receiving messages, as specified in the configuration file.

Initial Startup

The best way to verify correct operation is using the ntpq and ntpdc utility programs, either on the server itself or from another machine elsewhere in the network. The ntpq program implements the management functions specified in the NTP specification RFC-1305, Appendix A. The ntpdc program implements additional functions not provided in the standard. Both programs can be used to inspect the state variables defined in the specification and, in the case of ntpdc, additional ones of interest. In addition, the ntpdc program can be used to selectively reconfigure and enable or disable some functions while the daemon is running.

In extreme cases with elusive bugs, the daemon can operate in two modes, depending on the presence of the -d command-line debug switch. If not present, the daemon detaches from the controlling terminal and proceeds autonomously. If one or more -d switches are present, the daemon does not detach and generates special output useful for debugging. In general, interpretation of this output requires reference to the sources. However, a single -d does produce only mildly cryptic output and can be very useful in finding problems with configuration and network troubles. With a little experience, the volume of output can be reduced by piping the output to grep and specifying the keyword of the trace you want to see.

Some problems are immediately apparent when the daemon first starts running. The most common of these are the lack of a UDP port for NTP (123) in the Unix /etc/services file. Note that NTP does not use TCP in any form. Other problems are apparent in the system log file. The log file should show the startup banner, some cryptic initialization data and the computed precision value. The next most common problem is incorrect DNS names. Check that each DNS name used in the configuration file responds to the Unix ping command.

When first started, the daemon normally polls the servers listed in the configuration file at 64-s intervals. In order to allow a sufficient number of samples for the NTP algorithms to reliably discriminate between correctly operating servers and possible intruders, at least four valid messages from the majority of servers and peers listed in the configuration file is required before the daemon can set the local clock. However, if the current system time is greater than 1000 s in error from the server times, the daemon will not set the local clock; instead, it will plant a message in the system log and shut down. It is necessary to set the local clock to within 1000 s first, either by a time-of-year hardware clock, by first using the ntpdate program or manually be eyeball and wristwatch.

If the initial time error is less than 1000 s but greater than 125 ms, the daemon will perform a step adjustment; otherwise, it will gradually slew the clock to the nominal time. However, if a step adjustment, the clock discipline algorithm will start all over again, requiring another round of at least four messages as before. This is necessary so that all servers and peers operate on the same set of time values.

If, as discussed later in this page, for some reason the hardware clock oscillator frequency error is relatively large, the time errors upon first startup of the daemon may increase over time until exceeding 125 ms, which requires another step correction. However, due to provisions that reduce vulnerability to noise spikes, the second correction will not be done until 900 s have elapsed since the last adjustment. When the frequency error is very large, it may take a number of cycles like this until converging on the nominal frequency correction. After this, the correction is written to the ntp.drift file, which is read upon subsequent restarts, so the herky-jerky cycles should not recur.

Verifying Correct Operation

After starting the daemon, run the ntpq program using the -n switch, which will avoid possible distractions due to name resolution problems. Use the pe command to display a billboard showing the status of configured peers and possibly other clients poking the daemon. After operating for a few minutes, the display should be something like:

ntpq> pe
     remote      refid       st t when poll reach delay offset jitter
=====================================================================
-isipc6.cairn.ne .GPS1.        1 u  18  64  377  65.592 -5.891  0.044
+saicpc-isiepc2. pogo.udel.edu 2 u 241 128  370  10.477 -0.117  0.067
+uclpc.cairn.net pogo.udel.edu 2 u  37  64  177 212.111 -0.551  0.187
*pogo.udel.edu   .GPS1.        1 u  95 128  377   0.607  0.123  0.027

The host names or addresses shown in the remote column correspond to the server and peer entries listed in the configuration file; however, the DNS names might not agree if the names listed are not the canonical DNS names. The refid column shows the current source of synchronization, while the st column reveals the stratum, t the type (u = unicast, m = multicast, l = local, - = don't know), and poll the poll interval in seconds. The when column shows the time since the peer was last heard in seconds, while the reach column shows the status of the reachability register (see RFC-1305) in octal. The remaining entries show the latest delay, offset and jitter in milliseconds. Note that in NTP Version 4 what used to be the dispersion column has been replaced by the jitter column.

The tattletale symbol at the left margin displays the synchronization status of each peer. The currently selected peer is marked *, while additional peers designated acceptable for synchronization, but not currently selected, are marked +. Peers marked * and + are included in the weighted average computation to set the local clock; the data produced by peers marked with other symbols are discarded. See the ntpq page for the meaning of these symbols.

Additional details for each peer separately can be determined by the following procedure. First, use the as command to display an index of association identifiers, such as

ntpq> as
ind assID status  conf reach auth condition  last_event cnt
===========================================================
  1 50252  f314   yes   yes   ok    outlyer   reachable  1
  2 50253  f414   yes   yes   ok   candidat   reachable  1
  3 50254  f414   yes   yes   ok   candidat   reachable  1
  4 50255  f614   yes   yes   ok   sys.peer   reachable  1

Each line in this billboard is associated with the corresponding line in the pe billboard above. The assID shows the unique identifier for each mobilized association, while the status column shows the peer status word in hex, as defined in the NTP specification. Next, use the rv command and the respective assID identifier to display a detailed synopsis for the selected peer, such as

ntpq> rv 50253
status=f414 reach, conf, auth, sel_candidat, 1 event, event_reach,
srcadr=saicpc-isiepc2.cairn.net, srcport=123, dstadr=140.173.1.46,
dstport=123, keyid=3816249004, stratum=2, precision=-27,
rootdelay=10.925, rootdispersion=12.848, refid=pogo.udel.edu,
reftime=bd11b225.133e1437  Sat, Jul  8 2000 13:59:01.075, delay=10.550,
offset=-1.357, jitter=0.074, dispersion=1.444, reach=377, valid=7,
hmode=1, pmode=1, hpoll=6, ppoll=7, leap=00, flash=00 ok,
org=bd11b23c.01385836  Sat, Jul  8 2000 13:59:24.004,
rec=bd11b23c.02dc8fb8  Sat, Jul  8 2000 13:59:24.011,
xmt=bd11b21a.ac34c1a8  Sat, Jul  8 2000 13:58:50.672,
filtdelay=   10.45  10.50  10.63  10.40  10.48  10.43  10.49  11.26,
filtoffset=  -1.18  -1.26  -1.26  -1.35  -1.35  -1.42  -1.54  -1.81,
filtdisp=     0.51   1.47   2.46   3.45   4.40   5.34   6.33   7.28,
hostname="miro.time.saic.com", publickey=3171359012, pcookie=0x6629adb2,
hcookie=0x61f99cdb, initsequence=61, initkey=0x287b649c,
timestamp=3172053041

A detailed explanation of the fields in this billboard are beyond the scope of this discussion; however, most variables defined in the NTP Version 3 specification RFC-1305 are available along with others defined for NTP Version 4. This particular example was chosen to illustrate probably the most complex configuration involving symmetric modes and public-key cryptography. As the result of debugging experience, the names and values of these variables may change from time to time. An explanation of the current set is on the ntpq page.

A useful indicator of miscellaneous problems is the flash value, which reveals the state of the various sanity tests on incoming packets. There are currently eleven bits, one for each test, numbered from the right, which is for test 1. If the test fails, the corresponding bit is set to one and zero otherwise. If any bit is set following each processing step, the packet is discarded. The meaning of each test is described on the ntpq page.

The three lines identified as filtdelay, filtoffset and filtdisp reveal the roundtrip delay, clock offset and dispersion for each of the last eight measurement rounds, all in milliseconds. Note that the dispersion, which is an estimate of the error, increases as the age of the sample increases. From these data, it is usually possible to determine the incidence of severe packet loss, network congestion, and unstable local clock oscillators. There are no hard and fast rules here, since every case is unique; however, if one or more of the rounds show large values or change radically from one round to another, the network is probably congested or lossy.

Once the daemon has set the local clock, it will continuously track the discrepancy between local time and NTP time and adjust the local clock accordingly. There are two components of this adjustment, time and frequency. These adjustments are automatically determined by the clock discipline algorithm, which functions as a hybrid phase/frequency feedback loop. The behavior of this algorithm is carefully controlled to minimize residual errors due to network jitter and frequency variations of the local clock hardware oscillator that normally occur in practice. However, when started for the first time, the algorithm may take some time to converge on the intrinsic frequency error of the host machine.

The frequency tolerance of computer clock oscillators can vary widely, which can put a strain on the daemon's ability to compensate for the intrinsic frequency error. While the daemon can handle frequency errors up to 500 parts-per-million (PPM), or 43 seconds per day, values much above 100 PPM reduce the headroom and increase the time to learn the particular value and record it in the ntp.drift file. In extreme cases before the particular oscillator frequency error has been determined, the residual system time offsets can sweep from one extreme to the other of the 128-ms tracking window only for the behavior to repeat at 900-s intervals until the measurements have converged.

In order to determine if excessive frequency error is a problem, observe the nominal filtoffset values for a number of rounds and divide by the poll interval. If the result is something approaching 500 PPM, there is a good chance that NTP will not work properly until the frequency error is reduced by some means. A common cause is the hardware time-of-year (TOY) clock chip, which must be disabled when NTP disciplines the software clock. For some systems this can be done using the tickadj utility and the -s command line argument. For other systems this can be done using a command in the system startup file.

If the TOY chip is not the cause, the problem may be that the hardware clock frequency may simply be too slow or two fast. In some systems this might require tweaking a trimmer capacitor on the motherboard. For other systems the clock frequency can be adjusted in increments of 100 PPM using the tickadj utility and the -t command line argument. Note that the tickadj alters certain kernel variables and, while the utility attempts to figure out an acceptable way to do this, there are many cases where tickadj is incompatible with a running kernel.

The state of the local clock itself can be determined using the rv command (without the argument), such as

ntpq> rv
status=0644 leap_none, sync_ntp, 4 events, event_peer/strat_chg,
version="ntpd 4.0.99j4-r Fri Jul  7 23:38:17 GMT 2000 (1)",
processor="i386", system="FreeBSD3.4-RELEASE", leap=00, stratum=2,
precision=-27, rootdelay=0.552, rootdispersion=12.532, peer=50255,
refid=pogo.udel.edu,
reftime=bd11b220.ac89f40a  Sat, Jul  8 2000 13:58:56.673, poll=6,
clock=bd11b225.ee201472  Sat, Jul  8 2000 13:59:01.930, state=4,
phase=0.179, frequency=44.298, jitter=0.022, stability=0.001,
hostname="barnstable.udel.edu", publickey=3171372095, params=3171372095,
refresh=3172016539

An explanation about most of these variables is in the RFC-1305 specification. The most useful ones include clock, which shows when the clock was last adjusted, and reftime, which shows when the server clock of refid was last adjusted. The version, processor and system values are very helpful when included in bug reports. The mean millisecond time offset (phase) and deviation (jitter) monitor the clock quality, while the mean PPM frequency offset (frequency) and deviation (stability) monitor the clock stability and serve as a useful diagnostic tool. It has been the experience of NTP operators over the years that these data represent useful environment and hardware alarms. If the motherboard fan freezes up or some hardware bit sticks, the system clock is usually the first to notice it.

Among the new variables added for NTP Version 4 are the hostname, publickey, params and refresh, which are used for the Autokey public-key cryptography described on the Authentication Options page. The values show the filestamps, in NTP seconds, that the associated values were created. These are useful in diagnosing problems with cryptographic key consistency and ordering principles.

When nothing seems to happen in the pe billboard after some minutes, there may be a network problem. One common network problem is an access controlled router on the path to the selected peer or an access controlled server using methods described on the Access Control Options page. Another common problem is that the server is down or running in unsynchronized mode due to a local problem. Use the ntpq program to spy on the server variables in the same way you can spy on your own.

Normally, the daemon will adjust the local clock in small steps in such a way that system and user programs are unaware of its operation. The adjustment process operates continuously as long as the apparent clock error exceeds 128 milliseconds, which for most Internet paths is a quite rare event. If the event is simply an outlyer due to an occasional network delay spike, the correction is simply discarded; however, if the apparent time error persists for an interval of about 20 minutes, the local clock is stepped to the new value (as an option, the daemon can be compiled to slew at an accelerated rate to the new value, rather than be stepped). This behavior is designed to resist errors due to severely congested network paths, as well as errors due to confused radio clocks upon the epoch of a leap second.

Debugging Checklist

If the ntpq or ntpdc programs do not show that messages are being received by the daemon or that received messages do not result in correct synchronization, verify the following:

  1. Verify the /etc/services file host machine is configured to accept UDP packets on the NTP port 123. NTP is specifically designed to use UDP and does not respond to TCP.
  2. Check the system log for ntpd messages about configuration errors, name-lookup failures or initialization problems.
  3. Verify using ping or other utility that packets actually do make the round trip between the client and server. Verify using nslookup or other utility that the DNS server names do exist and resolve to valid Internet addresses.

  4. Using the ntpdc program, verify that the packets received and packets sent counters are incrementing. If the sent counter does not increment and the configuration file includes configured servers, something may be wrong in the host network or interface configuration. If this counter does increment, but the received counter does not increment, something may be wrong in the network or the server NTP daemon may not be running or the server itself may be down or not responding.
  5. If both the sent and received counters do increment, but the reach values in the pe billboard with ntpq continues to show zero, received packets are probably being discarded for some reason. If this is the case, the cause should be evident from the flash variable as discussed above and on the ntpq page.
  6. If the reach values in the pe billboard show the servers are alive and responding, note the tattletale symbols at the left margin, which indicate the status of each server resulting from the various grooming and mitigation algorithms. The interpretation of these symbols is discussed on the ntpq page. After a few minutes of operation, one or another of the reachable server candidates should show a * tattletale symbol. If this doesn't happen, the intersection algorithm, which classifies the servers as truechimers or falsetickers, may be unable to find a majority of truechimers among the server population.
  7. If all else fails, see the FAQ and/or the discussion and briefings at Network Time Synchronization Project.

David L. Mills <mills@udel.edu>