wiki:TestingRecursion

Problem

While the authoritative side of DNS is fairly heavily tested, the recursive side is basically completely untested.

The reason for this is that testing an authoritative server is relatively simple - the server is basically doing a lookup into a database and returning a result, so you can check various contents of the database compared to various mixes of query type, concurrency, and rate. However, testing the recursive side involves testing not only a single authoritative server, or a cluster of computers acting as an authoritative server, but rather the recursive server AND the set of servers that are authoritative for domains being queried. The presence of a cache complicates testing further.

However, in order to both verify that resolution is working correctly and also to optimize the entire recursive resolution process, we should create a system that does functional testing of the recursive side of DNS.


Solution Space

In order to systematically test the system, we create a test platform that works the same as all other test systems, with the following characteristics:

  • System base state
  • Test inputs
  • Expected test outputs

A successful test is one were the ACTUAL test outputs match the expected test outputs. A failed test is one where these do not match.

There are 3 types of data that are involved in a resolution:

  1. The DNS computers. This includes the clients, recursive servers, and authoritative servers.
  2. The network connecting the DNS computers.
  3. The DNS zone contents. This includes the root zone definition, and all other domains under that.

Test DNS Computers

We do not need separate computers for the DNS computers, but we do need to simulate them.

For the clients, this is something sending queries. We can possibly use queryperf (which makes DNS queries based on various input parameters) for this, although something like tcpreplay (which plays network traffic from pcap files) may be necessary for proper control.

For the recursive server, we should use the server software that we are testing. For us, this will probably be BIND 10 and BIND 9, although we may also wish to test other products, like Unbound, !PowerDNS Recursor, or dnsmasq.

For the authoritative servers, the minimum for this is something answering DNS queries (port 53) at unique IP addresses. Probably the best thing for this is a simple Python program using our DNS library, running on IP addresses configured for the test. These can be started quickly, and can also implement any special handling necessary (for example sending duplicate replies, or sending incoherent responses).

Test Network

We do not need to actually build a network, but can simulate the effects of the network on the DNS resolution process. For example, if we want to test the RTT algorithm, we can insert artificial network latency by configuring specific simulated authoritative servers to add a delay before answering.

All types of network issues only need to be simulated on the authoritative server side. The reason for this is that we are testing the recursive server, and that the recursive server does not maintain state when communicating with clients, so its behavior is not affected by networking between itself and its clients, beyond the arrival patter of queries.

A full list of tweakable parameters will appear below.

Test DNS Zone Contents

We need to test all manner of DNS zone contents. This includes properly configured zones that include things like out-of-zone name servers, as well as broken configurations like lame delegations or zones with CNAME and other RTYPE of the same ownername.

These can be expressed as normal text zone files.

It may be necessary that a zone change contents during the delegation process, but it may also be that there are no conditions that arise from this that are different from a carefully-configured set of zones.


Proposed Solution

What we need is both a test framework and the actual tests.

The test framework should be a program that works by reading a set of desired tests and then for each one of them:

  1. Configures the IP addresses needed for the test.
  2. Starts up the authoritative servers with the correct zones.
  3. Starts up the recursive server.
  4. Executes a set of DNS queries to get the recursive server in the correct state. (For example to populate the cache.)
  5. Begins recording network traffic from the recursive server (both to the authoritative servers and to the clients).
  6. Executes a set of DNS queries (the actual test).
  7. Compares the recorded network traffic to the expected network traffic.
  8. Stops the recursive server down.
  9. Stops the authoritative servers.
  10. Deconfigures the IP addresses.

Note that the test must be run as root, because we need to configure and deconfigure IP addresses, and also to bind to port 53.

The tests need to define a number of things:

  • A list of IPv4 and IPv6 addresses for authoritative servers
  • Whether UDP works for a given IP address + domain
  • Whether TCP works for a given IP address + domain
  • How EDNS works for a given IP address + domain (fully, only for packets of X bytes or less, or not at all)
  • Response delay for a given IP address
  • Drops for a given IP address (should be a pattern, not a percentage, for reproducability - so perhaps "10111" if we want to drop the 2nd packet of a 5 packet sequence)
  • Zone contents for each IP address (this allows us to test for things like SOA & other zone mismatches, some lame delegations, and so on)
  • DNSSEC parameters (mostly T.B.D., but for example using the NSEC3 RFC as a starting point is a good idea)
  • Queries, including which queries occur at the same time (needed to check behavior of simultaneous queries)
  • The important data from packets sent (more below)

Given the amount of data per test, it probably makes sense to use a directory to store a set of files that define everything. (A database might make sense, but that is probably more trouble than it is worth.)

As far as the "important data" from each packet, this depends on the particular test. For example:

  • Malformed query from client should reply with an error to that client.
  • Bogus TLD query should send a packet to one of the root servers, and then a reply to the client.
  • "Normal" WWW.DOMAIN-1.TEST query should send a packet to one of the root servers, then to one of the TEST TLD servers, then to one of the authoritative servers for DOMAIN-1.TEST, and finally send a reply to the client.
  • Simple cache test should immediately send a reply to the client from cache. (Here the setup for the test involves sending a query to populate the cache on the resolver, but that is not the test itself.)
  • Lame delegation test for DOMAIN-2.TEST should send a query to the root, then to one of the TEST TLD servers, then to one of the authoritative servers for DOMAIN-2.TEST, then try again at a different authoritative server for DOMAIN-2.TEST, then finally send an answer to the client. (Note that here the authoriative servers should work together to insure that the first reply is always an error!)

Probably a simple language to define how these packets look is needed, so these can be defined via data files and not require programming for each test. A file defining packets in this language may look something like this:

# define the packets in a simple A lookup of www.domain-1.test
# target(s)         time   query/answer  contents
a.root,b.root       *      q             www.domain-1.test a
ns1.test,ns2.test   *      q             www.domain-1.test a
$client             *      a             www.domain-1.test a 10.0.0.1

# define the packets with a non-responding name server
# target(s)                time   query/answer  contents
a.root,b.root              *      q             domain-2.test a
ns1.test,ns2.test          *      q             domain-2.test a
$last                      100    q             domain-2.test a
(ns1.test,ns2.test)-$last  100    q             domain-2.test a
$client                    *      a             domain-2.test a 10.0.0.2

I'm not sure about the exact retry behavior, but this last one means we retry a server once after 0.1 sec (100 msec) and then try a different server from the set. Note this language is probably not the best, just something to illustrate the basic idea.

Last modified 7 years ago Last modified on Nov 2, 2010, 1:28:39 PM