Initial Experiments and Analysis for Resolver Performance

1. Introduction

I've analyzed an old (about 5 years ago) but real query sample taken on a busy recursive server to get deeper understanding on how cached data would be used and exactly what the recursive server would do on cache miss. It's one hour log of queries, containing about 2.49M queries that consist of 290K unique ones.

In the rest of this memo, I'll briefly introduce the tools and methodology I used, and then summarize major findings of the analysis, and in some cases give possible optimization approaches related to the specific result.

2. The test tool

I've written a few Python scripts for this analysis, including:

  • A mostly full-feature recursive resolver, which resolves a given set of queries starting only with root hints, keeps all cached information used throughout the entire resolution process, and dumps the whole cache to a file.
  • A query replay tool. It loads the cached data generated by the resolver script into memory, and emulates the behavior of an actual recursive resolver for a given sequence of queries (with timing information). It refers to the cached data, but initially considers all entries except the root hints to be "inactive" (i.e. as if they were not cached). It emulates the resolution process using both "active" and "inactive" data, activating inactive parts of the cache as it encounters the data in the emulated resolution process. It keeps detailed record of how each query is resolved (whether it simply hits a cache, whether it needs explicit resolution, in which case how many external queries are needed, etc), and dumps various statistic information at the end of the session. This tool doesn't require any network access, and its behavior is completely deterministic and reproducible for the same set of cached data and query sequence.

The tools are available in the exp/res-research branch of the public git repository, under the exp/res-research/analysis directory.

While the resolver and replay tool should behave quite accurate as a recursive resolver, there are some cases they ignore or too simplify for the purpose of experiments. For example, the current version doesn't consider EDNS0 or DNSSEC at all; it assumes all (non lame) authoritative servers of a zone return consistent responses; queries are not resent on timeout, and the replay tool assumes that query would always be timed out; if an authoritative server is considered lame for specific query, it's always considered to be lame. Despite these limitations I believe the overall behavior should be reasonably accurate for the purpose of this analysis.

3. Cache Usage

The following graph summarizes overall cache usage focusing on query popularity (how many times a specific query is asked).

The blue line should read as the most "popular" query occupies 4.71% of the total of the 2.49M queries, the top 10 consume 10.62%, and so on. The red line shows the cache hit rate for the corresponding set of queries. For example, cached data for the top 10 popular queries hit 99.97%; it's 98.93% for the top 100, and 97.17% for the top 10,000.

These results suggest it's promising to introduce optimization approaches like multi-layer caches. Since the distribution is pretty heavy tailed (the top 10,000 (about top 1/30) occupies about 80% of the entire unique queries), we could even let separate threads or processes running on different CPU cores have their own "layer 1" cache in the most efficient form for faster response (like pre-rendered response). Note that the red line shows caching answers for popular queries is pretty effective even taking into account TTL expiration.

I've measured the expected size of the response to each query assuming the "minimal-answer" configuration (omitting authority and additional sections). The average size was about 62 bytes. So if a big server running on 16 cores has an L1 cache for each core up to 100,000 responses, the total memory footprint would be about (ignoring other overhead) 960MB. This is huge, but I believe it should be acceptable for this size of server (which would be dedicated to this service and would have 16GB or eve more of RAM).

The yellow line shows the ratio of used (hit) cache that happens in the "same second", i.e., the corresponding query times are equal at the granularity of seconds (for example, 97.2% of the most popular query (whose cache hit rate is very close to 100%) were asked at the same second as that for the previous one). This means in these cases if we had completely rendered response image we wouldn't even have to adjust their TTLs.

The yellow line suggests this optimization would be very effective at least for some very popular queries. It's not so obvious, however, for the entire data or even for the top 10,000. The necessary check should basically be a simple if statement comparing two integers, so I guess it's still useful even if it cannot fail for 60% of the tests, considering the amount of work this optimization allows to skip. But this will be subject to more specific benchmark tests.

4. TTL Distributions of Cached Data

The following graph shows distribution of TTLs of cached data that were made from an answer section of an authoritative reply (so these TTLs determine a query hits a cache or requires recursion).

The two major spikes are TTLs 3600 and 86400 (1 day), two commonly used ones. The four small spikes under 3600 are 300, 600, 900, and 1800 (1800 are probably "TTL"s of SERVFAIL results; in this experimental implementation these results are cached and reused for 1800 seconds, the value derived from BIND 9's lame-ttl).

There's not much interesting in this result (to me); it generally matches to my understanding of generally used conventions. One point that may be worth noting is that very small TTLs such as under 1 minute is quite few. While they might be used for relatively popular queries, this result probably shows we don't have to worry too much about them in terms of cache efficiency (and the result of the previous graph also supports this observation).

I gues the next one is probably more interesting:

The blue line shows TTL distribution for "glue" records (where glue means NS RRs in non authoritative responses and so called in-bailiwick AAAA/A RRs in the additional section of these responses). The red line is TTL distribution for "answer" records and NS RRs stored in the authority section of authoritative responses (note: not all "glue" records have corresponding "answer" or authority NS in the cached data).

Most (91.7%) of "glue" records have TTLs of 86400 and 172800 (2 days), while corresponding "answer" or authority NS have mainly TTLs of 3600 and 86400. Since the latter types of data are generally considered more trustworthy (per RFC2181, and existing implementations follow that), the "glue" records could be discarded once the answer or authority NS data are cached, even if the glue records could generally be cached longer than the more trustworthy ones. At least BIND 9 behaves that way, and so do unbound and PowerDNS recursor from their observable behavior.

That makes sense on one hand (at least in terms of cache memory efficiency); on the other hand, this could be suboptimal in terms of response performance (both in terms of throughput and latency) because delegation information could expire sooner, and since the "glue" information has been lost at that point the resolver needs to start from at least one level higher in the delegation chain (e.g., it could start from "com" for "" if the server discards glue info for "" and the more trust worthy data expire).

This leads to one optimization idea: we might keep both glue and more trustworthy data, and (if the TTL of the glue is larger than that of the other) use the glue data until it expires once the TTL of the more trustworthy data expires. I'll discuss the effectiveness of this idea in the next section. Regarding the cache memory consideration, I only have quick estimates by envelope-level calculation, but according to the number of cached data entries (1 entry generally corresponds to a tuple of name, class, type, and trust level), if we discard any glues due to inserting a more trustworthy entry for the same name, class, type, we'd have 485147 entries; if we keep "redundant" glues, it would be 573470 entries (+18.2%). I guess we could exploit the fact that many of these records have common data to reduce the memory footprint, so this could be a bit or more smaller. Even if it really requires about 20% more memory, I think it's still worth considering, depending on its effect on response performance.

5. Number of External Queries on Cache Expiration

The following graph summarizes the number of external queries need to complete resolution on cache miss (either due to expiration of existing one or complete cache miss). This includes all recursive queries following the delegation chain, and, when it's subject to CNAME substitution, queries for completing the CNAME chain. The graph shows the result for two different modes of the resolver behavior (see the previous section): one that keeps any glue records (NS and additional AAAA or A) even after caching more trustworthy records (blue); the other purges such glue records when it caches more trustworthy ones (red). Existing implementations generally (and seemingly) adopt the latter approach.

In either case, the vast majority of the cases only need 1 or 2 external queries, and the average is also under 2 (Note: this experimental setup is a bit optimistic, and doesn't count additional queries due to such as timeouts, EDNS0 probe failure, etc. So the real world number will be worse than these cases).

But if we keep all glues, it reduces the number of necessary external queries about 13%. It's not so obvious how substantial this difference is in terms of overall throughput and latency, but roughly speaking this improvement and the necessary 20%-ish more memory footprint is the tradeoff of this optimization.

6. Categorization of Response Messages from Authoritative Servers

I've noticed most of the responses from authoritative servers belong to a small set of common patterns, so I collected counted the number of responses based on some categorization. This is a summary of it (those that would only be needed in the replay session - the "resolver" tries to do some unnecessary resolution to make the cache as complete as possible, so just counting the number of responses in that session could be a misleading result).

  • Total number of incoming responses (including timeout): 534028
  • Of which 97.9% fall into one of the following types:
    1. AA bit on, Rcode==NOERROR, has RR answer section, the owner name of the first answer == query name, the type is either the query type or CNAME (266456, 49.9%)
    2. AA bit on, Rcode==NOERROR or NXDOMAIN, has empty answer section, either has empty authority section or the first authority RR is SOA whose owner name is equal to or superdomain of qname (131757, 24.7%).
    3. AA bit off, Rcode==NOERROR, has empty answer section, has authority section whose first RR is NS whose owner name is a pure subdomain of the current zone in the resolution context (124507, 23.3%).

In the first case, 233653 responses (87.7% of the case) are the exact type match, and the rest are CNAME. Further, for the 263851 responses (99.0% of the case) the first two bytes of the first answer RR are (0xc0, 0x0c), which means its owner name is identical to the query name, fully compressed (note: BIND 9 always compresses names this way in this case, but NSD doesn't. If and when NSD is more widely used for leaf zones, this ratio can change drastically).

In the third case, 47007 responses (37.8% of the case) have so-called "in-bailiwick" glue, so, while possibly suboptimal, delegation can be chased without referring to cache or triggering an additional query to resolve address records of the NS.

These seem to suggest we can consider some form of "response type prediction". Checking the condition of these are generally cheap or could possibly be more efficient, trading safety (like directly handling wire-format data). There are many other corner cases in the real world, but this experimental result shows they are probably very marginal. So, we can probably consider optimizing these dominant cases while providing slower and generic case handler to cover minority cases.

7 Query Types and Response Codes

This is a summary of per query type and per response code statistics:

  • NOERROR: 1109534
  • NXDOMAIN: 1037963
  • NXRRSET: 266915
  • SERVFAIL: 72415
  • A: 1248997
  • MX: 813549
  • PTR: 320106
  • AAAA: 47873
  • SRV: 33306
  • SOA: 16752
  • A6: 2655
  • ANY: 2360
  • NS: 531
  • TXT: 425
  • CNAME: 172
  • Other: 101

I'm not sure how we should interpret this, and since the data set is pretty old, especially the query type would look pretty different today (I believe AAAA will be more frequently queried, for example). But one thing it might be noteworthy is that direct NS query is very rare. This might suggest it makes sense to keep delegation information separately from the main cache. There will be some redundancy, but according to this result the intersection will be marginal.

Likewise, I'm not sure how we should interpret the per Rcode result. But at least according to this result most of the case (86%) result in NOERROR (with answer) and NXDOMAIN. If it's common, we may be able to exploit this bias in some way.

Last modified 5 years ago Last modified on Jul 17, 2012, 6:27:03 AM

Attachments (4)

Download all attachments as: .zip