wiki:NameserverAddressStoreDesign

Nameserver Address Store - Outline Design

Introduction

The Nameserver Address Store (NSAS) is the component of the BIND-10 recursive resolver that stores addresses of nameservers. The component handles the lookups of the addresses, as well as the monitoring of round-trip times and selection of a nameserver address when the resolver needs to make a query.

This document presents an outline design of the component.

RR Class

One requirement is for the names in a different classes to be distinct, e.g. example.com class IN is distinct from example.com class CH.

One design considered for this was to have class-specific NSAS objects and a dispatcher that chose the object based on class. Recognising that the vast majority of lookups are for class IN, the idea was streamline IN lookups by checking for the a class number of 1 (= IN) and dispatching to a pre-constructed NSAS object if the check succeeded. If not, the NSAS object for the class would be looked up in an mutex-protected STL map object and created if not present. Although fast for IN, it adds an additional layer of locking and contention for non-IN classes. Furthermore it could be vulnerable to abuse if an attacker could force the creation of a NSAS object for every possible class value.

Instead, all names and classes will be in the same store. The only additional overhead will be when searching for the name; as well as checking that the name is correct, the code will have to check that the class is correct as well.

Main Data Structures

Zone Entry

Zones are represented by the zone entry (ZE) object. This holds the list of nameservers for the zone in question. Its main attributes are:

  • Zone name (this is the key in the zone hash table)
  • Class
  • Zone expiry time (= time of entry + TTL of NS records)
  • Nameserver(n)

The Nameserver element is a vector of boost::shared_ptrs to the appropriate Nameserver Entry (NE) object.

Nameserver Entry

In some cases, particularly in the case of ISPs, a given nameserver may host thousands of zones. By splitting the nameserver information away from the zone information, the NSAS can have a single entry for a nameserver that is referenced by multiple zones. This reduces the memory footprint, as well as reducing the number of queries that the NSAS may need to make for address information and improves RTT estimates.

This describes a particular nameserver and has as its main elements:

  • Nameserver name (this is the key in the nameserver hash table)
  • Class
  • Address expiry time (= time of entry + TTL of A/AAAA record)
  • Address(n) (separate vector for each address family)

The Address element is a vector of AddressEntry objects. Given that a nameserver may have multiple addresses - and that it may have both A and AAAA addresses - the expiry time is the time of entry + min(A record TTL, AAAA record TTL). Although the A and AAAA records may have differing TTLs, (a) it simplifies the logic, (b) coming from the same zone, differing TTLs is unlikely (c) if we time out too soon, updating from cache, where it still lives, is cheap.

Address Entry

This object just holds:

  • Structure describing address
  • Round-trip time for the address
  • Time when this address stops being considered dead

The structure describing the address is an IOAddress object, used in the asynchronous I/O code. The round-trip time is initialized to a small number and is adjusted after each query. It is a 32-bit number, the time for the round-trip (in any units, NSAS does not care, probably microseconds). The value UINT32_MAX is reserved to indicate that the address is unreachable.

Hash Tables

The zone entries and nameserver entries are held in separate hash tables, each indexed by a hash of the zone/nameserver name and class. It comprises composite elements of:

  • Mutex protecting this slot of the hash table. This is a read/write mutex as readers are expected to be more common than writers, so use of this type of mutex will reduce contention.
  • Head of the list for address objects with this hash value

A hash table is used to limit contention when looking up information. The size of the hash table is a tuneable parameter: BIND-9 uses a value of 1009 with the remark that the size of the table should be prime (1009 is the smallest prime greater than 1000). Unless there is a problem, it is intended to use the same hashing function as that used by BIND9 (with an appropriate adjustment for table size). The size of the nameserver hash table is set to approximately three times that of the zone hash table (to take account of multiple nameservers per zone).

The lists themselves will be a list of boost::shared_ptr pointing to the actual zone entry or nameserver entry object. This simplifies the deletion of these objects (erasing an object from the list will cause the object to be deleted if nothing else points to it). It also handles the possibility that a nameserver entry could be deleted while there are still outstanding address fetch operations on it (e.g. a zone could have just been added and address fetch operations started when an operation to clear the NSAS is executed). Use of shared pointers means that the fetch can retain a pointer to the object and complete successfully, even if the object is deleted immediately afterwards; there is no need to worry about the fetch having a dangling pointer. (Of course, an alternative is for the fetch to retain the name of the zone and look it up in the hash table when it completes. However, in the normal course of events this will incur more overhead than retaining a pointer to the object.)

LRU Lists

A LRU list is used to prevent the size of the store exceeding a pre-set limit of zones. Protected by a mutex, the list holds a list of shared pointers to zone elements. When a zone element is created, it is added at the end of the list. Every time it is accessed, it is removed from the list and re-inserted at the end. When the list reaches its limit, a new element added at the end of the list will result in one being removed from the head. (A running count of the elements in the list is kept and protected by the same mutex that guards the list; the std::list count() method is inefficient as it requires traversing the list to obtain a result.) The zone element so removed is then removed from the zone hash table.

In future, there will be multiple LRU lists, each one responsible only for a subset of the hash table slots. That way there will be less chance for a lock contention on the LRU list.

A similar list is used with the nameserver entry currently. This has some drawbacks (There's the need to update LRU, which brings overhead. There might be active NameserverEntry? objects that are not reachable by name, which allows to have duplicities). Therefore the plan is to hold weak pointers in the nameserver hash table. That way, the nameservers will be kept alive by zones keeping pointers to them and the weak pointers will be removed when dead ones are encountered.

The deletion will trigger its removal from the nameserver hash table as well. If, for some reason, the nameserver entry is infrequently accessed and removed from the hash table before the zones that reference it are deleted, there is no problem; the shared_ptr in the zones will ensure that the nameserver entry remains in existence. At worst, the creation of a new zone referencing the nameserver will cause a new nameserver entry to be created in the nameserver hash table. Old zones will continue to reference the old nameserver entry until they are deleted, at which point it too will be deleted.

For efficient access, the LRU list iterator pointing to the zones entries in the LRU zone list is stored in the zone entry itself. This is set and reset after the entry has been inserted and moved. (Iterators to list elements do not change as other list elements are added and removed.) The same applies to the the LRU list holding nameserver entries.

Note that this only allows an approximate memory limit to be set - the number of nameserver elements and addresses per zone is variable, so a limit of a given number of zones may have differing memory requirements depending on the zones concerned.

Relationship Between Objects

The relationship between the various objects described above is:

V2 of the data structures diagram

Miscellanous Considerations

Choosing the address

When a zone is asked for an IP address to contact, it takes all addresses of all nameservers and picks from them. The selection is random, but weighted according to the RTT. The probability of an address is 1/(rtt * rtt) (normalized, so the total of the probabilities is 1). This algorithm is supported by a paper by Somegawa, Cho, Sekiya and Yagamuchi.

Initially the nameservers dave RTTs set to 1. This is lower than any realistic RTT; it makes the first lookup random and guarantees that every address is tried at least once. Thereafter, the lowest RTT is selected each time. Every time an address is used, the resolver updates the RTT for the address. To avoid transient effects, the new RTT is a weighted combination of the old RTT and the new RTT.

Fetching Information

Whenever NSAS needs information (either NS record set for the given zone name or A/AAAA record set), it uses the resolver. That way CNAME/DNAME redirection is automatically taken care of and it can use cache to get newer or authoritative information in preference of glue.

The fetch callback will contain shared_ptr to appropriate object (either ZoneEntry? or NameserverEntry?), so it is not deleted before the fetch completes.

It expects the cache will keep glue and will retain TTL=0 information at last until we ask it. This will be probably solved by cache cookies (eg. pieces of data guaranteeing that the information is available). Looping misconfigurations (eg. missing glue) will be solved by timeout in the call to the resolver.

Updating Information

When a query made by the resolver completes, the resolver may elect to update information in the NSAS. The RTT for the address queried can be passed. For that reason, NSAS provides NameserverAddress? object, which holds enough information to be able to update the NSAS (and, of course, provides the asiolink::IOAddress).

However, it was decided that as the cache will update its TTL, it is not needed to set the TTL in NSAS, because if it expires, it'll just ask again for it and the cache will have it. There is additional work and more code on it and it is dubious if it would speed things up or slow them down, since another lock on multiple entries would be needed, while the RTT is only on one of them.

When the cache detects a change (new data or the original data was referral that was wrong), we should have a mechanism to update NSAS and not wait for the data there to expire.

Data Expiration

A zone is only considered expired when it is accessed and the expiration time is found to have passed. In that case, the NS rrset is fetched again and the list of pointers to NameserverEntry? objects is rebuild. The old pointers are reused if possible (so we keep the original object, not create a new one if it fell out of the hash table because of LRU).

Similar thing is done with NameserverEntry?. When it expires, it fetches information again, recycling RTT information for addresses from the last time.

Processing Algorithm

Introduction

Although the NSAS is a data store, the logic surrounding it handles the issuing of requests to remote nameservers through the resolver.

Since the NSAS is capable of processing multiple queries at once, it must be prepared to handle cases where a query is received for information that is in the process of being obtained. For example, a lookup may be in progress for the nameservers of a zone when another request is received for the address of a nameserver in the zone. The second request must wait until the nameserver lookup is finished, and must then further wait until the subsequent A/AAAA records have been retrieved, but not trigger additional queries.

The general pattern to handle this is that of supplying a callback object. A call is made to request information and the call quickly returns. At some later time, the callback is called with the requested information (or with an indication saying why the request failed). So in the above example, the first lookup initiates a fetch of the NS records for the zone and returns. The second lookup sees that a fetch is in progress, so does nothing and returns. At some later time, the fetch of the NS records completes, an operation which initiates a fetch of the associated A/AAAA records. When that completes, the callbacks of the two lookups are executed, passing to them the requested information.

Processing Steps

In the following description, "fetch" is the operation of requesting a set of RRs from a remote server. "Query" is a call to the NSAS to retrieve address information about a zone. The scenario is that a query has been received. As this is the result of a referral, referral information is in a cache (and we expect to obtain a cache cookie in some future version).

There are five possible outcomes to the lookup of address information:

1a. The zone not present.

  • create the zone and mark as fetch in progress in progress.
  • add the callback object passed with the query to the list associated with that zone. This callback gets called when there is one or more addresses available.
  • initiate the fetch of the NS records and mark the zone as "fetch in progress".

1a'. The zone is present but is expired. This is almost the same as with 1a, but when the NS record set arrives, we recycle pointers we already have.

1b. The zone is present, but is marked as "fetch in progress"

  • add the callback object to the list associated with that zone. This callback gets called when there is one or more addresses available.

1c. If the zone is present but is marked as unreachable. (This occurs if a fetch of NS records for the zone failed.)

  • execute the callback object, with unsuccessful indication.

1d. The zone is present and is marked

  • add the callback object to the list associated with that zone. This callback gets called when there is one or more addresses available.
  • iterate through the nameservers to get the addresses associated with each nameserver.
    • if there are still any nameservers that did not start fetching their addresses, instruct them to do so (and pass them a callback) and retry 1d.
    • if any addresses are retrieved, execute all callback objects associated with the zone, passing to the address selected as described above.
    • if all nameservers report that all fetches for nameserver addresses failed, call all callbacks associated with the zone, with unsuccessful indication.
    • if no addresses are retrieved, but some nameservers have a "fetch in progress", return. When these fetches complete, they will complete the query.

When a NameserverEntry? is asked to provide addresses, it may:

2a. Be empty. In that case it returns nothing and signals it has not yet started fetching.

2a'. Is expired. Acts as in 2a.

2b. Is marked as fetch in progress. It returns nothing and signals it has not yet started fetching.

2c. Marked as unreachable. Returns nothing and indicates so.

2d. Marked as ready. Just returns the addresses.

Internally, the code looks slightly more complicated, because we may ask for any IP address or only v4/v6 ones and it needs to work this way for each of this case separately. In that case, it might be unreachable on v6 and reachable on v4, for example. With any address, there's also the possibility of having some addresses and still be in "fetch in progress". In that case the addresses are returned as well.

Action on Fetch Returns

When a NS rrset arrives (because asked for in 1a), we initialize the list of nameservers (looking them up in the nameserver hash table, if there are not there, we crate them). Then we mark the zone as ready and pretend the query just arrived, entering 1d (We need to be careful when we looked up a nameserver in the hash table, when it is in fetch in progress, we need to register a callback into it). When we fail to obtain the rrset, we do almost the same, but we enter 1c.

Similarly, when we receive A or AAAA, we store them, mark the entry as ready (at last for that address family). Then we execute any callbacks for that family (or any family) and let the zone entries try the 1d. case again, extracting addresses from us. When we fail to obtain it, we do similar thing, just mark ourself as unreachable.

Last modified 6 years ago Last modified on Jun 21, 2011, 2:31:23 PM

Attachments (2)

Download all attachments as: .zip