wiki:InMemoryZoneDesign

1. Introduction

This is an initial proposal of revising the data structure of the in-memory data source so it will be more memory efficient. The overall specific goals are:

  • It should be as efficient as possible in terms of memory footprint so that it can store very large zones (with multiple millions or more of records) in memory.
  • It should be incrementally changeable: we need to be able to make partial updates to the zone as a result of IXFR-in or DDNS.
  • It should be "shared memory friendly" for future extension. In terms of C++, this means structures and data fields should be basically "plain old data". In particular, we cannot use classes with virtual methods. We should also avoid using complicated data structures that may internally involve memory allocation/release, such as STL containers.
  • It shouldn't increase run time cost (primarily for lookup; at a higher level, for overall query processing including the search for additional records or rendering). Preferably, it should even become faster than the current version.

2. Design Overview

The proposed design is a hybrid of the internal zone structure of BIND 9 and NSD. They key points are:

  • It's a BIND 9-style tree of red-black trees. Each tree nodes corresponds to some sequence of domain name labels.
  • The label data are encoded in the form of wire-format name data with a sequence of offsets.
  • RRsets are stored as "tree node data". The owner name is omitted as it can be derived from the corresponding node; RDATA are generally encoded in its wire-format, but fields of domain names are represented as a pointer to a tree node of that name (which will make additional record processing and possibly name compression considerably faster).
  • RRSIGs are encoded with the RRset for which the signatures are. Only RDATA is stored because others can be known from the corresponding RRset or the context.
  • The tree may contain names that don't belong to the zone (called "invisible"). They are necessary for names only seen in the RDATA (and necessary for the shortcut pointers described above).
  • Dedicated RRset class is introduced, exploiting these internal details. Essentially, it consists of pointers to the tree node and the RRset structure, and provides public class features examining the pointed data. This means some of the public methods such as getName() or getRdataIterator() can be quite inefficient. It's okay, though - the plan is to bypass these and use customized internal logic for performance sensitive operations.

3. Name (Label) Encoding in Tree Nodes

The label sequence for each tree node is encoded as a binary sequence, placed in memory immediately following the corresponding RB node structure. So it will be accessed as follows:

    RBNode* node;
    uint8_t* labelseq_data = reinterpret_cast<uint8_t*>(node + 1);

Note: This is one of the rare cases where the use of reinterpret_cast would be justified (and is necessary). We wouldn't like to have a separate pointer field within RBNode: it would require additional 4 or 8 (or some other, depending on the size of pointer types) bytes per node, and the separate allocation of node and data will make cache miss more likely to happen. Since this is a very performance sensitive area both in terms of memory footprint and access speed, the advantages would outweigh the risk of using reinterpret_cast.

This is a conceptual definition of the tree node class:

template<typename DataType>
class RBNode {
    offset_ptr<RBNode> parent_;
    offset_ptr<RBNode> left_;
    offset_ptr<RBNode> right_;
    offset_ptr<RBNode> down_;
    offset_ptr<DataType> data_;
    // Some other attributes: the "color" of node, flags, ...
};
// Encoded name (label) data follow.  See below.

Note that we use offset_ptr for pointer types. This is based on the plan of making it shared-memory friendly from the beginning as we discussed before: https://lists.isc.org/pipermail/bind10-dev/2012-June/003551.html The size of offset_ptr is the same as that of normal pointers, so there's nothing special with it for the purpose of this design discussion (except for the additional costs in pointer manipulation).

The following diagram explains how the name (labels) would be encoded:

This shows the case where we add "example.org." then "example.com." to the tree. Since the label sequence at the node of "example.org" does not represent an absolute name, the encoded data do not contain the part for the trailing dot and the corresponding offset value. But the data are still stored in a larger region of memory that could store the entire absolute name. This is because there was originally only this node (before adding "example.com."), at which point a memory region that could store the entire label + offsets (+ the base RBNode) was allocated. It was subsequently split off into the node for just the trailing dot (root) and the remaining part. But we cannot discard the old memory region and re-create a new one for the shortened data (even if we ignore the time overhead), because this node may have been pointed to from somewhere else (see below) so we cannot simply change the address of the node.

For this reason, the node (+ data) needs to maintain the information of the currently effective length of data (for lookup/modification) and the originally allocated length (for eventual deallocation). In the above diagram the "old" lengths (for the label part and offset part) are encoded as part of the data; the current lengths would be somewhere else. In the BIND 9 implementation it's included in the base node structure. What makes most sense depends on the total size of the base structure, considering other necessary attributes.

4. RRset Encoding and Zone Structure

4.1. RDATA field specification

For encoding RRset as part of the in-memory data source, we introduce a notion of "RDATA field specification". It's similar to http://bind10.isc.org/wiki/RdataAPIFramework but even more optimized for efficient encoding. We categorize RDATA fields into the following 3 types:

  • Fixed length data
  • Variable length data
  • Domain name

And "domain name" type of field can have the following attributes:

  • Whether it should be compressed when rendered (and if possible)
  • Whether it requires additional section processing

This is a straightforward representation of this information:

struct RdataFieldSpec {
    unsigned int type;
    unsigned int attributes; // in practice, for name-type only
    uint16_t len; // applicable for fixed-len-data type only
};

We can then define how each type of RR should be encoded in the form of a sequence of RdataFieldSpec:

struct RdataEncodeSpec {
    uint16_t n_fields; // number of fields
    uint16_t n_names;  // number of domain name fields
    RdataFieldSpec field_spec[];
};

IN_A_ENCODE_SPEC = {1, 0, {{FIXED_DATA, 0, 4}}};
GENERIC_NS_ENCODE_SPEC = {1, 1, {{NAME, COMPRESSIBLE | ADDITIONAL, 0}}};
GENERIC_SOA_ENCODE_SPEC = {3, 2, {{NAME, COMPRESSIBLE | ADDITIONAL, 0},
                                  {NAME, COMPRESSIBLE | ADDITIONAL, 0},
                                  {FIXED_DATA, 0, 20}}};
GENERIC_TXT_ENCODE_SPEC = {1, 0, {VARIABLE_DATA, 0, 0}};
...
GENERIC_RRSIG_ENCODE_SPEC = {1, {VARIABLE_DATA, 0, 0}};
...

These will be either private definition for the in-memory data source implementation or part of general libdns++ framework. Maybe we begin with the former approach then consider the latter. Note that these are not part of the zone data. They are part of the library.

Note that the last field of SOA RDATA is defined as a 20-byte fixed length field. While protocol wise this "field" consists of five 4-byte fields, such details aren't necessary for encoding and rendering the data. Likewise, the TXT RDATA is considered to be a variable-length single field, even if in terms of the DNS protocol even the number of "real" fields is not fixed.

The key observation is that it would be most space and time efficient for performance sensitive operations (storing data, looking up records, rendering these in a DNS response message). We'll still provide general interface so the user can examine the actual field of the RDATA, which will be expensive, but it's assumed to happen only in a rare and performance insensitive situations.

I'd also propose handling RRSIGs as single-fielded variable length data. Although it's contained a domain name ("signer") field, it's not to be compressed, and, in practice, it won't be a target of name compression because it's owner name should be a subdomain of or equal to the signer. Even if we happen to load an RRSIG that doesn't meet this requirement, it only results in a bit suboptimal compression (we could even argue that's broken anyway). For memory foot print it's not a big deal or might even be a bit suboptimal (depending on the length of the name), but for rendering that should be surely faster, and it could be non negligible for a response containing multiple RRSIGs (which is normally the case for a DNSSEC-enabled response from a signed zone).

4.2. RdataSet structure

To encode an RRset excluding the owner name, we introduce a structure RdataSet (name derived from BIND 9), indicating that it's slightly different from RRset (i.e., missing the name). Like the tree node above, it effectively consists of a fixed length base structure immediately followed by other encoded data.

The actual encoding of the base structure and the following data fields is to be discussed, but one initial idea is as follows:

struct RdataSet {
    offset_ptr<RdataSet> next; // for linked list
    uint16_t type; // RR type
    uint16_t is_signed : 1; // If this set contains RRSIGs
    uint16_t many_rrs : 1;  // If the number RRs >= 2^14 = 16384
    uint16_t num_rrs  : 14; // # of RRs, up to 16383
    uint32_t ttl; // TTL of the RR, maybe in the net byte order
};
// Data that follows:
// [uint16_t // # of RRs, if many_rrs == 1]
// [possibly padding if many_rrs == 1]
// offset_ptr<RBNode<RdataSet> > // pointer to the tree node for the
//                                  1st name field (if any) of 1st RDATA
// ...repeat for all name fields of all RDATAs.  There should be
// n_names * num_rrs name fields.
// offset_ptr<RBNode<RdataSet> > // pointer to the tree node for the
//                                  last name field (if any) of last RDATA
// uint16_t n1_1 // size of 1st variable len field (if any) of 1st RDATA
// uint16_t n1_2 // size of 2nd variable len field of 1st RDATA
// ...
// uint16_t nN_M // size of last (Mth) variable len field of last (Nth) RDATA
// [uint16_t num_sigs // # of RRSIGs, if is_signed == 1]
// [uint16_t // size of first RRSIG]
// ...
// [uint16_t // size of last RRSIG]
// uint8_t[] // value of 1st data field of 1st RDATA
// uint8_t[] // value of 2nd data field of 1st RDATA
// ...
// uint8_t[] // value of last data field of last RDATA
// [uint8_t[] // value of 1st RRSIG, if any]
// ...
// [uint8_t[] // value of last RRSIG, if any]

The definition and comments should describe the meaning pretty clearly. Some key points are:

  • With the observation that # of RRs (RDATAs) should be reasonably small, stealing 2 bits of a 16-bit field that encodes this information, for indicating whether RRSIGs are included and for covering the case where there are actually very many RRs.
  • The total size of the base RdataSet structure (excluding the "next" field) is 8 bytes, so that there'll (normally) be no padding field for the pointer fields that follow.
  • As noted in the overview section, names are represented in the form of pointers to tree nodes.
  • RR class is not encoded here. It's assumed to be retrievable per zone or data source level data structure.
  • Actually, unless there are very many RRs there shouldn't be any need for padding. (The case with very many RRs should be extremely rare in the first place, and in such a case the additional padding overhead should be relatively minor).
  • Note that we only encode RDATA for RRSIGs. The type information is simply unnecessary, and we use the same TTL for both the signed RRset and RRSIGs.
  • As commented, it may make sense to store the TTL in the network byte order. Then we don't have to do byte-order conversion at the rendering time (and this information should normally be unnecessary in the lookup process).
  • RR type could also be encoded in the network byte order, but this field will need to be referenced in the main lookup processing, so the benefit wouldn't be worth the complexity (and its own overhead).
  • The RdataSets of the same name will form a single linked list through their "next" members. This means we need to perform linear search to identify a particular type of RRset of a given name. It should be okay in practice, as there wouldn't be many types of RRs for a name. But if we concern about that we could consider some other forms such as a sorted array, at the increased cost of initial building and updating.

The following diagram shows entire zone data including the name tree and RRsets for the "example.org" zone with the following RRs:

example.org. SOA 3600 ns.exapmle.org. root.example.org. <other fields>
example.org. NS 3600 ns.exapmle.org.
example.org. NS 3600 ns1.exapmle.com.
example.org. NS 3600 ns2.exapmle.com.
example.org. RRSIG 3600 NS <other fields 1>
example.org. RRSIG 3600 NS <other fields 2>

Most part of this should be straightforward representation of the concept and definition described above. Non trivial (and possibly controversial) points would be:

  • There's a node that actually doesn't exist in the zone, such as root.example.org. This is necessary because we need a pointer target for all name fields in the RRs. We'll need to distinguish these nodes from other ordinary nodes so that we exclude them in normal search. In this memo, we call such tree nodes "invisible".
  • Invisible names can include completely out-of-zone names, i.e., names under example.com. They will never have RRs, but will still be necessary for rendering (especially depending on how we compress names)
  • Invisible nodes require other special processing. For example, by adding and deleting names from the zone, a node could change "visible" to "invisible" or vice versa.
  • Also, tree nodes that are pointed to by RRset fields cannot simply be removed when the name is removed from the zone. In this example we assume to use a reference counter field in each node to manage that.

4.3. Adding and Deleting

Adding and deleting RRs to the zone are not particularly tricky (though maybe not very simple). The original BIND 9's implementation took into account these operations, and our memory-inefficient version already supports adding to some extent, so we basically only have to port the deletion logic from BIND 9.

Managing RdataSet needs some extra care, however. We need to be able to support merging some RRs to an existing RRset and removing some subset of an existing RRsets. Also, as noted, the implication of adding and deleting on "invisible' nodes needs to be taken care of.

Since RRSIGs and the RRset they sign are maintained in the same structure, we also need to be careful about tricky cases like there's no RRset to be signed. That could happen, for example, if only RRSIGs are added to a zone or all RRsets (originally signed) are removed. This could also happen when normal loading of a valid zone but RRSIGs are added first before the RRset they sign (which wouldn't happen in textual zone file dnssec-signzone would create, though). I guess in that case we'll keep the Rdataset with an empty set of actual data. So the name still exists in the zone, and lookup would result in NXRRSET.

5. Relationship between RRset Object and In-memory Zone Data

We'll need to define a customized RRset (derived) class so the RRsets encoded in-memory can be used in other generic interface. This customized class would look like this:

class RBNodeRRset : public AbstractRRset {
private:
    bool signed_; // whether the RRSIGs (if included) should be visible
    RBNode<RdataSet>& node_; // ref to the node for the owner name
    RdataSet& rdataset_; // ref to the rdataset of the RRset
    uint8_t* wild_data_; // optional placeholder for wildcard (see below)

    // placeholder for base class interfaces, see below.
    mutable scoped_ptr<Name> name_;
    mutable scoped_ptr<RRType> type_;
    mutable scoped_ptr<RRTTL> ttl_;
};

The following diagram show two examples of the relationship between RBNodeRRset objects and the encoded in-memory data:

The upper-right one is for RRset example.org/NS. Its toWire() method will be customized so it directly renders the data stored in RdataSet (handling of owner name and name fields of RDATA can be tricky. See below). It should be actually much faster than the generic version.

To provide public interfaces defined in the base class, we'll create necessary libdns++ objects on-demand from the encoded data and use them to return the result. For example, the getType() method would be implemented as follows:

const RRType&
RBNodeRRset::getType() const {
    if (!type_) {
        type_.reset(new RRType(rdataset_.type));
    }
    return (*type_);
}

This is less efficient, but it should be okay - we won't use these methods in performance sensitive paths (for example, the customized toWire() won't use them). Also, we might consider changing the return type of some of these methods instead of returning a reference they could return real objects. If we won't use them in performance sensitive contexts, that should be okay. And, in any case, copying RRType or RRTTL objects should be pretty cheap.

When it involves wildcard substitution, things will be more complicated because the owner name is not predictable and needs to be generated run-time somehow. The lower-right RBNodeRRset shows an example of this case. In this case we use the wild_data_ field. When an RBNodeRRset object needs to be created for wildcard matched name, the underlying node and rdataset pointers should point to the wildcard tree node and its RdataSet. But we'll also allocate a temporary buffer to store the real owner name (i.e., query name) that matches the wildcard and remember the real name there in some way (maybe in the same way as we encode names in the tree). When rendering it in toWire(), this information should be used instead of the tree node.

Another tricky case is a synthesized CNAME due to DNAME matching. The lefthand side example shows this case. For this we probably introduce a separate (derived) RRset class, which is named SyncCNAMERRset here. Both the owner name and the resulting cname can only be determined run-time, we store that on-demand like we did for the wildcard case.

These RRsets objects should probably be pooled in a InMemoryClient object (e.g., using boost::object_pool) and recycled to avoid allocation and release every time. Maybe same for the temporary buffer space in the wildcard and DNAME cases, although these cases could be considered less common and can be slower.

5.1. Comparing RRsets

We need to be able to determine whether given two RRsets are of the "same kind", i.e., whether they have the same owner name, RR type and RR class so that at a higher level query processing we can eliminate duplicate RRs. And we'd like to do it efficiently for RBNodeRRset and SyncCNAMERRset. It's probably not the best if we need to construct the owner name in the form of Name objects and compare them.

What we'd do is as follows:

  • Between two non wildcard RBNodeRRsets, we compare the (address of) corresponding tree nodes; if they are the same, we compare the (address of) the RdataSet to see whether they of the same type.
  • Between two wildcard RBNodeRRsets, we first do the same check as the previous case. If they can still be of the same kind, we compare the "wild_data_" parts of them (in some way) to see they are the same.
  • It should be safe to conclude that any wildcard RBNodeRRset and non wildcard RBNodeRRset` are of "different" kind. Likewise, it should be safe to conclude that any RBNodeRRset and SyncCNAMERRset are of different kind.
  • Between two wildcard SyncCNAMERRset we should probably compare the dynamically created owner names. But, in practice, there won't be a situation where we need to compare two of these RRsets.

6. Additional Record Processing

Handling additional records (e.g. finding and adding AAAA and/or A for NS name) can be efficient due to the pointer-based architecture for names stored in RDATA.

The following diagram shows how we create additional RRsets with the following zone content:

example.org. MX <pref> mx.example.org.
example.org. MX <pref> mx.example.com.
mx.example.org. AAAA <some IPv6 address>
mx.example.org. A <some IPv4 address>

In the getAdditionalImpl() implementation, it examines the RdataFieldSpec sequence for type MX. It contains one name field, which indicates it requires additional processing. So for each name field (pointer to nodes) stored in RdataSet, it follows the pointer and identifies the corresponding tree nodes. It then checks whether the node has data, and if so, whether the data contain AAAA or A RRs. In this example, the example.org. node meets this condition, so the corresponding RRsets will be created.

The mx.exmaple.com node doesn't have data (in fact, it's not even in the zone), so the processing is stopped here.

When wildcard is involved, however, it can be trickier. The next diagram shows that case with the following zone content:

example.org. NS ns.wild.example.org.
*.wild.example.org. AAAA <some IPv6 address>

In this case, the NS name (ns.wild.example.org.) doesn't exist in the zone, and only matches the wildcard. So the tree node pointed from the NS RdataSet should be "invisible". Unlike the previous case for mx.example.com, however, getAdditionalImpl() needs to go up the tree one level to see if the node has "wild" flag. In this case it does, so it concludes it's a wildcard match, identify the wildcard node, and constructs an RBNodeRRset with wildcard substitution. This is a bit expensive operation, but should be a very rare case and so acceptable.

A more critical problem is that there's no way if an invisible node is subject to wildcard substitution. So, even in the case of "mx.example.com." (forgetting it's out-of-zone name), we'd need to go up the tree to confirm there's no matching wildcard name. This cost is probably non negligible and would better be avoided. One possible idea is to keep record of whether the zone has any wildcard name, and if not, just skip the check.

7. Name Rendering and Compression

One open question is how we render (possibly with compression) owner names and name fields of RDATA of RBNodeRRset without increasing overhead too much.

One straightforward way would be to construct the name data with offsets in a temporary contiguous space, build LabelSequence from it (we need to extend the sequence to allow that), and render it using the common MessageRenderer interface (but we'll also need to extend the render class so its writeName method can take LabelSequence in addition to Name).

Another approach is to exploit the tree structure like NSD internally does. One implementation of this would be:

  • prepare a hash table to map node address to offset value
  • For each name, locate the corresponding tree node, and identify the offset value for the address of the node from the hash table. If found, the name can be compressed at that point; otherwise, render that part of the name (sequence of labels), create an entry and record the offset. Continue to the next step.
  • Going up the tree, level by level, and repeat the same thing until an offset is found or the end of the name (label sequence containing the trailing period) is reached.

This will be very fast, but to make it workable we need to support other tricky cases such as wildcard substitution and DNAME-synthesized CNAME. Names related to these cases are not in the original zone data, so we need to create temporary tree nodes for them so we can apply the same steps transparently.

Right now I'm not sure which way (or something else) we should pursue.

8. NSEC3 Space

Due to its special nature, the NSEC3 name space could be encoded differently for higher efficiency both in terms of memory footprint and lookup performance. But at least for the first (revised) version it's probably better to use the same architecture as the generic tree, just maintained separately.

Last modified 5 years ago Last modified on Jun 21, 2012, 9:44:39 PM

Attachments (6)

Download all attachments as: .zip