wiki:SharedMemoryDataSource

BIND 10 Shared Memory Data Source

Problem Statement

One of the fundamental problems with BIND 9 performance has been locking data structures to allow concurrent access by multiple threads. This requirement for locking slows all lookups, and causes contention that limits the scalability of the program.

For BIND 10, we use an in-memory data source to answer queries, and we wanted to find ways to limit or eliminate locking, while still allowing multiple cores to access data.

More Details about the Problem

An authoritative DNS server needs to search for answers to the DNS queries that it gets, looking in what we call a "data source" in BIND 10. Typically there are separate flows of execution on the machine, about one per core being usual since DNS servers tend to be CPU-bound. These flows may be threads or processes. Each of these flows needs concurrent access to the data source.

The contents of the data source may be updated by one of several methods. Some result in an entirely new version of a zone:

  • A zone may be re-loaded by the administrator
  • A zone may be received via AXFR

Some result in a changed version of a zone, with added, removed, or modified RRsets:

  • Zone contents may be changed by the administrator
  • A zone may be updated via IXFR
  • A zone may be updated via DDNS

Note that a server may serve thousands or potentially millions of zones at the same time.

When a zone is updated, we want hosts querying the server to get a consistent view of the zone.

When we have a data source that is based on some underlying technology that we did not implement, we punt the problem to that technology. So, for SQL-based data sources, we assume that either database transactions or locking will handle concurrency. This applies for all such data sources - and may actually mean that any particular for data source does not support any concurrency at all.

However for BIND 10's in-memory data sources we did implement the underlying technology, so we need to solve the concurrency issue.

For most DNS servers, the update rate is several orders of magnitude less than the query rate. For example, a busy authoritative server may get tens of thousands of queries per second, and several hundred updates per minute.

Solution Space

We believe that it is necessary to get lock free, concurrent access to zone data.

We believe that the only approach that makes sense is to have a version of the zone that is being updated that is not used by querying flows. Then each of the querying flows can access the data source without any locks. When an updated version is ready, the querying flows can quickly (ideally atomically) switch to the new version.

The simplest way to perform this update is to have a single flow updating the zone. There may be some benefit to allowing concurrent updaters to the zone data if fine-grained locks are used (perhaps using the lock bucket technique from BIND 9), but this is something that needs to be researched - since updates are quick the overhead of locking may overwhelm any performance benefits.

The flows would end up looking something like this:

   +----------+  +----------+        +----------+
   |  Query   |  |  Query   |        |  Query   |
   |   Flow   |  |   Flow   | . . .  |   Flow   |
   +----+-----+  +----+-----+        +----+-----+
        |    +--------+                   |
        |    |                            |
        |    |   +------------------------+
        v    v   v
       /+----+---+-\    /-----------\
       |   Query   |    |  Update   |
       | Zone Data |    | Zone Data |
       \-----------/    \-----+-----/
                              ^
                              |
                          +---+----+
                          | Update |
                          |  Flow  |
                          +--------+

What Kind of Shared Memory?

All query flows need to see a copy of the zone data, as does the update flow.

Full Copies

One approach is to keep separate copies of the zone data, so each query flow has its own copy. This is not actually shared memory, but does give each flow an identical image. This is the approach taken by BIND 10 currently. However this consumes a lot of extra memory, and is not a reasonable long term solution.

Forking

Another approach is to have the update flow use fork() to create copies of the data after updates have completed. This is the approach taken by NSD, and works something like this:

  1. Parent process loads copy of zone
  2. Parent process forks a number of processes to handle queries
  3. When the zone changes, the parent process updates its copy
  4. After updates are complete, the parent process asks each child to terminate, and then forks() a new version of itself to replace each child

This method is portable across all Unix variants, and requires a minimum of work to implement. It has a few disadvantages:

  • It depends on the OS to do efficient copy-on-write memory operations to avoid duplication
  • It is non-obvious to an administrator that the memory of the processes is shared
  • There is no equivalent of fork in Windows

Threads

In a multi-threaded model each thread has implicit access to the entire memory space. This is the model used by BIND 9. It has the advantage that there is no work at all required for the various threads to gain access to the data. If done carefully, it can be done in both a lock-free manner and also requiring a minimal amount of extra memory when the zone is updated.

The main problem with this approach is that it creates "fate sharing", meaning that a bug in any one thread will necessarily crash the server. While a bug that corrupts the zone data will cause problems for all models, with a threaded application ANY bug affects the entire service.

Explicitly Shared Memory

By "explicitly shared memory" we mean either System V shared memory or mmap() in Unix, or named shared memory in Windows. In this model, we create a block of memory that all of the flows attach to. Updates are done to a new block of memory, and when it is ready all of the query flows use that.

This model allows for additional flows to be created to handle queries at any time, it should be obvious to administrators exactly what is going on as far as memory use, and it works on Windows.

It has a few disadvantages:

  • Memory use will be increased when modifying zones, and may double if the OS does not support copy-on-write for shared memory (although with heroic effort this issue can probably also be overcome)
  • It requires a custom memory manager to handle working in the block
  • It requires relative pointers within the data structures, rather than usual absolute pointers

This is the model we propose for BIND 10, because while it is more technically challenging in the end it results in an optimal implementation.

Possible Implementation

Updater Task

A single program is started which handles creation and updating of the shared memory. On start, it reads which zones are served from the data source, creates a memory space big enough for them, and builds the data structures there. It may read the zone data either from a zone file (for static zones) or from another data source (for zones that change, either because acting as a secondary or because it supports DDNS).

The memory created is set to be cleaned up by the operating system when the last process detaches from it. The updater itself points to the block of memory for the current version of the data source.

Query Tasks

The servers that answer authoritative queries run as they do today, but one of their data sources may be this one. They connect to the block of memory created by the updater, in a read-only fashion.

Updates to Zones

Changes to zones occur on another data source first. Initially this will be the SQLite data source, although we may implement a PostgreSQL or MySQL data source due to concurrency limitations in the SQLite data source. We do this to simplify the updater - it does not need to handle transactions or concurrency in any way.

When a change is made to a zone, the updater creates a new block of memory, ideally using a copy-on-write version of the old block. It then reads the differences that have happened to the zone and applies them to the new block. When they are all applied, it notifies the query tasks that a new block is available. After the first query task replies, it detaches from the old block.

The query tasks receive the update and process it after they have finished answering their current query (if any). They switch to the new block and detach from the old block. When the last one detaches the OS cleans up the memory.

Implementation Bits

There are several things that need to be done to build this data source. We need to figure out how to most naturally use shared memory with C++, although Boost may largely solve this problem for us:

http://www.boost.org/doc/libs/1_49_0/doc/html/interprocess/quick_guide.html

Some things to be built:

  • A update program that loads zone files or from another data source into the memory block
  • A data source library that can attach to the memory block and answer queries using it
  • A notification message to tell processes to ask the data source library to look for a new memory block
  • A way to notify the update program when a zone has been updated
  • Addition to the update program to read differences in zones
  • Addition to the update program to apply differences to a new data source
  • Addition to the update program to send notification when a new block is ready
Last modified 6 years ago Last modified on Mar 16, 2012, 8:27:22 AM