wiki:SharedMemoryIPC

Inter-process Protocol for Shared Memory Data Source

0. About This Document

This is a proposed system design for shared-memory based in-memory data sources, mainly focusing on the protocol used between relevant BIND 10 modules.

1. Relevant Modules

We'll focus on the following BIND 10 modules in this document:

  • The memory manager (tentatively named, sometimes called "memmgr", just "manager", etc): this is a newly introduced module for the shared memory support. It manages shared memory segments corresponding to specific data sources. There must not be more than one memmgr process in a BIND 10 system (there can be zero). Only the memmgr can have write access to shared memory segments, and when the memmgr modifies a shared memory segment, no instance of other modules can refer to that segment.
  • Shared memory reader modules: other BIND 10 modules than memmgr that have read-only access to shared memory segments managed by memmgr. The auth server module is specifically intended as a reader, but some others such as xfrout may also be one. There can be multiple processes of the same reader module.

2. Assumptions

  • we assume we can distinguish different data source instances by name (like "sqlite3", "MasterFiles", or other user-defined name especially if there are multiple instances of the same type of data source).
  • use file-based shared memory
  • use a single shared memory region per data source instance: all zones of that instance are stored in that region, and data of different data source instances are mapped on to different regions.
  • we assume there's an API to map a mapped memory region with a specific file for a given data source instance (identified by the name). That API will allow the caller to specify the access mode, either read-only or read-write.
  • we assume the message bus guarantees reliability; messages won't be lost in the bus, and the message ordering sent by a module to the same destination will be preserved.
  • we assume there's a mechanism in the message bus that can reliably tell a module of the latest list of listening modules on a given group, and can tell any update (joining a new instance or leaving an existing instance) in a timely fashion.

3. Conceptual Data Structures

The memmgr maintains the following structure for each data source instance configured for the BIND 10 system:

// Note: For our familiarity a C++-like structure is used, but this is not
// necessarily a specific proposal for the actual implementation.
struct SegmentInfo {
    string datasrc_name; // something like "sqlite3", "MasterFiles", or user-defined name
    string map_file_base; // e.g., "zone-sqlite3.mapped"
    uint_t current_file_id; // 0 or 1 (set to 0 initially)
    set readers; // set of current readers (use any ID given from msg-bus)
    set old_readers; // ditto; only used while updating the image
};

Client module instances maintain the following structure for each data source instance configured for the BIND 10 system:

struct DataSourceStatus {
    string name; // consistent with manager's SegmentInfo::datasrc_name
    MemorySegmentState sgmt_state; // enum: UNUSED, INUSE, or WAITING
};

4. Messages

The memory manager and the segment readers exchange messages for synchronization about the configuration and status of shared memory segments. We use the BIND 10's inter-module command framework to implement these messages.

Each message is defined as a single command with possible command arguments. The following messages (commands) are defined:

  • "segment_info_update": This will be sent by the memmgr to each of its reader processes (maintained in SegmentInfo::readers) at system start up or when a new version of memory segment is ready and the readers need to be switched to the new one. Its argument is a list of maps, each of which contains:
    • data_source_name: string, corresponding data source name. value of memmgr's SegmentInfo::datasrc_name.
    • params: map, detailed parameters of the information specific to the segment type (generalized so we can extend it for other types of segments). The currently possible parameter is:
      • mapped_file: string, a path to a file to be mapped to memory.
  • "segment_info_update_ack": This will be sent in response to segment_info_update from reader processes to the memmgr. It confirms that the sender reader process has migrated to the new version(s) of the segment(s) and no longer uses the old version of the segment. Arguments: see below.

segment_info_update is a separate BIND 10 command. segment_info_update_ack may or may not have to be a command. If we can complete asynchronous command exchange first, segment_info_update_ack can just be an answer to segment_info_update. If we cannot wait for that, segment_info_update_ack will be a separate command, and will need to contain some information about the corresponding update command (if there are multiple data sources using shared memory segments, there can be multiple outstanding updates).

5. Manager and Readers Behavior

In this section we describe how the memmgr and reader processes interact based on the proposed protocol along with the conceptual data structures and messages through specific operational examples.

For simplicity, we'll basically focus on one data source (whose name is "sqlite3"), but in the actually implementation both memmgr and reader applications can use multiple data sources based on the configuration. Extending the description to the general case shouldn't be difficult.

We also assume the file-based shared memory, and two identical copies of corresponding mapped file are available at the time of startup.

5.1. Initial Startup

In this example, there's one initial reader named "auth-1". It retrieves the data source configuration and finds the memory segment for a data source named "sqlite3" is in the WAITING state. So it subscribes to the MemorySegmentReaders group and waits for updates.

On the other hand, the memmgr first identifies the mapped file for the "sqlite3" data source by concatenating SegmentInfo::map_file_base (which is assumed to be "zone-mapped" in this example) to SegmentInfo::current_file_id (which is initially 0), which is "zone-mapped.0". It then maps its content into the memory, mainly just for checking its integrity this time (or it may load the zones first time). The readers set is initially empty.

When the memmgr is ready, it sends a notify request for the MemorySegmentReaders group to get the list of readers at this point. Right now there's only one reader, auth-1. The memgr adds it to the old_readers set (note that auth-1 is not in the (current) readers list because the manager cannot determine the current status of these readers; whether they also just start up or are already using the mapped region in case this is a restart of memmgr).

The memmgr then sends each of the old_readers an segment_info_update message containing the segment information of the "sqlite3" data source (and possible others, but in this example we focus on the single data source).

The reader process (auth-1) receives the update message, and maps the content of the given name of file into memory in the read-only mode. It also resets SegmentInfo::sgmt_state to INUSE.

The reader then sends a segment_info_update_ack message to the manager. The manger moves the sender from the old_readers set to the readers set. Once the old_readers set becomes empty, the manager can be sure there shouldn't be any other reader of the other version of the file (even if it's a restart of the manager), so it maps the other version into memory in the read-write mode for future updates.

If, at this point, another reader starts up (auth-2 of the diagram below), it subscribes to the MemorySegmentReaders group, too, and the joining of a new reader is notified to the memmgr. In this case, the manager can know it's a new reader and shouldn't be using an existing map (otherwise it should have been included in the initial notify response), so it adds the reader to the (current) readers set.

Like the case with auth-1, the memmgr sends the segment_info_update message to the new reader, auth-2. auth-2 maps the specified file into memory, and updates its internal state. It doesn't have to respond to the update message, because the manager is already assuming auth-2 is using the segment (when it actually starts using it or even whether it's really using it doesn't matter for the manager, as long as it's clear that auth-2 isn't using the other version).

5.2. Update a Segment

Now, suppose the memmgr receives an external command for updating a zone in the shared memory segment. The memmgr makes the updates to the segment it exclusively opens for writes, that is, zone-mapped.1 (note: this probably has to be done in "background", so memmgr can still accept other messages. but such details are out of scope of this document). Reader processes are still using the "current" version of memory segment.

5.3. Switch Segments

When the memgr completes updating the zone, it unmaps that version, moves the current readers to the "old_readers" set, and clears the "readers" set. It then sends each process in old_readers a segment_info_update message with the file name of the new version.

When a reader process receives a segment_info_update message, it unmaps the current (now-old) version of the segment, and maps the newly specified version. It then sends a segment_info_update_ack message to the memmgr.

When the memmgr receives the ack, it moves the corresponding user ID from old_users to the (now-current) users set.

The same exchange and update process take place for the other reader process.

The memmgr waits until the old_users list becomes empty (i.e., until it receives segment_info_update_ack from all readers of the previous version). At this point the memmgr knows there's no other reader for the previous version of the segment (mapped from zone-mapped.0) and it can now make updates to it. So it maps that version of segment in the read-write mode.

Finally, the memmgr completes updating the other version of segment (zone-mapped.0). At this point the system is in the same state as the last drawing of Section 5.1 except which processes map which files. So further updates can be done in the same way.

5.4 When a Reader Dies

If a reader process currently in the memmgr's (current or old) set dies, the memmgr should know it via a notification update from the message bus. Basically, the memmgr can simply remove it from the corresponding set, and if the manager is waiting for an update in the set (like until it becomes empty) it takes an appropriate action just like when it gets an expected response from the process. If any last update from that process is ever delivered to the manager after its removal from the list (it's not really clear if the message bus can guarantee that cannot happen), the manager can simply ignore it.

5.5 When the Memmgr Dies

If the memmgr dies, the readers basically don't have to do anything. Until and unless a new manager starts up, they can keep using the current segments. Once a new instance of memmgr starts, the existing readers will receive a segment_info_update message just like the 3rd diagram of Section 5.1. Whether or not the file(s) are the same or a different ones they are using, they can simply remap the specified files, and send an update_ack message to the manager. When the manager receives ack messages from all existing readers, the entire system recovers consistency.

5.6 Rebuild Initial Map File On Demand

The described scenario so far implicitly assumed that the created/existing mapped files always reflect the latest state of the underlying data source. This can probably be ensured operationally, but ideally we should make it possible that memmgr automatically detects stale mapped files and rebuild them automatically when necessary (e.g., by comparing the timestamps of the mapped files and sqlite3 DB or the original master file(s)).

Details of this are quite open.

6. Full Data Source Reconfiguration

Updating the entire configuration of data sources will be a bit more tricky. This is a rough sketch of initial ideas on how it would work. While it looks like workable, more details will probably have to be determined.

6.1. Configuration generation IDs

First, we'll need a concept of generation ID of configuration, which will be managed by the cfgmgr per module basis. Processes referring to the configuration of the same module (like "data_sources") use the generation ID of that module's configuration so they can determine which version of config is referred to in case of migration.

Generation IDs monotonically increase and the latest one will be saved in the configuration DB file.

6.2. Memory Manager Behavior

The Memmgr now manages sets of data source clients and mapped files per generation ID of the "data_sources" configuration. Memmgr keeps a set of a particular generation ID as long as some reader still uses a memory segment of that generation.

segment_info_update and its ack message now contain the corresponding generation ID.

mapped file names would now include the generation ID, too, e.g. zone-mapped-in-sqlite3-42.0 where "in" is RR class, "sqlite3" is the data source name, and 42 is the generation ID.

When the memmgr receives a configuration update for full data source, it will create a whole new set of data source clients and mapped files (so, during the migration we'll at least maintain 4 mapped files per data source), and then send segment_info_update message to all readers for all mapped segments. It waits until it gets ack from all outstanding updates. On completion, the memgmr can be sure all readers now use the new generation of segments, so it can discard any information including mapped segments of older generations.

6.3. Reader Behavior

Likewise, readers maintain sets of data source client lists per generation ID.

When a reader receives a configuration update for full data source, it tries to complete all WAITING segments for the corresponding client list, just like the initial setup. When all segments are set, it swaps the new and old sets, wait until the old one is cleaned up, *then* respond to the latest segment_info_update message (the ordering is important because the memmgr will assume the reader does not use the old generation of segment anymore on receiving the ack).

It would be possible that a reader receives a segment_info_update for a generation ID older (smaller) than the generation of its first data source configuration. It happens if a reader joins the system later than others immediately followed by a configuration update. the reader would only get the latest generation ID from config manager, while memmgr still doesn't get that generation of config. In such a case, the reader can simply respond to the update.

It would also be possible that a reader receives a segment_info_update for a generation ID newer (larger) than its latest generation. This means a configuration update has been delivered to memgmr and is also coming to the reader but has not arrived. In this case, the reader should keep the update message (pending ack) until it gets the corresponding configuration update. As long as the system is working correctly this event should eventually happen. At that point the reader can resume handling the saved update message and respond to it.

Last modified 5 years ago Last modified on Apr 12, 2013, 7:14:00 AM

Attachments (11)

Download all attachments as: .zip