wiki:msgqng

Message queue the next generation

The message queue system is less than optimal, as it is cubersome to use and buggy at places. There are serious problems in relation with multiple instances of the same component and with modularity which need to be addressed. There are concerns about the system's behaviour in higher loads. Also, the configuration system is not perfect and will need some changes, the message queue system might make the changes a lot easier.

This is the place where some requirements with rationales of the message queue subsystem are listed and discussed (not all of them, the ones which are obvious or already working are often omitted, it would be useless to list things like „It must be possible to connect to it“). Also, some suggested ways to accomplish them are at the end of the document.

The goal is to discuss what we need and if the gain-cost ratio is better than with other solutions, like incorporating some library.

Requirements

Fault resistance

As the death or a problem with the message queue is fatal to the whole bind10, it is needed the message queue daemon doesn't crash, get confused or dead-locked by any out of order situation. It must cope with faults of clients or other unusual situations even at the cost of dropping individual clients.

Such unusual situations would include:

  • A client disconnects at any time, gracefully or not. This includes situations when we are in the middle of transmission of a message (either way) or if there's a backlog of messages.
  • Client stops reading or writing and never resumes. In case it stops reading, the send buffer size of the message daemon must not grow indefinitely.
  • There's a protocol violation.

Less need of timeouts

Currently, we rely on timeouts too much. This is visible, for example, if the user sends a command to component and the component crashes. The bindctl never gets a result (or error) and times out after some long time. In some cases, these timeouts even cascade (configuration changes), causing confusion.

But we even have false timeouts, if some action (loading a huge zone to in-memory after a configuration update) takes too much time, the system times out.

Therefore it would be desirable the system reports usual failures, like:

  • Message could not be delivered, because no such recipient exists.
  • The recipient got the message, but crashed before answering. Don't expect the answer any longer.

This would still not solve the problem if a recipient gets a message which requires an answer and blackholes it. We'd still need a timeout for this, but such timeout could be higher (because it would always be a serious bug) and log an ERROR. Also, the timeout should be configurable.

Logging and configuration of the message queue daemon

There are many events in the message queue daemon which need to be logged. These include DEBUG messages of various levels, from connections and graceful disconnections, to specific messages being sent, but also more serious things like dropping of connections due to protocol violations or other problems. Currently the message queue is either silent or produces backtraces to error output, which might end up in a completely different place than the rest of the logs. The logs would contain only that the daemon died.

Also, logging and other behaviour (timeout values) need to be configured. This means the message queue must be able to load configuration from the configuration manager.

It should be possible to send commands to the message queue. This is a tentative list of things that might be useful:

  • Shutdown
  • Drop a client
  • Provide some info about connected client(s)

Providing dumps of communication

Such thing could help debugging various communication problem. Currently, if there's something odd happening, we need to turn the full debugging of both communicating endpoints to the highest level and try to guess what is happening from messages printed to logs. These might be in the wrong order (since multiple processes report them) and contain a lot of other cruft to get through.

Therefore it would be nice if there was a possibility to dump the messages somewhere with some reasonable filters (type of messages, recipient, sender).

Notifications

Sometimes one component needs to take action if something in other component happens. There are many examples of this ‒ the configuration changes, the boss needs to know when a component that holds some open sockets exited, we need to update the in-memory data source if the underlying data are changed by XFR-IN, we need to transfer a zone in when the zone times out. Currently we create ad-hoc solutions (the boss detects clients disconnected from unix domain socket, the zone manager calls a direct command of xfr-in, we have a configuration command). But these are suboptimal, since we need to invent such solution every time and it prevents new modules to integrate seamlessly into the system (A component that needs to access data of zone after it was transfered in would have to pretend it's the authoritative server to get the update command).

A notification system would be provided, where a component declares which types of events it generates and then it can send notifications of these events without knowing who, if anyone, is interested in them. It should be possible to subscribe into getting such notifications.

File descriptor transfers

We have many places where we need to transfer file descriptors:

  • XFR-OUT
  • Sockets to listen on
  • DDNS

Each of them needs a separate unix-domain socket in the file system. This socket needs to be correctly found and a protocol implemented for the transferring (currently, the sockets to listen on are requested in two phases with ad-hoc protocol, while the XFR-OUT needs a push principle). This is inconvenient and there's a lot of similar and error prone code in this.

It would be much more convenient if it would be possible to simply send a file descriptor in-bound through the message queue system.

Binary blobs

If we had the file descriptors from above, we would like to switch the XFR-OUT to use this. The DDNS module might want to use it as well. But then we need to bundle the packet with the file descriptor and we may want to have a way to pass binary data effectively (not escaping it through JSON).

However, it is a question if we want to do the fd a packet transfers at all or if we adopt some receptionist model that'll work differently.

Asynchronous operation

Most of our RPC calls are blocking. This might not be a problem always, but it is unfortunate if, for example, the authoritative server calls a remote function in some other process and the other process does not answer in timely fashion. In this case, the authoritative server is waiting a long time and not answering the queries. Currently, the use of asynchronous RPC (if we are interested in the answer) is somewhere in between of very inconvenient and impossible. This is not a fundamental design problem, but the libraries are hard to use there.

Therefore it is needed to have a way to call a remote command, pass it parameters and a callback. The callback would be called when either error or result comes back.

Also, a mass-RPC would be nice. It would allow calling bunch of remote functions in any order and invoke the callback after all of them were handled. This would allow for requesting all the listening sockets in parallel conveniently, for example.

RPC convenience

This is similar to the above in the form it is only a convenience of how the libraries are used. It is hard to construct and send commands, it should be easy. It could look something like:

def callback(ok, result):
        if ok:
                print("There are " + str(result) + "zones")
        else:
                print("Bad luck happened: " + result)

cc.rpc('ZoneMgr', 'get_zone_count', callback, class='IN')

or

try:
        print("There are " + str(cc.rpc_sync('ZoneMgr', 'get_zone_count', class='IN')) + " zones")
except CCError as e:
        print("Bad luck happened: " + e.reason)

Richer error responses

Currently, an error must be of form [<error code>, "String with description"]. This is hard to use in many cases, because:

  • There need to be generally-agreed error codes. There are not, therefore we use error code of „1“ for anything.
  • The string is usually not machine-parseable.
  • It is not possible to send additional (structured) data with the error.

Multi-component support

We introduced multiple instances of a single component support (currently b10-auth is used that way, but more might come). But the msgq system and addressing doesn't consider that.

Now, if we send a message to Auth, all the instances get the message. This is unsuitable for RPC calls. If we had multiple xfr-in instances, we want to call the „Please, transfer zone example.com in“ on exactly one instance, not on all of them.

We need a way to:

  • Address a single instance. This is possible by the l.<number> address which is automatically assigned to each connection. But such address is not advertised anyway. There need to be a way query such address and get notifications about the newly connected or terminated connections. Also, some way of more convenient addressing of a specific instance would be nice, like sending a message to Auth[0] would be translated by msgq to be the Auth connection with the lowest number.
  • An anycast support for RPCs. For example, sending to address Auth[?] would make the msgq to randomly pick one of the connected auths and send it only there.

High-load support

The message system must meet the needs for a „big server“. We may want to send configuration snippets with millions of elements (many zones, long ACLs), and many messages (many notifications about updated zones by DDNS, for example).

Therefore, it must work with:

  • High number of messages transferred.
  • Large (multi-megabyte) messages.

This means the server must be able to effectively keep switching between requests (not go and server one client, starving the others) and must not wait for the whole large message to be transferred if others would want the service too.

This might need the client library to be able to send asynchronously as well as receive (at least in the C++ part) so the process is not blocked while it waits to send the message.

Use of the bus from python-C++ wrappers

If we have a C++ class that communicates over the msgq and takes an open connection C++ object in its constructor, it is hard to wrap it for use in python. The reason is, python has an independent implementation of the connection and they are not interchangeable, we can't pass the python connection to the C++ class.

If we had only the C++ connection and a wrapper around it for use in python, it would be possible to unpack the connection and use.

Tests

It is a core system of the whole bind10, but it has very little tests. Both the libraries and the daemon have some unit tests, but thing like this needs extensive behaviour tests of various stress and failure situations. We know it is buggy and we know it is because there are too little tests.

Proposed actions

I believe it is better explained in listing what should be done. We can rearange that, but we need to understand how it would hold togerther.

Addressing enhancement

We add more address schemes, as suggested above:

  • Auth[0] means the first auth, Auth[1] the second, etc. This is translated inside msgq as alias for some l.<number> alias.
  • Auth[?] means any random instance of auth. Again, this is translated inside msgq.

Reserve an address space for notifications

We need the events not to clash with module names. Therefore, let's say that there never will be a module called Events and all the events we need will be of the form: Event/name-of-the-event.

Then decide on the form of the event. Maybe just sending a message to the address with a dict containing all the parameters is enough (no command element, just the data) or empty dict if no parameters are needed. There's no answer to a notification.

Note that anybody can subscribe to it and anybody can notify of the event, if they know the correct format.

Do we want these events listed in some spec file?

While there's probably no code for this task, we need to document how the events are handled.

Wrap the C++ version of the connection

So we can drop the python one. Also, this needs to convert the python modules and possibly msgq itself?

Support notifications in the library

The libraries should be able to conveniently:

  • Send a notification.
  • Subscribe to a notification and register a callback that is called whenever the event happens.

Make msgq daemon listen to the bus

So we can support the next three tasks.

Add notifications for connection/disconnection

The msgq should notify about newly connected clients, disconnected ones and possibly about subscribes/unsubscribes (as these set the names of components).

Add commands to msgq

To shut it down and list active connections

Add configuration to msgq

This is slightly tricky. It needs to wait for the cfgmgr to connect first, then it can ask it for its own configuration.

It should have configuration for logging and for timeouts.

Add logging to msgq

So we can see what is happening.

Make the msgq track request/answers

Application should never need to care about timeouts on the communication. Therefore, these things should be done:

  • Add option to the header, signalling if the message expects an answer.
  • Let msgq respond with error if the recipient does not exist if it expects answer.
  • Let msgq keep track of messages in progress. If the client that should respond disconnects, report error. Also, track timeouts on them and drop them later, reporting an error.

Make the libraries more convenient

Provide methods to call RPCs with callbacks, etc. Simply make the usage convenient.

Providing dumps of communication

I have no concrete idea for that yet. Maybe just log at the highest debug for now, with various loggers?

Extend the message format to include attachments

Now, the message is like: <message size><header size>{header}{data}

Let's add a count of attachments which would look like: <message size (without attachments)><header size><number of attachments>{header}{data}{attachment 1}{attachment 2}…

The attachment would have a type, either a file descriptor or a binary blob of given size.

The libraries would get another class, encapsulating the whole message. The message would hold the header, the data and the attachments.

Others

The rest probably just needs some tests written and all problems found fixed. This includes system-level tests as well. It would go for the following features at least:

  • Fault resistance (Including timeouts and stuff)
  • Richer error responses
  • High-load support
Last modified 6 years ago Last modified on Apr 9, 2012, 11:28:38 AM