wiki:StatsDesign

Statistics design

See requirements page for requirements and use cases that this design is expected to cover.

This is work in progress

Constraints

While this design covers general statistics implementation, it is focused on a design that is implementable in the limited development time (roughly 2 months of engineering time) planned in the 0.9.2 release. The design should be extensible, so it should be possible to add extra features at a later date, without rewriting any significant parts of the code.

The goal for 0.9.2 is to provide statistics implementation and an API to extract it. No "client" side is planned in 0.9.2. That is similar to what our first hook implementations have been. ISC provided an API and a documentation that explained how to use it. We will probably provide a very simple client, but due to the nature of the statistics (they're usually integrated into existing network monitoring solutions), the client will never be feature rich.

Concepts

Data types

The basic concept of statistical analysis is an observation, implemented as Observation class. There are four basic types of observations: integers (represented as uint64_t), floats (represented as double), time intervals (represented by boost::posix_time::time_duration) and string (represented by std::string). By default, each Observation collects only a single value. There are two way new data can be added to the Observation: addValue() is additive, i.e. it adds the new value to existing one. This is useful for using statistic as a counter, e.g. number of packets received. The second method is setValue(), which sets absolute value. This is useful for observing properties that are are observed as absolute values, e.g. size of incoming buffer. When an observation is recorded, its time stamp is stored and will be available for retrieval. For basic observations, only the timestamp of the last change is preserved.

Each observation is by default able to store a single value. However, it is sometimes useful to keep multiple values of the same property, typically to observe how a given property changed over time. To avoid unlimited memory consumption growth, such a collection should be limited. There are two ways to define the limts: time based (e.g. keep samples from the last 5 minutes) and size based (e.g. keep at most 100 samples).

// Keep the packet-received statistics for the last 5 minutes
StatsMgr::instance().setMaxSampleAge("packet-received", time_duration(0,5,0,0));

// Keep at most 100 samples of the statistic
StatsMgr::instance().setMaxSampleCount("packet-size", 100);

Storing more than one observation per statistic may not be available in 0.9.2.

Data collection

One of the requirements is to keep the usage as simple as possible. Therefore lazy initialization design pattern was chosen. When the server starts, there are no statistics gathered at all. Each statistic is initialised when it is recorded for the first time. This approach has a number of benefits. The most important one is that there's no need to define any schemas of available statistics. This has a number of implications. In particular, statistics usage is simpler as there's no need to initialize any schema prior to the first use. This is also very important for the long term evolution of the statistics. Current implementation planned for 0.9.2 assumes that the statistics code will be run within each deamon process, but it is possible that over time the design will evolve and the statistics will be stored in a single process, shared among all daemons. In such case, it would be cumbersome to keep the definitions synced and updated them.

Data collection is expected to be taking place in numerous places throughout the whole server code and possibly hook libraries. Therefore it is essential for it to be as simple to use as possible. For example to record that a number of packets processed was increased by 1, the following call can be used:

StatsMgr::instance().addValue("packets-received", 1);

Contexts

Each statistic has a name. The name is case-sensitive. It is recommended to limit the name to small letters (a-z), numbers (0-9), square brackets ([ and ]), and dashes (-). Dot (.) has a special meaning. It is a context separator. For example subnet[0].packets-received will be interpreted as "packets-received" in the "subnet[0]" context. For 0.9.2 release, only one level of contexts is expected to be implemented, but this approach is expected to be extended in upcoming releases.

From the user's perspective, contexts are almost transparent and are performance optimization. The only case where they are non-transparent is statistics retrieval, when the statistics are retrieved. Contexts allow to get statistics for a given context, e.g. all statistics related to subnet[0].

Support for contexts may not be available in 0.9.2.

Performance Optimization

Observations are kept in a map indexed by a string that contains the statistic name. This should provide sufficiently fast access time in most cases. However, in some operations that are expected to be conducted many thousands times per second (one example could be a hypothetic counter for number of parsed options), this map access time may become non-negligible. For those cases, it will be possible to obtain Observation object pointer and use it to increase the statistic with completely omitting the search phase.

ObservationPtr my_stat = StatsMgr::instance().getObservation("options-parsed");
if (my_stat) {
    my_stat.addValue(1);
}

Data extraction

Requirements document mandates that at least the following access patterns must be supported: extract one statistic, reset one statistic, extract all statistics, reset all statistics. The first two require at least one parameter (statistic name), therefore a communication channel is needed that allows parameter specification. In general case, a number of channels could be possible: file with a signal (user or an externals scripts writes input parameters to a file, then sends a signal to trigger processing of that file, with the output being written to another file), unix socket (user or an external script writes input parameters to a unix socket then expected the results be written to that unix socket), UDP or TCP socket (open a socket, process incoming requests and send back statistics values as responses) or even more complex ones, like SSL.

UDP socket would be the easiest to implement, but it imposes two major disadvantages. First, it offers no security. Even when opened on a loopback interface, local user could send a query. The second is related to packet sizes. A single UDP over IPv4 packet is limited to 1476 bytes. That may not be sufficient for retrieving statistics that covers multiple observations. It is possible to use fragmentation, but that would require additional implementation effort.

TCP socket does not have packet size constraints, but current Kea code does not support TCP sockets, in particular there is no code for accepting incoming TCP connections, closing existing ones and close inactive (dropped) connections. This would introduce additional work to 0.9.2 milestone and would not solve the problem of access control.

Unix socket was determined to be the simplest initial approach. It is envisaged that in the upcoming releases other communication channels for retrieving statistics will be implemented. Therefore a parameter will be added that would govern what type of communication channel the server should open:

"control-socket": {
  "socket-type": "unix",
  "socket-param": "/var/kea/statistics-socket"
}

For the 0.9.2 timeframe, the only supported values for "stats-socket-type" will be "none" (no control channel, statistics are disabled) and "unix". If socket-type is set to "unix", non-empty "socket-param" is mandatory. It specifies the unix domain socket location.

Both parameter specification and the server responses will be in JSON format. This is the format we're using for configuration and communication between DHCP and DDNS module, so it makes sense to keep using it for statistics as well.

The socket will be created in StatsMgr? and will be registered using IfaceMgr::addExternalSocket() method. The external socket mechanism is currently implemented.

Control commands

Commands planned for 0.9.2:

  • statistic-get(name) - will report all recorded values of statistic name
  • statistic-reset(name) - will set statistic to 0 (uint64_t) or 0.0 (double)
  • statistic-get-all(reset) - will report the current value of all statistics. Optional parameter reset will govern whether the statistics should be reset after they're reported. It is faster to retrieve and reset at the same time, as compared to retrieve and later reset it.
  • statistic-reset-all - will reset all statistics.

Possible commands that may be implemented later:

  • statistic-set-storage-size(max_samples) - instructs the server to store up to max_samples.
  • statistic-set-storage-time(max_duration) - instructs the server to store observations that are not older than max_duration

The input syntax attempts to salvage what is still there in the code since BIND10 days (see src/lib/config/ccsession.cc: createCommand, parseCommand, createAnswer, parseAnswer). This BIND10 code seems to support several formats:

{ "command": [ "my_command" ] }
{ "command": [ "my_command", 1 ] }
{ "command": [ "my_cmd", [ "a", "b" ] ] }
{ "command": [ "foo", { "a": "map" } ] }

All of them are unnecessarily complex. The last syntax is almost good, but can be simplified. The proposed command format is as follows:

{
    "command": "foo", 
    "arguments": {
        "param_foo": "value1",
        "param_bar": "value2",
        ...
    }
}

Only command parameter is mandatory. There may be additional parameters that are command specific. For example, statistic-reset-all takes no parameters, so issue it, the following JSON structure may be used:

{ "command": "statistic-reset-all" }

Many commands require parameters. For example, to get a number of packets received, the following JSON structure could be used:

{
    "command": "statistic-get",
    "arguments": {
        "name": "received-packets"
    }
}

Control responses

BIND10 used signalling that used the following formats:

{ "result": [ 0 ] }
{ "result": [ 1, "error" ] }

Zero meant success and any non-zero codes represented an error. Again, this approach seems to be too complex. Proposed generic syntax:

{
    "result": X,
    "error": "textual-error-representation",
    "response_param1": "value1",
    "response_param2": 42,
    "response_param3": [ "eth0", "eth1", "eth5" ],
    "response_param4": { "enabled": "yes" }
}

The only mandatory element is result. It's integer value represents the general status of the operation. 0 means success, any other value indicates an error. Error codes may be command specific. If the result is non-zero, error field is present and it contains text description of the actual problem. Depending on the command, there may be additional parameters. They are command specific and can take essentially any JSON form: a single string value (e.g. as response_param1 in the example above), a single integer value (e.g. response_param2 above), a list of value (e.g. "response_param3" above) or a map of values (e.g. "response_param4" above).

In particular, statistic-get-X family of queries will return "observations", which is a map. It may contain zero or more parameters. For example, the statistc-get("received-packets") query may produce the following result:

{
    "result": 0,
    "observations": {
        "received-packets": [ [ 1234, "2015-04-15 12:34:45.123" ] ]
    }
}

Note that depending on the query and the statistic configuration, more than one observation may be returned. For each observation, there's a timestamp. For example, if there were 3 packets received and the received-packets statistic was configured to retain all of them, the response could look like this:

{
    "result": 0,
    "observations": {
        "received-packets": [ [ 1, "2015-04-15 12:34:12.100" ],
                              [ 2, "2015-04-15 12:34:44.463" ],
                              [ 3, "2015-04-15 12:34:59.532" ] ]
    }
}

Some queries may also return more than one statistic. For example statistic-get-all may return the following:

{
    "result": 0,
    "observations": {
        "received-packets":   [ [ 4, "2015-04-15 12:34:12.100" ] ],
        "sent-packets":       [ [ 4, "2015-04-15 12:34:12.500" ] ],
        "received-discovers": [ [ 2, "2015-04-15 12:34:12.050" ] ],
        "sent-offers":        [ [ 2, "2015-04-15 12:34:12.100" ] ],
        "received-request":   [ [ 2, "2015-04-15 12:34:12.100" ] ],
        "sent-acks":          [ [ 1, "2015-04-15 12:34:12.100" ] ],
        "sent-nacks":         [ [ 1, "2015-04-15 12:34:12.100" ] ],
    }
}

Each timestamp denotes when specific obsveration was last updated.

Note that for performance and bandwidth reasons, the actual JSON format will likely not have proper indentation and will use as compact format as possible (no indent spaces, no ends of lines).

Future support for multiple processes

Described approach assumes that the statistics code is run within the main thread of the server. That decision was made due to simplicity of the implementation. There were concerns raised that it may cause periodic slowdowns when the server is requested to generate responses. The API described here is agnostic of the statistics are gathered. Should we decide to implement some sort of a separate process that gathers statistics, it would require updating StatsMgr::addValue() method, but not many of the places throughout the code where it is being used. Delivery of the input parameters would have to be modified slightly, but that's outside of scope for this design.

Class diagram

Please excuse simplicity of this diagram. I haven't used UML in many years.

The primary interface is StatsMgr? class. Any piece of the code (server code, libraries or hook libs) are expected to include only stats_mgr.h, use StatsMgr::instance() and call methods on the returned instance.

The initial implementation will have only a global context and all statistics are kept in the global context. Support for additional contexts will be implemented if time permits. Each context has a std::map that maps its name into Observation object. Each Observation object may contain one or more actual observations for a given measured property.

Last modified 3 years ago Last modified on May 8, 2015, 3:32:18 PM

Attachments (1)

Download all attachments as: .zip