wiki:ConfigRollback

Configuration Rollback

Currently (Feb. 2017) some Kea components are able retrieve their configuration while already running. This page reviews each component implementation and proposes necessary changes to achieve better configuration rollback. Also, configuration checking (the ability to verify if a new configuration looks sane) is discussed, because it employs the concept of parsing configuration, reporting result and then reverting back to the old configuration.

Current state

DHCPv4 and DHCPv6

There are two ways in which both DHCP components can update their configuration: receive a signal to re-read their configuration file or set-config command can provide new configuration. After refactoring done with bison parsers (see SimpleParser), the code is now well prepared for configuration rollback. One special case of rollback-like situation is test configuration. This mechanism allows checking if new configuration is valid. It goes through most of the configuration steps, but does not apply the actual configuration.

The configuration is done in configureDhcp4Server and configureDhcp6Server methods. Both of them have provision for configuration rollback. The rollback is currently used when configuration is tested or when an exception is raised during reconfiguration.

The important methods here are processConfig (see src/bin/dhcpX/ctrl_dhcpX_srv.cc) and configureDhcpXSrv (see src/bin/dhcpX/json_config_parser.cc)

There are several aspects here:

  • database connections. createManagers() is called from processConfig(), after configureDhcpXSrv method is finished.
  • (re)loading hook libraries. hooks_parser.loadLibraries() is called from within configureDhcpXSrv in if (!rollback) section.
  • data sockets are opened by calling getStagingCfg()->getCfgIface()->openSockets() in processConfig, after configureDhcpXSrv method is finished.
  • D2 connection is opened by calling D2->startD2() in processConfig, after configureDhcpXSrv method is finished.
  • DHCPv4-over-DHCPv6 socket is opened by calling Dhcp4to6Ipc::instance().open(), after configureDhcpXSrv method is finished.

In case any of the operations above fail, there is no recovery mechanism, an error is reported back and the server is left in undefined state. Depending on the nature of the failure, it may or may not have working database connection, sockets and other essential components.

DHCPv4 and DHCPv6 components have the capability to test configuration.

D2 (DHCP-DDNS)

D2 can receive a signal and reload its configuration. There is a plan to implement control channel for D2.

When receiving new configuration (currently the only was to do it is to use command line parameter or send a signal to re-read existing configuration), the D2Process::configure() is called (see src/bin/d2/d2_process.cc). Much of the configuration change parsing is done in completely generic way in DCfgMgrBase. This is definitely a plus and it also has the mechanisms to revert to the original configuration if the new configuration does not parse correctly. On the drawback side, there is np recovery from a "runtime" error such as a port that is already in use or an IP address that does not exist.o. If exception is thrown, the code reports failure, but does not try to revert back to the old configuration (e.g. doesn't reopen old sockets).

D2 doesn't seem to have any way to test its configuration.

CA (Control Agent)

The CA component is dedicated to management of other components and its design does not specify how to update CA's configuration. However, we can speculate that once its implementation is more mature the topic of CA reconfiguration will show up.

One thing that is currently implemented in CA is provision for configuration checking. In a sense, configuration check is a configuration that is reverted on purpose. Most of the provisions are done, but the capability to test configuration is not fully operational yet. In particular, there is no command line switch that enables it and there are no unit-tests. However, the ground work for this in libprocess is there.

CA, being very recent addition, lacks on its configuration capabilities. It does not handle configuration rollback or any provisions for it.

Recommendation

The CPL architecture, implemented in libprocess, seems to offer better overall interface as compared to DHCP components, so it should be used as a base for implementing configuration rollback across all components (DHCPv4, DHCPv6, D2, CA). In particular, the comparison between components seems to indicate that:

  1. signal handling in DControllerBase (see DControllerBase::initSignalHandling()) seems more convenient. As such, it is recommended to be used in both DHCPv4 and DHCPv6.
  2. context handling in CPL seems to be similar to what is implemented in dhcp::CfgMgr and dhcp::SrvConfig in the sense of supporting staging configuration that can be either committed or rejected. However, CPL does a better job with treating it uniformly (component specific config is derived from base DCfgContextBasePtr) and provides overall more convenient interface.
  3. dhcp::CfgMgr seems to do a better job with clearly naming the contexts - getCurrentCfg() and getStagingCfg(). This distinction has one other major benefit. We immediately apply the new configuration and discard the old one as soon as parsing finishes. However, Cisco has a very nice feature that you can apply new configuration (change something to staging config in our terms) and run it for a while to see if the change is good or not. Then, you can either commit (and keep the configuration) or revert (to discard it and get back to earlier configuration). If we ever decide to implement something similar in Kea, the clear distinction between staging and running config in dhcp::CfgMgr would be useful. This capability should be ported to CPL, before we retire dhcp::CfgMgr.
  4. dhcp::CfgMgr contains quite a few parameters (echo client-id, data-dir, ddns-enabled etc). Those are should be migrated to SrvConfig.
  5. The DCfgMgrBase::parseConfig method splits configuration parameters into params_map and objects_map. This is not necessary and introduces extra intermediate storage - that's something the refactoring in 1.2 eliminated in DHCP components. This part of the D2 configuration should be refactored away. CA seems to be a good example of how to parse configuration without any additional intermediate storage.
  6. The naming convention in libprocess and D2 does not follow Kea coding guidelines. What does the initial D stands for? (At the time I wrote the CPL I hadn't given it a clever acronym nor was it in a separate library and in a misguided effort to group the components together I latched onto prefixing their class names with a "D". It could just have easily been Q or Z but there you have it. - Thomas)
  7. The configuration handling in libprocess should be updated to follow this logic:
    configure(ElementPtr config, bool check_only) {
    
     // This creates a new "empty" config.
     staging = getStagingCfg();
    
     try {
         // This parses the config and stores parsed data in staging
         parse(staging, config, check_only);
     } catch {
         revert(); // discard anything that was already parsed in staging.
         // report failure
     }
    
     if (check_only) {
         revert(); // discard anything parsed, it was only for testing
         // report check success
     }
    
     try {
        cfgmgr->apply(staging); // apply configuration stored in staging
        // this includes operations, like socket opening/closing, hook
        // libraries loading, DB connections etc.
     } catch {
        // something went wrong, we need to revert back.
        // Apply currently running config.
        cfgmgr->commit(getRunningConfig());
        revert(); // this discards staging config
     }
    
     commit(); // this sets staging config as current config.
    }
    

This approach above offers better recovery than we have now, but it is not bullet-proof. For example, we could close the DB connection, try to open a new one as the new configuration dictates and detect an error. Then, when trying to reopen the old connection it could no longer work. There's no way to truly prevent such failures. DHCP data sockets could become unopenable (because the underlying interface went away), DB connections could be failing, because the DB is currently down, hook libraries could no longer be loaded, because their binaries have changed or were deleted etc. These are all external dependencies that used to work, but are currently not available any more. One could speculate that it's feasible to maintain both old and new DB connections or sockets and close the old ones only when the new succeeds. However, that's not fully feasible, because loading the same libraries twice would be risky, also external databases could limit the number of allowed connections from a given IP to 1, thus preventing extra connections.

Last modified 9 months ago Last modified on Feb 21, 2017, 1:06:00 PM