Opened 9 months ago

Closed 6 months ago

#5470 closed enhancement (complete)

HA: failover state machine (v4)

Reported by: tomek Owned by: marcin
Priority: medium Milestone: Kea1.4
Component: high-availability Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description (last modified by tomek)

This is a tricky one. It covers implementing failover state machine. For details, see HADesign, in particular HADesign.

For v6, see #5471.

Subtickets

Change History (12)

comment:1 Changed 9 months ago by tomek

  • Description modified (diff)

comment:2 Changed 6 months ago by marcin

  • Owner set to marcin
  • Status changed from new to accepted

comment:3 Changed 6 months ago by marcin

  • Owner changed from marcin to UnAssigned
  • Status changed from accepted to reviewing

The DHCPv4 state machine has been implemented for both load balancing and hot standby mode. It also facilitates the backup server case.

Proposed ChangeLog entry for the Kea main repo:

13XX.	[func, bug]		marcin
	The network_state argument is provided to the callouts in
	the dhcp4_srv_configured hook point. Also, fixed a couple
	of minor bugs in the HTTP client classes.
	(Trac #5470, git cafe)

and for the premium repo:

XX.	[func]		marcin
	Implemented high availability state machine for the DHCPv4
	server.
	(Trac #5470, git cafe)

comment:4 Changed 6 months ago by tmark

  • Owner changed from UnAssigned to tmark

comment:5 follow-up: Changed 6 months ago by tmark

Main Repo:


.gitmodules

Do not check in .gitmodules etc, it has the premium repo added


src/lib/http/client.cc

Connection::doSend()
Connection::doReceive() - You added try-catches, out of curiosity, how did not catching these exceptions manifest themselves?

Main repo changes are fine, you may merge them.

========================================================

Premium Repo:


Fixed some typos pull first


src/hooks/dhcp/high_availability/communication_state.*

isCommunicationsInterrupted should be areCommunicationsInterrupted

"Communications" is plural

or you could call it isCommunicationInterrupted() making it
singular.


src/hooks/dhcp/high_availability/communication_state.cc

CommunicationState4::failureDetected()

Why does max unacked clients being 0 mean a failure was detected?


src/hooks/dhcp/high_availability/ha_service.cc

I'm assuming you created HA_AWAITING_IO_EVT for clarity rather
than using NOP_EVT?


src/hooks/dhcp/high_availability/ha_service.cc

HAService::amTransitioned()

I'm trying to puzzle this one out. It is rather odd for a state model
to define an event equating to "I changed states", which is what
HA_TRANSITIONED_EVT seems to be. Typically changing states is not an
event in itself but rather the result of an event.

If that's true you would not need a HA_TRANSITIONED_EVT event.

Now I understand wanting to do execute things either upon the first call
into a new state's handler or prior to exiting it for another state, hence,
StateModel::doOnEntry() and StateModel::doOnExit().

These are one-shots specifically for this purpose. Any call to
StateModel::setState() that changes the current state to a new state,
sets the entry and exit flags to true, else they are false.

So in your expression:

    return ((getNextEvent() == HA_TRANSITIONED_EVT) ||
            (getNextEvent() == HA_SERVICE_STARTING_EVT) ||
            ((getLastEvent() == HA_TRANSITIONED_EVT) &&
             (getNextEvent() != HA_AWAITING_IO_EVT)));

The first two expressions would be covered by StateModel::on_entry_flag_
being set.

I'm not sure I follow the logic behind the third expression:

((getLastEvent() == HA_TRANSITIONED_EVT) &&

(getNextEvent() != HA_AWAITING_IO_EVT)));

"The prior event was an initial state change, and we've since received any event other than HA_AWAITING_IO_EVT"?

If we've already transitioned into a given state and set the query_filter and adjusted the NetworkState?,
I'm not sure I see how calling these again would give you different results then you already had. Unless
one of these other events might have toggled on/off the network state?

Long way around, it seems like this in the handlers:

if (amTransitioned()) {

query_filter_.....();
adjustNetworkState();

}

could be replaced with this:

if (doOnEntry()) {

query_filter_.....();
adjustNetworkState();

}


src/hooks/dhcp/high_availability/ha_service.h

Shouldn't you provide some commentary about each of the state handlers do?

Describing "<sometext>StateHandler? as the "Handler for the "<sometext>" state"
isn't real helfpul. ;)


The test code is brilliant and looks pretty comprehensive. It also looks to have been
more work then the rest of actual code. It all passes under Ubuntu 16.04. Legal
log blows up but that's not related.

Hopefully Wlodek is looking at this and planning how to test it.

comment:6 Changed 6 months ago by tmark

  • Owner changed from tmark to marcin

comment:7 in reply to: ↑ 5 ; follow-up: Changed 6 months ago by marcin

  • Owner changed from marcin to tmark

Replying to tmark:

Main Repo:


.gitmodules

Do not check in .gitmodules etc, it has the premium repo added


src/lib/http/client.cc

Connection::doSend()
Connection::doReceive() - You added try-catches, out of curiosity, how did not catching these exceptions manifest themselves?

I've seen issues in unit tests which sporadically caused uncaught exceptions.

Main repo changes are fine, you may merge them.

I merged the changes. But, now when I try to compile the premium against master I get the compilation error:

host_cmds_unittest.cc:352:29: error: no member named 'setTestHostDataSource' in 'isc::dhcp::HostMgr'
        HostMgr::instance().setTestHostDataSource(HostDataSourcePtr());

which is an unrelated problem which I have to chase Francis for.

========================================================

Premium Repo:


Fixed some typos pull first

Thanks.


src/hooks/dhcp/high_availability/communication_state.*

isCommunicationsInterrupted should be areCommunicationsInterrupted

"Communications" is plural

or you could call it isCommunicationInterrupted() making it
singular.

I made it singular.


src/hooks/dhcp/high_availability/communication_state.cc

CommunicationState4::failureDetected()

Why does max unacked clients being 0 mean a failure was detected?

This is a special case when the failure detection (based on "secs" field) is disabled. In such case, the fact that the servers can't communicate over the control channel is sufficient to consider "partner-down" situation. I added a commentary in the header file regarding this.


src/hooks/dhcp/high_availability/ha_service.cc

I'm assuming you created HA_AWAITING_IO_EVT for clarity rather
than using NOP_EVT?

Yeah. I initially anticipated more states but it all got simplified significantly and I think that there is no harm (or confusion) in use of NOP_EVT so I changed the HA code to use it.


src/hooks/dhcp/high_availability/ha_service.cc

HAService::amTransitioned()

I'm trying to puzzle this one out. It is rather odd for a state model
to define an event equating to "I changed states", which is what
HA_TRANSITIONED_EVT seems to be. Typically changing states is not an
event in itself but rather the result of an event.

If that's true you would not need a HA_TRANSITIONED_EVT event.

Now I understand wanting to do execute things either upon the first call
into a new state's handler or prior to exiting it for another state, hence,
StateModel::doOnEntry() and StateModel::doOnExit().

These are one-shots specifically for this purpose. Any call to
StateModel::setState() that changes the current state to a new state,
sets the entry and exit flags to true, else they are false.

So in your expression:

    return ((getNextEvent() == HA_TRANSITIONED_EVT) ||
            (getNextEvent() == HA_SERVICE_STARTING_EVT) ||
            ((getLastEvent() == HA_TRANSITIONED_EVT) &&
             (getNextEvent() != HA_AWAITING_IO_EVT)));

The first two expressions would be covered by StateModel::on_entry_flag_
being set.

I'm not sure I follow the logic behind the third expression:

((getLastEvent() == HA_TRANSITIONED_EVT) &&

(getNextEvent() != HA_AWAITING_IO_EVT)));

"The prior event was an initial state change, and we've since received any event other than HA_AWAITING_IO_EVT"?

If we've already transitioned into a given state and set the query_filter and adjusted the NetworkState?,
I'm not sure I see how calling these again would give you different results then you already had. Unless
one of these other events might have toggled on/off the network state?

Long way around, it seems like this in the handlers:

if (amTransitioned()) {

query_filter_.....();
adjustNetworkState();

}

could be replaced with this:

if (doOnEntry()) {

query_filter_.....();
adjustNetworkState();

}

All good thoughts. I changed it as you're saying and it is now much simpler and cleaner. Functionally, it remains the same.


src/hooks/dhcp/high_availability/ha_service.h

Shouldn't you provide some commentary about each of the state handlers do?

Describing "<sometext>StateHandler? as the "Handler for the "<sometext>" state"
isn't real helfpul. ;)

Ok. Point taken. Added comments to each state handler.


The test code is brilliant and looks pretty comprehensive. It also looks to have been
more work then the rest of actual code. It all passes under Ubuntu 16.04. Legal
log blows up but that's not related.

Thanks. Most of the time in this ticket I spent on unit testing.

Hopefully Wlodek is looking at this and planning how to test it.

I am going to raise this on the call tomorrow.

comment:8 in reply to: ↑ 7 Changed 6 months ago by tmark

Replying to marcin:

Replying to tmark:

Main Repo:


.gitmodules

Do not check in .gitmodules etc, it has the premium repo added


src/lib/http/client.cc

Connection::doSend()
Connection::doReceive() - You added try-catches, out of curiosity, how did not catching these exceptions manifest themselves?

I've seen issues in unit tests which sporadically caused uncaught exceptions.

Main repo changes are fine, you may merge them.

I merged the changes. But, now when I try to compile the premium against master I get the compilation error:

host_cmds_unittest.cc:352:29: error: no member named 'setTestHostDataSource' in 'isc::dhcp::HostMgr'
        HostMgr::instance().setTestHostDataSource(HostDataSourcePtr());

which is an unrelated problem which I have to chase Francis for.


I guess I'm going to have to trust you.

========================================================

Premium Repo:


Fixed some typos pull first

Thanks.


src/hooks/dhcp/high_availability/communication_state.*

isCommunicationsInterrupted should be areCommunicationsInterrupted

"Communications" is plural

or you could call it isCommunicationInterrupted() making it
singular.

I made it singular.

I had to make you change at least one name.


src/hooks/dhcp/high_availability/communication_state.cc

CommunicationState4::failureDetected()

Why does max unacked clients being 0 mean a failure was detected?

This is a special case when the failure detection (based on "secs" field) is disabled. In such case, the fact that the servers can't communicate over the control channel is sufficient to consider "partner-down" situation. I added a commentary in the header file regarding this.

Ok. Better to have it explained that not.


src/hooks/dhcp/high_availability/ha_service.cc

I'm assuming you created HA_AWAITING_IO_EVT for clarity rather
than using NOP_EVT?

Yeah. I initially anticipated more states but it all got simplified significantly and I think that there is no harm (or confusion) in use of NOP_EVT so I changed the HA code to use it.


src/hooks/dhcp/high_availability/ha_service.cc

HAService::amTransitioned()

I'm trying to puzzle this one out. It is rather odd for a state model
to define an event equating to "I changed states", which is what
HA_TRANSITIONED_EVT seems to be. Typically changing states is not an
event in itself but rather the result of an event.

If that's true you would not need a HA_TRANSITIONED_EVT event.

Now I understand wanting to do execute things either upon the first call
into a new state's handler or prior to exiting it for another state, hence,
StateModel::doOnEntry() and StateModel::doOnExit().

These are one-shots specifically for this purpose. Any call to
StateModel::setState() that changes the current state to a new state,
sets the entry and exit flags to true, else they are false.

So in your expression:

    return ((getNextEvent() == HA_TRANSITIONED_EVT) ||
            (getNextEvent() == HA_SERVICE_STARTING_EVT) ||
            ((getLastEvent() == HA_TRANSITIONED_EVT) &&
             (getNextEvent() != HA_AWAITING_IO_EVT)));

The first two expressions would be covered by StateModel::on_entry_flag_
being set.

I'm not sure I follow the logic behind the third expression:

((getLastEvent() == HA_TRANSITIONED_EVT) &&

(getNextEvent() != HA_AWAITING_IO_EVT)));

"The prior event was an initial state change, and we've since received any event other than HA_AWAITING_IO_EVT"?

If we've already transitioned into a given state and set the query_filter and adjusted the NetworkState?,
I'm not sure I see how calling these again would give you different results then you already had. Unless
one of these other events might have toggled on/off the network state?

Long way around, it seems like this in the handlers:

if (amTransitioned()) {

query_filter_.....();
adjustNetworkState();

}

could be replaced with this:

if (doOnEntry()) {

query_filter_.....();
adjustNetworkState();

}

All good thoughts. I changed it as you're saying and it is now much simpler and cleaner. Functionally, it remains the same.

So, now that HAServic::poll() is just runModel(NOP_EVT), why not replace the two places
you call poll() with runModel(<event you want to post>):

index 8fdebe5..c8ea44f 100644
--- a/src/hooks/dhcp/high_availability/ha_service.cc
+++ b/src/hooks/dhcp/high_availability/ha_service.cc
@@ -584,8 +584,7 @@ HAService::asyncSendLeaseUpdate(const QueryPtrType& query,
                 // are required, such as next heartbeat or a lease update. The poll()
                 // may transition to another state, schedule asynchronous tasks etc.
                 // Then it returns control to the DHCP server.
-                postNextEvent(HA_LEASE_UPDATES_COMPLETE_EVT);
-                poll();
+                runModel(HA_LEASE_UPDATES_COMPLETE_EVT);
             }
         });
 
@@ -692,9 +691,8 @@ HAService::asyncSendHeartbeat() {
             // finds that some new events are required, i.e. next heartbeat or
             // lease update.  The poll() may transition to another state, schedule
             // asynchronous tasks etc. Then it returns control to the DHCP server.
-            postNextEvent(HA_HEARTBEAT_COMPLETE_EVT);
             startHeartbeat();
-            poll();
+            runModel(HA_HEARTBEAT_COMPLETE_EVT);
       });
 }

After you do that, if then you then modify HAService::HAService() to call startModel:

@@ -53,16 +53,8 @@ HAService::HAService(const IOServicePtr& io_service, const NetworkStatePtr& netw
         /// communication_state_.reset(new CommunicationState6(io_service_, config));
     }
 
-    // Initialize dictionaries of events and states.
-    initDictionaries();
-
-    // Set the current state to waiting to let the server look around and
-    // see what the state of other servers is and whether it should wait
-    // for the other servers to wake up etc.
-    setState(HA_WAITING_ST);
-
-    // Indicate that we're starting up.
-    postNextEvent(HA_SERVICE_STARTING_EVT);
+    // Initialize the state model and crank it up.
+    startModel(HA_WAITING_ST):
 
     LOG_INFO(ha_logger, HA_SERVICE_STARTED)
         .arg(HAConfig::HAModeToString(config->getHAMode()))

then you won't need the call to poll() in HAImpl::startService() and then you can delete HAService::poll() altogether. Oh, you also wouldn't need HA_SERVICE_STARTING_EVT anymore.

jes sayin...


src/hooks/dhcp/high_availability/ha_service.h

Shouldn't you provide some commentary about each of the state handlers do?

Describing "<sometext>StateHandler? as the "Handler for the "<sometext>" state"
isn't real helfpul. ;)

Ok. Point taken. Added comments to each state handler.

THANK YOU! This is very good.


The test code is brilliant and looks pretty comprehensive. It also looks to have been
more work then the rest of actual code. It all passes under Ubuntu 16.04. Legal
log blows up but that's not related.

Thanks. Most of the time in this ticket I spent on unit testing.

Hopefully Wlodek is looking at this and planning how to test it.

I am going to raise this on the call tomorrow.

Yep.

comment:9 Changed 6 months ago by tmark

  • Owner changed from tmark to marcin

comment:10 follow-up: Changed 6 months ago by marcin

  • Owner changed from marcin to tmark

I further modified the HA state model to rely on the standard methods and events. I also ran some manual system tests to make sure that it still works. Ready for another review.

comment:11 in reply to: ↑ 10 Changed 6 months ago by tmark

  • Owner changed from tmark to marcin

Replying to marcin:

I further modified the HA state model to rely on the standard methods and events. I also ran some manual system tests to make sure that it still works. Ready for another review.

EXCELLENT! This looks really good, well done.

Proceed to merge.

comment:12 Changed 6 months ago by marcin

  • Resolution set to complete
  • Status changed from reviewing to closed

All changes merged to the main repo and premium repo.

Note: See TracTickets for help on using tickets.