Opened 9 months ago

Closed 4 months ago

Last modified 4 months ago

#5478 closed enhancement (complete)

HA: User's guide

Reported by: tomek Owned by: marcin
Priority: medium Milestone: Kea1.4
Component: documentation Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

High Availability is the major feature in upcoming Kea 1.4. It's also a very complex feature with many aspects. This has to be properly documented in the User's Guide and must be accompanies with several configuration examples.

Subtickets

Change History (6)

comment:1 Changed 6 months ago by marcin

  • Owner set to marcin
  • Status changed from new to accepted

comment:2 Changed 5 months ago by marcin

  • Owner changed from marcin to UnAssigned
  • Status changed from accepted to reviewing

I described the usage of the HA hook library. I didn't provide any example files at this time, but that will be done later.

Proposed ChangeLog entry:

13XX.	[doc]		marcin
	Documented High Availability hook library in the Kea
	Administrator Reference Manual.
	(Trac #5478, git cafe)

comment:3 Changed 5 months ago by tomek

  • Owner changed from UnAssigned to tomek

comment:4 Changed 5 months ago by tomek

  • Owner changed from tomek to marcin

With a growing number and complexity of hooks, the hooks.xml file will soon become
one unmanageable mess. Please move the description to hooks-ha.xml. See on master
how that was done for hooks-radius.xml and hooks-cache.xml. I would do it, but
then git history would incorrectly attribute me as original creator of this text.

Section 14.3 has a list of hooks. I think the HA description should be
updated slightly. We should not say multiple instances, but rather say something
like "a pair of servers" with the addition of optional backup server. The
brief description should also mention both load-balancing (also known as active-active)
and hot-standby (sometimes known as active-passive). The more keywords we put there,
the more likely it is we'll draw people attention.

I've redacted the text a bit. Please pull and review.

On a related note, there's no lease-cmds entry in that table. Added
missing text. Please review this one as well.


There are no example configs. Please add some in doc/examples/kea{4,6}. They
don't have to be super documented. Just throw in whatever you used during tests.
We can pimp them up later.


Every hook description starts with a brief paragraph explaining to uninformed
user what the library does and why he may potentially want it. Your text
immediately jumps into the details. Please add some short intro.

Section title please remove libdhcp_ from it. All other hooks use just the core
name (e.g. subnet_cmds). It should be "ha: High Availability".

14.3.7 "crashed" => "becomes unavailable". I've added a short list of reasons
why a server may stop functioning and crash is only one of them. Honestly, Kea
is a reasonably stable software, so I would not think about software crash as the
first reason for a server to become unavailable.

14.3.7.2 Server states - very good description of the the states.

"the server may ... use additional measures to verify if the partner is still
operating". What are those measures? Are they currently implemented or is this
a theoretical future extension?

"statet" => "state"

"The DHCP service scopes require some explanation." Before you go into the
details, you should describe what a DHCP scope is.

14.3.7.3
"The former provides the implemenation of the HA feature. The latter" =>
former => latter, latter => former.

The whole paragraph about loading both libs is general to both HA scenarios,
so it should be moved up to the common sections (or perhaps a new common
section should be added just for it).

Not strictly doc related, but do you think heartbeat-delay, max-response-delay
and max-ack-delay really should be expressed in seconds? Wouldn't milliseconds
be better? What if you want to set up a HA pair that is super-fast in detecting
issues? During UKNOF there was a presentation about data center that looked at
the service availability numbers they mentioned five nines (99.999% time, which
means less than 5.26 minutes of downtime per year). With this kind of goals
seconds start to matter. If you think this is reasonable, we can change the
parsers for now to accept milliseconds (values would be divided by 1000). The
doc could have a note that in 1.4 the values are rounded to the nearest
thousand, but it is likely to be changed in the future.

Is heartbeat sent even if there is some other transmission being received?

max-ack-delay. It is not clear about the units. Is it seconds or values used
in packet? secs is expressed in seconds, but elapsed is in 1/100th of a second.
This should be clarified.

"should automatically start serving its clients." => "should automatically start
serving the partner's clients."

There should be a section explaining how one server can take over leases
already assigned by the other one (going through rebinding phase). The section
doesn't have to be extensive. A single paragraph will do for now. Make sure
it mentions that each server has its own DUID and that those DUIDs are
expected to be different between servers. (On a related note, wouldn't it make
the transition smoother if both servers had the same DUID? In v6 they could
take over renewing clients without needing to wait for rebind).

14.3.7.7.1 ha-sync command

The text should clearly say whether the sync means: send all updates from
this server to "server-name", retrieve all updates from "server-name" to this
server or both.

What happens if the max-period is reached and the DB still syncing? This
should be explained in the command.

Also, it should mention when does the command response is generated: immediately
after the sync starts or when it is complete?

Do you think it would be useful to add a link to the HADesign page? It definitely would imho.


I've did some small changes. Please pull and review.

Feel free to merge this branch after you address my comments. I don't need to see this again, until you really think another round of reviews is crucial.

comment:5 Changed 4 months ago by marcin

  • Resolution set to complete
  • Status changed from reviewing to closed

I addressed the comments. From the major changes:

  • HA doc moved to separate XML,
  • The units of HA timers changed to milliseconds and changes applied to the premium repo as well.
  • Added a section about DHCP clients transitioning to the other server via Rebind mechanism.

Rebased and merged with commit 3db34400d0331e3d4fc208529eeb18f6abfb6562

comment:6 Changed 4 months ago by marcin

Also merged changes to premium with commit d779730a8d2663244a44369d47c69a663c0d16d1

Note: See TracTickets for help on using tickets.