Opened 4 months ago

Closed 2 months ago

#5556 closed defect (fixed)

Implement database reconnect logic for MySQL

Reported by: L Owned by: tmark
Priority: medium Milestone: Kea1.4
Component: database-mysql Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket: 5477
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

Under #5477 support was added to kea-dhcp4/6 to be configured to automatically attempt to recover Postgresql backends if connectivity is lost.

This ticket calls for that behavior, once merged, to be implemented for MySQL backends.

Subtickets

Change History (20)

comment:1 Changed 3 months ago by tmark

  • Owner set to tmark
  • Status changed from new to assigned

comment:2 Changed 3 months ago by tmark

  • Owner changed from tmark to UnAssigned
  • Status changed from assigned to reviewing

MySQL lease and host backends now support configurable auto-reconnect.
I also update the Admin guide sections for lease/host backends. I took the liberty of reformatting this a bit to what I believe is a more readable arrangement.

Ready for review.

comment:3 Changed 3 months ago by fdupont

I looked at it:

  • not in a good shape to review the doc changes (need to sleep first).
  • the C++ code seems fine.
  • succesfully -> successfully
  • about the impossible unit test IMHO it is not so impossible: I suggest to scan open file descriptors (int iterator) to find the database connection (should be not so hard to recognize) and simply to junk (close?) it (i.e. take benefit that file descriptors are short integers and the tester code runs in the same space than the under test code).

comment:4 Changed 3 months ago by tmark

  • Owner changed from UnAssigned to tmark

comment:5 Changed 3 months ago by tmark

I'm taking this back to fix compilation errors masked by --with-dhcp-mysql not failing as invalid.

comment:6 Changed 3 months ago by tmark

  • Status changed from reviewing to accepted

comment:7 Changed 3 months ago by tmark

  • Owner changed from tmark to UnAssigned
  • Status changed from accepted to reviewing

Issues corrected, I updated log messages and added unit tests which simulate DB connectivity loss by closing the SQL client's socket fd. I altered some minor, internal Postgresql behavior exposed by
the new tests.

Ticket is once again ready for review.

comment:8 follow-up: Changed 3 months ago by fdupont

BTW NULL does not exist in C++, please use 0 instead.
Two comments: we decided a long time ago to not use nullptr (which is a different beast) and some compilers raise an error on NULL.

comment:9 in reply to: ↑ 8 Changed 3 months ago by tmark

Replying to fdupont:

BTW NULL does not exist in C++, please use 0 instead.
Two comments: we decided a long time ago to not use nullptr (which is a different beast) and some compilers raise an error on NULL.

A habit I'm trying to break. This would have also have been caught on merge, I fixed these in master for 5477 but forgot to update them here. If you plan to actually review the ticket I recommend you re-pull.

Oh and BTW, there are over 2130 occurrences of NULL in our code base.

Last edited 3 months ago by tmark (previous) (diff)

comment:10 Changed 3 months ago by fdupont

Note you have not ASCII (so illegal) characters in guides.

comment:11 Changed 3 months ago by tmark

  • Owner changed from UnAssigned to tmark
  • Status changed from reviewing to accepted

comment:12 Changed 3 months ago by tmark

I am reclaiming this ticket so it can be reworked post 5574 which changed the DatabaseConnection? callback to a static class member. This ripples through most of changes done for this ticket.

comment:13 Changed 2 months ago by tmark

  • Owner changed from tmark to UnAssigned
  • Status changed from accepted to reviewing

I have reimplemented this work under branch trac5556a as attempting to rebase it after 5574 would have been a mess.

READY FOR REVIEW.

comment:14 Changed 2 months ago by fdupont

  • Owner changed from UnAssigned to fdupont

comment:15 follow-up: Changed 2 months ago by fdupont

  • Owner changed from fdupont to tmark

I have two concerns about testDbLostCallback (the two methods):

  • first there are some systems where the runtime opens a socket so the ASSERT_EQ(-1, sql_socket); can fail. Jenkins should say if we have one in supported systems.
  • second why use errno and not simply check the return code of close: even nobody checks it (mainly because there is no useful action to recover) the close system call returns a value!

It is not critical but perhaps a TearDown method should reset the db lost callback to 0?

Please don't abuse in the comments of the fact the socket descriptor is named fd as file descriptor. It is the usage by highly incorrect and at the limit does not make sense (and very particular to Unix)... (src/lib/dhcpsrv/tests/test_utils.h).

Reading the code it seems good. I have a make check running with MySQL and PostgreSQL enabled. I don't expect it will fail but I don't expect it will return something soon too...

BTW I pushed 2 minor commits so please pull.

comment:16 Changed 2 months ago by fdupont

5 failures in my macOS (got 5 for sql_socket in place of expected -1):
PostgreSQLHostMgrDbLostCallbackTest.testDbLostCallback -> host_mgr_unittest.cc:602
MySQLLeaseMgrDbLostCallbackTest.testNoCallbackOnOpenFailure -> generic_lease_mgr_unittest.cc:2833
MySQLLeaseMgrDbLostCallbackTest.testDbLostCallback -> same
PgSqlLeaseMgrDbLostCallbackTest.testNoCallbackOnOpenFailure -> same
PgSqlLeaseMgrDbLostCallbackTest.testDbLostCallback -> same

Code are:

void
HostMgrDbLostCallbackTest::testDbLostCallback() {

    // We should not find a socket open                                         
    int sql_socket = test::findLastSocketFd();
    ASSERT_EQ(-1, sql_socket); <--- line 602.
...

and

void
LeaseMgrDbLostCallbackTest::testDbLostCallback() {
    // We should not find a socket open                                         
    int sql_socket = test::findLastSocketFd();
    ASSERT_EQ(-1, sql_socket); <--- line 2833.
...

lsof shows the opened socket is a Unix one (not IP).

comment:17 in reply to: ↑ 15 Changed 2 months ago by tmark

  • Owner changed from tmark to fdupont

Replying to fdupont:

I have two concerns about testDbLostCallback (the two methods):

  • first there are some systems where the runtime opens a socket so the ASSERT_EQ(-1, sql_socket); can fail. Jenkins should say if we have one in supported systems.

It seems that socket state/closure varies in timing between OSs and whether or there are any still "open" or not from previous tests is not reliably predictable. So rather than attempt to verify a
pre-condition, the tests have been altered to only fetch the most recently opened socket descriptor
immediately after connecting to the back end. This value, if > -1 is assumed to the SQL client
socket and is used in the close. Similarly, tests to verify that the SQL socket is no longer returned by findLastSocketFd() after it has been closed were also removed as these too seem to be
dependent on both the OS and back end implementation. This was what was causing the PostgreSQL tests to fail.

  • second why use errno and not simply check the return code of close: even nobody checks it (mainly because there is no useful action to recover) the close system call returns a value!

Not sure why I did this. I think once upon a time was planning to spit out the strerror() for it.
I have changed it to look at the return value.

It is not critical but perhaps a TearDown method should reset the db lost callback to 0?

Done.

Please don't abuse in the comments of the fact the socket descriptor is named fd as file descriptor. It is the usage by highly incorrect and at the limit does not make sense (and very particular to Unix)... (src/lib/dhcpsrv/tests/test_utils.h).

OK, a little on the hyper-technical side with this one, but I have replaced "fd" with descriptor in the commentary.

Reading the code it seems good. I have a make check running with MySQL and PostgreSQL enabled. I don't expect it will fail but I don't expect it will return something soon too...

BTW I pushed 2 minor commits so please pull.

The tests all run happily now on MacOS and Ubuntu.

Ready for review

comment:18 Changed 2 months ago by fdupont

  • Owner changed from fdupont to tmark

Read the changes: they look fine. Rebuild and run tests with "DbLostCallback?" in their names with success.
I believe it is good now. I am rerunning a full make check but I don't expect a problem.

comment:19 Changed 2 months ago by fdupont

Ready to be merged!

comment:20 Changed 2 months ago by tmark

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged with git b31da6f9a3545a2cac228eb17c59d72b6b4823f2
Added ChangeLog entry 1383.

Ticket is complete.

Note: See TracTickets for help on using tickets.