Opened 10 months ago

Last modified 4 months ago

#5435 accepted defect

[Kea 1.2.0] Kea LFC not executed anymore after some time

Reported by: nicolas.chaigneau Owned by: marcin
Priority: high Milestone: Kea1.2
Component: dhcp Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Very High
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

As previously discussed on the Mailing List:

I'm running Kea 1.2.0 on RHEL 7.1.

I have LFC configured to run every 10 minutes.
I've noticed several times (4 occurences now, on two distinct environments) that after some time (several days) LFC is no longer executed:

  • there are no DHCPSRV_MEMFILE_LFC_START messages in Kea log file anymore.
  • the lease files on disk are no longer modified, and their last modification date coincides with the last "DHCPSRV_MEMFILE_LFC_START" log.

With Kea 0.9.1 (running on RHEL 6.4) I never experienced this issue.

Subtickets

Attachments (2)

0001-4068-No-longer-ignore-SIGCHLD.patch (2.9 KB) - added by marcin 9 months ago.
0001-Timer-manager-deadlock-fix.patch (1.3 KB) - added by marcin 9 months ago.

Download all attachments as: .zip

Change History (32)

comment:1 Changed 10 months ago by nicolas.chaigneau

For reference, the discussion from the Mailing List up to this point.
.

Observed a 4th occurrence of the issue.

Server started on 2017, November 15th at 17:20:47
Last time LFC was executed: on November 19th at 04:50:47

I'll open a ticket.

Regards,
Nicolas.

So this is confirmed: LFC is really not executed anymore at some point, for a reason yet to be determined.

I'll have a look at the changes between Kea 0.9.1 and 1.2.0.

If you think of anything that could help, I'm all ears :)

Regards,
Nicolas.

Hello Marcin,

Thanks for your answer.

I was too hasty to suspect the log rotation, this has nothing to do with it. Rotation occurs at night between 3 and 4am, and the last log I have was in the middle of the day.

Furthermore, I think the issue manifesting after 24H is just a coincidence. Before that I've had Kea run for several days before seeing the issue. So it seems random...

I don't think there's an issue with logging either, because when I stopped the server after the LFC issue appeared, then I got the normal shutdown logs.

I think the LFC is not executed because of the two observations:

  • No log from Kea saying that LFC is launched
  • Lease files are not modified anymore (and their last modification date coincides with the last "DHCPSRV_MEMFILE_LFC_START" log from Kea)

But you're right, I'm not 100% sure LFC is not executed...

So I've launched a little script that does loops over a "ps" to try and catch the LFC process running.
It's not fully reliable because if LFC has nothing to do, it may start and end so fast that the ps won't see it.
I tried that with LFC scheduled every 10 seconds. The script catches about 2 LFC executions out of 3.

So on the live system on which the LFC appears to not run anymore, I'll let the script run for an hour. In this time frame LFC is normally executed 6 times.
If I don't see anything, I'll conclude that LFC is indeed not executed.

Regards,
Nicolas.


De : Marcin Siodelski marcin@…

Nicolas,

I think we should start from checking if the issue doesn't stem from the
log rotation. If I understand correctly, you're rotating the log file
daily. Per your message below, things go weird after around 24 hours
since server startup. Is it possible to verify that rotating every 12h
would cause the issue to begin after 12h rather than 24h?

I can't see anything in the Kea code that could prevent periodic
execution of LFC, other than server reconfiguration. You're saying that
LFC is not executed because the lease files aren't modified. But, is
there any other log output present there? Perhaps the LFC is not
executed because of some trouble with logging being a result of log file
rotation? We should see if logging generally works fine after log file
rotation in the first place. Could you enable more verbose logging to
see if any log is produced when the LFC starting message is gone?

Marcin Siodelski
ISC

On 17.11.2017 10:41, Chaigneau, Nicolas wrote:

Hello,

I've observed the issue again.

With a Kea server that does not receive any DHCP packet:
At some point, Kea is still running but LFC is no longer executed.

In Kea log file we don't have "DHCPSRV_MEMFILE_LFC_START" messages anymore.
And the lease files are not modified on disk (they would be if LFC was executed, so this is not just a logging issue).

Kea was started on 2017, November 15th at 16:20:23.
LFC was first executed 10 minutes later, at 16:30:23.
And the last execution of LFC: November 16th, at 16:40:23.
Nothing after that...

Regards,
Nicolas.

Wait... when I restarted the server, I got logs again... before the shutdown!
So logging actually still worked. I just didn't have anything being logged.

2017-11-15 17:20:45.551 INFO [kea-dhcp4.dhcp4/29754] DHCP4_SHUTDOWN server shutdown
2017-11-15 17:20:47.603 INFO [kea-dhcp4.dhcp4/28141] DHCP4_STARTING Kea DHCPv4 server version 1.2.0 starting

The only logs I normally have in Kea logs is LFC execution every 10 minutes:

2017-11-16 02:10:47.617 INFO [kea-dhcp4.dhcpsrv/28141] DHCPSRV_MEMFILE_LFC_START starting Lease File Cleanup
2017-11-16 02:10:47.617 INFO [kea-dhcp4.dhcpsrv/28141] DHCPSRV_MEMFILE_LFC_EXECUTE executing Lease File Cleanup (...)

I don't have these logs anymore after some time.
So... either LFC is no longer executed after a while, or Kea stops writing these log messages ?

A few more detail:

This worked fine before (with the same logrotate configuration), with:

  • Kea 0.9.1
  • RHEL 6.4

The issue now appears with:

  • Kea 1.2.0
  • RHEL 7.1

comment:2 Changed 10 months ago by nicolas.chaigneau

Maybe also interesting:

On this 4th occurence, LFC is not running anymore.

But I have a host that keeps renewing its lease, so the lease file is being updated.
(but never cleaned up)

address,hwaddr,client_id,valid_lifetime,expire,subnet_id,fqdn_fwd,fqdn_rev,hostname,state
10.226.0.3,00:0c:29:eb:ef:18,,1200,1511065462,1003,1,1,admplat-virtual-machine,0
10.226.0.3,00:0c:29:eb:ef:18,,1200,1511066268,1003,1,1,admplat-virtual-machine,0
10.226.0.3,00:0c:29:eb:ef:18,,1200,1511067105,1003,1,1,admplat-virtual-machine,0
10.226.0.3,00:0c:29:eb:ef:18,,1200,1511067940,1003,1,1,admplat-virtual-machine,0
10.226.0.3,00:0c:29:eb:ef:18,,1200,1511068802,1003,1,1,admplat-virtual-machine,0
(...)

10.226.0.3,00:0c:29:eb:ef:18,,1200,1511168471,1003,1,1,admplat-virtual-machine,0
10.226.0.3,00:0c:29:eb:ef:18,,1200,1511169181,1003,1,1,admplat-virtual-machine,0

comment:3 Changed 10 months ago by nicolas.chaigneau

  • Defect Severity changed from N/A to Very High

Issue seems to affect all timers. I've had server running with debug logs.

I noticed that, not only did the LFC stopped being executed, but also the lease reclaiming mechanism (both at the same time).

I can't see anything wrong in the logs...

Below, all log messages I got up to the last reclaim-expired-leases execution (executed every second):
(and 7 minutes after that a packet is received)

(...)
2017-12-01 09:13:39.255 DEBUG [kea-dhcp4.packets/32013] DHCP4_BUFFER_WAIT waiting for next DHCPv4 packet with timeout 1000 ms
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.dhcpsrv/32013] DHCPSRV_TIMERMGR_RUN_TIMER_OPERATION running operation for timer: reclaim-expired-leases
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.alloc-engine/32013] ALLOC_ENGINE_V4_LEASES_RECLAMATION_START starting reclamation of expired leases (limit = 0 leases or 100 milliseconds)
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.dhcpsrv/32013] DHCPSRV_MEMFILE_GET_EXPIRED4 obtaining maximum 0 of expired IPv4 leases
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.alloc-engine/32013] ALLOC_ENGINE_V4_LEASES_RECLAMATION_COMPLETE reclaimed 0 leases in 0.011 ms
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.alloc-engine/32013] ALLOC_ENGINE_V4_NO_MORE_EXPIRED_LEASES all expired leases have been reclaimed
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.dhcpsrv/32013] DHCPSRV_TIMERMGR_START_TIMER starting timer: reclaim-expired-leases
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.packets/32013] DHCP4_BUFFER_WAIT_INTERRUPTED interrupted wait for the next packet due to timeout, signal or external socket callback (timeout value is 1000)
2017-12-01 09:13:40.255 DEBUG [kea-dhcp4.packets/32013] DHCP4_BUFFER_WAIT waiting for next DHCPv4 packet with timeout 1000 ms
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.dhcpsrv/32013] DHCPSRV_TIMERMGR_RUN_TIMER_OPERATION running operation for timer: reclaim-expired-leases
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.alloc-engine/32013] ALLOC_ENGINE_V4_LEASES_RECLAMATION_START starting reclamation of expired leases (limit = 0 leases or 100 milliseconds)
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.dhcpsrv/32013] DHCPSRV_MEMFILE_GET_EXPIRED4 obtaining maximum 0 of expired IPv4 leases
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.alloc-engine/32013] ALLOC_ENGINE_V4_LEASES_RECLAMATION_COMPLETE reclaimed 0 leases in 0.013 ms
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.alloc-engine/32013] ALLOC_ENGINE_V4_NO_MORE_EXPIRED_LEASES all expired leases have been reclaimed
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.dhcpsrv/32013] DHCPSRV_TIMERMGR_START_TIMER starting timer: reclaim-expired-leases
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.packets/32013] DHCP4_BUFFER_WAIT_INTERRUPTED interrupted wait for the next packet due to timeout, signal or external socket callback (timeout value is 1000)
2017-12-01 09:13:41.256 DEBUG [kea-dhcp4.packets/32013] DHCP4_BUFFER_WAIT waiting for next DHCPv4 packet with timeout 1000 ms
2017-12-01 09:20:12.634 DEBUG [kea-dhcp4.packets/32013] DHCP4_BUFFER_RECEIVED received buffer from 10.226.0.1:67 to 10.31.19.28:67 over interface eth3

comment:4 Changed 10 months ago by nicolas.chaigneau

I've raised the severity.

We've deployed this version on Production.

It doesn't handle traffic yet (migration is scheduled for early January 2018).
But migration will be a "no go" if this issue is not resolved at the time.

Last edited 10 months ago by nicolas.chaigneau (previous) (diff)

comment:5 follow-up: Changed 9 months ago by nicolas.chaigneau

Can I get an update on this ?
The deadline is getting closer for us, so this is becoming quite urgent... :'(

Thanks.

comment:6 in reply to: ↑ 5 ; follow-up: Changed 9 months ago by marcin

Replying to nicolas.chaigneau:

Can I get an update on this ?
The deadline is getting closer for us, so this is becoming quite urgent... :'(

So far we haven't been able to reproduce this problem, which makes it hard to diagnose why it occurs. There is one idea what might be causing this behavior. If I create a patch and send it out to you, can you verify if it helps?

Thanks,
Marcin

comment:7 in reply to: ↑ 6 Changed 9 months ago by nicolas.chaigneau

Replying to marcin:

Replying to nicolas.chaigneau:

Can I get an update on this ?
The deadline is getting closer for us, so this is becoming quite urgent... :'(

So far we haven't been able to reproduce this problem, which makes it hard to diagnose why it occurs. There is one idea what might be causing this behavior. If I create a patch and send it out to you, can you verify if it helps?

Thanks,
Marcin

Yes, of course I'll test patches. :)

But as it can take time to reproduce, the sooner the better.
Even if you're not sure of what you send, since I can reproduce the issue on our lab, it's really not a problem.

Thanks!

comment:8 Changed 9 months ago by tomek

  • Component changed from Unclassified to dhcp
  • Milestone changed from Kea-proposed to Kea1.4
  • Priority changed from medium to high

Accepting in 1.4 as high priority.

Changed 9 months ago by marcin

comment:9 Changed 9 months ago by marcin

I attached the patch to the ticket. You need to do:

patch -p0 < 0001-4068-No-longer-ignore-SIGCHLD.patch

in the Kea 1.2 source tree.

comment:10 Changed 9 months ago by nicolas.chaigneau

Thanks, now rebuilding. I'll keep you updated.

comment:11 Changed 9 months ago by nicolas.chaigneau

We've started 4 servers with the patch. Now we just have to wait.

comment:12 Changed 9 months ago by nicolas.chaigneau

The issue is not fixed. Occured again this morning around 8:50 :/

The last 2 logs of Kea starting LFC:

2017-12-20 08:41:05.210 INFO  [kea-dhcp4.dhcpsrv/32144] DHCPSRV_MEMFILE_LFC_START starting Lease File Cleanup
2017-12-20 08:41:05.210 INFO  [kea-dhcp4.dhcpsrv/32144] DHCPSRV_MEMFILE_LFC_EXECUTE executing Lease File Cleanup using: /opt/application/pawdhcp/products/kea/1.2.0-c10/sbin/kea-lfc -4 -x /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.2 -i /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.1 -o /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.output -f /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.completed -p /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.pid -c ignored-path
2017-12-20 08:51:05.210 INFO  [kea-dhcp4.dhcpsrv/32144] DHCPSRV_MEMFILE_LFC_START starting Lease File Cleanup
2017-12-20 08:51:05.210 INFO  [kea-dhcp4.dhcpsrv/32144] DHCPSRV_MEMFILE_LFC_EXECUTE executing Lease File Cleanup using: /opt/application/pawdhcp/products/kea/1.2.0-c10/sbin/kea-lfc -4 -x /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.2 -i /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.1 -o /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.output -f /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.completed -p /var/opt/data/flat/pawdhcp/ramdisk1/kea1-open-leases.csv.pid -c ignored-path

But, the old lease file last modification date is 08:41 (not 08:51).
So it looks like at 08:51, Kea tries to execute LFC, maybe fails (or at least LFC does not perform its job), then the timer is not scheduled again ?

comment:13 follow-up: Changed 9 months ago by nicolas.chaigneau

I'll test with Kea 1.3.

comment:14 in reply to: ↑ 13 ; follow-up: Changed 9 months ago by marcin

Replying to nicolas.chaigneau:

I'll test with Kea 1.3.

Testing Kea 1.3 is a good idea. We have made significant modifications in the timer manager after 1.2.0. In case of 1.2.0, I can provide a patch that adds additional log messages into timer manager (on the info or higher level, so as you don't need to increase the global logging level to debug). These log messages might be helpful to detect where the process is stuck (I suppose it might be stuck in the timer manager).

comment:15 in reply to: ↑ 14 ; follow-up: Changed 9 months ago by nicolas.chaigneau

Replying to marcin:

Replying to nicolas.chaigneau:

I'll test with Kea 1.3.

Testing Kea 1.3 is a good idea. We have made significant modifications in the timer manager after 1.2.0. In case of 1.2.0, I can provide a patch that adds additional log messages into timer manager (on the info or higher level, so as you don't need to increase the global logging level to debug). These log messages might be helpful to detect where the process is stuck (I suppose it might be stuck in the timer manager).

Kea 1.3.0 is rebuilt and running, now waiting for the issue to occur (if it does).

In the meantime could you provide me with this new patch you mentioned for 1.2.0, with the additional log messages ? I'd like to run another couple of servers on 1.2.0 to try to figure out where the issue is.

Thanks!

comment:16 in reply to: ↑ 15 Changed 9 months ago by marcin

Replying to nicolas.chaigneau:

Replying to marcin:

Replying to nicolas.chaigneau:

I'll test with Kea 1.3.

Testing Kea 1.3 is a good idea. We have made significant modifications in the timer manager after 1.2.0. In case of 1.2.0, I can provide a patch that adds additional log messages into timer manager (on the info or higher level, so as you don't need to increase the global logging level to debug). These log messages might be helpful to detect where the process is stuck (I suppose it might be stuck in the timer manager).

Kea 1.3.0 is rebuilt and running, now waiting for the issue to occur (if it does).

In the meantime could you provide me with this new patch you mentioned for 1.2.0, with the additional log messages ? I'd like to run another couple of servers on 1.2.0 to try to figure out where the issue is.

Thanks!

Is it ok if I output these additional logs to "stdout" rather than to a log file?

comment:17 follow-up: Changed 9 months ago by marcin

May I know what is the gcc version you're using to compile Kea 1.2.0?

comment:18 in reply to: ↑ 17 ; follow-up: Changed 9 months ago by nicolas.chaigneau

Replying to marcin:

May I know what is the gcc version you're using to compile Kea 1.2.0?

Compiler is:
g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9)

I'm starting Kea through a cluster agent so stdout is not available, but I guess I can change the agent to redirect stdout to a file. So if it's easier for you, that's ok.

I have some questions about Kea 1.3.0: I've noticed that Kea now automatically links with the boost libraries installed on the build machine, even though I use "--with-boost-include" option, and not "--with-boost-lib-dir / --with-boost-libs". Is this on purpose ?

What happens if I build Kea with "--with-boost-include" using a different version of the boost headers (1.57) than that of the boost libraries (1.53) ?

(it seems to work but I'm not sure it's good form)

comment:19 in reply to: ↑ 18 ; follow-up: Changed 9 months ago by fdupont

Replying to nicolas.chaigneau:

I have some questions about Kea 1.3.0: I've noticed that Kea now automatically links with the boost libraries installed on the build machine, even though I use "--with-boost-include" option, and not "--with-boost-lib-dir / --with-boost-libs". Is this on purpose ?

=> yes: there were too many systems where the include only option for boost raised problems so we gave up and changed the default. Note you can overwrite this using a --without (or --with=no).

What happens if I build Kea with "--with-boost-include" using a different version of the boost headers (1.57) than that of the boost libraries (1.53) ?

=> the usual thing when includes and libraries mismatched: if you are lucky nothing bad happen, if you are not you get random failures from the hell... BTW I already had this issue with OpenSSL (and its clones) and I can say I was not lucky (OpenSSL API is not very stable too: this was the argument used by Apple to no longer support it...).

comment:20 in reply to: ↑ 19 Changed 9 months ago by marcin

Replying to fdupont:

Replying to nicolas.chaigneau:

I have some questions about Kea 1.3.0: I've noticed that Kea now automatically links with the boost libraries installed on the build machine, even though I use "--with-boost-include" option, and not "--with-boost-lib-dir / --with-boost-libs". Is this on purpose ?

=> yes: there were too many systems where the include only option for boost raised problems so we gave up and changed the default. Note you can overwrite this using a --without (or --with=no).

AFAIK, the actual option is --enable-boost-headers-only .

What happens if I build Kea with "--with-boost-include" using a different version of the boost headers (1.57) than that of the boost libraries (1.53) ?

=> the usual thing when includes and libraries mismatched: if you are lucky nothing bad happen, if you are not you get random failures from the hell... BTW I already had this issue with OpenSSL (and its clones) and I can say I was not lucky (OpenSSL API is not very stable too: this was the argument used by Apple to no longer support it...).

comment:21 Changed 9 months ago by marcin

As for the issue itself, I am currently looking into the timer manager code and I have some ideas what might be wrong. So, I will wait with the patch introducing additional logging a bit.

comment:22 Changed 9 months ago by nicolas.chaigneau

Thanks Francis and Marcin.
So I'll try to build with "--enable-boost-headers-only", since so far headers only worked fine for me.

comment:23 Changed 9 months ago by marcin

  • Owner set to marcin
  • Status changed from new to accepted

Changed 9 months ago by marcin

comment:24 follow-up: Changed 9 months ago by marcin

Nicolas,

Can you please apply the following patch http://kea.isc.org/attachment/ticket/5435/0001-Timer-manager-deadlock-fix.patch against your Kea 1.2.0 instance and see if you continue to have the problem with fading of the LFC:

patch -p0 < 0001-Timer-manager-deadlock-fix.patch​

Thanks,
Marcin

comment:25 in reply to: ↑ 24 Changed 9 months ago by nicolas.chaigneau

Replying to marcin:

Nicolas,

Can you please apply the following patch http://kea.isc.org/attachment/ticket/5435/0001-Timer-manager-deadlock-fix.patch against your Kea 1.2.0 instance and see if you continue to have the problem with fading of the LFC:

patch -p0 < 0001-Timer-manager-deadlock-fix.patch​

Thanks for the patch, I'll rebuild 1.2.0 with it.

comment:26 Changed 9 months ago by nicolas.chaigneau

So I have now:

  • 4 servers running with Kea 1.2.0 + the two patches
  • 2 servers running with Kea 1.3.0

... wait and see :)

comment:27 follow-up: Changed 9 months ago by nicolas.chaigneau

It's a little early to conclude anything, but so far everything is ok. (both 1.2.0 with the latest patch, and 1.3.0).
I hope you're having a happy winter solstice season. See you next year! :)

comment:28 in reply to: ↑ 27 ; follow-up: Changed 9 months ago by nicolas.chaigneau

Replying to nicolas.chaigneau:

It's a little early to conclude anything, but so far everything is ok. (both 1.2.0 with the latest patch, and 1.3.0).
I hope you're having a happy winter solstice season. See you next year! :)

All Kea servers are still ok after two weeks. The issue is fixed.
We'll repackage version 1.2.0 with the patch.

Thanks again! :)

comment:29 in reply to: ↑ 28 Changed 9 months ago by marcin

Replying to nicolas.chaigneau:

Replying to nicolas.chaigneau:

It's a little early to conclude anything, but so far everything is ok. (both 1.2.0 with the latest patch, and 1.3.0).
I hope you're having a happy winter solstice season. See you next year! :)

All Kea servers are still ok after two weeks. The issue is fixed.
We'll repackage version 1.2.0 with the patch.

I am really glad to hear it! Thanks!

Thanks again! :)

comment:30 Changed 4 months ago by tomek

  • Milestone changed from Kea1.4 to Kea1.2

As discussed on 2018-05-10 call, moving this ticket to 1.2 milestone.

In principle, this ticket would be a candidate for 1.2.1 release, but since we do not plan to release it, we will keep it open there.

Note: See TracTickets for help on using tickets.