Opened 7 years ago

Closed 5 years ago

Last modified 5 years ago

#988 closed defect (fixed)

Infinite loop on xfrout

Reported by: shane Owned by: jinmei
Priority: very high Milestone: Sprint-20120619
Component: xfrout Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: High
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 2.5 Internal?: no

Description (last modified by shane)

There is an infinite loop in xfrout on my server.

The process list:

root@h:/opt/bind10/var/bind10-devel/log# ps -eLf | grep xfrout
root       625   621   625  0    5 May25 pts/1    00:00:01 /usr/bin/python3 /opt/bind10/libexec/bind10-devel/b10-xfrout
root       625   621   632  0    5 May25 pts/1    00:00:00 /usr/bin/python3 /opt/bind10/libexec/bind10-devel/b10-xfrout
root       625   621   633  0    5 May25 pts/1    00:00:52 /usr/bin/python3 /opt/bind10/libexec/bind10-devel/b10-xfrout
root       625   621  3771 96    5 May28 pts/1    3-04:54:56 /usr/bin/python3 /opt/bind10/libexec/bind10-devel/b10-xfrout
root       625   621  3772  0    5 May28 pts/1    00:00:00 /usr/bin/python3 /opt/bind10/libexec/bind10-devel/b10-xfrout
root     12322 12117 12322  0    1 07:42 pts/0    00:00:00 grep xfrout

It appears to be getting a read event on a file descriptor, and then getting 0 bytes when reading, and then not taking action:

root@h:/opt/bind10/var/bind10-devel/log# strace -p 3771 2>&1 | head -10
Process 3771 attached - interrupt to quit
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0", 1}], msg_controllen=0, msg_flags=0}, 0) = 0
select(14, [9 13], [], [], NULL)        = 1 (in [13])
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0", 1}], msg_controllen=0, msg_flags=0}, 0) = 0
select(14, [9 13], [], [], NULL)        = 1 (in [13])
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0", 1}], msg_controllen=0, msg_flags=0}, 0) = 0
select(14, [9 13], [], [], NULL)        = 1 (in [13])
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0", 1}], msg_controllen=0, msg_flags=0}, 0) = 0
select(14, [9 13], [], [], NULL)        = 1 (in [13])
recvmsg(13, {msg_name(0)=NULL, msg_iov(1)=[{"\0", 1}], msg_controllen=0, msg_flags=0}, 0) = 0

This file descriptor is for the Unix-domain socket used to transport file descriptors:

root@h:/opt/bind10/var/bind10-devel/log# lsof -p 625 | grep 13u
b10-xfrou 625 root   13u  unix 0xffff88001e642900       0t0   35734 /opt/bind10/var/auth_xfrout_conn

Subtickets

Change History (22)

comment:1 Changed 7 years ago by shane

Whoops... looking more carefully I see that auth_xfrout_conn is used in xfrout.py, not notify_out.py!

So it's not a notify problem, but a loop for some reason on our Unix domain socket.

comment:2 Changed 7 years ago by shane

  • Description modified (diff)

comment:3 Changed 7 years ago by jinmei

Note: This may be the same problem as #1001.

comment:4 Changed 6 years ago by shane

  • Milestone changed from New Tasks to Next-Sprint-Proposed

comment:5 Changed 6 years ago by stephen

  • Milestone changed from Next-Sprint-Proposed to Sprint-20110712

comment:6 Changed 6 years ago by jinmei

As commented, this problem may have been fixed with #1001 (now merged
to master, although, if it was really a loop in the main thread
waiting for XFR requests, this would be a different problem).

Shane, can you confirm that on your server?

comment:7 Changed 6 years ago by jinmei

  • Owner set to shane
  • Status changed from new to reviewing

comment:8 follow-up: Changed 6 years ago by shane

  • Component changed from Unclassified to xfrout
  • Status changed from reviewing to accepted

I am pretty sure that this is not the same problem. #1001 describes a busy loop that waits for 0.5 seconds. The loop I have seen is a true loop where 100% of the CPU is consumed by a single thread, and appears to be having a problem when it tries to read from the Unix-domain socket used to pass file descriptors.

I'm not sure why. At the very least, a bit of debugging could be added.

comment:9 Changed 6 years ago by shane

  • Owner changed from shane to UnAssigned
  • Status changed from accepted to assigned

comment:10 in reply to: ↑ 8 Changed 6 years ago by jinmei

Replying to shane:

I am pretty sure that this is not the same problem. #1001 describes a busy loop that waits for 0.5 seconds. The loop I have seen is a true loop where 100% of the CPU is consumed by a single thread, and appears to be having a problem when it tries to read from the Unix-domain socket used to pass file descriptors.

I'm not sure why. At the very least, a bit of debugging could be added.

Ah, in that case, it may not be appropriate to have it in the current
sprint unless we know how to reproduce it.

comment:11 Changed 6 years ago by shane

  • Defect Severity changed from Medium to High
  • Milestone changed from Sprint-20110712 to Year 3 Task Backlog

Agreed. I'm moving this to the year 3 backlog, although I am raising the severity, since it is pretty horrible when it happens!

comment:12 Changed 6 years ago by jinmei

I think I found the way to reproduce it: once complete a successful
xfr, then kill auth (the bind10 process will restart it, but it
doesn't matter). The fix is probably also easy - I guess we should
simply exit from the while loop in UnixSockServer.handle_request() if
process_request() fails (raising an exception from the latter would do
the trick).

I actually wish to fix the broken architecture more fundamentally than
applying a bandaid, or at least use the more sophisticated
SocketSession class, but the problem is probably bad enough to apply
a smallest fix.

And, the fix will also solve #1537.

Last edited 6 years ago by jinmei (previous) (diff)

comment:13 Changed 6 years ago by jinmei

  • Milestone set to Next-Sprint-Proposed
  • Priority changed from medium to high

comment:14 Changed 6 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20120619
  • Priority changed from high to very high

comment:15 Changed 5 years ago by jinmei

  • Owner changed from UnAssigned to jinmei
  • Status changed from assigned to accepted

comment:16 Changed 5 years ago by jinmei

trac988 is ready for review.

I've also fixed another similar case, although it's much less likely
to happen in practice.

Proposed changelog entry:

446.?	[bug]		jinmei
	Fixed a bug in b10-xfrout where a helper thread could fall into
	an infinite loop if b10-auth stops while the thread is waiting for
	forwarded requests from b10-auth.
	(Trac #988, git TBD)

comment:17 Changed 5 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:18 Changed 5 years ago by jinmei

#1833 should be able to be closed with this.

comment:19 follow-up: Changed 5 years ago by vorner

  • Owner changed from UnAssigned to jinmei

Hello

It seems OK to merge. And it seems it could really cause this and #1833.

comment:20 in reply to: ↑ 19 Changed 5 years ago by jinmei

Replying to vorner:

Hello

It seems OK to merge. And it seems it could really cause this and #1833.

Thanks for the review. Merge done, closing.

comment:21 Changed 5 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:22 Changed 5 years ago by jinmei

  • Total Hours changed from 0 to 2.5
Note: See TracTickets for help on using tickets.