Opened 2 years ago

Closed 20 months ago

Last modified 8 months ago

#4243 closed defect (fixed)

NameChangeUDPTest.roundTripTest fails on Ubuntu 15.10 (gcc-5 ?)

Reported by: tomek Owned by: tmark
Priority: medium Milestone: Kea1.1
Component: ddns Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 32 Add Hours to Ticket: 1
Total Hours: 29 Internal?: no

Description

The latest master, as well as all 1.0.0 betas, are building ok, but failing one unit-test on Ubuntu 15.10 when compiled with the default compiler (g++ (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010).

Failing unit-test output:

$KEA_LOGGER_DESTINATION=stdout tests/libdhcp_ddns_unittests --gtest_filter=NameChangeUDPListenerTest.basicReceivetest
[ RUN      ] NameChangeUDPListenerTest.basicReceivetest
2015-12-14 19:02:58.792 ERROR [kea.libdhcp-ddns/32203] DHCP_DDNS_NCR_UDP_RECV_ERROR UDP socket receive error while listening for DNS Update requests: Resource temporarily unavailable
ncr_udp_unittests.cc:259: Failure
Value of: result_
  Actual: 3
Expected: NameChangeListener::SUCCESS
Which is: 0
ncr_udp_unittests.cc:262: Failure
Value of: checkSendVsReceived(sent_ncr_, received_ncr_)
  Actual: false
Expected: true
ncr_udp_unittests.cc:259: Failure
Value of: result_
  Actual: 3
Expected: NameChangeListener::SUCCESS
Which is: 0
ncr_udp_unittests.cc:262: Failure
Value of: checkSendVsReceived(sent_ncr_, received_ncr_)
  Actual: false
Expected: true
ncr_udp_unittests.cc:262: Failure
Value of: checkSendVsReceived(sent_ncr_, received_ncr_)
  Actual: false
Expected: true
ncr_udp_unittests.cc:271: Failure
Value of: listener_->isIoPending()
  Actual: true
Expected: false
[  FAILED  ] NameChangeUDPListenerTest.basicReceivetest (1 ms)
[----------] 1 test from NameChangeUDPListenerTest (1 ms total)

Oddly enough, this particular unit-test passes when compiled with -O0 or with -Og, but fails with the default -O2. This makes regular debugging problematic, as many of the parameters passed between methods are optimized out.

With the attached verbosity patch (ncr-verbose-sanity.patch attached), the parameters passed to UDPCallback::operator() were printed and NCR passed to NameChangeUDPTest::operator () were added only if the status was success. The output is as follows (with -O2):

$ KEA_LOGGER_DESTINATION=stdout tests/libdhcp_ddns_unittests --gtest_filter=NameChangeUDPTest.roundTripTest 
[ RUN      ] NameChangeUDPTest.roundTripTest
#### error_code=system:11 bytes_transferred=0
2015-12-14 19:31:40.125 ERROR [kea.libdhcp-ddns/18636] DHCP_DDNS_NCR_UDP_RECV_ERROR UDP socket receive error while listening for DNS Update requests: Resource temporarily unavailable
#### skipped ncr=0
#### error_code=system:0 bytes_transferred=200
#### error_code=system:0 bytes_transferred=200
#### added ncr=0x15db390
#### error_code=system:0 bytes_transferred=200
#### error_code=system:0 bytes_transferred=200
#### added ncr=0x15db510
#### error_code=system:0 bytes_transferred=214
#### error_code=system:0 bytes_transferred=214
#### added ncr=0x15db180
#### error_code=system:11 bytes_transferred=0
2015-12-14 19:31:40.125 ERROR [kea.libdhcp-ddns/18636] DHCP_DDNS_NCR_UDP_RECV_ERROR UDP socket receive error while listening for DNS Update requests: Resource temporarily unavailable
#### skipped ncr=0
[       OK ] NameChangeUDPTest.roundTripTest (1 ms)

The same code compiled with -O0:

[ RUN      ] NameChangeUDPTest.roundTripTest
#### error_code=system:0 bytes_transferred=200
#### error_code=system:0 bytes_transferred=200
#### added ncr=0x25b0170
#### error_code=system:0 bytes_transferred=200
#### error_code=system:0 bytes_transferred=200
#### added ncr=0x25b02f0
#### error_code=system:0 bytes_transferred=214
#### error_code=system:0 bytes_transferred=214
#### added ncr=0x25af2d0
#### error_code=system:125 bytes_transferred=0
2015-12-14 19:32:42.973 DEBUG [kea.libdhcp-ddns/19364] DHCP_DDNS_NCR_UDP_RECV_CANCELED UDP socket receive was canceled while listening for DNS Update requests
#### skipped ncr=0

Note lack of two logs about resources being temporarily unavailable.

I recall experiencing the issue since upgrade to Ubuntu 15.10 from 14.10. One of the changes between those versions is migration to gcc 5.

Subtickets

Change History (20)

comment:1 Changed 2 years ago by fdupont

Can't reproduce with my 15.10 VM (32 bit desktop)... I strongly suggest to try --with-boost-libs configure flag and/or boost 1.59.

comment:2 Changed 2 years ago by fdupont

IMHO the issue is with boost 1.58, please note:

  • the bug doesn't appear in 32 bits
  • the bug is fixed when boost libraries are configured (--with-boost-lib=-lboost_system
  • the bug is not fixed with boost 1.59.0 so it is not clear it is a gcc or a boost bug.

comment:3 Changed 2 years ago by tomek

  • Milestone changed from Kea-proposed to Kea1.0
  • Priority changed from medium to low

Moving to 1.0 low as agreed on Kea call 2015-12-16.

comment:4 Changed 23 months ago by hschempf

Per team meeting Jan 7, move from 1.0 to 1.1

comment:5 Changed 23 months ago by hschempf

  • Milestone changed from Kea1.0 to Kea1.1

comment:6 Changed 22 months ago by tomek

  • Priority changed from low to medium

comment:7 Changed 21 months ago by tmark

After painstaking analysis the root cause of the failure is that the compiler is incorrectly optimizing out some critical logic the following boost::asio function:

(from the source file: /usr/include/boost/asio/detail/impl/socket_ops.ipp)

bool non_blocking_recvfrom(socket_type s,
    buf* bufs, size_t count, int flags,
    socket_addr_type* addr, std::size_t* addrlen,
    boost::system::error_code& ec, size_t& bytes_transferred)
{
  for (;;)
  {
    // Read some data.
    signed_size_type bytes = socket_ops::recvfrom(
        s, bufs, count, flags, addr, addrlen, ec);

    // Retry operation if interrupted by signal.
    if (ec == boost::asio::error::interrupted)
      continue;

    // Check if we need to run the operation again.
    if (ec == boost::asio::error::would_block
        || ec == boost::asio::error::try_again)
      return false;

    // Operation is complete.
    if (bytes >= 0)
    {
      ec = boost::system::error_code();
      bytes_transferred = bytes;
    }
    else
      bytes_transferred = 0;

    return true;
  }
}

The checks against the value of ec after the call to recvfrom() are optimized out causing the function to always return true, even if the read was interrupted (should continue) or the call would block (should return false). In the either case, the true return causes the ready handler registered to the socket to be invoked when it should not be. The DHCP_DDNS library code interprets this as a failed read.

This manifests itself in D2 as an infinite loop of failed reads. This issue was reported by in kea-users.

The short term solution is likely to be using pragmas around our inclusion boost/asio.hpp to turn off optimization:

:
#pragma GCC push_options
#pragma GCC optimize ("O0")
#include <boost/asio.hpp>
#pragma GCC pop_options
:

Last edited 21 months ago by tmark (previous) (diff)

comment:8 Changed 21 months ago by fdupont

IMHO the bug should be reported to boost and gcc people too.

comment:9 Changed 21 months ago by tmark

I already opened a ticket with GCC on 2/12:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69789

According to their most recent update, GCC 6.0 compiles the code correctly. We need to determine if this issue is isolated to 5.3. I believe that it may be as we have at least one Fedora VM on the build farm using 5.1 and it does not exhibit this issue. Bascially we need to test with GCC 5.2.

After which we might consider a configuration switch to turn off optmization entirely, for Kea for GCC 5.3. It is not unlikely that there are other places where this issue has mangled code and simply await discovery.

I just opened the following ticket with BOOST:

https://svn.boost.org/trac/boost/ticket/11989

comment:10 Changed 20 months ago by tmark

  • Owner set to tmark
  • Status changed from new to assigned

comment:11 Changed 20 months ago by tomek

Tom asked me to run trac4243 (commit-id ac5d93b8f3521d8e4cc95797b29766a29defe9b4) and I confirm that it indeed working. The code was built with all default options and all unit-tests now pass.

Without this change, unit-tests failed. My system hasn't changed, it's still is Ubuntu 15.10 x64 with g++ version reported as: g++ (Ubuntu 5.2.1-22ubuntu2) 5.2.1 20151010.

Here's how I built it:

time (autoreconf -i && ./configure --enable-debug --prefix=/opt3 --with-gtest-source=/home/thomson/devel/googletest-release-1.7.0 --enable-generate-docs --enable-logger-checks && make -j9 && make -j9 check)

comment:12 Changed 20 months ago by tmark

  • Estimated Difficulty changed from 0 to 32
  • Owner changed from tmark to Unassigned
  • Status changed from assigned to reviewing
  • Total Hours changed from 0 to 24

Ticket is ready for review:

After a great deal of analysis the crux of the issue is as follows:

The function boost:system:system_category() returns a reference to a local static variable, system_category_const. When built as header only(-DBOOST_ERROR_CODE_HEADER_ONLY) the function is declared as inlined and the optimizer fails to ensure a single value is returned by this function within a translation unit. It is conceptually the same as the issue discussed here:

http://processors.wiki.ti.com/index.php/C++_Inlining_Issues

under the section "Static Variables in Inline Functions".

This manifests itself in the following expression from boost:asio:detail:impl:socket_ops.ipp:non_blocking_recv():

    if (ec == boost::asio::error::would_block
       || ec == boost::asio::error::try_again)
      return false;

that evaluates to false when it should evaluate to true. Ultimately this causes the io_service to invoke the socket's ready handler rather than continuing to poll the socket waiting for data.

The inequality stems from the implicit conversion of the enumerates to error_code
instances, during which calls to system_category() used to set error_code:m_cat, return differing values.

The problem disappears if either the error handling code is not built header only as this results in a single definition of system_category() supplied by libboost_system; or the error handling code is not optimized.

From testing it appears that the issue began in GCC 5.2.0 and exists through 5.3.1, the latest released version. GCC 6.0 is nearing release and will have to be tested.

A great deal of time was spent trying to isolate the combination of includes and asio calls or references that cause the issue but the results were frustratingly indeterminate. Sufficiently so that no definitive pattern could be established.

We should give serious consideration to whether we want to utterly wrap the asio code and then forbid any direct use of boost:asio code in anything but the wrapper library or remove the wrapper layer entirely. We also may wish to consider whether we truly want header only building at all.

Short of the above measures which pose their own issues, the mitigation for this issue is either to not build the error code with optimization or allow optimization so long as the error code is not built header only. The former avoids the bug in the optimizer, the latter guarantees a single definition of system_category() which is supplied by libboost_system.

I have automated this solution by creating a new file asiolink/asio_wrapper.h, which encloses the inclusion of boost/asio/asio.hpp in conditional compilation logic that turns off optimization if the compiler is GNU 5.2.0 through 5.3.1 and BOOST_ERROR_CODE_HEADER_ONLY is defined.

I then replaced or removed includes of asio/asio.hpp from all of our source files other than the new wrapper.

10xx.   [bug]       tmark
    Wrapped asio/asio.hpp with logic to suppress optimization when building
    under GCC 5.2.0 through 5.3.1 and BOOST_ERROR_CODE_HEADER only is defined.
    This avoids an issue in the asio socket layer that was incorrectly
    reporting socket read errors causing unit tests to fail and kea-dhcp-ddns
    to loop logger calls in the error handler (aka double errors).
    (Trac #4243, git TBD)
Last edited 20 months ago by tmark (previous) (diff)

comment:13 Changed 20 months ago by fdupont

IMHO it is simpler to just ask to link with the boost system library. Unfortunately it is hard to detect the issue in configure (but we have the compiler version) but with the proposed solution we have to live with it forever... BTW was the problem signaled to boost and gcc guys?

comment:14 Changed 20 months ago by tmark

FYI - I just tested against Boost 1.61.rc1 (first beta release). The problem is still present in Boost 1.61 which is not surprising. The submitted solution corrects the issue. Also not surprising.

comment:15 Changed 20 months ago by marcin

  • Owner changed from Unassigned to marcin

comment:16 follow-up: Changed 20 months ago by marcin

  • Add Hours to Ticket changed from 0 to 3
  • Owner changed from marcin to tmark
  • Total Hours changed from 24 to 27

I reviewed the commit d0115185965ebe827a12729f0db6977dc7419a4b.

This looks like one of those issues when you think you've had enough of being a software engineer. But, in any case kudos that you were able to isolate the problem.

src/lib/asiolink/asio_wrapper.h
The commentary refers to ticket #4309, which is "Unused value remote_tmp in AllocEngine::reclaimExpiredLease()". I think you wanted #4243.

In the following sentence: "The issue arises from in-lining the above function, which returns a
reference to a local static variable, system_category_const," there should be a period, rather than comma at the end.

It would be useful to have a reference to gcc and boost bugs/tickets in the text in this header file. It very well describes the problem but there is no mention that we did submit bugs against gcc so as it is easy to check what the status of this issue is.

It could be useful to put a sentence saying that this file must be included in all file requiring asio.hpp and that asio.hpp should not be included directly. I also wonder if we should mention it in the Kea coding guidelines? Or asiolink README? I am just thinking how we prevent people from including asio.hpp directly. But, there is probably no reliable method.

This file doesn't appear in Makefile.am and thus it is likely to cause distcheck issues.

comment:17 in reply to: ↑ 16 Changed 20 months ago by tmark

  • Add Hours to Ticket changed from 3 to 1
  • Owner changed from tmark to marcin
  • Total Hours changed from 27 to 28

Replying to marcin:

I reviewed the commit d0115185965ebe827a12729f0db6977dc7419a4b.

This looks like one of those issues when you think you've had enough of being a software engineer. But, in any case kudos that you were able to isolate the problem.

Thank you. I'll be happier when we have better solution.

src/lib/asiolink/asio_wrapper.h
The commentary refers to ticket #4309, which is "Unused value remote_tmp in AllocEngine::reclaimExpiredLease()". I think you wanted #4243.

Oops. I suck at multi-tasking.

In the following sentence: "The issue arises from in-lining the above function, which returns a
reference to a local static variable, system_category_const," there should be a period, rather than comma at the end.

Got it.

It would be useful to have a reference to gcc and boost bugs/tickets in the text in this header file. It very well describes the problem but there is no mention that we did submit bugs against gcc so as it is easy to check what the status of this issue is.

Added comments noting the bug reports.

It could be useful to put a sentence saying that this file must be included in all file requiring asio.hpp and that asio.hpp should not be included directly. I also wonder if we should mention it in the Kea coding guidelines? Or asiolink README? I am just thinking how we prevent people from including asio.hpp directly. But, there is probably no reliable method.

I put an prominent comment at the top of the file. Hopefully this will suffice.

This file doesn't appear in Makefile.am and thus it is likely to cause distcheck issues.

Good catch thanks. I have added it. Along the way I stumbled onto a failure in distcheck,

:
ERROR: files left in build directory after distclean:
./src/bin/dhcp4/tests/test_leases.csv.2
make[1]: *** [distcleancheck] Error 1

I'm pretty sure I've seen this failure before and I think it is timing issue where LFC doesn't complete in time or something like that. I modified both dhcp4/tests/Makefile.am
and dhcp6/tests/Makefile.am by added test_leases.csv.* to CLEANFILES.

This seems to be working.

comment:18 Changed 20 months ago by marcin

  • Owner changed from marcin to tmark

I reviewed commit b71852dc1fbfc61a87a88e77aa1b1fc5a621f4a5.

Your changes look good. Distcheck passed for me on OS-X. Please merge.

comment:19 Changed 20 months ago by tmark

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 28 to 29

Changes merged with git 082f846f37cb32964c876b2bff5fcac82d1eaaf0
Added ChangeLog entry 1100.

comment:20 Changed 8 months ago by tmark

Please note the following discussion on Boost users list:

http://lists.boost.org/boost-users/2017/03/87322.php

Boost authors are aware of the issue.

Note: See TracTickets for help on using tickets.