wiki:QualityAssurance

QA used for BIND 10 DNS

This document describes the processes, tools, and methods used to help ensure quality and confidence in BIND 10. This also discusses short-comings and needs. Note this document does not yet discuss the BIND 10 DHCP QA (but there is some overlap). The following topics are mostly an unordered list of QA concepts with discussion.

The Quality Assurance is based on a variety of criteria, including coding best practices, review, testing of interfaces for defined behaviors for correct and incorrect inputs, performance, resource usage, etc. We do not have an overall defined QA plan including for management and maintenance.

Requirements definition

For the protocol side, we do have sources for conformance guidelines and specifications available via RFCs, BIND 9 documentation, and BIND 9 (the "reference implementation") and other third-party open source server source code. Nevertheless, this is disjointed and the amount of specifications as stated in single sentences would be well over 750 items. We have done a few evaluations and writeups for specific protocols (such as IXFR) but this has been ad hoc, inconsistent, and overall very incomplete.

Also we have not had detailed definitions or designs prior to various development. (Some code has been rewritten multiple times due to lack of initial design or was developed speedily just as proof-of-concept but then used for production.)

NOTE: The guidelines are in progress of being published by the Engineering Best Practices board.

Plan:

  • document the Customer's requirements.
  • document the detailed designs for functional requirements and administrator interfaces. (This is beyond the DNS specifications.)
  • document the QA requirements.
  • develop a detail DNS specifications document which is reused as descriptions for individual DNS protocol tests, and for generating DNS documentation. This structured data contains references to the official standards or best practices or to source code if the behavior/protocol is not officially defined. (Jeremy did initial work on the document focusing on DNSSEC. He will provide a proposal toward completing this work.)

Specification based testing

We do not have a single document listing all our required DNS specifications and pointing to our tests for each item. We do have various unit tests and system level tests but most are not identified as testing the "specification". (This is also discussed in the previous Requirements definition section.)

There are DNS conformance test suites but we do not use them currently. The TAHI DNS test suite (and IPv6 test suite) was evaulated, but we never set it up for automated use. It is several years old and not maintained. http://www.tahi.org/dns/

Another test suite is PROTOS. It probably was not evaluated yet. It uses Java. It appears to be very old and unmaintained. https://www.ee.oulu.fi/research/ouspg/PROTOS_Test-Suite_c09-dns

There is also a commercial conformance product. TODO: find the URL

Plans:

  • As discussed in the Requirements definition plans, use the detailed DNS specifications document (in progress) as descriptions for individual DNS protocol tests. An evaluator may easily review a generated report to see how complete (or lacking) the DNS protocol support is. (There was a discussion and example on reusing it for lettuce tests in a face-to-face meeting. Jeremy will provide a proposal toward completing this work.) This test suite could be developed to be server agnostic and reused for BIND 9 and other DNS implementations. We should work with third-party contributors toward this effort. It could be developed using the lettuce framework.
  • automate use of TAHI suite. Maybe it is better than nothing and even if we pass all tests now, it could be good for detecting later regressions.
  • automate use of PROTOS. Also maybe it is better than nothing and even if we pass all tests now, it could be good for detecting later regressions.

Ad hoc manual testing

Before BIND 10 snapshot releases, there has been various unofficial uses of the latest code tarballs for simple testing. Such as a few developers upgrading personal machines and doing quick tests of serving or transfers. Also we have a few production uses where the systems may be upgraded with some testing. These informal tests have identified problems multiple times. But these are not automated, nobody is assigned, the environments and testing tools are undefined, and no checklist is provided for what to do.

Plan:

  • create mechanism to automatically upgrade some of the production systems. This will include checking results and logs and on failures rolling back to previous working release.
  • create checklist of tasks and things to report for users for informal manual testing.
  • create detailed lettuce tests for the common tasks that we have been doing manually. (Add to existing lettuce based setup so this will be automated.)
  • provide a simply questionnaire for ad hoc manual testers to gather operation ideas that we can possible create new system tests for.

Compatibility / Interoperability Testing

In addition to Specification based testing, we can test our servers' behavior against other software implementations.

  • systest -- We automate use of "systest" as part of the build farm. The systest target uses a small subset of the BIND 9 test framework ported to BIND 10. It is included in the BIND 10 source tree and may contains new tests not included in BIND 9. Currently the ixfr/in-2 system test uses BIND 9 named for some interoperability tests for xfrin and xfrout. (It appears the other tests aren't enabled.)
  • query_two_server.py -- this tool located in the BIND 10 source may be used compare two DNS server's responses to a query. It can also use DNSSEC, EDNS0 (and set buffer size), and TCP or UDP. Included is sample zone files and databases of queries to use. The code is not tested automatically. The tests are also not automated. It requires setting up the two servers.
  • Jeremy also used compared responses against different name servers by using nmsgtool to create the data. This was not automated and is not fully documented.

Plans:

  • evaluate other systest tests that aren't enabled and enable if ready
  • evaluate porting rest of BIND 9 system tests to BIND 9
  • or port the existing BIND 10 systest tests to lettuce framework
  • and consider porting all of the BIND 9 stsrem tests to the lettuce framework.
  • automate use of query_two_server.py for the various scenarios it provides.
  • document the methods for capturing and comparing BIND 10 data with other nameserver implementations using nmsgtool.
  • automate use of nmsgtool to capture DNS data for BIND 10 and then compare with previously captured data (against itself or other server).

Fuzz testing

Various unit tests do send invalid, random, or unexpected data to some functions. This may include testing configuration and DNS interfaces, for example. This is automated as part of the "make check" framework. The scope of this is not documented yet.

Tools used to provide various invalid, random, or unexpected data as input to a DNS server include:

  • tests/tools/badpacket -- badpacket is used to set values (or ranges of values) for DNS message flags, section counts, size of message, and name. Its output shows you the message created and the message returned. This is not built by default. Its use is not automated.
  • In the BIND 9 source tree, a tool called packet.pl reads packets encoded in hexadecimal from a text file to send arbitrary packets (and ignores replies). It has been used to check for "packet of death" scenarios. The use of this is not automated for BIND 10.
  • The BIND 9 source tree also have a few versions of ans.pl that acts like a name server but misbehaves in controllable ways. This has not been evaluated nor automated for BIND 10.

In addition, Jeremy replayed traffic collected from popular (busy) DNS servers against our BIND 10. This did identify some missing support, but this was years ago. This is mostly not broken messages, but may contain unknown details. He didn't automate this. His steps are not documented in the wiki, but he does have notes.

Plans:

  • automate use of badpacket tool. See ticket #703 for examples.
  • automate use of other fuzz tools.
  • automate replaying traffic using nmsg api.
  • automate use of BIND 10 message parser to query/respond to the SIE ch202 live data (in 2011 was about 3 billion transactions per day). An SIE feed is available via switch to the bind10 testing1 system; jreed is setting it up.
  • evaluate Fujiwara's dns_replay2 and see if he has a data set to automate against our BIND 10.
  • automate the packet.pl (BIND 9) in BIND 10 build farm to send known "packet of death" tests.
  • evaluate the ans.pl name servers for use in BIND 10.

Acceptance Criteria

The current release engineering steps for the development snapshots do have items for various checks, but some have been ignored for the development snapshots and we don't keep a dedicated history for these decisions. In addition, the current BIND 10 release plan for production releases does not have these steps documented yet.

The developers have a policy (maybe undocumented) that failures reported by our builder (build and tests) farm be handled quickly (such as within 24 hours). In some cases, code has been reverted -- by the committer or another developer -- because it is known that it won't be resolved (fixed) quickly. But in some cases, failures for specific machines (such as Sunstudio on Sparc only) may be ignored for a long time (even weeks). It is expected that large merges would not be done on late Fridays of the developer knows he/she will be unavailable over the weekend (or similar scheduling).

In addition, code review procedures may skip the full set of tests, so may be approved with later-seen-as-obvious failures. This is accepted because the build farm has been used to detect and quickly resolve problems. There has been no shame in using it this way and is a common practice. The build farm may also be used to run the builds and tests for specific branches too.

Plans:

  • document and follow defined criteria for accepting code or a tarball for a release or merging to a release branch.
  • document and follow defined criteria for committing/pushing code to master branch.
  • document procedures for reporting problems.
  • document procedures for corrective action which may mean simply reverting commits and re-opening tickets.

Documentation

API documentation is written as part of the code; so the generated syntax in most cases is correct. Warnings during documentation generation point out the developer documentation mistakes, such as missing descriptions or mismatched information. (This is using doxygen.)

Plans:

  • have a builder system generate the developer documentation and fail on warnings. Also convert the python code documentation to doxygen syntax.

Unit Testing

Since almost the beginning, the BIND 10 project has followed a Test-Driven Development (TDD) development model. In some cases, the tests may not be written before the code, but they are provided before the code review (by a peer developer) and certainly before the code is imported (merged) into the master repository.

The unit tests are done at different levels. They may be low level testing the behaviors for individual functions or may be higher level covering the documented protocol rules. Some tests cover performance at the micro benchmark level. Some tests check that command-line arguments are handled correctly. These unit tests may also be considered low-level regression tests.

The tests for the C++ code are written using the Google Test (gtest) framework. The ./configure --with-gtest switch is used to enable this. The "make check" target is used to run these tests. Our policy for over a year is to build the test code with the regular "make" (all target) instead of building as part of the "check" target. There are currently around 2130 tests using the gtest framework.

The tests for the Python code are written using standard python tests framework. No configure switch is needed to enable this. "make check" also runs these tests. (So if --without-gtest is used, the python tests are still available.) There are currently around 67 python scripts that are ran that provide various tests.

There are also some other tests for testing zone loading. This may not be considered a "unit test" but these are tested at the same "make check" level.

The "make check" is ran on all the build systems for every merge to master. An email is sent to the developers for new failures.

Code coverage reporting are available for the C++ and Python unit tests. These reports indicate how many lines are covered by the tests. The reports also provide other measurements and point to code that is not tested. The python coverage check is enabled using the ./configure -with-pycoverage option. The Python coverage software needs to be installed. The C++ code coverage is enabled by using ./configure --with-lcov. The LCOV software needs to be installed. It uses gcov which is provided with the GCC suite. After the "make check" is ran, the HTML website reports are generated with the "make report-cpp-coverage" and "make report-python-coverage" targets.

Note that some code coverage detection may be misleading due to tests covering unrelated code and the code's own tests not checking all of its own code.

Plans:

  • run code coverage per component/module and then generate report(s) and clear counters (to make sure the local tests cover the local code).
  • distinguish between different levels of "unit" testing
  • correlate the tests with the documented specifications (be able to look at specs and no what is or isn't supported).
  • get to near 100% test coverage (all lines of code are tested).
  • Document what the threshold is. This may be used as a QA measurement criteria. Report failures when the test coverage is lower than our requirement.

Load and Resource Usage Testing

Jeremy manually runs a variety of performance tests against the b10-auth server releases and generates graphs. This is not automated at this time, simply because frequently the configurations have changed between releases and needs to create new custom configurations often. (The old configurations are still used to re-run tests against older versions.) Some details are at http://bind10.isc.org/wiki/DnsBenchmarks and more graphs are at http://git.bind10.isc.org/~jreed/bench/releases/. Data collected also included resource usage (but not graphed yet other than binary data sizes).

The scenarios exercise bind10 for high traffic and large memory usage for short bursts (such as few million queries and datasources containing millions of records).

Plans:

  • add more graphs; we have a lot of data collected that can be represented.
  • fix setup and graphs for -n (hot spot cache changes)
  • add more scenarios (ddns, xfr, dnssec-ok, edns, etc)
  • test the scripts more on different operating systems
  • test running performance tool on different system than the server
  • test using jinmei's queryperfpp (and compare)
  • report on significant one-time or accumulated regressions (performance, resource usage, etc)

Static Analysis

For static analysis (analyze code without running it), we automate use of cppcheck (http://sourceforge.net/apps/mediawiki/cppcheck/) and Clang Static Analyzer "scan-build" (see wiki:ClangStaticAnalyzer) (http://clang-analyzer.llvm.org/scan-build.html), and build most code with compiler -werror flag (so compiles fail on warnings). Cppcheck has been useful; we upgrade cppcheck periodically and also discuss issues with it upstream. Our cppcheck suppressions list is at src/cppcheck-suppress.lst. "make cppcheck" target may be used. scan-build has some false reports, but we generally try to adjust for it also.

Plans:

  • automate pycheck, pylint, and/or pyflake. (Originally didn't support python3, but now some do.)
  • try RATS and flawfinder again (note already tried them a few times, but much noise; maybe they can be tuned to ignore a lot more.)
  • try g++ -Weffc++ to warn about violations of various style guidelines from Scott Meyers' Effective C++ book. NOTE: Boost numeric_cast doesn't compile.
  • automate use of Coverity

To write about:

TODO: code coverage with system tests

TODO: micro benchmarks

TODO: profiling, gprof

TODO: run-time analysis, like valgrind

Originally written around: October 16, 2012

Last modified 4 years ago Last modified on Jun 19, 2013, 9:56:58 PM