wiki:January2012MeetingMinutes

Monday 9 January

Overview of the BIND 10 Plan

  • How Shane Plans

Basic problem: at the beginning of year 3, we needed a detailed plan of what we would deliver and when, for sponsors. This is the opposite of what Scrum dictates. We have determine some limitations with Scrum, because we need to do this. Basic software planning process: list all the work that needs to be gone, check if anything's missing, try to find dependencies, set priorities, and lay out work based on resources and schedule. Then there is estimation of the size of varying tasks. Basically, you can't get away from decomposition and synthesis of tasks, based on scope, schedule, and resources. Approach: Two things we need: List of features we need Trustworthiness and reliability Features: every set of features come into a release Failure point was good estimates -Remaining year 3 plan

-Features

-DDNS -Name server identifiers -TSIG usage -XFR limitations

  • DNSSEC authority server
  • NSEC3 authority server
  • Views
  • In-memory
  • Operational support tools
  • Data Source extensibility
  • Hooks
  • API documentation, HOWTOs
  • Reliability
  • Year 4 plan
  • Year 5 and beyond

Hook Design Discussion

  • review of previous hook discussions
  • review of DHCP hooks

Basically, you can't get away from decomposition and synthesis of tasks, based on scope, schedule, and resources. Approach: Two things we need: List of features we need Trustworthiness and reliability Features: every set of features come into a release Failure point was good estimates We don't use tools really to do the planning Note from Jinmei that some long term planning is ok in Scrum. Shane notes we will discuss this more tomorrow. Text works for planning with shallow dependencies but would fail if we had more complex ones We had one big dependency this year, the datasource refactoring. This was somewhat of a surprise, and almost everything depended on it. Possibly this could have been detected with more analysis, which we didn't have room for in this year's planning, but we hope to remedy that for year four. We try to order things in a way that makes sense and keep making them publically visible and with a prioritization based on perceieved and determined user needs as well as basic functional requirements. To date, on the planning side, Shane set Making The Plan as a single task. This is something we will change. For future planning work, the goal is to break that down into actual planning tasks, and plan the planning, so to speak. We are also considering adding some planning and administrative type tasks to the trac site. This would make this work more visibe, which is probably helpful to all, and keeps them from getting lost, etc. We have also been doing some priority setting exercises with the Steering Committee. We also had a mixed response on whether steering committee ought to do that but we did get complete response from them. We also have summarized the responses from last year's BIND 10 Operational Requirements Survey into a set of priorities and will be repeating that exercise soon. More details in the year 4 discussion tomorrow. High level year 3 issues: (wiki down, so we're struggling) We had some issues with communication with the steering committee and with... an aggressive/wishful schedule. Not unexpectedly, reality didn't meet the actual plan. The sponsors got concerned. Joao and the team have done a lot of communicaiton with the sponsors and things have improved greatly. What this means for the remainder of year 3, is that we need to be able to deliver everything we say we can deliver. To this end, we have vastly reduced the scope to something we feel is more realistic. If we do end up with extra cycles in March, we can work on more cool things. Focus areas: -DDNS work - we're very close, possibly able to finish in one sprint based on the estimations we currently have. If we finish in two sprints, it goes in the next release and all is well. -TSIG usage - we can use it everywhere we have ACLs but the config is "bumpy" - to discuss more in the user experience area (ACL usage may also come into this) -XFR limitations - we need requirements for this and how much of it we can do this year, and a plan -NSEC and NSEC 3 for inmemory and sqlite (NSEC nearly done but not NSEC3) -Data source "extensability" - finish current refactoring, document how to add more types of data sources later (so we can then invite outside developers to try to do so) -Performance has to become a priority. (AS112 failure and sponsor needs) -Reliability - we had a plan for this and we've hd a lot of problems with it. We need to revise the plan. bind10.isc.org is successfully self hosting, which was the first goal. It was not easy. AS112 server took a long time and failed. We have been trying to run an open resolver too, but it isn't working well yet because we've been focusing on the authoritative side. Dogfood experiments, therefore, a real mixture. We're still working on it. Less key but maybe some will happen this year: -Operational Support tools -Name server identifiers

  • API documentation, HOWTOs

Not in this BIND10 year (very important, but for year four): -Name server identifiers - from the command tool, etc -Views (not getting there) -Hooks (again) -More reliability work. Statistics: not enough insight. Two things left to do. Gather statistics, and provide an SNMP interface. Also a concern that we need to ask for input/reviews ongoing and not have the code merge and review stack up to the end of the BIND 10 year.

Move to Tuesday:

  • Year 4 plan
  • Year 5 and beyond

Hook Design Discussion

  • review of previous hook discussions
  • review of DHCP hooks

Quick and dirty way to do multiple cores restricted capability for updates for short term: start multiple auth servers, each load own copy of in-memory usable for most people memory consumption is high; we may not be able to load even one huge zone in in-memory; much memory needed; vorner tested this before. Maybe 50 KB per RR? (need to look at vorner's old research on this; how much memory per RR?) jinmei said it was clear about what is needed to fix memory usage problem. We knew that and was intentional. Not difficult work design wise. Recoding will also be needed for the shared-memory work. Trick with forking is probably useless at this time. Do we really want to support multi-core for year 3? So initial goal is just be able to get multiple auth servers to run at same time. Like to see how scaling will work. It should provide some performance. Hopefully straightforward. anyways with minimal effort to improve performance? maybe 6-8 core processors could handle the AS112 situation? limit how much time we will expend during these sprints on this. maybe f-root would be a goal for this get two auth servers using the open socket try different auth servers with their own configurations.


Hooks -- custom code inserted at different points custom datasource new module example: registrar, change zone, custom interface RPZ or AAAA filtering or other special processing On DHCP side maybe more interesting, special network topologies, special permissions for certain users, change networks, etc. Hooks are early requirement for DHCP. See wiki: DhcpHooks multiple callbacks; callback chain; dynamic loaded libraries (multiple libraries including defining order of what is called) callback may modify date and return status code need to decide what to do on failure (maybe abandon the processing of the packet) have a defined order within a single library or use multiple plugins maybe not a need for configurations per single plugin customer didn't have any specific requirements; other in past was a single call out. Difficult to imagine possibilites without seeing in operation. For DNS: call-chain pre-action post-action dlopen (used for datasource currently) could be used to implement views hook plugins may be available for xfrout, xfrin, auth, resolver, all separately can't easily hook into the datasource itself on recursive side: cache hooks, before looking in the cache, etc https://lists.isc.org/pipermail/bind10-dev/2011-March/002086.html https://lists.isc.org/pipermail/bind10-dev/2011-March/002087.html main difference is instead of backbone list is to build a chain with decisions (hook point) for branching; vorners idea could be used to build the entire server. greater flexibility without much more complication since need to solve all this anyways. configuration file used to define the use of hooks not enough time to implemement hooks before end of March release. don't know yet if simple call flow versus complex is better. Need to see what customers would use. not sure yet if hooks will cause a performance issue, but if not used, shouldn't cause any problem (just a pointer lookup). 20 or so quick tests. benchmark to make sure. inline versus virtual function define in our processing where we should do callouts generalized context / a query state in case of python hooks, generic function call python hook would need parameter which function to call. context can be shared between hooks two contexts: one per library and one per query. single function used for multiple hooks. In the case of written in python, for example. add another parameter or change functions to function objects. or maybe could be done at the python / C++ layer -- needs to remember where each hook is at need to research details here on multiple copies of same library version our hooks, variable or hook library function research what people really would use unbound has some type of hooks and maybe nominum too evildns, customized responses work item: documenting where DNS hooks need to go, input and outputs (similar to DHCP proposal) structure with one parameter at a time


socket creator related issues message bus have restrictions that this process is only allowed to use these commands conclusion: setuid, boss drop priv regardless, if admins want other components as other users then suggest setuid (and wrappers if needed). if can't shutdown, then can't kill other processes, so just warn about that; sockcreator shuts down if told or if socket with boss closes. chroot would need datasource plugins; it would need all libraries if starting new or restarting component chroot is complex, worth pursuing? ops currently doesn't use heavy UDP users within FreeSD jails (historical issue already fixed) ops does use chroot (defined in named.conf) these are dedicated boxes for dns only with customer (raw) zone files only


for socket listening, preserve previous status if another auth is listening, sockcreator can provide the same file descriptor where to maintain the socket states? check all auth processors or keep in sockcreator? if something fails (or race condition of asking/responding) something may fail. one possible conclusion is get rid of sockcreator if we merge current prototype, we need to allocate some work time for clean up before end of this project year. some synchronization problems may go away if we switch to dbus. dbus tells who the message comes from or who disconnects and can pass file descriptor over dbus possibly. use new sockets and then drop unused ones.


DDNS ~ 1.5 sprints

maybe need to skip this due to time available

NSEC3 refactor 2 sprints NSEC -> in-memory 1 sprint NSEC3 -> in-memory 2+ sprints in-memory additional `1

  • IXFR support (?)
  • load from zone file (DNSSEC too)
  • load from other data sources

performance

  • cores need to support multiple cores

  • profiling need to know bottlenecks

(larissa also taking notes) next sprint, try to profiling to identify some ways to improve these. Do this measurement so

we can make decision later.

also write down steps to do this, make an automated way soon (make profile)

Tuesday 10 January

BIND 10 from a user-point of view

Original high-level goals for each year of project.

Y1 - authoritative server
Y2 - resolver
Y3 - turn them into production server
Y4 - drop-in replacement for BIND 9
Y5 - experimental

What happened:

Y2 - delivered resolver but not as extensive as we want. Adopted Scrum. Additional work meant that we delivered resolver without validation or polish.
Y3 - trying to finish NSEC/NSEC3, DDNS and performance. (NSEC3 must be a goal for this year.)

Realistically, at end of Y3 we don't have a production-ready server for most environments. Not enough polish for replacement of bind 9. Need to think about year 4.

Goals for Year 4

1. Making bind10 easy to use.

  • Need to update command tool (already have a proposal for that)
  • Transparency - need to be able to look into the system, e.g. list refresh timers, trace queries etc.
  • Build migration tools from bind 9 (e.g. import named.conf). Suggestion: may be able to update BIND 9 to help migration to BIND 10.
  • need to be able to support traditional configuration file administration.
    • suggestion that current config.db is a config file
    • Need to be able to difference files.
    • Suggestions for comments in the config file?
  • Need to implement config history - save version of config and revert back to it. (Save previous version of config.db on a commit comment.)

2. Recursion

  • Do not have anti-Kaminsky features (e.g. random ports). (Port allocation by operating system can be slow.) Need to make sure that the features in BIND 9 are in BIND 10.
  • Need to do NSEC3.
  • Need to support RFC 5011 (rolling of KSKs)
  • Need to check EDNS0 support and what BIND 9 does.
  • Better server capability tracking
  • Query tracing (know why a server is doing a specific thing.)
  • viewing cache as first order object - ability to control cache (add/remove entry,modify TTL, etc.). Cache persistence. Migration of cache between servers?
  • Tool to allow us to check that server is operating as it should.

3. DNSSEC Management

  • HSM support (important to sponsors).
  • Server-side signing support (sign and re-sign zones)
  • Support inline signing
  • Automated key rollover (ZSK rollover/KSK rollover)
  • Need command-line tools for signing zones
  • Are we going to have monitoring, making it safe to serve?
  • There is a BIND 9 project for zone consistency checker. Should do it first as a check that subsequent code didn't break it.

4. Other Things

  • Views
  • Hooks
  • Handling network interfaces elegantly (what happens if interface appears or disappears).
  • Share port between authoritative server and resolver.
  • Add new data sources. (Critical - Sqlite3 has many limitations.)
  • Packages/installers for popular operating systems.
  • Performance - focus on recursive performance.

Development Method

Optimistic to say that we can reproduce all BIND 9 work in one year. Way that we are working means that progress is slower than expected. However, need to look at how we have been managing software development.

We don't use waterfall model (strict staging). It does have drawbacks, but does have some things we need. Better management for one thing.

Weakness for having new user-visible feature on every release is a focus on features, not value. Doing that will interfere with usability. Drawback we have is that no-one is using the software and giving us feedback because they can't use tin a production environment. Difference with BIND 9 is that it is incremental, BIND 10 is new software.

We need to get more users. Idea of feature for every release is to pull people in to use it.

Waterfall: positive points:

  • requirements/deliverables
  • time for design/etc
  • Timelines
  • Customer sign-off

Scrum

  • Timeboxed releases
  • Feature demos
  • Short sprint
  • Daily calls
  • Team estimation of tasks
  • Direct user involvement.

Would like to mix these features and methodologies. (Discussion of pros and cons of each approach.) Discussion about the (well-known) problems of estimation.

Proposal for Year 4

Focus on one area at a time:

  1. Usability
  2. Resolver
  3. DNSSEC management

Suggested timetable:

Feature April July(-ish) December(-ish)
Usability Implementation Polishing Maintenance
Resolver Design Implementation Polishing
DNSSEC management Requirements Design Implementation

Need to take into account things that are left over from the previous year.

Means that there will be more time writing documents, less time writing code. Suggest Joao takes on programme management (interface with customer, speaking etc.) to give Shane more time to do the project management stuff.

Can begin pre-work for April deadline now (e.g. deciding what usability features we want.)

Year 5 Planning:

Year 5 was intended to be the year where we started to collect the benefit of the work we've done

Interlude about additional data sources… need one for year four

Review of steering committee priorities

We took the top 4 items for the rest of this year

Discussion of why hooks and views were lower prioritization

Note that we did not include DNSSEC auto-signing as an option on the list.

Note that comments on "in-memory" in the steering committee priority list really refer to performance.

Back to year 5…

We had discussed stuff like support for clustering environments and multi-master We probably won't target embedded in year 5 but we may get there eventually Also there is the OpenCPE project idea at ISC and a study about that and we can read about it later. What does embedded mean vs what we need to do for distribution into say the BSD world or various Linux worlds.

When do we do Windows support? Do we do Windows support? We need to do an exercise to determine if we will do it and in what way.

User Testing:

For usability stuff, command line tool testing, lettuce might be useful?

Year 4, Flexibility, and Expectation Management:

Back to the year 4 plan, Jinmei wonders how we can introduce flexibility into the plan. Perhaps we can review at the "trimester" signposts.

Shane: my intention was to do the detailed planning only within the specific 4 month period we're at. Maybe what could happen is we could schedule a steering committee meeting immediately before or after the face to face where we are updating the plan. We need to introduce the concept of this in a way that also helps them trust us.

Over the next two months, we will discuss the year four plan individually with the sponsors and negotiate an appropriate feedback mechanism.

There is a goal of increasing some parallelism in year four… Shane says it is a minor goal, Jinmei points out this is very difficult based on his experience.

End of year "how much is left" discussion

DDNS: we are mostly done and we have estimates for the remainder: about 30 points worth for absolute minimum - about 1.5 sprints worth of effort.

  • checks -using the prerequisition checks -diff normalizer -ACL checks -XFrout to DDNS -Sending packet from auth -documentation

Refactoring for NSEC3: 36 points worth of work not including getting rid of the old implementation - another 2 sprints worth of effort.

NSEC --> in-memory: hasn't been broken into tasks. Much of the code will be repeatable from sqlite. The test code is quite generic and can be applied to both. The code logic for in-memory is much like BIND9, so once we identify how BIND 9 does this for NSEC it should be fairly straight forward. So this will not be a very difficult task. - this may be one sprint worth of effort.

NSEC3 --> in-memory: more complex than NSEC. at least two sprints.

  • iterator
  • updater
  • push logic into the find function
  • closest enclosure (find and query logic can both call it)
  • some other stuff.

In-Memory additional features: IXFR support, load from file or other data sources (DNSSEC too), eliminate the load zone tool.

Performance :

  • cores - experiment from Monday indicates gain with relatively little work, but we need to do single core first. There are memory issues with multiple cores. (the multiple core thing is about a one day one developer task)
  • demonstration of considerable progress may help
  • name compression?
  • profiling (does it make sense to do this before we add NSEC and NSEC3? - somewhat but not for all things such as name compression)
  • need to get near bind 9 performance if possible - but substantial improvement is the goal, whatever it is
  • if we can do multi-core and some profiling work early, we can work from a better informed position. There may be low hanging fruit.
  • there are experiments which have not been merged (Michal had one) and thoughts about shared pointers, which may help.
  • split into at least two tasks, initial look for low hanging fruit, and then a longer/more complex look (maybe)
  • plan is to spend time in next sprint profiling and the figure out what to do next.
  • need make targets for reproducing

Based on this, we can't do DDNS before the end of the year. We may defer the NSEC work too. Focus has to be NSEC3 and performance, and all plans are subject to change based on the profiling results.

Software development methodology (scrum navel gazing)

Developing requirements documents and post development reviews.

  • meeting to establish goals and run-through the plan for the feature set (eng manager, product owner, author, one other person, present?)
  • review of document by same team to ensure correct requirements. -does this make sense for design too? Maybe for design we want all the developers present. meeting to discuss initial design and brainstorm, then followup to review final design document.
  • design meetings do not need to have the product owner or other user land oriented people as they are "how" not "what" discussions. -requirements meetings will begin week after next, design meetings will begin when we need them, starting with an NSEC meeting next week, or possibly this week during the meeting. -behavior feature level tests at the same time? - during design -user stories form some of the requirements discussion?
  • Daily Meeting:

-Current situation is good, except that we never "mix the streams" between the hemispheres (and, between development vs non development work efforts) -Jinmei notices a trend that the Asian developers miss more meetings than before. -Shane will investigate now that the holidays are concluded. -Status report jabber room - only some developers use it- is this useful? - the problem is the people who can't attend a call are the very people who need to update the jabber room, but they are the people who forget or can't update. Shane will send mail announcing a policy that we're using this.

  • some members could sometimes attend the "other" call, but this may not be necessary for now. Jelte will become "optional" attendee in the "pacific" call.
  • non development tasks will go into trac site and non development team members will report status on these in the jabber room and meeting.
  • Feature Backlog - using requirements and design docs in combination with the year/semester planning exercises instead. More waterfallish than scrummish. Task backlog will remain as it is.
  • Feature Demos - we have had two and they have been cool (BOSS configuration, Michal, and Lettuce, Jelte). We would like to be able to record them for posterity. It is hard to find a time for people to attend. In the long run it would be good to have them open to external viewers (sponsors, users, whomever) and then recorded for future viewing. Shane knows how to record it for Linux and Joao has a possible method for Mac (IShowU). Next one will be to do with the socket creator. We will open them up to external viewers after one or two more.
  • Estimations, hours… we have been reporting hours spent on tickets for the last couple of sprints. Shane and Jelte need to sit down and compare estimations with hours to see what the correlation is. Jinmei points out that in some cases not everyone has recorded their time. We may need to add a check in Trac to make the hours entry non optional.
  • we also need to know how many FTE we had for the sprint. Shane can proxy this. If we can get to knowing velocity average per person we're really doing well.
  • then there is hours vs estimate points. -do we want to engage in longer term estimation? We do some of this now. It is very sWAG in nature. Does it make sense to do high level estimates the way we do task estimates? Comparing them? Relationship between task level estimates and higher level estimates? Could we improve reliability of deliverability of specific features through improved estimation?

We will be doing more planning in general and more check-ins between the team and the planning process (Shane to lead this)

  • "good" estimation takes time. Lots of time.

Review of BIND10 technologies:

Usability Discussion

Goal: go through the stuff that bIND 10 does and compare how it works and how we want it to work. From both developer and administrator points of view.

Step one: downloading everything.

-google bind10. isc.org/bind10 comes in.
-no specific bind10 downloads button.
-issues finding downloads.
-finally found download.
-"geez this is a lot of clicking"
-the naming is weird too.
-we have a tar ball.
-untarred it, found some files
-looking at README
-differences of opinion in what user would do next.

  • type .configure and see what happens.
    (with help)
    -no C compiler.
    -get C compiler.
    -confusing python error. "Missing Python.h"
    -but above it there is another message telling us we need python3-dev
    -installing python3-dev…
    -configure errors… but Jeremy has handled these
    -then user looks at the install.
    -should use .txt for the guide referenced here
    -oh, the guide is a thousand lines long[[BR]] -note guide has been significantly updated from here.
    -could the required apps list be *in* the install doc?
    -there are just way too many dependencies and its difficult to know what you need
    -on the other hand we are getting there.
    -getting botan was a pain in the rear
    -need pkg-config and its not obvious
    -had to go to source forge to get log4c--
    • log4cplus builds OK
    • need to install boost
    • need to install sqlite3
    • now configure succeeds

-list of flags is a bit confusing, what do blank spaces mean
-building bind10 takes forever….
-switching to existing built BIND10 on Shane's server.
-"why does it run that Python script?" (MakeFile? issue)
-execute bind10 and see what happens?
-unable to start, why? Because user is running as shane not as root… "fail with unknown exit status"
-why did msgq fail? time-out. didn't record its pid. BUG.
-first error has a wrong message, is "unable to connect to c channel" the cause? its the symptom.
-there is no logging for msgq itself. -installed as root but not running as root? permission issue? -can't create a socket but does not tell us this. -msgq needs to issue a standard error if it fails immediately. -look up error messages in documentation? -look up error message in google. -brings up random ticket discussion from the trac site. -"this is not my problem" -opened the messages manual. Found message, but it doesn't really tell us anything useful, only things we already know. ("BIND10 STARTUP ERROR") -there really isn't a good note in the error output that BOSS is shutting down. Should say "exiting". -output was too verbose. esp "INFO". Things get repeated twice. INFO is supposed to be marking significant events. -time to try as root. -that worked fine, but yes, we have way too much output. Need to reduce verbosity -when it tells us it started up, it should tell us the version number. -in this version the buffering bug isn't fixed yet. -"nothing runs as a daemon so I'll just use my shell to do it" -dig @localhost -t txt -c ch version.bind gives a weird answer because of the local resolver

  • netstat
  • named isn't listening on TCP why? -ok, we're running. -now i need documentation. -oh look, the build finished. -"I don't care about these details!" -We discuss configuration before the command line tool. Being given commands without knowing where to put them. -hard to find the right place to learn basic command line syntax -finally found the"control and configure user interface" section. -tried looking in quick start and found "load my zone" -grabbed a zone of Jelte's via axfr (a signed zone) -ran b10-loadzone -sudo failure -cannot parse RRtype 655 (BUG) -removing offending RR type -ran it again and it worked. cool. -ran into known bug "DATASRC_QUERY_NO_ZONE" -went to TCP when Joao queried, because the answer size is soooo big -setting up a secondary…. -looking in the guide…
  • section "Secondary Manager: Configuration for Incoming Zone Transfers" -right now you have to configure the zone twice -then it says log in as root but we don't want to do that and maybe the terminology is wrong (bug already filed) that worked
  • trying it with ipv6…. the configuration seems to have been correct, but, now what happens -"this is a weird thing, you also need to configure zone manager" -ok the instructions do say that -"this is a lot of work" -have to cut and paste one line at a time from the bind 10 guide to get the right config commands. -shane was trying different ways to exit
  • when we tried exit, we got : "Error! command name is missed"
  • it should say something like you should use quit instead of exit.
  • now a dig for time-travellers.org brings up the zone correctly -secondary setup is working -checking the log…. -"whaaat?" "Failed to create IXFR query due to no SOA for time-travellers.org" -it thinks time-travellers.org has no SOA, but its fine, it just will make the admin worried for no reason -it should have said that the zone is empty at this time. we were just following the documentation.
  • BUG is that we should not give an error for a normal condition. -need to add multiple zone masters support -otherpeople would want TSIG for that -we don't have global TSIG key everywhere, and it may not be documented, it is documented in a manage, but -"are ACLs described anywhere?" -not finding the right TSIG information in the guide, moving on for now -its not putting anything in syslog, I want to figure out how to do that -looking in the bind 10 guide for "syslog" which is there. Logging Configuration section. "this is too long" "why do we need to add loggers to our logging?" adding a new item to the list. "maybe we ought to ship a config file with some of these basic logging things already created" setting destination for syslog good error message when the destination was not correctly set "i don't know, did that work?" now i'm kind of stuck hard to tell what is going on with logging at this point. we need a default logger so you can see it. Nice feature to have: saving command history to a file. currently there is no persistent history. trying to enable xfrin error correctly logged. reading manage for the go command config go gave us a nice list of logs Boss shutdown help brought up some help options but not what we needed Boss shutdown didn't really work right, the shell does admit that its done but it dumped us out It didn't say "exiting" therefore maybe should not exit when it can't connect instead if just quitting it could go to a "disconnected mode" it could also just return to the prompt the command tool will handle this better. should not have to g to guide to configure things long term we could have default profiles for more than just logging, and a wizard

Packaging issues

-debian - need to work with package maintainer

Benchmarking Discussion (Stephen to insert notes)

Jinmei shared a command line tool benchmarking experiment.

List of benchmarking tasks we need:

b10-resolver:

-+ ns -

b10-xfrout

-+ A l.root-servers.net

dig -+ axfr for bigzone.example

queryperf (?) for IXFR

Startup and shutdown time benchmarking:
Startup Speed:

in-memory

lots of zones (master)

" " " (slave)

graph time from startup to ready to serve.

Need tickets for:

Benchmark tools into git
"" publication

html for descriptions

auth tests we have are ok for now.

What is the threshold for a significant change? Need a baseline we can always compare against as well as the "warning" level

Right now this is ad-hoc.

Going forward, treat a major performance drop like a build failure. Immediate high priority ticket.

Need to measure both TCP and UDP performance. (Need to patch queryperf to support TCP.) Digression about TCP behavior and RFCs....

Wednesday 11 January

DNS Benchmarking

Need to take care about performance - some patches can actually slow the code down.

Have a box for benchmarking. Run benchmarks twice a month (everything is scripted). Run queryperf - collect speed and resource usage. Use various zones:

  • small zones
  • 10k zones with 15 records
  • 100k zones

… with varying parameters.

Results are published to the ~jreed directory on the bind 10 server. Plan is to include the procedures in git to allow the developers to run it . Plan is to extend it to DHCP. Also plan to send notification if the performance drops below a particular threshold.

Should publish to the main site.

Metrics

  • Startup speed more important than performance in some environments (e.g. lots of parked zones).
  • If we can transfer zones from SQLite to in-memory, that would be a winner.
    • Not on the immediate plans.
    • Some operators may not want faster startup that serves with degraded performance during the startup.
  • Shutdown is fast unless components are misbehaving. May want to add a debug mode later, perhaps to check for memory leaks.
    • Not really an issue.
  • Authoritative query performance is a big concern for most people.

Hotspot cache can skew benchmarks if entire zone fits in cache. Need to choose query mix carefully.

Discussion of benchmarks

Query Performance
Graphs are at http://git.bind10.isc.org/~jreed/bench

For NXDOMAINS, queries for built-ins have dropped to zero since about May 2011 - this may be due to excessive logging.

Transfer Efficiency
Lower compared with BIND 9. Could be due to many reasons, e.g. use of Python.

In Python code, is logging may be a problem? It could be that we evaluate an object and pass it to the logging code even if it is not logged. This needs to be checked - a ticket has been created for it.

Idea: create a null data source to check efficiency of the code. (Without a null data source, an in-memory data source may be helpful.) A null data source could also serve as an example if people want to build their own.

Recursive Resolution
Need benchmark. More variability in behaviour, e.g. concerning the caching of NS information. So sending more packets may result in better performance. Having a standard series of tests would be useful.

Need to spend more time in next F2F meeting when we are closer to implementing recursion. Points:

  • Stability of memory consumption's important.
  • Need also to measure performance over longer period to discover any issues concurring fragmentation of the cache.

Runtime provisioning
(Anything that is changed on the server, e.g. how long does it take to add a zone.) Could include programs like loadzone.

DDNS Performance
Do have nsupdate so we could wrap it in a benchmarking script.

What tests do we have to do? Most primitive level is to know the maximum possible update rate.

Also need to know how updates affect query processing

DNSSSEC Performance
How long does it take to sign a zone, how long to re-sign a zone. Bit early until we have DNSSEC management in the code.

Performance

Include memory usage in performance measurements.

DNSSEC

Discussion on DNSSEC management tools. Performance of these tools can be an issue.

Profiling discussion

We would like a make target to do some basic profiling, but some profiling techniques need kernel-level switches to be enabled (oprofile for example).

So first a discussion on various profiling techniques.

oprofile
kcachegrind
gprof

possible approaches: manual (with description), make target to do full profiling session, make target or configure option to just set up any system for easy profile runs.

For some things, we can use specific tailored benchmarking tools (like src/bin/auth/benchmarks/query_bench).

Proposal: Initially make a --with-gprof configure option, and a make profile target which runs query_bench with some set of data.

For fine details, we will probably want more sophisticated tools.

Can we use commercial software? It is tricky, since we like to instrument as much as possible, which may be tricky. We are not opposed to it, but it is not the first thing we would look at. Haikuo recommends Jumpshot as something to consider. Jumpshot takes a specific log output format, at provides visual tools for inspection. We'd have to add such output then though, but it is certainly worth considering.

The proposal is to look at these more advanced inspection methods, and initially do the gprof approach with querybench.

We would at least need a few shellscripts and documentation on how to reproduce.

Also document some other tools, like oprofile.

Things To Do:

  1. Add --with-gprof configure option and make profile target (running querybench)
  2. document how to run with gprof (in general)
  3. document how to run with oprofile
  4. document how to run with valgrind/callgrind/kcachegrind
  5. (longer-term?) Research what needs to be done to use Jumpshot

Meeting adjourned; we will be running some initial profiling tools and base next discussion on any results we find, in about an hour.

(people showing initial results; most time spent in renderer, some discussion on tolower, need of optimizing name compression was already known)

Some discussion, optional good optimizations:

  • name compression
  • case comparison optimization
  • buffer array check (quick test: 34000->42000 when removed, 40000 when made an assert)
  • hash-based pre-lookup (this may have some serious memory drawbacks, low prio)
  • additional section cache (low prio)
  • (re)optimize asio wrappers (some efficiency got lost when making resolver and auth asio wrapper one shared one) (low prio)

Given the time in the rest of the year, it is not deemed worth it to instrument all methods of profiling for repeated use. So we should probably instrument the one (gprof?) and document the others.

Sprint Planning

We had a sprint planning session, minutes are at the SprintPlanning20120112 page.

Thursday 12 January

Use cases discussion

The idea for this session is not to design the use-cases, it's not to go through things step-by-step and say what the user expects and how it works. My idea was rather to come up with a list of use-cases, but not go into the use-cases themselves.

We can do 'this is what you can do' or 'this is what bind users might want'.

Are we looking for statements or descriptions of things that administrators want?

A lot of us could share realp-world exampels of things people we would want.

For example (jeremy), "if i change an address of one of my servers, i want it to be automatically signed and distributed to my secondaries.".

Some discussion; what would happen if someone modifies a record in sql? (in a signed zone...)

Michal: If I have database, I want to use that as the source data.

PowerDNS does that pretty well.

Michal: I might want to transfer a zone from one datasource to another

Jelte: I want a better nsupdate tool (for instance one that can query too)

Michal: Nice GUI

Jelte: actually, perhaps just better building blocks to make guis (other people may be better at that)

Haikuo: Key management tools

Michal: on/off switch for dnssec for a zone 'just sign this, i do not care how'

Stephen: sign multiple zones with one key, multiple keys?

Jelte: Supermaster idea too

Stephen: zone groups

Shane: Zone templating

Stephen: not copy in memory for each 100k similar zone

Jelte: different zone file formats. Jeremy: is there an xml schema for zone files? Stephen: Jay Daley and me came up with one at some point, never submitted draft though.

Stephen: dig and similar tools (possibly as a separate package)

Stephen: i want to upgrade bind10 without losing service

Michal: run bind10 across multiple machines

Jinmei: might be the same thing, some way of auto updating the version

Jeremy: some tool to report how many queries it can serve on the current system. 'currently serving 50000qps out of a possible 70000 qps'.

Shane: new module: tftp server

Jinmei: details of status of various components of bind 10. For instance: status of secondary or primary, sometimes i saw that transfer didn't happen, so i would like to know what the last time was when it worked, and when it intends to try again.

Shane: in that regard, i'd like to see what a server is master for, and when the last transfer was. serials of those, etc. Joao: I would want this exposed in an API manner Shane: it would most likely be a set of message channel commands

Shane: also currently running transfers, re-signing status etc.

Shane: Role-based permissions

Jeremy: load zones but not serve them (yet)

Jeremy: recursive statistics. Last 10 queries would be nice

Michal: ttl limits, send all outgoing lookups to specific servers

Shane: something like top(1) for our dns operations

Jeremy: maybe also something like that just for resource usage by processes Michal: isn't OS tools enough for this? Jeremy: possibly, but maybe nice to have it specifically for the things bind10 manages

Stephen: cache editing

Shane: also, cache rules, and disabling of dnssec checking for certain domains

Stephen: and 'do DNSSEC, but only report errors, and not fail'

Haikuo: alternative behaviour based on network hardware (inifiniband for example)

Jeremy: resolver *must* validate option

Jinmei: receiving icmp error messages and using that information for some purpose, even if just logging

Jinmei: and keep data on failures per msg size

Michal: keep tcp connections open to busy servers

We can probably go on forever, so at this point this topic was closed.

Zone loader

Currently we have two ways to load zones; b10-loadzone: started as a short-term hack, no real parser, writes directly into database. It was never meant to do serious stuff. In-memory loader: intergrated into libdns and datasource, with a small function that connects rbtree to loader.

The latter is very strict about the types of data it can load, essentially input has to be flattened first, and there are a lot of things it fails on (cname+rrsig for instance).

What we need is a generic way, kind of a front-end to the compiler, that can read zonefiles.

Perhaps a generic dns zone reader, that can read data in some format and put it into another

  • zone file
  • sqlite
  • ...

Joao mentions that parsing zone file and loading data are separate things (for instance, we will also want some zone checker, that parses zone files, but not loads them).

Michal: Perhaps we can implement this as a datasource, the datasources we have have certain capabilities, querying, iteration, etc. this one could only support iteration (at which point it reads rrs from a zone file)

Stephen: when you say zone loading i see a unix pipeline; first stage replaces directives and flattens rrs, second one validates rrs, third one interprets full zone data

Jinmei gives a bit of description of what we have now; libdns++, masterLoader takes zone format, parses it and output a stream of RRset objects. Accepted format is currently very limited. We would also need an actual Loader, which takes those RRsets and puts it into the inmemory datasource. We have that as part of the authoritative server (NOT the inmem datasource). This loader should also support other backends (database datasources etc).

Do we do zone-level checks? The parser does not, but the loader does some (currently in the inmem datasource).

Discussion on where to do semantic checks; you want to do these checks outside of the actual loader (since you may want to do something like zonecheck without loading it into some datasource), but need to be careful about the pipeline model.

We want the checking to be sufficiently generic that it can be done either when being put in the datsource, but also independently (without using a datasource at all).

What about writing zonefiles?

It can be really easy; just iterate over your rrs and print them. A slightly better one gives multiline output.

But if you use tesxt input files, you might want stuff like preserved comments. Men & mice apparently do this. We don't know how they do this and in whast use-cases it works and does not work.

One approach would be to have some (compilcated?) preservation logic for order, whitespace and comments. Another could be to never touch the input file at all (and dump to another one, and have a way to see the differences between the previous input file and the current one when it is 'reloaded').

Stuff like this is probably best declared a research project (i.e. year 5 work). But we need zone output soon. We will probably get away with simple multiline output, or probably even flat output.

Like comments, we would also lose directives. Perhaps we can have something like a hook that could do this.

Also possible in that regard is something that takes flat one-line output and makes it into one that has directives again, etc.

Back to loading.

Some discussion on where to actually do the loading, especially for different datasources. What should the database datasources do on startup? (inmem loads the zones then), what should inmem do when a zone is changes (b10-loadzone trigger?)

We could always require a loadzone into sqlite, and inmem would load it from sqlite (so that external behaviour is the same for different datasources).

We should probably first define what we expect the process to be (for all datasources), and what b10-loadzone's place is in that, and what its behaviour should be.

A high-level question is, do we want to keep bind9 behaviour? (load zones on rndc reload and on startup), or have one consistent way (only update actually served zone data when loadzone is called, NOT on restart).

Joao: seems to me that all use-cases you are talking about are the same; there is a internal store, and external data, and the system needs to decide which one takes precedence.

Jinmei: right now, i only have 1035-format zonefiles, and always edit that when i change zone data. I use ZKT for dnssec, zo it periodically reads that source file, and when i change it i call it manually. It calls b10-loadzone which update the sqlite3 database. One problem is that if i edit but forget to update, my changes are not reflected, and we should probably solve that.

Shane: I think we are talking about two things, one is to improve the loader itself, and another is to probably monitor for changes.

Michal: I think we are making this much more complicated than it should be. In essence I thought we were designing the inmem to be kind of a cache that just pulls its data from some source (either zone files or some database or anything). By hiding this from the user you are only making it confusing and harder for ourselves.

So how to proceed?

In the end, we need the following steps:

  • make libdns++ understand full zone file format (including directives) (Y3)
  • fix in-memory checks (Y3)
  • move the checks out of in-mem and make them generic for any datasource (Y4)
  • load it into datasources (Y3) (i.e. get rid of b1-loadzone (Y4))
  • load inmem zones from other datasources (Y4)
  • xfr zone load hack (Y3?)

Thursday afternoon

Testing

Coverage Testing

Overall, coverage not too bad. Are mostly going by lines - clove seems to report low coverage of functions, but high coverage of lines.

Should we monitor coverage and warn if the coverage goes down? Set up to do this is not difficult, so why not try it? Why not just warn if the coverage goes down? (Problem if the warning is ignored, but it will be OK for now.)

Action: Set up a job to monitor coverage and warn if coverage goes down (on a per-entry basis).

Note that some unit tests check correct behaviour rather than low-level stuff. However, this is no problem - all tests need to pass.

What about long tests (take more than a few seconds to run). Also still see occasional timing errors (although these tend to be relatively rare); better to live with the pain than avoid the test.

Unit Tests

Plea for more comments in unit tests. In practice, test code is written carelessly. Normal review process should cover this, but many don't.

Reviewers

Should we have multiple reviewers? Problem is that multiple reviewers slows down the work - takes more time (more expensive).

Have had some pieces of code that we need more review. Perhaps we should do more reviews on certain pieces of code? But at the moment, no way to know when to have another reviewer - perhaps when coder and reviewer disagree?

Should probably do a review of bugs - how many are incorrect implementation, how many due to misinterpreted requirements?

Lettuce

Looking at the Lettuce output:

Action: Check if Lettuce can produce HTML output.

Must have requirements so that we know what we are testing. Would be useful to have some comments in the output. Perhaps a null comment step for including comments in the output?

Agree that we need to document requirements - decompose RFCs into the basic requirements. It would be a good idea to keep requirements and tests together. Jeremy showed some XML that could keep the requirements and the tests in one place.

Need to do requirement coverage within Lettuce before committing to a major effort of system testing using the software.

One problem is that the test output is so verbose - need a summary output (i.e. tests: pass/fail/skipped). If we are doing behaviour-driven development, we write tests before implementation so that they fail. (If we are doing an XML-based system, we could use the XML to give the status of the tests.)

NSEC3

Lettuce tests for NSEC3? Ideally want one (Lettuce-experienced) person to write the tests - but do we have the time?

Must write unit tests - nice to write Lettuce system tests, but depends on how much time we have. Do this at the end (so no behaviour-driven development). However, we can't do this in the time remaining for year 3.

Other Testing

  • Will set up a clang server for testing.
  • cppcheck reports are automated.

Fuzz testing can find bugs at the basic level. However, the experience of NSD is that it found a number of problems.

Stress testing - how can we find the 95% line - how does the behaviour of the server change as the load approaches the limit?

msgq to D-bus

D-bus turns out to have a problem; the python bindings for D-bus use glib, which is not only yet another dependency, it is also licensed under LGPL. Additionally, the python3 bindings have only just been submitted for merge, so these won't show up on real systems for some time yet.

But the license is a bigger issue; for a mandatory subsystem, we can't use LGPL libraries. These bindings are quite large and we can't simply write our own.

So, if we still intend to use D-bus, we should find out how exactly we would use it, and see if we can have something similar to our asiolink and cryptolink libraries; which abstract out the functionality we need from the underlying libraries.

If this is possible, we can wrap that and we don't need actual d-bus wrappers for our python modules.

If this is not possible, it looks like we might need to start looking for another alternative to repace msgq.

Views Design Discussion

Views Design Discussion

When we ask people about views its too generic

Different people want different things - we need to survey them more specifically than just "how important are views"

Enterprises mean Split DNS The appliance vendors mean an interal vs an external network Could we have the receptionist and multiple auth servers doing different places? yes, maybe. Views can also be used in recursion to override the auth server though BIND 9 views allow multiple configurations of everything

Depending on how we choose to implement views, we might not even need the receptionist. Views may not be based on IPs they may be based on TSIG keys. If your system turns into completely separate configs for each view you might as well just run two instances.

If we're going for a receptionist model anyway, this is probably the logical place to put views.

We also need to figure out how zone transfers would work. We could have multiple IXFR in and out. Receptionist would take over transfer from auth. We have user stories already

Review of John Dickinson's name server control protocol. In this, views are not really views and maybe should be named something else. Maybe views here are really a "virtual server" or "zone groups"

We might ask the vendors how they implement views - we can do that at the Open Day!

We can and will document some implementations (simple ones) we could recommend for achieving similar goals to views. (socket creator plus multiple auth instances, etc)

Saturday 14 January

Open day review

Barry believes that we need to know what we are doing so that we know when we've done it.

For the first open day we didn't have any specific goals, we just wanted to see what happened. For this one, that wasn't true, but we didn't really write down what the goals were.

Retro-goal: get enough attendants to overflow our own facilities. Reached.

The attendees were interested in the slide decks, when they are on the wiki larissa will inform them.

For getting to meet users, it was very useful.

Shane: Learned that recursive performance is very important. Joao: Not entirely, what I got was that threaded performanvce ofr auth was important, and recursive performance in general. Shane: afaict, auth perf has an upper bound, and recursive does not.

Some discussion on what upper bound means; for auth this would be echo. For recursive it is harder.

Shane: in my mind, demand on performance is higher on the recursive size. Graff: we are fast approaching the limit of what can be reached with bind9. So there is an opportunity for 10.

Joao: What I'm saying is that focus should be on both.

What we need to do on the recursive side is define a way to measure it. Larissa: I talked to someone yesterday who claimed they had a method (CloudShield?), I can see if I can get more information from them. Maybe they can lend us a box or something.

Jinmei: there are many topics related to performance; just improving qps is one thing, but faster load and update and memory usage are also things to consider.

Stephen: regarding that, there is often a trade off between memory usage and raw performance, we could consider some tuning parameters here. Shane: yeah... but that could also make us lazy 'just leave it to the user'

Tomek: Do we have a list of the people that attended the DHCP session? Larissa: We have a plan to have a list, there is no list at this moment, but we do know who were interested in DHCP.

Jeremy: we should think of questions we want to ask attendees ahead of time. Jinmei had a few good ones, but we could show them on screen so people can think about them, etc.

Jeremy: Also, we didn't know what the others were going to talk about, so there was a bit of repetition.

Graff: ease of use of DNSSEC. This was brought up my a lot of people. Larissa: even from people you wouldn't expect it from.

Graff: universities seem to get into it now; some have it, and if one has it, the other ones must follow (pride thing)

Michal: I talked to a bank and they weren't really interested in DNSSEC (but they did know about it) Jelte: most banks don't really see the added value, since they generally have put in a lot of effort getting secure mechanisms on the current net Shane: is that still true with the crumbling state of SSL? Graff: I don't think they really care; they just need someone to blame if something goes wrong, and dnssec does not really provide that.

Stephen: Do we need a dhcp roadmap? (for instance for extra funding) Larissa: we should be looking for more DHCP funders, and there were a few there that may be interested.

Joao: not really need a three year plan Shawn: there were people there wanting hard numbers: when can i start running this? Also, basic transition strategy will also be necessary.

Joao: what we should tell them, we are not like closed source vendors, who just tell people 'this is what we have', we ask people what they want.

Stephen: I was wondering what we should do longer-term, maybe some form of newsletter Larissa: interaction is most important, yesterday was very useful for planning.

Stephen: doing a roadmap allows us to define individual work items, with which we can go around to people. Joao: sure, but we don't need to feel guilty about not knowing what will happen in two years.

Larissa: I think they were happy about what we told them yesterday.

Shane: is it somebody's job at ISC to look for new sponsors? Joao: for bind 10, it was Norm's task, but he got busy with registry services.

Situation in bind9, we are sometimes implementing feautres just because some people think they need it, and not necessarily for the good of the software long-term.

(sorry, missed a part here by larissa and shane about account managers etc)

joao: one thing we should take into account; these companies that have some money for sponsoring, cannot sponsor something that will appear in three years. it must be for something that already exists or that they can see done soon. Otherwise, yes, we will have this splendid platform, but people will see it and go 'but this doesn't do anything'.

Shane: the dhcp work is supposed to be just that; we have something that can actually be run and used.

Joao: for instance, we have a nice logging platform now, but really we have a logging client that does not really do DNS. I'd rather have people complain about logging than about it not doing dns.

Shane: so what now? Joao: go through the list of features and requirements, call it a roadmap if you wish, and talk to these people. e.g. performance stuff: there was a good presentation about <who?> about how they set up their recursive farms and what their issues were, and it had some real numbers there.

Joao: for future open days, i think people got what they were looking for, there was at least one person asking when the next one would be, so that is a good sign. We should probably not have one at the next f2f yet (since that will be year planning), but I suggest we do one at the next one (around august?). I think we can do better about telling people what our ideas are, and have people respond. People don't tend to take initiative for this.

Shane: my concern with letting the business fold handle getting sponsors, is like, hollywood will make a movie that is similar to the last one, since that worked, and similarly sales people would be most comfortable selling support for existing software.

Joao: yes, but we are the engineers, so let's first get the technicalities in order, and then talk to the business people.

(going through agenda now)

Shane: oh I talked to someone at infoblox, he pointed out that if we make DNSSEC really easy, we are taking business away from them. Graff: they told us the same thing for multimaster, kind of. They really just want us to be their engine. Shane: well we're not competing with customers if they're not our customers. Shawn: it's almost always the case that you compete with your customers on some points, and complement on others

Shane: anything else?

Jinmei: for future events like this, it would be nice if we had something appealing. For instance, a number of people are interested in performance, and we don't have that now. I'm quite sure we can improve it, but for people attending such a message can be quite disappointing. So ideally it would be nice if we see these things as a deadline for Nice Things, so we can talk about those.

JeremyL we have 6 to 7 horus there, maybe we should allocate one hour as a tutorial, and not just 'marketing' Michal: one thing i thought of, if we had a working resolver, we could run that and say 'you are using one on your wireless right now' Shawn: we might want to do those as breakout, dhcp wouldn't be much interested in dns tutorial Tomek: people may be too technical for something simple as a 'tutorial' Shane: yeah a tutorial sounds like something telling me where to click Jelte: call it a "Hands-on session"? Graff: yeah, you should to 'demo' or 'tutorial' (screencasts for instance) separately

Jelte: regarding announcements/deadlines, do we know how many of the attendees also follow our announcements? Larissa: no, but that is one of the followup survey goals to find out.

Shane: we should probably summarize the day and send it to staff Jelte: yeah, but you'll want to type up something nicer than my notes :)

Shawn: other thing, I had some customers talk to me about dhcp stuff, if we don't do one for half a year, we won't get such feedback Joao: there is half a plan to have a similar sort of meeting on bind9 or other meetings Graff: Laura wants us to do things like this every time we get engineers together.

(some random chatter about reachability and openness of engineers)

Status reports

DHCP

Focus of DHCP effort was work that was defined by a statement of work for comcast, who sponsored, which was a skeleton server for dhcp and dhcp6, basically they just hand out one lease. Including in that was the development of some of the basic functionality; packet parsing, lowlever i/o, some interface detection. Currenlty very basic, multiplatform stub detection (actually uses a file), and better detection on linux. Also created a performance tool. That statement of work was for delivery at the end of the year, and we sent out the mail with instructions on december 30. (a day early!)

(Larissa: some of comcast will be here next month, if we haven't heard anyhting by then we can ask, but we would really want to have feedback from john though.)

Next we will be adding a lease database, using the bind10 paradigm, we define an abstract lease database and make an implementation with sqlite

Jeremy: when will it do msgq and config things to really fit into the system?

Stephen: yes that is part of the work planned now; -refactor perfdhcp, message logging, use of bind10 mechanism. These were things that were not essential for running a skeleton, so they were sacrificed for the deadline.

Shanw: the amount of processes or threads we are having, people were not really against it, but aren't really comfortable. We should be aware of that, and we may need to explain how it works etc. Michal: just don't tell them right at the start, they might not even notice (see postfix), or at least won't consider this as much of a problem. Jelte: that reminds me of the documentation (later on the agenda), where we first tell about every single process. we probably should not do that.

DNS

Goal was production ready. We are not going to reach that. So focus now is looking at the functionality and provide something that the steering committee can start labtesting. That ends up being NSEC3 and qps performance (specifically for inmem).

We spent some time on wednesday doing a bit of profiling, and found a few (4 or 5) immediate areas that would be relatively easy to address, and we will be tackling those.

So hopefulle in a few months, you will be able to run a server, load some nsec3 zones, and do some benchmarking.

For next year, we will first focus on usability. Beyond that we will need to turn our attention on the resolver, we need to get validation working, and testing it, and adding functionality, and stuff like EDNS0 backoff. That may be a joint effort with 9

Graff: it's not just edns backoff, it is more general how we talk to auths and keep state on them.

Shane: i'm thinking we end y4 by looking at dnssec management.

Jeremy: some more status: next week new release. There's two releases scheduled in march.

Shane: our real objective for next year is to get people to start using it, and not because we pay them, but because they want to.

Documentation

BIND 10 guide is growing and will end up big - also written in a lot of different styles.

First chapter is too detailed - information should be there, but be in an appendix.

Need different documents for different use-cases. Don't know how to structure them. Suggestion: let's sort out the documentation we would like.

Do we need installation guide? People who are installing it, know what they are doing. (Suggestion that many people install BIND 9 from source; however, some people use distributions. Consensus: some people install from source, some from distribution.)

Suggestion: we should organise the one document so that people can find what they want. OTOH, everything in a single document has too much in it and becomes confusing. Docbook can generate both formats, so we can leave it for now.

Possible structure:

  • BIND 10 overview
  • Documentation overview
  • Installation guide
  • Quick start guide
    • Authoritative server
    • Resolver
    • Examples
    • Template files
  • Bindctl guide
    • tutorial
    • reference manual (part of BIND 10 guide?)
  • Troubleshooting guide (like Samba tree)
  • Developer's guides
    • Architecture overview
    • Hacker's guide
      • How to write a hook
      • How to write a data source
      • How to do logging
    • Tools overview
  • Contact information
    • include requests for translators
  • Standards supported (RFCs, BIND 9, …)
  • Migration guide
    • BIND 9, NSD
  • Doxygen stuff
    • Overviews
    • APIs
    • Tutorials
  • Performance/resource requirements

If we put stuff on the web site, we should allow comments. However, we should prune the comments and incorporate useful stuff into the documentation.

Would the Knowledge Base (KB) handle it? But if we do, we need to keep the documentation and KB synchronised. And how do we sort out the formatting problems? Summary: BIND 10 documentation is the master - KB may have some copy of the information.

Library Documentation: for Dibbler, two types of documentation:

  • Users' guide. (~90 pages, lots of examples.)
  • Developers' guide. (Explains architecture, compilation guide, optimisations.)

Most common usage of Doxygen is for documenting code, but can do overview and the like.

Will make Doxygen warning errors (will break the build). However, will not do this for the code all at once. Will gradually increase the part of the code on which doxygen will cause a failure over time.

Action: Shane/Jeremy: turn goals for documentation into a plan to get there.

Python documentation generated by Pydoc is more incomplete than Doygen. The Pydoc stylesheet is not very good. Couple of points:

  • Many of the Python interfaces are provided by the C++ wrapper and they are not covered here.
  • Other Python projects use different documentation software.

(Review of some of the Python documentation.)

Doxygen also does documentation for Python. Will look at doing Python documentation using Doxygen.

Action: Jeremy - experiment using Doxygen for Python.

Suggestion: could use other DNS documentation as a check that we have not missed anything.

Packaging

Debian - a debian user wanted to get all the dependencies packaged with us but there was a sticking point with log4c+. So there is now an "intent to package" request to package log4c+ filed, and Eric Kom put a package together for us and then he sent it to the debian mentor folks and nothing has happened yet with that. We need to scream on the maintainers list?

We will try to talk to Andrew Pollock to see if he can help us. Then all the dependencies for all our builds would be there. We probably want to make a package but not actually submit it yet. We want to make sure its not put into mainline too early. But getting into Debian is a good way to get into about a third of the linux ecosystem. Maybe we can put it in very clearly labeled as experimental. Clearly state in there that this is not yet a replacement for named. We need to include specifications, build rules, and a subdirectory, for Debian.

This is similar to how RedHat? and other projects do it. We should include the specifications in our source tree. Jeremy knows a release engineer at RedHat?.

SuSE Build Service - you can submit the definition of the package and it will produce packages for all the types they support. Keith and Michal both recommended this.

Gentoo Michal could make an ebuild for gentoo.

How do we track these tasks - tickets? - put these in the non-developer ticket system.

Need to include contact information in the packaging so that people know where to send experiences and issues and whatever.

FreeBSD - the guy who already submitted a port seems to not be in communication now, so Jeremy will submit another one.

Jeremy has been invited to maintain Botan and other libraries for FreeBSD but hasn't said yes, we need to figure out if he will do this. He has also done work for PackageSource? for NetBSD (also used by BlastWave? for Solaris and three Linux distributions and Minix)

MacOS - Brew (like PackageSource?) - source based, also installs as a non-root user by default, there was also MacPorts? but its not well maintained. There are more ports available but there used to be DarwinPorts? too.

$100/year for putting it on the Apple Store to download. We could even charge some money.

Do we want to try to support Brew. Going the Apple Store route is a business decision, Joao and Larissa need to chat about it.

Brew as a free method is fine for now.

Jeremy will do NetBSD, He'll call it bind10-devel (also for FreeBSD)

In debian etc there will probably eventually be bind10-dns and bind10-core

Solaris - we can suggest using BlastWave? for now? Or SunFreeWare? or OpenPackage?. Maybe? Larissa will talk to Stacey Marshall about it

OpenPackage? does all sorts of packages actually, we probably want to talk to them.

OpenBSD - asked to support - Jeremy will work on packages. Their base system is another story.

If other distros appear and people submit patches or build files we can agree to put them in contrib.

Testing

  • custom build request overview - there is a wiki page.... but.... we can't find it. It was an email! Jeremy will make a wiki page.

Random side note: we need to update authors.bind on our new developers.

Test Lab

What's going on with virt3? Is it waiting on us? No, Jeremy gave some advice to Ops about it. At that point the KVM was no longer plugged in, so we need to make sure those things work. Michael will install it while he's here this week and just get it done.

We have a CentOS box we use for performance testing. The DHCP guys will need RedHat? installed on it too for customer testing requirements. That system is also part of the Build Farm. When we need to do performance testing there's just a file touch to be removed.

RFC List

We touched on this a bit earlier in the week. We will have good documentaiton on this going forward, and we won't just list the RFCs we will list functionality within there.

Michael: but this caused a problem in BIND 9. Because well, there is no conformance standard. Except BIND. When they did the registry compliance stuff, they stripped out all the "complication". We want to keep the detailed list, but provide a simplified one for specific purposes.

A magical code pixie put code on Jeltes laptop that generates some lettuce conformance tests on his laptop that relate to this.

We need another task to go back through old stuff and document RFC compliance. Someday. Jelte would rather we approach that with a new requirements test in the lettuce system form and the test contains the compliance test so we can prove we comply.

We need an intern....

Localization

Jeremy: maybe its too soon to worry about this

Need to figure out what localization hooks we need and then don't worry about adding other languages later. Figure out how to make it suitable for localization first. We decided we're not localizing exception text.

Jerry Scharf's document does include some localization guidance.

How do we keep the new localizations up to date when they change? Need to study what other folks do.

Handling User Support Issues

  • List Procedures

We had an actual user asking for some help on the user list and two people (correctly) answered the guy at the same time. So the current plan is to mention it in the jabber room when you plan to answer a user list comment.

We don't need any procedures on the dev list.

On the bind10-users list, we do need some procedures. We need to make sure things don't get dropped. This is the only current place to get help except the jabber room. Digression on putting a jabber chat client onto the website.

In bind-users, the official rule is if you're going to answer on bind-users you need to write a knowledge base article and point them to it.

Jelte: right now, I think we need to cherish the three users we have.

Shane: lets do this, but use the wiki. For now.

How do we make sure we dont drop issues? Shane will make sure, for now. If anyone sees an issue not answered please speak up.

For now we will also be prioritizing user-submitted bugs. They go into the next sprint unless we have a problem. We do send the user a fix first to ensure it works and then it goes into the next release. Most of the time, anyway.

Revisit these policies at the next face to face to make sure this level of engineer support is still scaling.

  • Security

How do we detect when we think there is a security issue?

email the user and security officer

Action: Larissa to write up the bind10 specific procedures as they relate to ISC procedures irt the lists.

  • Donated Code Procedures

We had a discussion at the ISC all hands about this:

Why do we accept code?

   Prevent forks of our code as we become more modular.
   We cannot possibly implement every feature.
   Build a vibrant community of developers
   Expected for OSS
   Transparancy.
   Submitter joy and love of ISC.

How does the code arrive?

Capture submissions from multiple sources. 

Suggestion: ticket, tracked by someone at ISC, etc. When things are found, or mentioned, put it in this system.

Point people at the "How to Submit Code to ISC" article.

They have to sign off, copyright, license, etc. 

Inform contact that we have opened this ticket on their behalf. If they want, make them a reporter so they can be directly involved or monitor.

Tracker or whoever they contact examines it lightly, gives opinion on quality and usefulness. This is a QUICK review and generally assumes knowledge in that area or with this specific submissions/submitter.

The outcome of this review may be outright rejection (already fixed, insane), security issue (defer to sec process, probably in parallel to this flowchart), recommend reject or accept, or that it is too big to review quickly. This status should be communicated to the submitter.

The PO or their delegate chooses the importance of this task and informs the scrummaster to place it appropriately. This then loops back into the previous types of decisions: reject, security, etc. This status should be communicated to the submitter.

Once the work is done, the acceptace feedback loop begins by notifying the PO and tracker.

The acceptance feedback loop includes the PO (acceptance criteria may include submitter's or tracker's OK and other acceptance requirements) and the PO either accepts or rejects the work. Normal SCRUM process is used here.

If accepted, or during other comminication to validate the change, we provide the patch or a web reference to it, information about the changes we made during our work, query about if they want credit/mention and what that should be, and which versions we intend to apply this change if known when those will be public.

Note that ISC is working on a policy for this due by summer 2012.

The goal has to be to make it as easy as possible.

Long discussion about author status on commits. We will ask people how they want to be credited (name, email, neither, both?) and then only go further (like author status) if they specifically request it themselves.

Coding Issues

Code resilience and safety - assert() versus exception

Jinmei: this comes from a discussion in a ticket #1198

Discussion of whether to use an assertion or not in a situation where the only possible bug would be very internal. In this case the assert seems redundant for now but if we added code later we might need it.

Asserts are compiled in by default

The basic guideline is that if you've detected a potential memory corruption then using an assert makes sense, because at that point all bets are off anyway.

This is for a situation where you really cannot recover.

We do not want to say we always want to end the program in the case of a programmer error, however.

If you use exceptions at the beginning of a function without changing anything internally then sometimes dropping just one process is safe and we can continue, but this may not really work well.

In this case, if you don't assert here you might end up running code on the stack, so, its definitely a problem if there isn't an assert.

It is difficult to recover from the errors in the right way, and BIND 9 decided to always move to an assert, but do we want to do that again?

In this case you cannot trigger the assert externally.

We have exceptions we dont test right now

Most of the exceptions we do have are memory allocation exceptions.

A good example of a place you can trigger asserts is when you're using some other library as your underlying code and it acts in unexpected ways. Such as sqlite.

C++ does not actually enforce that you catch exceptions.

Conclusion: If we used only assertions, for detecting coding bugs, we would be no worse off than BIND 9 in safety and resilience. It would be equivalent. If we can then identify cases where we can detect specific coding errors which do not invalidate memory but something else has been violated, but some people might want to continue from this error, would be willing to do that?

Stephen: like in the last BIND 9 problem, it was triggering an assertion, what would happen here now?

There wasn't anything fundamentally wrong that wouldn't allow the server to keep running in this case. So maybe an exception would have solved it. And this issue may have been caused by lack of testing. So running with degraded service in this case would have been better than falling over. As long as it is this kind of bug, an exception is preferable.

Jinmei: so we know that after the evaluation but without that how do we know?

Joao: so the conclusion is that the programmer should think hard about assert/insist/exception when it comes up.

Shane: so in any CONST function you should probably be able to recover. Because while there is a problem, in principle nothing has been altered.

If the check verified you had corrupted memory, then yes, use an ASSERT.

There are some compiler specific issues to be sorted out.

There are 19 ASSERTs in BIND 10 and two are comments.

Logging Issues

  • Verbosity Review: if you look at the readme file there are guidelines. We don't need to review the guidelines but the actual values. We currently have a few log messages at too high of a level today. We don't know if things are missing there which we would really want there. We can probably tackle this during our usability testing when it comes to the excess messages, but missing ones might be harder. Right now we sometimes have three messages for the startup of one process which is too many for sure. There can be debug levels set for more information. Our current standard level is info.
  • Formatting:
    • address/port we chose a format (RFC compliant) but we haven't gone through and changed the code to reflect this. We need a ticket to update these.
    • IPv6 socket bindng - we need a format for logging these and including the interface name. Generally %name of interface. There is an RFC which Jinmei wrote, so. We use that. There are portability issues here though once we consider other platforms than Linux/BSD/MacOS
    • other standard things we need to agree on a format of? hostnames? zone/name/class/type separated by slashes is agreed upon. This is in the coding guidelines.
      • Action: need to add the format for how we handle interfaces and RRSETs to the UI guidelines document. We should have formatting functions for these mentioned here.
  • Handling old messages - delay topics
  • Tool(s)
    • we have a ticket about the tool for this. 1055. The ticket specifies an action for utility to manage logging messages. But people are so fried, we are going to discuss it later. There are requirements listed out in the ticket. Do we use our new design process for development of this tool? No, we will just create it by these guidelines as it is a completely internal developer utility. This tool's development is probably part of the upcoming usability work.

Monday 16 January - ISC DHCP

This day was devoted to a face to face meeting on the current version of DHCP, which is not part of the BIND 10 project.

Tuesday 17 January - BIND 10 DHCP (Kea)

Review of Kea work to date

2 servers:

  • DHCPv4 (200 lines): handles giving 1 lease to a client, along with necessary extra options (netmask, default router, DNS). All hard-coded. Works with relay (easier than direct traffic).
  • DHCPv6 (600 lines): handles giving 1 lease to a client, along with necessary extra options (DNS). Works with direct traffic (easier than relay).

Also library:

  • libdhcp++ (9000 lines): packet parsing/assembly for DHCPv4/DHCPv6, options parsing/assembly, interface detection (Linux), socket management.

And performance testing program:

  • perfdhcp: written in C.

Shortcomings:

  • Not integrated into BIND 10 environments
    • Start-up/shutdown
    • Configuration
    • Logging
  • No direct traffic in DHCPv4
  • No relay traffic in DHCPv6

Tomek: Logging would be a good introduction task.
Stephen: It will probably be you, or maybe me.
Stephen: Shawn needs to learn C++.
Shawn: But we have to get other things out the door in an appropriate time frame....
Stephen: I'll speak to Joao about possible training courses. The alternative is to reserve a portion of time for C++ training on a weekly basis.

Stephen: I'd like to have Shane have a look at the framework and breakdown of the code.
Tomek: I suggest this be done after rework of the packet.

Stephen: May be good to talk to Jinmei about the abstract data source in DNS and how we want to do that for DHCP.

Tomek: Jinmei is concerned about us using IPv4 and IPv6 addresses instead of the abstrace asio_link address.
Stephen: Sometimes you need a different level of abstraction. Maybe we should use a different address object.
Shane: There were other problems right?
Tomek: Yes instead of using union it uses 2 addresses. This is part of Boost.
Stephen: So we may have to revisit the addresses.

Action: Stephen - add ticket for object review.
(Note added after the meeting - this is ticket #1609.)

Stephen: there is a refactoring ticket with perfdhcp (#1268).
Tomek: Will Francis continue to contribute to DHCP?
Stephen: We cannot count on it.

Action: Stephen - add ticket to figure out why perfdhcp does not work with Kea.
(Note added after the meeting - this is ticket #1610.)

Kea - how will it "hang together"

Conditional Processing

Stephen: Now conditional processing in configuration file. Do we want this in Kea? I assumed we would.
Tomek: We have hooks, and we also provide reference implementation that does most common features.
Tomek: There is a cost for hooks, but it can simplify our internal design.
Shane: Maybe we can make a "DHCP 4.x hook" that invokes our current configuration language at one or two points.
Shawn: We could go with the 80% rule, that's what I've been talking with customers about.

Stephen: In DHCP 4.x we've got classes, pools, hosts. Will we take this across?
Shawn: We probably want pools of addresses.
Stephen: Our current hook design with two pools...

  if cond_a:
      allocate from pool_a
  else:
      allocate from pool_b

Seems easier to have the hook code return which pool to use than require that hooks go down to the lower levels.

Shawn: How will hooks get added?
Stephen: We'll maintain a data structure of pointers to functions.
Tomek: One extra hook for figuring out pool.
Shane: What do people do with our configuration language.
Tomek: Maybe there are only a couple of use cases.
Shawn: People do a lot of different things.
Stephen: Need to revisit hook design, so we can support our current configuration file.
Tomek: Yes, also for FQDN and everything related to DDNS updates.
Stephen: Want to get feedback early on.
Shane: I don't think we'll get that until we have an example hook.
Shawn: The idea of simplifying hooks is probably something to ask customers. Like someone saying "that's code, my company does not let me do code".
Stephen: We may need to provide a set of hooks that do common things.
Stephen: I'm a bit concerned from going from a configuration file to a programming language.

Tomek: How do we configure stuff for hooks?
Stephen: We have a configuration item.
Shane: We should probably add a list of strings to each hook on initialization. Could be filename, or database login, or whatever.
Shawn: I think we should provide a set of library routines where the hooks can do everything. Move the complication of DHCP configuration language into an actual programming language.

Summary
Current thinking is:

  • Kea will not have a procedural configuration language. Tasks that now require that facility will be done using hooks.
  • Functions will be supplied to allow the hooks to access pools, leases etc.
  • We will supply a number of example hooks with the code.

However, we need feedback on this issue:
Action: Stephen - discuss with Larissa getting feedback on the issue of hooks.

Multiple cores/CPUs

The philosophy of BIND 10 is to run multiple processes. The problem is interaction with each other. If we split by pool they are independent.

What we have today

  • Host - unique entity
  • Range - set of contiguous addresses
  • Pool - range + other info
  • Subnet - list of pools on a specific link (CIDR)
  • Shared subnet - list of Subnets on a link
  • Class - server assigned label of client
  • Subclass - Sub-Label auto assigned assigned (?) [ implementation hack ]
  • Billing class - ?
  • Lease - assignment of address to client
    • fixed - associated with single client
    • reserved - similar to fixed, but goes through server-side processing
  • Options

For IPv4, all leases are constructed when the configuration is read in. Since for IPv6 we need to construct leases on the fly, we should do this for IPv4 too to keep the code simple.

Requirements

  • Single database
    • avoid contention
    • Database for leases (r/w)
    • Database for configuration (r)
  • Packet from client on any interface
  • Packet routing occurs after packet reception. (The point here is that if we need to route a packet to a particular process, we need to have received and parsed the packet first. This seems to indicate that BIND 10 DNS model of multiple processes all reading the same socket may not be appropriate.)
  • Need to decide based on any information in packet and interface

Actual Needed

Class / Subclass / Billing class -> all are "tags" - additional classification information. Host *might* be "tag".

We can do all of the configuration computation in a read-mostly setup. When we need to pick a lease, we need a read and a write to the lease

Do we need to put things in memory? Right now everything is in memory.

Maybe we should check contention of:

  get_lease_from_pool(pool0, pool1, ...)
  renew_lease_in_pool(pool0, pool1, ...)

Action: Run with 1, 2, 50 concurrent users, both in SQLite, MySQL, PostgreSQL, and BDB.

Review of Plan

  1. Production Server (see the Grand Plan for year 3.)

Tomek: Also prefix delegation?
Shawn: People want failover, not HA
Stephen: I want people to use the backend database for failover.
Shawn: This won't work. Many people want a star topology - home office server acts as backup to satellite offices.
Stephen: Many servers offer multi-master: http://en.wikipedia.org/wiki/Multi-master_replication
Shawn: Fails requirement to be multi-vendor.
Stephen: True, but multi-master offers a form of failover sooner than we implement one.

Stephen: Lots of people download client, not so many servers. Should we concentrate on the client?
Shawn: Will that make as much of an impact?
Shawn: Client will need to be stand-alone.

Stephen: We'd want to avoid a bunch of standard libraries.
Shane: Would we use libdhcp++? Would we use C++ at all? Do we want to continue the model of using shell scripts or do we want to configure the interfaces natively?

Shawn: We need ranges and pools and subnets first, then later we can look at implementing tagging (classes).

We need to define the object model, and need to have some sort of inheritance for characteristics (collections of options).

Last modified 6 years ago Last modified on Jan 20, 2012, 4:11:07 PM