wiki:March2011MeetingAgenda

The March 2011 meeting falls at the end of the project's 2nd year. We should have a pretty good idea of where the project is at - the focus of this meeting will be organizing the work for the project's 3rd year.

CZ.NIC is hosting the meeting, and we will be joined by 3 people from the CZ.NIC DNS server team. This should be good for the BIND 10 developers and the DNS operators!

Logistics

We will be at the CZ.NIC office, and try to keep to 09:30 to 17:00 each day.


= Agenda =

We have certain topics scheduled at particular times, and a number of "free-floating" topics that we can slot in on an "as needed" basis.

Monday, 2011-03-21

We have no new people attending this meeting, so we will not have an introduction to BIND 10 development.

Opening Remarks

When everyone arrives, we'll officially open the meeting, with our standard opening.

  • Greetings & Salutations
  • Introductions
  • Meeting roles & etiquette
  • Meeting goals
  • Meeting plan overview

Introductions

Attendees:

  • Shane
  • Larissa
  • Stephen
  • Jeremy
  • Michal
  • Jelte
  • Likun
  • Jerry
  • Lubos (guest from CZ.NIC)

Via phone starting Tuesday:

  • Aharen
  • Kambe

Joining us later in the week:

  • Fujiwara
  • Michael
  • Other cz.nic DNS developer folks

Scrum planning is later in the week after some overall topic discussion Agenda is vast but flexible

Goals: Like all of our annual kickoff meetings we have two goals

  1. project status - not the major goal, we hope to review and wrapup and evaluate today.
  2. the rest of the week will be a discussion of year three stuff: what and how we will deliver in year three.

BIND 10 Year 2 Wrap-Up This will be a discussion of how Year 2 of the project went. We should look at the technical and other aspects of the project, and make sure we have consensus about where the project is at right now. (viewing https://bind10.isc.org/wiki/Year2Deliverable)

After our January meeting we knew what we would actually deliver in Year 2. Shane then sent mail to the BIND 10 Steering Committee detailing what we could and could not send, and some of that is the list represented here.

Our main goal was to focus on the resolver, and our secondary goals were an authoritative performance increase, additional backends, and to begin looking at the command tool.

  • We do have a working resolver, but we do not have a dnssec enabled resolver.
  • The hot spot cache is actually slower than the in-memory, for now.
  • We didn't quite get "bind 9 query performance" but we are within a binary order of magnitude.
  • We don't have a command language prototype but we may get a hack going next week
  • We will definitely not have XML statistics reporting done - its in a branch but it needs thorough review. It is not in the release tarball.
  • We have done a lot of work (maybe not quite half) toward DDNS and IXFR - though the recent in memory work may have impacted this.

Shane's thinking about why we got where we got and did not make all y2 objectives:

  1. At the year one deliverable, we had a lot of extra work we delayed - a lot of technical debt from the y1 release had to be done after year one which took about two months.
  1. We had one less ISC engineer working on the project than we expected - it took a long time to hire Scott and then he moved to the BIND 9 project.
  1. We also took a long time to come to what it meant for us to build a resolver. We're not sure how we could have sped this up. It seemed important to analyise processes but maybe it wasn't necessary at this point in the project. We didn't know how to break the big problem into little problems. We didn't really know the problem space.
  1. We adopted scrum in a somewhat piecemeal fashion, and the adoption of new process temporarily slowed us down. It has probably now sped us up though.

Jeremy's thinking:

  1. Not always knowing what other developers are doing - we could have reused code that we didn't, and new designs were implemented without using existing code

Jinmei:

I don't think that is such an issue - using jabber and the scrum organisation

Shane:

Early on in the project we adopted a default ISC development model where you assign a piece of work to a developer and they "submarine" with it. One of the really good things about Scrum is it takes us away from that. You're given "automatic buoyancy"

Jeremy: we didn't implement the features that the sQLite implementation has in the in-memory implementation

Stephen: the last few months have been very much about the y2 deliverables, and as ScrumMaster? I was pushing just to get year two out, focus on what is essential for y2 deliverables only. It was necessity over function. Shane is aware of the politic, but it is important for us to keep our obligations so sponsors see progress. So when we're coming up to planning for year three, I would like for us to say by xx dates we will get certain deliverables done. That way we can tell the sponsors periodic progress updates.

Larissa: I think that is the plan and should be the plan. We can break it down even more.

Jelte: do the sponsors think we achieved our goal?

Shane: we have 11 or 12 sponsors. Of them, the ones who have contributed developers have taken the strongest interest. We have a mix of involvement. A lot of them are non profits tlds who have money, and they don't worry so much about it.

Stephen: but some of the sponsors *have* given feedback, and we should take that seriously.

Shane: but not all are in this situation, many have specific hopes for bIND 10. A lot of the sponsors want a higher performance server. If you've got 20 sites around the world with a stack of servers in each one, it would be a considerable savings to have less servers in each. They're all adopting DNSSEc, etc. From this point of view, most sponsors are just happy for us to make good progress. As far as being visible and giving them incremental progress, one of the things we don't have that Scrum insists upon is customers working with the team. We don't have that and we have to guess. However, we're not just building the software for the sponsors. None of our current sponsors run big resolver farms. As much as I send updates to the sponsors, they are very busy and distracted, and we often don't get feedback from them. We sent a draft of the year 3 plans to the sponsors and we heard nothing. Silence doesn't mean they're happy, just that they're busy. So Shane approached each of them directly and then we got a lot of suggestions. We're going to try to get more feedback and more testing in year three.

Jelte: I think they don't care too much about the resolver, for which the essential parts of the resolver only came out last week.

Larissa: Perhaps we have less technical debt at the end of year two?

Shane: I don't think so, but I think we know what it is and the work is planned. Also we now are always putting out user visible features.

Jinmei: Regarding the sponsors, I've been feeling that they are too generous. If they're serious, I would think we're going to get more pressure from them.

Shane: the one sponsor who has said exactly what they want is JPRS. But I think they are reluctant to do that. I think it cost us a little bit, that we didn't know sooner what they wanted. So for example, we had a discussion at the 2010 fall face to face, about performance, where we said well, from our perspective we've met our year two goals. JPRS wasn't happy about that and eventually came to us and said that it wasn't okay, so then we changed our goals. But changing goals has a cost. We hope now we can get feedback earlier and with scrum we can change course more quickly when we need to respond to changing sponsor requirements.

Stephen: of course, we have a long range plan, in that we need to accomodate all the requirements laid out in the RFCs.

Jinmei: I think there have been some inevitable overhead cycles with Scrum. If we simply want to port BIND 9 behavior such as the red-black tree to BIND 10, and we only care about the speed/easiest way to develop this, that is one thing, but instead we tried to break it down and share it in a scrum model, and this caused overhead. Of course this is also a good thing, because many more people are familiar with the development. Hopefully it is a long term investment, which we can benefit from in year three.

Shane: I hope so, I've been very happy with the recent cycles since the last face to face meeting.

Jinmei: I hope so too, we just need to carefully watch how things improve, or not.

Shane: on paper, the overhead cost of Scrum is very high.

Stephen: but its actually standard for other projects.

Jelte: of course in waterfall the initial phase is more than 10%

Shane: and quite a few of our sponsors also use Scrum. Fredrico, our Brazilian sponsor, they recommended it, so does RIPE NCC and CIRA.

Jinmei: The overhead of the project, with the testing and review, are also considerable but important. Just for example, comparing us to BIND 9 development two or three years ago, all of our code must have tests and careful review, so we tended to underestimate the work time, and I guess thats another reason why we couldn't make the progress we expected. I guess providing the tests is also a kindof an investment. It will help with refactoring later. For the long term it may result in shorter development.

Shane: In engineering courses we were taught that 30% of the time is coding, 30% is testing and review, and 40% is design, requirements, and other overhead. I actually think its pretty accurate. I probably spend twice as much time testing when I work on BIND 10 code as I do coding. I don't think testing is overhead I think its part of the process.

Jinmei: Maybe I should have said it differently, I am actually positive abotu providing tests. I just think it introduces underestimation.

Shane: I think we have gotten better about not having review backlogs. I think this is a natural output of getting away from the submarine model.

Larissa: I think the scrum model has helped us a lot with estimates and understanding status and we can do even better this year.

Shane: I'm really happy with the status of the project now. We had a dip this year but if we can do even 75% of what we've done the last three months, I think we can make our goals this year.

Jinmei: I hear that google is going to sponsor. What do they want?

Larisa: I believe they want geolocation. We can meet with Warren Kumari next week.

Shane: they did reject us for Google Summer of Code though. Perhaps we can get some of our own interns though. (general agreement that we will try to do this)

BIND 10 Scope of Work (Shane)

Shane will present the SoW document that is being used to get grant agreements from sponsors for the year 3 work, and go through it in some detail.

Shane: we have to get renewed committments from the sponsors annually. We show them what we did and what we're going to do, and we don't know for sure who will sponsor year to year. Sponsors may change year to year, Norm Ritchie is getting us some new ones for year three (google, .ru, .nz, possibly Chilean and/or Australian registries)

The SoW is the document we use to tell people what we're doing. The year by year plan should be no real surprise: year 1 auth server, year 2 recursive server, year 3 goal, build on previous work and come up with a "production ready" server. Of course "production ready" is highly subjective. The goal of year three is the "80/20 rule" - 20% of the work can cover 80% of the users - this is the goal for year 3.

Years four and five - our two most important things to do in BIND 10 are not to make the mistakes of BIND 9 - we must be faster, and more compatible, not less. That is the work of year four. In year five, we have reserved time for "all the other stuff". It will be a wrap up, loose ends, and some "cool" things. Right now BIND 10 is about 10% of ISC's operational budget. We have to transition the project financially at that time. We move from being a special project to a mainstream ISC software product. The "ideas page" on the wiki is a place where potential year 5 projects can go. We hope by then the thriving user community will give us guidance as well.

Referencing the Y3 Wiki Page: (http://bind10.isc.org/wiki/Year3Goals)

JPRS suggested we divide this work into three categories: things originally planned for year two, things originally planned for year three, and newly introduced things, and that makes sense.

(discussion of the list on the wiki)

Jinmei: regarding views, in some sense we already have them in the modular system, separating authoritative and recursive. Some people will use them that way.

Shane: I think a lot of people use views to provide two faces on the authoritative side.

Jinmei: so I want to know what do people want: separate auth and recursive, or views within on of these?

Larissa: I will look into the survey data more thoroughly.

Michal: Its harder if its in one side

Shane: yes we may have to do extra refactoring work

Shane: I always planned operational tools support, this is an umbrella topic. One thing we need here is "phone home" technology. Recursive resolution tracing will be a stand out feature for operators using BIND 10. Full system information aka "BIND 10 showtech" is important for support, debugging, etc. Fedora has a really cool crash report feature where you submit a report and then it looks up your issues in the database to see if others have had it. Quite sexy. We have to decide how many of the BIND 9 tools we duplicate this year, as well.

Features we added since project incetption:

Command tool: Jerry Scharf has specified this and we need the initial implementation this year.

Authoritative data sources: we have multiple options here. We had to choose the lowest risk path in BIND 9, but we have other options: radix trees, Jinmei's initial work, and something Vixie suggested - the "first no compromise DNS in memory data structure".

Jinmei: these are not necessary the only alternatives nor are they exclusive. The other thing is that I have actually looked at the code Vixie provided and my option is the application is quite limited. Its mainly focused on a recursive IPv4 reverse lookup use case. For such purposes it might be quite fast, but its not a general purpose use case. (Shane notes maybe this was for blackhole use cases)

Michal: maybe we can get the ideas out there and then decide if we can apply it.

Shane: my personal feeling is that performance optimization on the auth side is mostly a distraction. Most authoritative operators don't need more performance than we currently have. However, like graphics benchmarking, its what people look at. And our sponsors care about it.

Another goal we want for this year now is hooks for plugins.

Jeremy: what about recursive performance?

Shane: its not on the official year 3 goals for now. (it is on the overall project goals of course.)

Jeremy: I have benchmarked it.

Shane: I do want to see a comparison across resolver products on this. Its hard to benchmark this but it would be very interesting.

Jelte: actually this is an area where the current sponsors may care about recursive performance - when we start resolution lower down, we can save traffic for root operators and TLDs.

Shane: we have other topics for this year which are not exactly development but they are in increasing reliability and people's trust in the reliability of the software. Testing, test platforms, security audit, system testing, and operational experience and documentation of that.

Jeremy: we have done some of the interoperability work already using some tools by Robert Edmonds.

Shane and Jelte: and building on the NSD work, some of which Jelte worked on for NSD 3.

Shane: we need a security audit, but Barry Greene has pointed out that this could mean a lot of things. We need to figure out what this will mean. We need to help people feel confident. Security is always a trade off between functionality, ease of use, and cost.

Jeremy: regarding system tests, we need to review the BIND 9 model this week and decide if we want to move forward with that model or use another.

Shane: operational experience. We're going to be running this on some ISC servers in operation. We're starting with bind10.isc.org and then moving to our AS112 server as our next operational step, the ISC's internal resolvers, and then the big scary things, SNS-PB (which is a best effort service) hopefully in September. And then hopefully after a few months we could test it on root nameservers. By our year four kickoff, maybe we can discuss the operational problems of root nameservers in BIND 10 :)

Stephen: I am wondering about discussion of other refactoring in year 3, in logging, and TCP, probably a lot of other areas.

Shane: we had to make some estimates, so I made some SWAGs. I don't think they're too far off. (see SoW for SWAG estimates) We will revise the estimates based on the output of this meeting. If history is any indicator, we will end up with more work not less.

IF we get more input from users, we could get more direction based on actual user experience. That is our goal.

Jeremy: one thing that isn't on the list is a DNS specifications document. Basically going through all the RFCs, BIND 9, and other implementations and creating a specifications document now.

Jelte: people have started to do this inside the IETF but its crazy to do it there.

The existing Wiki page on resolver design is more design than requirements and does not reference RFCs specifically.

Jeremy: this would be a thousand pages if it doesnt just reference the RFCs.

Jeremy: having a design document like this might really help us know what we missed.

Stephen: a design document says how you do it. Functional testing requires a list of what you do, which is what this would be.

Shane: when you do requirements documents, the scope of what you're describing is the requirements document itself. If its meant to be for other implementations besides just BIND 10....

Stephen: we're also talking about answering questions like "what does BIND 9 compatibility mean?"

Jelte: and sometimes the thing that the RFC says is not the thing any implementation is actually doing, because the RFC doesn't make sense.

Larissa: are there other year three feature level items the team thinks are missing?

Shane: Jinmei sent one about standalone library packaging.

Jelte: we also need to discuss an API freeze at some point.

Shane: or at least versioned.

Michal: or you can do it the way Linux kernel does it.

Shane: we may have certain points where we can break API compatibility, thats how BIND 9 does it.

Jelte: you can go the Linux Kernel way, or the Firefox (and BIND 9) way. In Firefox only break on .0 releases.

The missing and incomplete items seem to be:

  • DNS requirments document
  • refactoring of: ASIO, and....??
  • finishing incomplete features such as logging and TCP connection handling
  • standalone packaged libraries

(list to be continued later)

Shane: we need a measurement of deciding what we include. I think "will this help us becoming production ready" should be it.

Larissa: yes

Michal: the problem is we keep putting things off for later.

Shane: everything we add means bumping something off.

Shane: people seem to be uncomfortable with my proposed yardstick, so maybe we can discuss that.

Stephen: we have already given the current list to the sponsors, so must we stick with it?

Larissa and Shane: not necessarily.

Larissa: so we can add other features to the "if we can" list

Discussion here about how to handle requests we cannot currently accommodate, such as HSM support and IXFR-from-differences, etc. Current plan is that we let people know we cannot add more features without more human resources - this is tricky, the sponsors views are critical, but they don't always reflect 80% of DNS users needs.

Michal: when I ask around, what people want is support for MySQL. That seems to be what would motivate DNS users I know to try it out.

This is a topic we had discussed assigning to a GSoC intern, we'll see what happens now.

Likun: some of Chinese DNS users prefer PowerDNS because the backend is Oracle. They use Times Ten.

Shane: I looked at doing zone transfers with PostgressSQL but its not recommended.

Lunch

Scrum Setup

The A-Team / R-Team split made sense when we were working on two separate goals at the same time. It was based on needing to finish 2 separate pieces of work, as well as the expected size of the team becoming quite large. For Y3, we have many more work items, and the team size does not look like it will expand that much (I hope I am wrong - in which case we will revisit the setup).

We need to discuss how we want to organize our team.

Shane: for the past several months we've been split into two teams. The prime motivation was that the team is too large for the ideal scrum size. Ideal size is 5-10 people. We thought it would grow to 20 people by the end of the project year. The other motivation was that we had two distinct pieces of work to do. It has worked pretty well, each of the teams has focused, but people have expressed some concern that they dont know what the other team is doing.

What has changed now? The team has not gotten smaller, but we don't all work full time on the project. We havent added people as much as we might have. We also all work remotely, which makes communication more formal. We have delivered the initial server implementations. There does not seem to be a logical team split for year three.

I think we should reunify into one scrum team. What do people think?

Jelte: I think the way the teams are split now they are too small.

Shane: sometimes people seem lonely

Jinmei: I think this makes sense

Michal and Jinmei: I have been the only person on my team working in my timezone (Jinmei: on the CONTINENT!)

Jeremy: I think this would save us a lot of time. All of the people who dont have to attend two sprint plannings.

Larissa: sprint planning might run slightly longer.

Jinmei: I do think a slightly larger team is preferred but we will have a slightly larger team than the ideal scrum size.

Jelte: also it sometimes happens that people who are co-located in a timezone or company work together more, so we end up with specialized teams of two.

Stephen: the thing with the timezones is important. Speaking in real time to discuss a problem is really helpful.

Michal: when there is a problem with some code, and the person who wrote it is on the other team from you, it is difficult to figure out what to do.

Stephen: I think we can do it in one team. If we have one team, there will be related groups of tasks. It would then be logical for people in the same areas of the world to do related groups of task, so they can talk more easily.

Jinmei: if we have one big team, sprint planning session could become uncomfortably long.

Shane: we're trying to do most of our sprint planning in the face to face meetings. however we may need a marathon planning session or two before the next face to face.

Shane: so we have consensus we will try one team, though we know there may be a few problems.

BIND 10 Year 3 Release Schedule Based on the SoW we need to discuss the release schedule for Y3.

Returning to the discussion topic from the morning, about additional features and issues we must handle in Year 3 beyond the statement of work items.

The things we know we need:

  • BIND 9 style IP address based ACLs (TSIG, IP, extensions/hooks)
  • TSIG
  • IXFR in and out - protocol level (and data source level)
  • DDNS (server side only - same issues as IXFR)
  • DNSSEC validation for resolver
  • DNSSEC support for in memory data source
  • Views
  • Operational Support Tools:
    • Version Check / Phone Home
    • Recursive Resolution Tracing
    • DNS ShowTech
    • Cache Management (deleting, injecting, viewing, loading, dumping)
  • Command Tool
    • Demonstration Version
    • Framework
    • Specific functions:
      • Replicate functionality in bindctl
      • Load, Delete, List, Modify(?) Zone ("rndc addzone")
      • Per Feature Configuration
  • High Performance Back end (faster than BIND 9 for in memory and *maybe* hot spot cache)
  • Requirements, Design, and Implementation for Hooks (for Plugins)
  • Test Platform for Recursive Resolution
  • Interoperability Testing
  • Security Audit (and followup)
  • System Level Testing
  • Operational Experience (and followup)

Additional Possible Requirements:

  • Completion of Logging - multiple files, destinations, filters. (logging API)
  • Configuration of the BOSS (using cmdctl), command line configuration, and config manager configuration
  • Save and load config (export and import)
  • Scattered TODO items
  • Refactoring
    • ASIO
    • Auth and Recursive server callbacks
    • General utility library
    • Generic BIND 10 process (make modules into libraries)
    • Stand Alone Mode (like b10-auth only)
    • datasource refactoring
  • Finish Socket Creator
  • DNS Specifications Document (Referencing RFCs, etc)
  • Complete support for RR types (everything on the IANA list)
  • Link to Crypto Libraries
  • Replacing msgq
  • Supporting multicore systems (multiple process model) at least auth
  • Complete zone file parser
  • Offline Configuration
  • Additional datasources: MySQL, PostgreSQL, BDB
  • Reduce the bug backlog! (resume inclusion of bug fixes per sprint)
  • Status query (zones being transferred, timeouts, qps, acls loaded, etc)
  • Demuxer (handling multiple queries on the same port) - suppressing duplicate queries
  • Randomization of Ports

We have two major milestones listed, and 41 feature level tasks. (minor issues with the dependencies working correctly). There are a lot more tasks listed for auth than for recursive. (note today we have remote participation by JPRS members)

List of tasks is on a separate url to be added to these notes.

List of features with dependencies is complete but we will be breaking out into tasks for sprint planning.

Replacing msgq is an item that we may leave out if time does not allow - but Michal notes that it needs enough work, that perhaps replacing it would be faster than refactoring it.

We wonder how many people use TSIG in their recursive implementations today.

We note GSS-TSIG is something to do when we do our windows implementation. Later.

Discussion of how early in the year to start DNSSEC validation work. It is one of the most complex tasks, but it is also not linked to the first major deliverable of the year (production Auth only server). We think we need to hold off starting in on DNSSEC validation for a few months, though there is risk in not specifying this work for too long. So a second quarter start for DNSSEC validation.

Added a dependency between the DNS specifications document, TSIG, and DNSSEC validation.

Much of the work is split into the authoritative and the recursive implementations (documentation, multicore support, etc)

Refactoring tasks are generally put ahead of new code, though not always.

Added a feature to the task list for datasource refactoring (and API standardization)

Operational support tools actually can be done at any time on an as needed (as resource is available) basis.

Command tool has quite a few tasks inside of it, but it does not depend on any other tasks

When do we do support for multi-core systems? Its not that its necessary for administrators in auth-only, but people may not want to install if the system does not benchmark fast.

Once we refactor the SQL backend perhaps it will be non-major to implement more SQL backends.

Status query can be pushed to later in the year if necessary, though administrators will start asking for it, and it "looks cool".

A lot of things depend on the security audit. It may take a few weeks to do. We need to decide what the terms of reference are and who will do it.

Views - scheduled for the recursive timeframe, even though they are useful to auth only systems.

High performance data source is moved into the recursive section. This is something we can drop off if we need to.

Hooks: it can be held to the recursive part of the year, but the team expressed concern that the longer we wait the more refactoring we would need to do. Stephen felt that once we write hooks we need to freeze the relevant APIs. Michal said he would like to be able to tell the world we have an early implementation of hooks, to get them to play with it. Stephen feels we should potentially hold off because this is not essential and we have so much to do. General agreement that among things we will "do with enough time", this would be very high priority.

Interoperability testing and system level testing: when? we may want to start approaching it as we write our unit tests.

Which items should depend on the DNS specifications document being written first? Stephen: I see the auth and recursive as being written in parallel, not dependent. We need these to be written in bite size chunks - its boring and tedious work, and we need to be able to consume it as we develop as well. If we do them in parallel we can move through it one RFC at a time.

Export/import configuration is not recursive specific but we can delay it to later in the year so it falls into that part of the year for now.

Offline configuration - we may want to do before the auth server release because bootstrapping may be cumbersome otherwise.

What is a "Forwarder" Anyway?

Apparently nobody has ever defined what a DNS forwarder is. At least not to our satisfaction. We need a list of what a forwarder does and does not do.

Jelte: We've been adding and removing features to/from the current forwarding feature as we've developed the resolver. Lets make a list.

Shane: is the concept of a forwarder defined in the RFCs?

Michal: A proxy is mentioned, but not much. Only to the level that it exists.

Jelte: it may be mentioned in an informational doc on DNS setups

Jeremy: 5625?

Michal: 1033?

Shane: if its just casually mentioned and not carefully defined, does anyone else implement a forwarder? I guess DNSmasq have a forwarder?

Michal: it has a cache and a dumb proxy

Shane: is this a BIND-ism or not?

Jelte: I guess Unbound has this but I am not sure how it implements it

Michal: you can forward, but its still a resolver

Stephen: RFC 2308 section 1, defines a forwarder.

Jelte: this definition would suggest it is on the other side of the resolver - between the resolver and the internet, not between the stub and the resolver.

Jeremy: RFC 2136 section 6 also discusses the forwarder.

Michal: Why do we actually need one? We can create whatever server we like as long as it speaks the protocol correctly, so we can have the feature there, but what is the point?

Shane: my use case: my ISP runs a resolver and its fine, but I'd like to have a local cache also, to save time.

Michal: I use the DNSmasq for this.

Jelte: I can see that use case in this scenario, but I don't see that it has a lot of benefit.

Michal: I use unbound for another thing, I want validation that my provider doesn't do, but I was unable to configure BIND to do it because the provider blocks all other DNS traffic than to its own server. I use a validating forwarder.

Jelte: thats a good thing to come out of this discussion, you would implement this differently than what we had in mind.

Michal: maybe if we had plugins, we would do this this way, by replacing the part that sends queries.

Jelte: Maybe we directly call the query which sends to the upstream address and then when it returns instead of going back into the resolver query you just pass the answer to the original client. That does mean you would be going through all the logic even though you don't need to.

Jelte: if you run a straight forwarder you want to copy all the flags but if its a validating forwarder you do not.

Stephen: three modes? one, pass through, no interpretation, second way adds a cache, third way is a validating forwarder

Use cases:

  • First is for firewalls, or a computer not connected to the internet
  • Second is for local cache
  • Third is for getting additional or more trustworthy validation than is provided upstream
  • Fourth is selective forwarding - to get specific information from a particular server

(note that BIND 9 has a default to fallback to iteration when forwarder fails. We may or may not want to do this. Useful to know why people would want this and what the behavior is)

  • There is also DDNS forwarding - some clients try to send requests to non primary master (other auth servers) - causing problems.

Michal: we may want to try some scenarios in the forwarder before putting it in the resolver

Jelte suggests that a forwarder does minimal work. No retry or fallback is done by a forwarder.

Shane considers a forwarder as one which acts as a proxy and does fallback (and maybe retries).

Behavior:

See RFC 5625:

http://www.faqs.org/rfcs/rfc5625.html

  1. Very, VERY Simple Forwarder Pass everything through without interpretation, except:
    • QID
    • port number
    • ACL considerations?

  1. Very Simple Forwarder Pass everything through without interpretation, except:
    • QID
    • port number
    • ACL considerations?
    • EDNS0 (adjusted?)
    • VERSION.BIND

  1. Proxy Forwarder Read query Do everything (interpret/strip EDNS, ...) except follow delegation, TCP fallback (?) Note: BIND 9 may originate other queries, for example follow CNAME chains
  1. Very Simple Forwarder + Cache
  1. Proxy Forwarder + Cache Maybe setting DO bit is helpful so we can cache that information. That may bloat cache though.
  1. Validating Forwarder Full resolver that only goes to specific address(es) (except with RD bit on)

RFC 3490 mentions forwarders for IDN transformations.

RFC 3901 mentions using forwarders for IPv6 to IPv4.

RFC 2845 is about forwarders and TSIG.

Forwarder is not a goal for Y1/2/3 so maybe we should remove the current support.

Google draft about geolocation EDNS0 option.

RFC 2671 mentions what *not* to forward.

Jelte notes that if we modify the current ticket to pass the DO bit (#598) to lower the EDNS buffer size if the client's is greater than ours, then we have forwarder type 2 (simple forwarder). Also we probably don't copy all the correct response flags yet.

Tuesday, 2011-03-22

BIND 10 Year 2 Release

We'll actually make our official Year 2 release. Everything will be prepared in advance, so it should just be a matter of sending some e-mails and updating some Trac pages.

hurrah! champagne and sparkling cider were had.

Y3 deliverables: approach to discussion

Make sure we all understand how we're going to go through the list. Shane did his homework and made a list with dependencies, and we'll go through those together. Shane decided to organize this using Task Juggler. A copy of the gantt chart will be linked to the developer wiki. We did leave a few things out. We have a lot of work to do, and a few of the tasks were not needed or requested by sponsors in year three. We may remove additional items as we discuss.

Y3 deliverable: ACLs

Y3 deliverable: TSIG

Stephen: what does BIND 9 do?

Jinmei: named key-gen

Stephen: do we want to replicate this? or not?

Jelte: tsig key generation is basically just writing random data

Larissa: what would be easier?

Stephen: is it a separate program?

Jeremy: it is but it uses the libraries

Stephen: so we write our own

Jelte: but it shouldn't be hard

Jeremy: we can also provide workarounds to do it with OpenSSL etc

Stephen: how about the relevant crypto?

Jinmei: we dont have it

Stephen: okay so the first question is what crypto library

Jinmei: well, we do have SHA1 code. And so we have some minimal crypto of our own, but it is still a question whether we want to have an outside crypto library or use our own minimum version.

Stephen: this is our first assay into cryto really So what are the option: (refer also to the Beijing meeting notes at: http://bind10.isc.org/wiki/f2F1_Y2_Tue)

  • Soft HSM (is this where we add our HSM transparency layer?)
  • Botan
  • OpenSSL
  • Crypt++

Discussion: how much more work would it be to add the HSM transparency layer when we're already adding crypto?

JElte: so if we define an abstract crypto interface that takes keys as arbitrary identifiers, it doesnt matter what that uses internally.

Stephen: we probably dont want to rely on SoftHSM. so what underlying library?

Larissa: OpenSSL has been problematic in BIND 9. What about Botan? SoftHSM uses it...

Jelte: the reason I wanted the SoftHSM in OpenDNSSEC ws that I didnt want a different code path whether you used an HSM or not.

Stephen: we need to do our own implementation of libHSM?

Jeremy: why don't we hafe someone try replacing our current SHA code with Botan and see how it goes?

Larissa: what about GOST?

Jeremy: maybe we get Botan to support GOST.

(Continued after lunch...)

Shane: Current BIND9 use of TSIG is broken - can't have two keys with the same wire information. Need to decouple DNS name from identifier in configuration.

Shane: TSIG from resolver side not a priority this year.

(Discussion on bootstrapping problem.)

Shane: Clients/stub resolver out of scope. Main use of TSIG is to secure connection between servers. What are issues with Crypto library?

Jinmei: none that are insuperable.

Jelte: One issue - if query signed with TSIG, answer must be so signed. However, must be aware of keeping copy of wire data.

Jinmei: TCP is tricky - need to provide signature every 100 messages or so. Current impression is that it will be part of libdns++.

Shane: need way to configure TSIG certificate as "global" data.

Jelte: Need way to configure data. Question is where to put it? How about "System" meta-module?

Shane: Create TSIG configuration module in which TSIG data is put.

Shane: Issue about NOTIFYs. BIND9 does not support this (NSD does).

Vorner: Not critical to sign them - can't do it now.

Jeremy: Can configure BIND9 to do this.

Shane: Motivator: NSD does it now. Also, do we want to avoid remote used being able to get server do do something?

Michael: Q: how will it be configured? (A: via bindctl.)

Y3 deliverable: Views

Jeremy: in BIND 9, Views are basically matching a client, matching a destination, match TSIG, or match if the recursive RD bit is set. The goal is to provide a different data source back end based on the match.

Stephen: If you have a look at the NSCP draft, we discuss views in there.

Shane: Views are a BIND-ism, right?

All: Pretty much.

Shane: what do you do based on the match?

Jeremy: provide different data.

Stephen read out more on zones from the NSCP draft (http://tools.ietf.org/html/draft-dickinson-dnsop-nameserver-control-02)

Jelte and Michal: this gets tricky if you mix auth and recursive

Jeremy: the match takes you to separate data sources. This is why you need to figure out ACLs and TSIGs first. I thought at one time we had talked about being able to provide different data sources.

Jinmei: is this recursive, auth, or both?

Stephen: there are two parts, the access part, and then the selection of the data source part.

Shane: and there is first match or best match.

Stephen: what happens when you have 10,000 zones

Shane: is there a performance penalty with zones?

Jinmei: yes, with the matching part.

Stephen: is that the way we want to do that? For a given zone you probably have relatively few views, but if you have 10,000 zones with different views, and you match by view, you have potentially many thousands of views...

Jinmei: views have zones. not the other way.

Michal: that is why it is so powerful. you can have one server pretending to be many servers.

Shane: the difference between the way we use datasources and views: different views can contain the same name of a zone, but in a datasource they would have to physically copy the data.

Jeremy: I would like to see our nameserver, regardless of views, be able to use multiple datsources at the same time.

Jinmei: we cant do that now

Jelte: but we will refactor to be able to

Jeremy: in bIND 9 you're always loading everything into memory. This could make it easier.

Larissa: and faster?

Michal: you can have a mix, too. some things in both, and some in other things. how would this look?

Jelte: I've never used views, but I think each view has its own full configuration.

Jeremy: yes. There are 60 or more toggles you can put inside a view

Stephen: and its like a virtual server

Jelte: Michal suggested you could change your pipeline by views

Michal: yes, then only the critical part would care about zones the rest could ignore

Shane: as far as working with hooks, maintinaing what view you're in context wise should be passed around. should be straight forward.

Michal: we would need hooks per view. if we have different configuration per view, we could have a hook in one view and not in another one.

Stephen: do you pass the view to every hook and the hook decides?

Michal: the first thought is you need to take care of views everywhere. Its a lot of code.

Stephen: we're in danger of getting very very complex for corner cases. the main use of views as i understand it is to separate internal vs external networks in a company. That is the use case we should optimize for.

Jeremy: one easy solution we have now for a destination based view is to make sure bind 10 can run multiple resolver processes listening on different IPs. They would have different configuration and different caches. Same with multiple b10-auths.

Shane: we talked about config being different but there are different caches per view on the recursive side?

Michal: so you can redirect a zone in one view but not another

Shane: some people will not be able to set up two processes listening on different IPs

Stephen and Larissa: lets work on the common case. 80/20. corporate situation, intranet/extranet.

Michal: maybe we can simply solve both the common case and many corner cases.

Jinmei: maybe there isn't much difference between common case and corner cases.

Michal: we can restrict configuration somewhat

Jinmei: there will be an exception

Shane: thinking from an administrator point of view. I've got three zones, one each in two views and a third zone in both views. Would i have to put them each in their own database?

Jeremy: our database needs another level

Shane: we need a layer of indirection

Jinmei: we should separate the notion of type of datasource and the database files

Shane: I'm thinking of the abstract concept of a datasource. Right now when I query a datasource I ask for a name. When we add views, I have to ask for a name, and a view.

Michal: yes, and the datasource can either look specifically for data based on the view, or...

Shane: I don't care right now. What I'm realizing is what we need to do is expand our data source API to include views.

Jeremy and Jinmei: how will we share a single zone file in multiple views?

Jinmei: I see the desire but that will be very tricky and error prone

Michal: the price of passing it to the API is nearly zero. I think we can handle this better on the datasource level than the higher level

Jeremy: if you're changing configuration all the time, do you need to replicate that in your data source?

Shane: not if it is done in an abstract way, or in SQL, in a reference table. Depending on how we implement, sQL could look to see if it has views and do different queries if it has it or if it doesn't, for performance.

Jeremy: I don't know how you do this in BDB.

Michal: every piece of configuration can be different. We dont want to go through the whole server and add conditions.

Shane: we can say views are not able to configure *everything* just a specific set of commonly used things.

Michal: it depends on the plugin system I suppose but the plugin system could provide a piece of logic that could copy views itself

Shane: not a bad design but i hesitate to implement that without a use case

Stephen: what about the receptionist model?

Michal: this is similar to my idea

Stephen: the plugins could be determined by the configuration of the server

Michal: the plugin means that its in some hook, and would be in the hook for one view and not for another. But you could also have common places for all views. You don't configure everything differently. You just can.

Shane: I worry about using receptionist for this I dont think it would be that much simpler and it might cost performance. Maybe for BIND 9 compatability. Where everything is configurable per view.

(Michal draws on the notepad)

Jeremy: there is no memory sharing between caches in the BIND 9 way. So important information doesn't leak, but it uses 10 times the memory.

Stephen: only if you have 10x the queries.

Jelte: well.....

Stephen: so where do the definitions of the views live. In the configuration database?

Michal: I don';t know how. If we're allowed to configure everything, you need a configuration overlay.

Shane: that wont be the initial implementation. We will just configure zones and recursive behavior.

Michal: the configfuation manager can be handled somehow.

Jinmei: we should ensure that views implementation are consistent across all the modules.

Shane: we need a work item for non module related configuration.

Stephen: we could have a pseudo module called system, and put it all in there

Jeremy: in BINd 9, statistics can be separate by view

Stephen: you need statistics per zone too

Shane: we will need to capture that and report it, reporting should not be a problem with this, reporting is quite flexible now.

Jeremy: BIND 9 by default has three views: BIND 9 view, _default view, _meta view.

Stephen: even if you dont define views, everything goes through BIND 9 views. it simplifies the data model.

(interlude about NSCP and nominet and whatnot)

Lunch

Y3 deliverable: DDNS

Shane: tell us about the current status to changes to the backend to make them writeable?

Jelte: yes, for the first SQLite data source I added functions that could add and remove RRs and also parse a dynamic update and perform nearly every action in there. It does not do data consistency. But that was for the SQLite data source, Jinmei had one look, and kindof disagreed to the general design, since I added everything on the abstract data source level. He thought we might want to add a separate class. We might want to make every datasource writeable.

Shane: no, surely you want read only data sources.

Michal: could we make a write only datasource?

Jelte: I dont see a use case for that

Shane: it might make sense if you had programmatic data sources. You could say, use DDNS to do logging.

Michal: I am just thinking it might make sense to have readable and inherit read writeable, or to have all three.

Shane: you can do this with aggregation instead. I don't know. I see why it would be nice to do that, then if you are implementing a datasource and you don't want it to be writeable you don't have to implement it at all.

Jelte: I think you can do that today. It was written over 6 months ago now, though, so...

Michal: I believe we want to merge first and see what a writeable datasource might look like before we start refactoring.

Shane: questions: How do you handle concurrent access in the current code?

Jelte: for DDNS it is the datasource itself that handles the packet, so right now it doesn't worrry about it, in the case of IXFR it is a separate process and it will send a fail.

Shane: with IXFR we should not have to worry about that since we are the ones doing the updates. Thats probably appropriate though we may want to define a default where we lock everything, for naive implementations.

Jelte: if you make a very simple implementation it just sets a lock.

Shane: for ease of use for implementors, we may want to put an in memory mutex there by default.

Shane: can we not use an in memory lock if we have multiple processes?

Michal: we could but it wouldnt be easy. If you provide it in the abstract class the simple version may use it, but...

Shane: okay we can refactor this later if we need to.

Shane: Multiple processes? I guess with SQLite we dont care too much. I guess Jinmei or Michal or someone thought about this for multiple processes.

Jinmei: in the in memory data source?

Michal: if you have the memory shared you can share a semiphore. But you need another daemon that handles it that holds the data. It would be another process. It seems quite heavyweight.

Shift to some discussion of multicore model as it relates.

Shane: my thinking is we would scale across multiple cores by using multiple processes.

Michal: you could use the writeable as SQLite and inmemory as the secondary store.

Shane: we could encode deltas as well, useful for a big zone.

Michal: we could start loading from the datasource in parallel with handling the current data as well.

Shane: if you're using a system that requires the performance of an in memory data store, you will then start dropping queries.

Jinmei: can we get back to dynamic updates?

Shane: the proposal is that we dont allow dynamic updates to the inmemory source at all - that when you need to change it you do partial or full zone reload.

Jelte: either DDNS or IXFR says you have to store it there before you start serving it anyway.

Stephen: one thing about an auth server is the updates wont be that frequent.

Shane: I really think this might be the right way, where in memory gets its data from another source, and if you want to update it, we have an upload method that can be done with a delta, and have an API for the upload method.

Stephen: you can load it into memory as soon as possible, but if you get multiple updates to a single record (Shane notes this happens in dHCP) it is complex.

Shane: if you presume the set of changes will be small and infrequent, you can lock the whole dataset to make the changes.

Jelte: this sounds remarkably similar to constructing an IXFR out packet.

Stephen: how often are things read in the dHCP case?

Shane: it depends on the environment. In a reverse tree, probably pretty soon. For some reason many machines want to do reverse lookup.

Stephen: if its going to be updated 5x before its uploaded again there is no point. just mark it as dirty. If you've stored it, and you update it on disc before you bring it to memory, then you do essentially have a hot spot cache.

Jelte: as a general design thing.

Shane: it listens for the query, applies all the prerequisites, and the pushes it down to the datasource

Shane: do views apply to DDNS?

Jinmei: yes

Jelte: the reason I applied this to all the layer of abstraction is that if you have a datasource that can handle more efficiently you can rewrite it

Shane: if I have prerequisites across multiple data sources will that be a problem for us?

Shane: Then the datasource layer needs to do prerequisite checking and then the actual updates. Then for in memory, we need an abstract class for stable storage.

Stephen: I think this is a problem for a very large zone

Shane: so we need a signaling mechanism for them to get updated, which would end up a lot like IXFR out.

Stephen: unless we say well, when we load a zone from zone file, we load it into a database, full stop. If you want a zone file out, we just write a zone file back out.

Shane: yes that is the right model

Michal: yes and you can use the zone file as a source for the inmemory datasource

Stephen: only going through an intermediate database

Michal: you don't need that.

Shane: we probably need a special case.

Shane: it probably needs to be a synchronous notify, so it can also send data back to the stable database.

Michal: but there is no guarantee.

Jinmei: so how do we ensure consistency between original and in memory?

Shane: that is why I propose the synchronous model, so when there is an update to the "disk space" datasource, it sends an update to the in memory, which is a process, waits for the reply, indicating the update, and only then is the process complete. This will also allow us to do other things in the future.

Jeremy: I don't think the current msgq can keep up with this. Which is why we may replace it.

Shane draws the current design plan on the notepad (see photo)

Shane: updates are *really* slow in BIND 9, I think using a real SQL database in the back can buy us a lot.

Jinmei: this might be faster than writing to disk directly?

Shane: I think maybe. SQL people have worked very hard to get their writes fast.

Michal: they tune the performance toward parallel updates

Shane: the trick here is the SOA update which has to occur with every update

Jeremy: no, it can be trained to every 300 seconds.

Shane: so that would be a lot faster, yes

Jelte: I was thinking of a shorter time, but yeah, we would do that

Shane: in DHCP they queue up the answers and synch them periodically.

Jinmei: does this architecture have its own bottlenecks in it, and in the worst place, does the request from DDNS block the auth server from responding to further queries?

Michal: this is where we need the good msgq.

Shane: there are potential bottlenecks. I think with this model though, its a bit like microkernel architecture, you can throw it away if its a problem.

Jinmei: in the case of IXFR with NSD, I thought that it does periodic updates, like every 30 seconds or something. Not update immediately upon receipt.

Shane: in the update I worked with it in we did updates every minute.

Jinmei: so it can combine incoming updates. In that case I dont know if it also makes sense for dynamic updates. Especially if the update rate is quite high.

Shane: could be.

Jinmei: I simply dont know.

Shane: batch processing can be a lot more efficient but with DDNS it may be difficult to ensire fairness.

Likun: we need to think about the lightest uses, like a user who just needs to start an auth server and we dont want the model coupling too much.

Shane: I think we can easily hide all this from the user. You just load a zone file in and start. That should be the default. If you're not configured as a secondary we should not start xfrin or zone manager modules. Automatically.

Y3 deliverable: logging

Stephen: log for cxx did not work, so now logging just goes to std-out and that's it. We have to decide what we want to do with the logging. DO we go to another existing package or do we write our own? The log4cxx has the advantage that you can create independent loggers with individual characteristics. So for one module you could have detailed logging and very primitive logging for another. It also provides multiple levels and destinations. My principle reason for choosing it was that it is already there. If however, we decide we want to implement our own, we need to do everything log4cxx does now plus... ?? So.... what do we do?

Jinmei: Is it true that FreeBSD doesn't have sufficiently new version of Log4cxx?

Jeremy: it was not in the packages collection.

Jinmei: which version?

Stephen: I downloaded the Ubuntu version and that was 0.9.8.

Jinmei: on my laptop is 0.10.0

Stephen: the version we had problems with was a 0.9.x and the issue was they changed an underlying strings thing due to a windows issue.

Stephen: we could leave it logging to std-out for the OSes it doesnt work on and hope that upcoming versions fix this? Log4cxx comes from Apache, but there are others.

Jelte: SyslogNG has its own API?

Stephen: so what is a simple logging system? Log4cxx is really complex, and realistically you dont need this.

Jeremy: BIND 9 logging is hard to configure, but it does have a lot of features

Stephen: this is part of why I wanted Log4cxx, becaue it seems to have the features people want

Jeremy: what about log4c+? It is just run by one guy (http://log4cplus.sourceforge.net/)

Stephen: yeah that makes it a non starter

Jeremy: or we embed and maintain.

Stephen: going back, would it be right to go along the same lines as log4cxx, but not as flexible, in our own implementation.

Jelte: I would be fine with that

JEremy: I dont think I want us handling log rotations or anything like that

Stephen: so we need to either base on an existing package or...

Stephen: principle of least surprise, do we make it do what bind 9 does?

Jinmei: maybe its sufficient to use the features in the operating system support - but I also think it makes sense to have a minimal version that does *not* rely on something like Log4cxx, as Jelte said.

Stephen: ok, compromise. we write a minimal implementation, no log rotations, goes to a few specific locations, and we have the option for plugging in log4cxx later for people for whom it works with their OS.

Others: we also lookd at glog, logging for c++ by google. It didn't have documentation with it.

Y3 deliverable: IXFR-out

Y3 deliverable: XFR-in

Y3 deliverable: DNSSEC validation

Jeremy: really need a specifications document.

Stephen: Q: are we really trying to requirements or design. (A: neither at the moment.)

Michael: can approach this by supporting one algorithm initially.

Shane: can decompose. e.g. know about trust anchor management, could document that. However, really do need to understand this before we start writing code.

Michael: corner cases make life very complicated. Also, validation is a combination of top-down and bottom-up validation. Odd cases where you can almost reach it one day, then have to go back and read data other day. If I plead for one requirements, its to make it easily validatable.

Michael: many recent bugs in BIND9 due to different trust levels of data. Suggest having two caches, and copy between untrusted and trusted data.

Vorner: Suggest we walk chain from root each time and check - don't need to do crypto every time.

(Discussion on validation procedure: Problems when elements validation chain have different TTLs. Hardest cases come when something is wrong - remember "roll over and die", Stress need to have every corner case as a test case.)

Michael: Really do need a specification/design document - need to document Mark Andrew's experience. Can see it becoming a best practice document.

Jinmei: can't document corner case here.

Michael: how easy is it to issue queries?

Jelte: not too difficult.

Michael: need to do fetches in parallel.

(Discussion on when to issue queries for DS records.)

Michael: biggest problem in BIND-9 is retry time and retries.

(Discussion on what to do with insecure responses.)

Shane: Will task Jeremy to produce document describing validation process. Will need to get periodic updates on the document - say every two weeks.

Jeremy: Will work with BIND-9 developers to document existing code.

Jinmei: Q: 5011 support?

Michael: A: Yes - is critical.

Jinmey: Q: DLV Support?

Michael: whether or BizOps? says we need it.

Shane: Need to support it for next year.

Michael: Why do we need it? (If parent does not support DNSSEC)

Jelte: Nice but not essential (Shane: agree. Michael: recommend we don't implement it - nasty hack needed before root was signed; expect fewer zones will be signed with key here.)

Conclusion - does not make sense to implement DLV now.

Refactoring ASIO/Event Driven or Threaded Model?

We need to talk about how we're going to refactor the ASIO code, or at least the coroutine style. It's hard to work with.
A suggestion to use non-preemptive threads for processing. We need to decide if this is worth pursuing, and what it would mean if we did.

Event Driven or Threaded Model

External Assertion: event driven not good for high-performance server.

With threads, have problem about concurrent access, and scaling gives problems? (Assertion - no way to make program thread-safe?) Proposed that real problem with threading is concurrency. Proposed that threads operate one at a time. Way to do this is co-operative multi-tasking(?)

(Discussion on ASIO and coroutines. Recommendation to remove coroutines)

Jinmei: if we use event-driven model, don't see reason to drop ASIO.

Michal: thread model can be used, but will hit problems with it.

Conclusion: can get rid of coroutines with relatively little effort.

Q: What version of ASIO do we use and are we updating it? A: Jeremy will check.

Q: do templates give code-bloat? A: Stephen will investigate.

Shane: Threaded code may be simpler to read, but interface provided by pthreads is not easier to read. However, have problems with things like cancel.

Michael: multiple threads but only one thread running at a time.

Shane: proposal from the comments was state threads (Apache project). (Discussion of the state threads model)

Vorner: potentially only single core, but multiple processes. This does not appear to support multiple (real) threads.

Vorner: Believe that we can run multiple processes for authoritative server. But will need multiple threads to run resolver.

Shane: What about current code? Proposal is that we won't pursue this now - event driven code is easy enough to read.

(Discussion about general multi-threading issues.)

Michael: authoritative server can be done multi-process (although there is a lot of interaction in the data base). Recursive server has too much interaction.

Wednesday, 2011-03-23

Unit Testing: How to Do It (Medium)

We should talk about our unit tests, and where and how we draw the line on testability. Some things are hard.

Shane: our general rule is we test everything. There are cases where that is really hard. I have to say, though, some places I thought it would not be possible, it was, with refactoring. Do we have examples of places we dont have tests now because they're too hard? Assuming we don't test the libraries we rely upon.

Jelte: I have one test that doesn't actually do statistical test on the QID but it does test that it doesn't get the same QID a few times in a row.

Michael: a random number doesn't mean you never get a repeat.

Jelte: which is why it does a few checks in a row.

Michal: the part of the code without many tests is the TCP and UDP servers.

Jelte: msgq is also insufficiently tested.

Shane: that is one area that is quite difficult - when you interact with the external environemnt

Michal: I dont think thats why they dont have tests - they were written at the beginning before the strict policy.

Shane: For things like that, we could create our own descendent of the listening classes themselves, and use that for testing somehow.

Michael: the Samba folks have a full virtual networking layer that lets you inject any format you want without using a networking stack to do it.

Michal: you could use the loopback interface

Shane: how do you cause bad behavior then?

Stephen: the problem is testing how it fails

Shane: if our code is structured so anything that doesn't succeed goes to the same code paths, this matters less.

Michael: if you remove the network part of the unit testing, its more reliable.

Jinmei: what is the goal of the topic?

Shane: to discuss where we are failing to make unit tests, how to fix it, what we can do about it?

(looking at an example of tcp server code)

Shane: its easier to instrument python code for testing than c++.

Stephen: if you're writing your c++ code and you want to point to something different for testing, build it into the object and put in a flag, so the production code includes code for testing, and I think that's valid. Its like an automobile with diagnostics for maintenance.

Michal: you could use inject the tests with templates if you dont want the test code compiled in?

Shane: possible.

Jinmei: we can also use some higher level abstractions, by introducing class hierarchy just for the purpose of tests. There are techniques, but it is true it will be more difficult.

Shane: its early binding which makes it more difficult.

Jinmei: I dont think that is the essential difficulty.

Michal: the places we dont test are sometimes main functions.

Jinmei: One possible good thing is to have a wrapper layer - then we can separate the dependency - so we can test the code using the network related things.

Shane: so add an indirection layer?

Jinmei: right. Then we can use a fake certificate, fake network communication, etc. Then we can test all of these other things with the ASIO wrapper.

Jelte: so we already have the layer, but if you replaced it you'd be rewriting much of ASIO. If we have that layer and we don't directly use ASIO directly, we use ASIO link. But if we replace it for testing, we'd have to replace all the functionality.

Michal: we only have to replace some specific network parts.

Stephen: you can inject packets, but if you have a fate where you replace a routine to write packets to the netowork, the routine has to do a callback, and it replicates a lot of effort. I think its really only the servers we haven't really tested.

Michael: have you tested the client query stuff?

Shane: no, we don't check for it.

Jelte: we do test the resolver behavior.

Likun: can we look at ASIO's test code?

Jeremy: I was just looking, ASIO and Boost have unit tests. Maybe we can work with them.

(Shane brings up boost.org and a google search for ASIO and Boost tests)

Shane: we should research this

Jinmei: at least in theory we should be able to test all parts but the wrapper itself, but some things heavily rely on the core ASIO. Another thing is that if the wrapper itself is very trivial, we can maybe skip that - it will simply mean testing with the external library itself. If the wrapper is difficult, then it needs tests

Shane infinite regression!

Michael: if they test the ASIO stuff, they've got to have a way to do this

Jeremy: check out http://www.boost.org/development/tests/trunk/developer/asio.html for example: http://svn.boost.org/svn/boost/trunk/libs/asio/test/basic_datagram_socket.cpp

Michael: I think client behavior is trickiest.

Shane: do you mean the resolver?

Michael: yes.

Shane: we will test packet drops, packet delays, incorrect answers, etc, but we wont test UDP checksum errors etc.

Michael: of course not.

Michal: we will have the demultiplex thing, so we will test on that level. Right now the client in the resolver is... temporary, right?

Shane: part of it is. the demuxer is a layer in front of that.

Jelte: yes. Right now the resolver issues its own queries and it would ask the demuxer to do the sending of the actual packets.

Michael: do you use the system resolver to send notifies?

Shane: yes

Michael: thats the right way. don't change that.

Shane: for unit testing, for new code, there should be *no* new code that you cant write a unit test for. If you cant figure out how to write the test, speak to the team. I was trying to do some BOSS work and I couldnt figure it out and tried functional tests and then Michal asked if I needed to and I realized I didn't. So that works.

Larissa: and people can mention it in a daily scrum call if they're stuck on a test

Shane: yes

Larissa: arent we also doing TDD?

Shane: yes.

Larissa: and arent those unit test?

Shane: yes, but people get stuck, so they dont write the test, they just write the code.

Michal: what about refactoring?

Shane: if you refactor the code you refactor the test. If you're writing a sort function and then you refactor, even though its internal, or private, you refactor the test.

Stephen: you test *all* the code you write.

Michael: if you dont test functions, you have to write more tests from the outside. If you have internal tests, your tests are less fragile.

Shane: you have to test the function somehow.

Michael: right if you know you tested that then at the higher level you can trust that its tested, its opaque, and thats okay.

Jinmei: I got lost. This is about testing private things? I am afraid there is no single universal solution to this problem I think we need to use our discretion.

Stephen: the simple way is to make it protected.

Michael: we had this discussion in another BIND 10 meeting, that we will allow other people to shoot themselves in the foot, if they want to mess with this stuff. Why make it private?

Shane: private is an *advisory notice*, not something we use to prevent.

Jelte: I thought the decision back then was to not change our interfaces for testing.

Stephen: as the code becomes more complex, why not put in code that is just for testing?

Shane: I think we were saying we didn't want different code executed for testing.

Michael: the plan for BIND 9 is to be able to compile a test version that's static. We also have to rename functions in BIND 9, but you're protected from that with c++.

Stephen: if you access something protected for test use, but have it set to private for regular use, and there is a macro, it just wont compile if you try to compile it for real environment not test environment. It will compile for testing only.

Michael: its worth trying this and seeing how it goes, but it may end up you just need a comment or somehting.

Michal: people don't read docs/comments.

Shane: its good to use the standard way the language is normally used.

Larissa: project goal of understandable hackable code....

Jeremy: should we focus time on getting better coverage? We have some specific areas with poor coverage.

Shane: I think this has been getting better. Except msgq. And BOSS. But these have a refactoring scheduled.

Jeremy: bindctl and xfrout and xfr library need more tests. The datasource master. We knew, but it needs all testing done.

Shane: we will also be refactoring datasource soon. I hope.

Jeremy: there are a few things.

Shane: all of these places will be touched within the next 3-6 months, so the question is should we expand the scope of the changes to also add tests.

Jeremy: I would guess yes, because otherwise people will only test what they are writing

Michael: and its always better to have tests first.

Jinmei: I think in general we should care about test coverage but should we introduce specific action to address this concern?

Shane: we have two pieces of work scheduled that will affect xfrout daemon. So we can schedule another task before those that is for writing tests and the relevant refactoring.

Jinmei: there are some other cases that are normally considered difficult to test. Database related things. That would be moreso when we add more backend databases. I anticipate some excuses and reasons we cant test in this instance.

Shane: I think the tests we have for SQLite now are a little broken.

Michael: you have to run the relevant server to test the specific backend, which can eventually not scale.

Jelte: it would be nice to have a generalized datasource functional test suite.

Michal: isn't there some kind of general database library where you send SQL but it doesn't matter which server is thee?

Shane: I looked at this 7 years ago and the answer was no because once you do anything non trivial, things vary really a lot.

Michael: databases are becoming more standard now.

Shane: its the details of how things work within databases that are really different. Jelte and I looked at the SQLite schema, and normally where you would expect a between command to work, it doesn't work there. Thats a really simple thing. Some systems don't support nested selects, etc and so forth.

Jelte: we need to have some high level tests, functional tests, that run on any datasource.

Michael: unit test what you can, don't unit test what you can't, in this instance.

Jinmei: we can't solve this today, but this way we are prepared for the case.

Michal: some people do not want the SQL backend to be compiled at all, and some will, and they will have SQL running anwyay, and will want to test it, and we want to test it.

Jinmei: another point: time related tests?

Michal: for some of the time related stuff we could provide our own function that gives the time. And then the time moves. We could put it in a common library.

Jeremy: I am just wondering how important these things are. I don't know what all the tests are but 5 time tests have been failing. I don't know how important they are. Would bind10-auth or resolver fail on a virtual machine?

Jinmei: possibly. Even forgetting about VMs, time related tests are tricky.

Larissa: at BayLISA multiple operators asked me if we are optimizing for VMs.

Jinmei: I think even if its ugly, its much better to test it than not test it - but its not so sophisticated.

Michal: one of the tests that failed, is a test where somehow I created a msgq core and a client, and tried to see if the traffic will arrive, and I put a timeout there. There is no timeout in real life, but if its stuck forever... I put a timeout there I thought was large enough but it turned out it was not.

Michael: we also have to start considering timing involving DNSSEC validation stuff. Then you have to plan time tests involving months.

Larissa: Francis wrote some sort of time machine meant to help with that.

Michal: we don't want to ask for the time once we are computing, but we ask so many times, and the time only differs by milliseconds.

Michael: but how do you know?

Michael: BIND 9 has two useful things - one, once a test starts, gettimeofday locks down. Second, Francis wrote this time library with an exponential curve that crushes 30 days into 15 seconds. There are some tests you can do that are helpful that way. Particularly for functional tests. Its a library that you can use. Compiled in for some things.

Jeremy: to finish my point, once we know the test is what we want, and it still fails on virtual machines, maybe its the code that needs tuning not the test.

Shane: sometimes its really not the code causing the test to fail.

Shane: also about timing, every time we add time to a timing test, it adds waiting when I type make check. Sometimes you need a small wait, but they add up over time.

Stephen: then get a biscuit with your coffee.

Shane: but in a year or two, will it take three hours? Lets think about this as we write the tests.

Michael: eventually maybe we can get tests running in the background. make test running continuously on the laptop.

Shane: I run make check across the whole system when I do a review.

Michael: it takes 8 hours to run the tests on BIND 9. Don't ever get there on BIND 10.

Jinmei: Can we make a rule for this? Timing tests? We may want a generic framework for faking time.

Stephen: can we pull across Francis's work?

Shane: for functional testing. For unit tests we need arbitrary time values.

Jinmei: regarding tests taking time, there are severl issues. In general taking time for tests is a bad thing because it makes people skip running tests. So one question is whether we want to avoid that. I personally think its better to run the tests.

Shane: could we flag time related tests?

Jinmei: there is not a general flag but we could include time in the name and separate them that way.

Shane: is it possible in google test to run tests in parallel?

Jinmei: maybe

Michal: I don't think so. But we have many test programs, they could run in parallel, but I worry that they use ports.

Michael: we can't run all our variants in BIND 9 in parallel. We have to stop unit tests to run specific tests and then remember to turn them back on, and it sucks. This is why I recommend looking at what Samba does.

Jelte: I think this is also what Unbound does.

Michael: if you don't use ports, you can run in parallel.

Michal: it would work if we didn't use auto tools.

Jinmei: so we could introduce a filter for longer duration tests. The other thing is that I would suggest using smaller timeouts as much as possible. That also means we may want to change the API so that it will take a milisecond granularity.

Shane: which API?

Jinmei: an example would be the cache timeout for Hot Spot Cache. It is set to seconds which makes sense functionally but not for tests.

Michael: google test does not run tests in parallel and has no ability to built in, but it does support the naming pattern sets. So if you say named things "slow" or "fast" you could break down some tests.

Jelte: lots of projects do "make test" or "make all tests"

Shane: then people never run "make all tests" - I want there to be pressure against avoiding tests

Jelte: except that if the tests take sooooo long people stop running them at all. Just run the tests you are interested in. You can specify which tests run with which features too.

Jinmei: in any case my approach would be to have high level techniques to shorten the time we need for tests, and to have that concept in the review test, so if the reviewer can check the time of the test and bring it up if its long...

Shane: someone add that to the review process now!

Functional Testing: How to Do It (Medium) This is testing at a higher level. We have had some brainstorming about this at the end of Y2 during our mad testing phase, but we need to formalize our work here.

Shane: testing is one of those things where getting the terminology right is tricky. In our project we understand unit testing but we have no or nearly no functional testing. In our case we mean running the software as a system and seeing what happens.

Jeremy: I have a few ad-hoc scripts for server start, loadzone, xfrin, dig, etc

Shane: unlike unit testing we want to do this at the system level, right? Do we want to define it by module?

Jinmei: what?

Shane: do we want to define tests for cmd-ctl, or just for configuration, etc

Stephen: if you list requirements, there might be functional tests that correspond.

Shane: a note for jeremy, we need to at least identify which tests cover which areas of the functional dns specification.

Michael: how will you write specifications? Is it a user story format?

Stephen: I think we're talking about the same thing. Every requirement should be testable.

Michael: the reason I like user stories is because it focuses you on the user focused outcome.

Stephen: except we write from RFCs

Michael: BIND 9 was written in RFCs... and the user interface...

(discussion about what user stories are)

Michael: the idea that a user story translates to a functional test is very useful.

Shane: let me pull up an example.

http://bind10.isc.org/wiki/MasterOfBindRequirements

This has functional and programatic requirements.

Shane: assuming we have a framework to execute tests on a functional level, who writes the tests, when do they get written, and do we have a document to track them?

Stephen: whether we use user stories or requirements statements or a combination, how do we test it?

Michael: you can do a "work in progress test" where a test you're going to add goes.

Stephen: the reason why this business about the requirements came up is that DNS is specified by many RFCs plus we have BIND 9 compatibility.

Michael: can the requirements be generated from the test suite, or are the requirements their own document?

Michal: I would rather have them in the same file, from the developer point of view.

Michael: this is what I would recommend. But there is one catch - you end up with one functional spec, but 40 tests for one functional spec. Numbering can get weird.

Jeremy: lets say I write 700 statements. They are a few sentences each, and I attribute them to source code or RFCs. I can put it in XML, parse it out, generate HTML, whatver, and point to URL in the test cases?

Shane: in XML it will generate directories, it could even generate test stubs.

Michal: then someone has to write the test, and they can put a comment that links to the specification. But when the test fails the error message should indicate what the test tried to do.

Jeremy: I have this document and then changes go along and we change a requirement, then we change tests?

Michael: but we're talking about having the descriptions in the tests. So the master file is that XML document. How do we structure this and is there a tool that will do it?

Shane: there are probably 700 test frameworks that academics have written.

Jeremy: I think we should try one of the three python cucumber clones.

Michael: you can use either one, you dont end up writing much code in those. Its very verbose, english language type testing. Its really driven for user stories.

(looking at http://cukes.info)

Michael: I experimented with this and I liked it, but I dont think it would be easy to get BIND 9 people to do it. You would be more able to do this because you're just starting to implement functional tests. Also this is a very good format for developing tests progressively.

Shane: I'm trying to think of corner cases. How would this work for say a key rollover in DNSSEC. There are a lot of ways to *do* a key rollover. Do you document them all?

Stephen: there are a sequence of tests. "Given I have put a DNS key in the zone and I have waited xyz I should see xxx"

Shane: and I guess we choose how we implement this.

Jeremy: in some situations we start one server snd run many tests. in other situations we run multiple server to run one test, and stop between, etc. How does that work?

Michael: its just. slow. You can set it up to specifically track and kill processes, etc. I also have things I call "meta sets". It knows what having a dnssec implementation with 3 masters means.

Jeremy: the good thing for us as we create these rules, if it doesnt work right, we can fix BIND 10.

Michael: I would love to be able to run the same test suite against BIND 9 for things that make sense.

Shane: like tests where we change config engines would be different.

Shane: so getting to implementation, I think finding a python cucumber clones would make sense. In the past I would have asked Jeremy to look for that, but will you have time?

Jeremy: I would like to but I would only to have a couple of days to look. I would also like Jinmei and Michael to explain the systest that is in BIND 9 now.

Jinmei: its basically lifted from BIND 9's system tests.

Shane: is this an executable program?

Jinmei: for now its a shell wrapper thing. You can look at the source code.

(team looks at test.sh in bindctl)

Michael: yes this is vry much like what BIND 9 does, its disgusting, but it works.

Jinmei: yes this was a quick hack to get some testing done before a release. We can throw it away or enhance and integrate it. Or I don't know.

Michael: the one problem with BIND 9s system tests is that you really want to start the server, issue a query, do a specific thing, shut it down, do the next one. BIND 9 starts, does a lot of tests, and then shuts down. Its not as clean of a test. Its expedient in some cases but its not good test methodology.

Shane: this may depend on the kind of test.

Michael: one improvement I want is, the way you make a test is, you find one that does something like you did and you copy it. Refactoring to a library for common use cases would be better. This could be shared between BIND 9 and 10.

Shane: so.... yeah. I don't even know if we would port these, maybe we would, but they should reflect a requirement. We will have requirements that arent in the DNS spec. Like statistics, etc.

Stephen: we need to make an assessment, as to how much is automated, a couple of things may not be worth it.

Shane: we may need at least two documents. One is a DNS specification but the other is other related things.

Michael: in cucumber you can tag them, so we can have a set of RFC compliance specific tests, statistics specific tests, etc.

Michal: can you have a test that has no requirement?

Shane: no, actually, there needs to be a requirement or why is it there? You need to say what happens if you start a server when its already running? etc.

Michael: remember how we're doing unit testing. Once something runs cleanly you can rely on the unit test.

Shane: this also applies backward pressure on developers to avoid adding cool features that no one asked for.

Shane: we may have to have developers do some of the research on test frameworks and set it up.

Michael: maybe 3 people each research one and bring it to the engineering forum for 15 minutes.

Shane: hmm...

All: maybe we do this in a bind 10 staff meeting and then present the decision.

Jeremy: there are 3 python based cucumber clones, and maybe we can just look at those.

Michael: ATF is an option too. It spits out XML.

Jeremy: and I know the ATF developer.

All: hmmmmm.

Shane: okay. Jeremy, if you have time over the next two weeks to figure this out, then cool. If not, we'll flag it, and we'll get other resources onto the solution.

Jinmei: what do we do with the existing test framework?

Shane: will we need to add tests in the next two weeks? We don't know.

Michael: did the tests you ported over from bIND 9 find problems?

Jinmei: yes I did

Michael: then I would continue with this and prioritize for importance and ease

discussion of existing tests written against dns-python and what to do with them? Should we rewrite to use our own library or not?

Jeremy: can we set goals for the year?

  • Jeremy will research test frameworks and not spend more than 3 days
  • set our functional test framework by end of May
  • develop xx number functional tests or % of tests by end of y3 - for example (100% P1, 50% P2, 0% P3)
  • Jeremy will share his list of requirement/stories with Larissa sprint by sprint and she will set priority with guidance from the team (we will see if this works, resource wise) - developers write test implementatios and they are reviewed with code.

Testing Suites (Medium) In addition to functional testing, we may want to include several other type of testing suites such as Tahi (for example, performance).

Shane: Jeremy looked at Tahi, which is an IPv6 thing with close ties to the WIDE project.

Michal: I looked at it, its for testing IPv6 infrastructure.

Jeremy: it seems like the scripts and requirements are not generated automatically, but I've never set up the platform.

Michal: It seems like you need a complete laptop setup and you need to change your environment to run it. They provide their own DNS server and client. If I understand correctly, they are checking to see that the network runs DNS, not that the DNS server runs.

Jeremy: it might be useful, but the setup time might be high. 2-3 days at least to set up virtual servers.

Shane: the main use of it is probably to tell people we run it.

Jinmei: I can talk to the developers of it, I know them.

Shane: the coolest thing would be if there is an existing lab we could use it in. CNNIC is using it.

Jeremy: we could ask Cathy that.

Shane: of Jinmei could talk to the developer, that might be best.

Jinmei: if we are very lucky they may be interested in testing bIND 10, but I don't know. I will ask for general advice.

Jeremy: there is another test suite called Protos that is a java based conformance suite.

Michael: there is a huge set of people writing test suites. Its a service model

Shane: maybe OARC could ask people... lets ask dnssec-deployment what suites they are using for dnssec conformance? Shane will ask.

Fujiwara-San: I made a specifications document I will share with the team.

Larissa: that document was excellent and may be useful to Jeremy's requirements doc as well.

Shane: there is also non functional testing, you can convert a lot of it to functional testing. But for performance benchmarking you really want a chart or a list.

Jeremy: our current tests are not automated because there were always failures.

Shane: it would be really nice if we could include that testing in our test suites so the team can run the tests.

Jeremy: some of it will be duplicated by what the functional tests do. So I am wondering if I should move it into the functional test layout.

Shane: maybe see if any of the functional test framework supports performance benchmarking. Or we could also have timing reported for all our tests and tag things for performance specific tests.

Jeremy: we also have Jinmei's microbenchmark testing that is a bit like unit tests. I dont think people use them outside development.

Jinmei: they are not for regular use, they are for when you want to introduce an optimization to see if you actually improve performance.

Jeremy: my concern is maybe people don't know about it.

Shane: what about Stephen's fuzz testing?

Stephen: yes I am planning to expand it actually.

Shane: what we want to do at some point is leave fuzz testing running for a weekend prior to release. We will want to include that.

stephen: its in the experiemental branch for now.

Shane: there is a test directory off main. It can go there.

Jeremy: Fujiwara-San also has a fuzz tool that fakes traffic.

Modularity & Hooks (Medium)

Michal notes:

I proposed it some time ago on the mailing list, some people looked at it, I got few comments from few people, but we should talk more widely if we want something like this. If so, we should start using it ASAP, because it could easy some development or at last lower the need to refactor later.

The ideas are here: http://bind10.isc.org/wiki/modularity

Michal: I would like the user to be able to not just add behavior but also remove the default behavior to replace it with theirs. We would build a whole system for the hooks, and it would have advantages for us as well, where we can generalize a library that does listening on the network.

Larissa: so are you saying make all the existing process modules act like hooks?

Michal: yes.

Stephen: One of the things about hooks and putting data out and pulling it in is the data is basically self contained. As soon as you start doing processing, you're accessing internal data structures, and that complicates things. If you want to change data in the cache, do you put hooks into the cache?

Michal: I would make the cache itself a hook.

Stephen: I see hooks more as a set of well defined points where you can change specific simple things in the code.

Shane: explain more please.

Michael: is this a hook or more like a filter?

Michal: I don't know exactly what to call it, its a bit like Apache.

Shane: ok, so..... I can see how this could be fairly straighforward in our event processing today

Michal: so then you build the server at runtime from the parts.

Shane: so basically when we get an event we do things and at the end we register a callback to another thing. We could change the callback to be what the user wanted, which would fit with this model.

Jelte: we kindof discussed this before, but currently we have two callbacks, dns-lookup and dns-answer, and if we made that a configurable list of dynamically available callbacks, maybe that would work?

Michal: I want the callback to be able to modify the data. You could say "this is bogus, drop it" or "Stop processing, servfail" or...

Michael: in asterisk call forwarding of all things, you do something and then you call what the next hook would be. Then you dont have a pre-defined list but you do have a library of options.

Shane: if you're too flexible, if you don;'t want to write an entire telephone system, it is hard to set up asterisk.

Jeremy: I think we need to write down 20 things we would want here. Some of them were discussed before we started the bind 10 project. two examples: have code points that point out to places where people would write scripts with an if-then statement. Another way is using firewall rules, like if _ matches , accept/reject. Those would be a lot easier to do than configuring named.conf is today.

Michal: if we could configure them like this, we could make them very powerful for power users.

Shane: to me this seems like... how would this be different, for the user, than writing code? Easier I mean?

Michal: because you can replace the library at run time. I want them to be able to both put in and take code out.

Stephen: at some point, you can reconfigure everything at run time, and providing we've got our encapsulation right, you could replace the cache, you've got the object interface, replace it, and it works.

Jelte: I would not do that with the cache.

Michal: I would make the cache replaceable because the cache would be a source of data.

Shane: if you want to change cache data, you can inherit from the existing cache and write your own, or you can also use the API for how the cache works today, and in the hook world, when you do xyz with the cache, a series of hooks are called. Administrators can make changes at each point.

Jelte: I don't like that I think its wrong way round. I don't think people should modify cache behavior.

Michal: if you want to change what to throw out, what do you do?

Shane: An administrator wants to never cache data related to a specific website. So there is a specific hook point he can edit.

Stephen: what is the business case? 80/20 rule

Michael: if you cant make a case for why its useful, then why do it?

Shane: there are blacklists in BIND 9, right? It would be nice if you didnt have to have special code to do that.

Michael: that's a specific example.

Jelte: I think everything people will want to do can be done with a fairly simple API. And we have several places (currently in TCP or UDP server now) and we point to specific callouts, we can do everything people would want.

Larissa: I just want to make sure that this is still something sysadmins can deal with.

Shane: what is the difference between this and writing a new ASIO block? On a web server it used to be you had a callout point and you added a function.

Jelte: if you write a module for apache or lighty, you write a function thats called, you configure when it will be called, and the context. It can modify anything, and it can send back some defined options.

Shane: and there are defined steps. In this way there are no defined steps.

Jelte proposes a model with a specific plugin module and specific limited list of points where it plugs in.

Shane: why does this scare me less?

Michael: I am worried we will write a language here. That is a big mistake. Think of the blacklist option? You're actually shortcircuiting certain options.

Stephen: I think we need to keep it simple.

Shane: maybe we do something simpler and then consider Michal's option later if we need to

Larissa: I suggest a very simple prototype and then some user discussion.

Jinmei: we need it to be testable by itself, we dont want to be able to replace everything. I generally think its a good idea to have a small potentially replacable module. I kindof think its a good idea to have a framework that makes this whole idea possible.

Shane: one possible concern is that whenever you design something new thats complex you will get it wrong the first time.

Michal: I really didn't completely design it, I just was inspired by Miranda and Apache.

Shane: I am worried about an elaborate design that won't get used.

Jelte: SIDN very much wants exactly the thing I described.

Shane: we need a defined set of calls.

Jinmei: decomposing the feature into separate pieces, or making everything decomposable, seems to be different.

Larissa: I need to understand what people want to do. To figure out whether this is more complex than we need.

Shane: Jelte and Michal's position is that it wont be any harder to do what they want than to do a smaller thing. So I suggest whoever wants to proposes a design. Define an API and some configuration examples, maybe some pseudo code, and then we evaluate it.

Potential use cases:

  • DNSSEC signing w/ on the fly answers
  • validating forwarding resolver
  • blacklists
  • NXDOMAIN redirection
  • NSEC masking
  • non DNS operational data management?
  • script run upon AXFR
  • query introspection (need to know why)
  • alternate method to configure ACLs - to use an LDAP database to authenticate updates
  • dynamically generated content of zone data - be able to write a script to send answers
  • experiments with new data sources
  • debugging - log various steps
  • AS112?
  • possibly use this to combine auth and recurse
  • evlDNS stuff
  • network discovery from behind a NAT
  • change timing behaviors on the XFR side - have zones refresh more or less often
  • pick or prefer specific masters
  • change query behavior - resolver gets a timeout then it tries all the servers in the NS set
  • non expiring cache for better performance
  • reduction of configuration knobs
  • Filter-AAAA or other IPv6
  • stub zones?
  • SCTP
  • Shim6?
  • alternate classes (think MIT people like Hesiod users)

Thoughts: could we use the hooks system for BIND 9 compatability?

We don't want to avoid coding in things that we really want, though

What kind of programming languages will we support hooks in? C++ and Python, but... do we extend to other languages... we probably need perl. Could other people write layers to support other languages?

Lunch

Task Breakdown Part 1

We begin our Epic Quest to break down the tasks for the first 6 months of Y3.

Thursday, 2011-03-24

Task Breakdown Part 2

Lunch

Scrum Estimation Part 1

We need to do some planning poker for the tasks that we have identified for the start of Y3, so we can estimate how much we can deliver in each sprint, and so we can track our performance on an ongoing basis.

Friday, 2011-03-24

Scrum Estimation Part 2

We should be able to finish our Scrum estimations here.

Lunch

Working with BIND 9 (Michael Graff)

The main goal for Y3 is not BIND 9 compatibility, but we are going to be living in a world where BIND 9 and BIND 10 are both running in the wild. We would also like to avoid duplicate work and divergent code paths as much as possible.

Michael Graff, the BIND 9 programme manager, will be joining us and we will discuss this topic.

Shane: Michael has been running BIND 9 for about a year, as its first dedicated engineering manager.

Michael: So we've been trying to do TDD, Scrum, and some other concepts used in BIND 10, with varying success.

Jeremy: how long will BIND 9 last?

Michael/Larissa/Shane?: well, 7-10 more years... some current OS versions can't upgrade, people need motivation to upgrade, but there is a plan to deprecate ununsed features in BIND 9 so they need not be ported to BIND 10.

Larissa: and can we talk about how code can be shared?

Michael: yes, we are going to be using pythion

<discussion of python 2 and 3>

Michael: we will be writing key managment tools in BIND 9 in python that maybe we can use for both. (Discussion)

Shane: one challenge I have in bIND 9 is the tight coupling.

Michael: The biggest problem I think is that it was written by engineers without object oriented experience to separate the data parts. That was a decision by some original BIND 9 developers and it was questioned then and its not consistent in the code.

Shane: you're trying to figure out what behavior is going on but it has pseudo object orientation and you can't figure it out. This was to the database.

Shane: we would like to lift/share code from bIND 9 when possible. If we do that, how do we keep changes in sync?

Michael: It seems silly to reproduce things. There are a couple of things. In BIND 9 we need to write code thats easier to test and compatible with modern design techniques used in BIND 10. We have a unit test framework now. And we use it! We're working on writing testable functions and reasonably sized functions. (discussion of code copying and problems therein)No more 5,000 line functions.

Shane: BIND 9 also has a lot of functions with 15 parameters

Michael: actually I think its about 8. The problem is you pass them in almost every context and that makes it bigger

Shane: I don't understand the directory structure.

Michael: libdns is a supporting library for named. There are a lot of things in libdns specific to named and vice versa

Shane: lets talk about the logger in particular

Stephen: we're talking about how to share code. Thats a goal, to make an independent library both projects can use.

Jelte: the "real" libbind. If we have tools that work with either project, it should be a separately distributable thing.

Michael: not distribute but treat separately.

Jelte: I mean package.

Stephen: say you want to release BIND 9, there is a formal internal release of the library, and its separate.

Michael: we kindof have this issue with DHCP already.

Larissa: maybe DHCP could use this library instead of libDNS which makes a mess.

Shane: and we can optimize things in one place.

Michael: someone has to change, but i dont care who. maybe easier for bind 10 because it has tests and because most C programs are valid in C++ but not the other way.

Jeremy: BIND 9 has coe thats compiled, and built, but no paths ever use it. Like logging from source. Bob Halley told me nothing uses it. I found that easily.

Michael: I've considered writing a script that changes the names of the functions and then if it compiles, nothing is using it, and we can clean it out. We add functionality but we don't remove it.

Discussion of issues with shared libraries.

Shane: Michael, tell us the release schedule plans.

Michael: we're releasing a feature version about every 6 months, and maintenance releases between quarterly and monthly, depending on whats going on.

Shane: all of our bug tickets are currently private, in bIND 9, right?

Michael: yes. Working on this.

Shane: and its all in RT?

Michael and Larissa: yes, there are two instances, so support manages a case a customer logs, and the customer can see it, but then if it becomes a bug, it goes to the bugs instance, which is closed to ISC people only. RT is almost too powerful. An example is our review process. It is in the bug queue, moves to the review queue, then the notes queue, then the resolved queue. But it looks like the guys didnt finish work because things never just close. Also Dan has gone to RT training now, and he has ideas about how to fix it.

Shane: we discussed this at all hands, and Barry mentioned that you do want to decouple ticket handling from bugs.

Michael: I'm not worried about trac.

We can put a trac ticket item link into the support instance.

Shane: you also have an alpha, beta, release candidate model for major releases.

Michael: we have an obligation to the forum, for advance code release at each point. This impacts our schedule. Alpha is something we have for .0s betas and RCs for everything.

Shane: are there fixed times?

Larissa: no, but there should be. its an attempt to build community testing but it fails.

Michael: people ignore everything until the .0 and then they send bugs.

Stephen: and you dont change after beta

Michael: our rules: alpha establishes syntax. Beta is bugfixes but the feature set is locked down. RC1 is critical bug fixes and docs, and the final only has docs changes.

Michael: some things that didn't work well. I wanted to start putting features in point releases. Lots of projects do it. But it was a disaster for us. We can't do it, it confused everyone. the other thing that didnt work well was setting a fixed release date. What they really wanted was release on this day, except if there are bugs, and well, don't take my features out.

Discussion of the forum model and its issues and open source etc.

Stephen: we do have to be careful about the copyright for patches etc.

Michael: lets not go into legal issues

Stephen: if something is a release candidate, make sure its a real release candidate, and not a buggy version you put out because your release said.

Michael: agile has helped wit this, we know sooner when a feature will be too buggy and not ready in time. We let release dates slip, but if they slip because of poor planning, we need to fix planning, if they come in late bugs, we need to fix the schedule.

Larissa: we're going to have beta programs across the board too

Jeremy: and we claim ops tests our software but its not that effective

Michael: they compile it (which is a good test) and they run it for a bit, at least a weekend. IS that real testing. It does show that someone could install this.

Larissa: Jim has indicated he would like to improve this.

Michael: we need to give them a specific checklist.

Jeremy: BIND 10 has the same problem. You'll probably notice my bursts of bug submission, its because suddenly I'm using stuff or new stuff.

Larissa: we need to treat ops a bit like a beta test person. Specific instructions.

Jeremy: BIND 9 sometimes gets bad press on security issues. You know, there was a long period of no security bugs. Do we know what happened?

Michael: DNSSEC. In 1994 we wrote DNSSEC. in 1995 we rewrote it the spec changed. in 2004 we rewrote DNSSEC because the spec changed. We didn't introduce it per se. Now, in 2011, people are using DNSSEC. All of a sudden, here are the bugs. It was written poorly, it had no tests. Rob warned us about this.

Jelte: DNSSEC is so new, its logical that its in this state.

Michael: and we don;t get yelled at for this. People understand. But also one person's little bug is actually a giant security hole.

Jeremy: so in hindsight unit tests and functionality tests might help.

Michael: the projects compete for resources. BIND 10 had money but BIND 9 didn't, and we had to shuffle people around because we didn't want to lay off, and we are still suffering from this. In any case, I'm looking for BIND 9 developers, if anyone is looking! Especially someone who can do Windows *and* UNIX

Shane: lets talk abotu how to organize shared efforts.

Stephen: maybe logging is a good first option

Jelte: ideally you could have a shared scrum thing for the shared project

Michael: or a "prisoner exchange" where developers trade for a sprint or a few sprints or something.

Larissa: I would want to have people on sprints, and probably more than one in a row, for coherency. Mike Cohn advised on this.

Michael: maybe pair programming is the solution here.

Shane: hmmm so we put one BIND 10 person on logging paired with one BIND 9 person, on logging, together.

Larissa: I want to also figure out how we share the whole culture not just the code, so we need to figure tht out.

Michael: one last word: when you go to develop thigns, please consider the bIND 9 code, and why you did things. Please.

Shane: we are thinking about crypto libraries. what does bind 9 do? OpenSSL?

Michael: also, please, tell us, when you find a BIND 9 bug?

Jinmei: I often refer to BIND 9, to import logic, and I do report bugs I find.

Jeremy: I think this should be a blog article

Confidential Work for Security (Jeremy Reed) (Medium)

We need a procedure for privately using git and our discussions for security issues (such as #80).

Jeremy: Aaaright. We need a way for the customer to contact us if they have an issue. And they might need a private way. Phone or an alias.

  • We need a secure email method (Securityofficer@)
  • We need an obvious way to mark a ticket confidential (Jeremy needs to note it still works)
  • We need a wiki page on how to do this (and report problems)
  • Form that goes to the securityofficer list?
  • Maybe we should default to the secured method of submission
  • Michael: maybe we can do a threestate toggle
  • we should always *ask* the customer before we mark something insecure after they mark it secure.
  • Decision: the best solution is a pulldown box that defaults blank, with yes no or not set. (do not display not set tickets until we review)
  • We need a human to respond when someone submits a security issue (bug triage)
  • if the issue comes to securityofficer@, that person creates a ticket and then comments to the submitter.
  • Quick evaluate the issue - run a CVSS check - determine approximate severity and work estimate
  • move discussion to a private email list
  • determine if the issue is in the wild or not - type 1 vs type 2
  • if the issue came in over an open list, assume it is in the wild
  • contact reporter, inform them that we think its a security issue, ask them to refrain from discussion, and offer them a credit in the CVE if desired
  • determine schedule for phased security notification
  • we need a private git repository for security specific branches.
  • need filter and git commands to keep repository secure
  • need to ensure all bind10-team@ list only has core developers who are (or their organizations arE) under NDA
  • we need to use a password or invitation only jabber room for security issues
  • beyond these things process sticks as closely to existing security process as possible
  • by the end of the next sprint (April 15th) policy and git changes are established and in the second half of April a test security event will be rehearsed.

Note: we also need to redesign our front page to make it clear how to report issues and security issues (and in general, redesign)

Writing Down What a DNS Server Is (Medium)

Several team members feel that it is important to document what a DNS server is so that we can be sure we have built it. We need to discuss what exactly the goal of this activity would be and how we can achieve it. This is to create a plan for how we will document, not to actually document it.

Scheduling Team Calls & Suchlike (Short)

Once we have decided how our team(s) will be organized, we should probably take a moment to review our regular meetings/calls.

Shane: we decided earlier this week that we're abandoning the A and R team split at least for now. We have three regularly scheduled calls now:

  • daily call
  • team call every two weeks
  • R-team planning
  • A-team planning

We still need the daily call. The time it is now is 08:00 UTC. This is a good time for Europe (9 and 10 am) and Beijing (4pm) and Tokyo (5pm) but a poor time for North Americans. Jinmei will call in on a best effort basis. Larissa and Jeremy are not expected to call. Larissa, Shane, and Jeremy plan to meet a few days a week at 6:30am Pacific (8:30 Central American and 15:30 Central European)

We need to set up the sprint planning call and the staff call. We will continue the idea of one week sprint planning one week team call. We are also looking into using the team meeting time for scrum style demos and retrospectives/reviews.

We need to keep the meeting at the same or a similar time to what we have now. We acknowledge that this is a rough time of night for Asian colleagues. We also need to mind the date line factor. We will leave the time as it is for now. Which day of the week is good?

Michal: Tuesday remains good.

Shane: Tuesday remains good.

Stephen: Tuesday is good

Larissa: Tuesday is good

Jinmei: I am worried the combined sprint planning will be a very long call.

Shane: maybe we reserve the same time on Wednesdays in case we need it.

Stephen: also after two hours people tail off

Michael: In BIND 9 we now do breakdown tuesday and estimation thursday

Stephen: also more than 90 minutes, is really going to be hard on the Asians

Shane: developers, how do you feel about sometimes having a second call in the same week?

Jinmei: I don't think I mind.

Shane and Larissa: and our current round of advanced planning will fall apart around June/July? this time

Jelte: and we had a lot of clarity on tasks in the last meeting

Stephen: how many releases per year? Would it be worthwhile breaking up in to 18 weeks so every three sprints we have distinct goals?

Larissa: so quarterly deadlines for feature sets?

Stephen: every four months.

Shane: to get back to the planning issue, my proposal remains that we have an optional meeting on wednesday or thursday.

Stephen: a lot of time is taken up with estimating. you can actually start a task without an estimate when necessary. how do people feel the email estimating went? People sent their estimates via email, and I took a consensus value and we accepted it without further discussion, and we only discuss when opinions diverge wildly.

Shane: Likun, how do you feel about that?

Likun: its okay, sometimes if I'm not clear on the task I can then find out more independently

Jinmei: I'm basically negative on email estimation, people forget, it tends to introduce delay in the timeframe of a two week sprint that is significant. If we are going to a compromise I'd rather go more aggressive, like someone who is picking up the task just does the estimate.

Stephen: what I get in email is usually relatively close in size. Its only when I get a large disparity that we need the discussion. The difference between a 1 and a 2 comes out in the noise.

Jelte: doing it in email does eliminate discussion thats not necessary, but I agree that it introduces delay, and that people forget.

Discssion of estimation and sprint practices and whether the email thing would work.

Shane: how do JPRS and CNNIC feel about an overflow meeting on wednesday or thursday if we need to?

Jinmei: I am not sure its an "if" I suspect we always will need it.

Fujiara: It is okay.

Likun: if there is no other solution we will survive it.

Larissa: if we start half an hour earlier and allow two hours, that might help?

Michal: yes?

Larissa: How would that be for you Jinmei?

Jinmei: I guess it is okay. Maybe not in standard time, but that is a long way away.

Shane: what if the second sprint planning call every other week was at night for europe, afternoon for california, morning in asia?

Jelte: if its Wednesday, thats fine.

Stephen: I'm fine with that.

Shane: okay. One proposal is the tuesday call is always 15:00 UTC once a week. when we need a second sprint planning call, it would be at 23:00 UTC wednesdays, which means 8 am Thursdays for China and 9am for Japan.

Shane: if we do am am call for Asia and factor in the time that Kambe and Aharen san are traveling to work in the mornigns, the meeting would be 3am for europeans. Maybe what we should do is steal time from the standup calls.

Larissa: personally I am okay with missing the estimating.

Shane: okay so task estimation could happen in slightly extended daily scrum calls.

Michael: so proposal: task breakdown at scrum planning call, then emails for estimation, then discrepancies discussed on the daily sprints.

Larissa: yes

Shane: and start sprint planning at 14:30 UTC.

Things for End of Each Sprint (Short)

We are missing a couple of things from the end of each Scrum now. We don't do a true retrospective, and we do not do demos. We have been doing Scrum long enough that it may be time to adopt these practices.

Demos: at the end of the sprint, Shane can ask one or two developers to come up with a demo for their new stuff. We would do that at the next team meeting. Demos would last 15 minutes. After a few rounds of this, we will start including customers and users in the demonstration. In general we might allow specific customers/users to attend the "internal" demos, but it may depend. We would probably invite close outside colleagues we know well.

Reviews: at the 6 week release point we will review all features against definition of of done.

Retrospectives: Stephen will call for a stop-keep-start style retrospective at the beginning of each sprint planning session. Shane will send a remindner email the day before, about the retrospective.

Unification of in-memory and SQLite Back-ends (Medium)

Michal notes:

Some unification of in-memory & sqlite3. Or should this be handled on the ML rather? Because this would probably include little homework to look trough both the APIs to be able to talk about it.

Michal: we have a base class for the datasource, and we have SQLite based on that, and we have another base class, and inmemory based on that, and this misses the point of having the abstract base class. So I think we should look at them and unify it.

Larissa: will this help us to have a shared API for datasources?

Others: yes

Shane: how did we come to be in such a place?

Michal: well, the base class was created, and the SQLbackend was in mind, but its a little bit specialized.

All: it was all because we needed to do the inmemory structure quickly.

Michal: I don't think either is what we want. We need to modify both a little bit. I think we could then get to the point that we find what we want in the end by merging them.

Stephen: it could be one task, to merge them?

Michal and Jinmei: three or four.

Decision: While our inmemory datasource will support DNSSEC, our API for datasources needs to allow databases that do not support DNSSEC to integrate with BIND 10.

(see task list)

Lack of Users (Short)

Michal notes:

I also worry little bit about the fact that, in contrary to the fact that the software is generally buggy, we get really few bug reports, emails, complains. We should have a situation when we release a tarball, we get ten people hammering onto the door of jabber room demanding it's fixed. Also, it's two years already, but we still don't have anything that could be really used, though it's already planned I guess. But I'm not sure there's anything to talk about here.

Shane: we actually only got tarballs 12 months ago and have been actively recommending against production, so that is probably part of the problem. I think now though we should be telling people they should run it, in a specific limited capacity.

Jinmei: I want to encourage people to play with it, but probably there is currently no reason for people to play with it, because its slower and missing many features.

Jeremy: we don't want to give a poor impression?

Shane: the analogy I've been recently giving is to Mozilla .6, when it was slower, and crashed all the time, and didn't do what you wanted, but the potential was there. eventually, it got to the point around .9, where it would finally render some sites better than netscape.

Jinmei: if, for example there is a website that can be better with Mozilla, that can be a reason. My point is we don't have that.

Stephen: so, is there anything we can add that BIND 9 doesn't have? that will get people to try?

Shane: we have the SQL.

Stephen: should we make a bigger play?

Larissa: we do have some plans to start telling people to try it, the webinar, demos, beta program, and blogs, are all oriented toward that.

Shane: we can show the simple demo, because right now, (once we fix the unindexed query bug) we could say look, we start in two seconds for a zone with a million records. That would be sexy.

Larissa: and we need a cool thing to do with a user story for every release.

Shane: so next after this one is TSIG, then configuring the BOSS for the release after that.

Jeremy: we need to make sure Ops really runs it and that we point people to it. We could also possibly run a public resolver that could take a beating.

Shane: is there a problem with that?

Michael: I want to do that in BIND 9.9 if you're confident that your resolver will hold up to high load with DoS attacks, go for it

Jelte: not yet!

Shane: if we put it on the bind 10 dev list and wiki for a while before putting it to say, isc.org and bind-users, that could work.

Jeremy: the bind10 box has been running the iterator without crashing since March 17th.

Shane: do we have statisitcs for the resolver yet?

Jeremy: no but I can find some information with verbose logging (which we have)

Jelte: sometimes it gives up too fast.

Larissa and Jinmei: so the three things to get this going:

  • make sure there is sexy "geeky dns catnip" in each release (ie speedy load of large zones, TSIG, BOSS configuration)
  • communicate increasingly abotu BIND 10 with webinars, demos, blogs, events
  • demonstrate the stability etc of the server by getting it into limited use with ISC ops, beta programs, etc.

Of course, we want to be cautious. We don't want to increase users faster than we can keep up with new features, bug fixes, etc. Its a delicate balance.

Blogging

We agreed that BIND 10 will be doing a blog per month - we will schedule one as a sprint task every other sprint. Larissa will enforce this.

Topics we didn't quite get to:

  • API/ABI Versioning
  • How to benefit from Multi-core/processor(TBD)

There was a discussion on the dev list before: https://lists.isc.org/pipermail/bind10-dev/2010-December/001738.html but there didn't seem to be a clear conclusion.

  • msgq Replacement (Medium)

It may be time to consider using something other than our own, hand-crafted message bus. We need to worry about portability, increased dependencies, ease of use, reliability, feature set, and so on. Plus at least sketch out a plan for selecting and adopting such technology.

  • External Tester Program (Medium)

Larissa, Jeremy, and Shane have worked to outline how we may work with external testers. This may be interesting for everyone on the project.

Last modified 7 years ago Last modified on Mar 25, 2011, 3:58:53 PM

Attachments (1)

Download all attachments as: .zip