April 19, 2013

JPRS zone transfer performance discussion

@1001 hrs

Michal: I believe the performance is OK, but rest are due to Sqlite. MySQL and PostgreSQL should solve it.
Jinmei: Mostly related to AXFR. May still hit limitation of sqlite3 if we use IXFR, but most notably about AXFR of large zones.
Shane: We could workaround limitations of sqlite.
Shane: If we really want to improve performance, we need MySQL and PostgreSQL.
Jinmei: I played with MySQL with large zones; seems to cover most of our scenarios like multiple zone transfers at same time, or AXFR while updating zones at the same time.
Jinmei: PostgreSQL API seems to require some more work. Non-trivial to workaround. MySQL will be relatively a trivial port of sqlite3.

Mark: Do we do compressed IXFRs?
Jinmei: We don't do that.
Shane: BIND9 doesn't do it right?
Mark: No.. pain and suffering.

Shane: What I'd like to do when we schedule reworking zone transfer stuff is to add MySQL as a data source.
Michal: There are ways to workaround it using sqlite3 itself.
Michal: If you put each zone in a different file, you can separate the databases [and don't hit sqlite3 bottleneck].

ACTION: Add ticket for lettuce test with multiple sqlite datasources.

Shane: We should add some zone transfer benchmarks.

Shane: So overall zone transfer work is well-understood. We can start this work when we are done with the shared memory work instead of doing the hooks next.
Shane: Depends on when DNSCo wants to get hooks first.

Shane: I want to improve transparency of the zone transfers in process, what are happening, how long they'll be, etc.
Jinmei: We should add one another data source as a kind of bugfix for scalability.
Jinmei: For AXFRout and loadzone into memory, we have to iterate over . Right now we iterate in a sorted way. It is difficult to do this quickly using MySQL. If we use ORDER BY, it doesn't work very well.
Jinmei: It works very slowly, and sometimes causes a fatal timeout.

Breakdown of tasks:

- meta
 + performance check of using multiple processes to do zone transfers

- xfrin-ng (part of zonemgr-ng)
 - inspection of packet (success/fail)

- zonemgr-ng
 + grand design
 + scheduler
 + soa checks
 + NOTIFY handling (maybe unified)
 + mostly completely reimplement it
 + admin info (xfrs in progress)
 + admin commands (kill xfr)

- xfrout-ng
 + revisit design of current architecture (notify out, xfr request, etc.)
 + benchmarks
 + limits on how many there are
 + support back to back zone transfers
 + ixfr + udp
 + persistent notify info
 + out of zone notify
 + control of notify targets

- mysql
 + research performance (RRs, sorted/grouped)
 + conf design
 + port of sqlite
 + document how to add data source
 + test setup (how to setup mysql server to run test suite?)
 + configure help
 + fix sqlite3 hardcoded bits
 + loadzone
 + dbutil (dhcp team may want to update this soon)
 + mysql embedded version, licensing

What is missing from BIND 10?

@ 1129 hrs

Shane: Missing feature for sponsors - we don't have RRL. Whether it is
necessary or not is tricky. Didn't exist a year ago. Most TLDs didn't
have it in place 6 months ago.

Jinmei: Ported BIND 9's implementation. Wrote detailed unit tests and
things like that. Confirmed it worked. Not surprising because it's a
straightforward port of original implementation. Still missing some
features that exist in the original implementation, but it can be
easily completed. Right now, it's just my personal experiment and we
could just use it as a reference and do it separately, or merge it
later after any necessary cleanup/review/etc. I could clean it up and
create a few tickets mainly for reviewing. I guess it could be 5-6

Jeremy: Ops said they could run B10 auth in production if it had RRL.

Jinmei: The part we talked about for getting the number of levels from
the DomainTree to optimize using the origin name for statistics work
can also apply here.

Shane: Another thing is to improve our user interface.  We may
outsource it or do something else about it. Will know in the next few

Shane: signing topic
Michal: We may want something like on-demand signing, for things like
hook-generated answers.

Discussion about on-demand signing. Does PowerDNS sign per-query? Does
anyone use it? Yes and Yes.

Shane: I know some TLDs use BIND 9 managed zones to sign their zones
and use DDNS.

Shane: We'll have to implement command-line tools and such.

Michal: We'll have to add a data source example as well.

Mukund: We should have examples of our REST api as well.

Fujiwara: How about daemoninzing BIND 10?

Shane: We probably have to change it to run BIND 10 as a daemon by

Jeremy: We don't have output go to a log file by default.

Jinmei: One more thing to do is making sure the response is equal to the source address.

Shane: We should be a bit smarter about starting auth servers when we
have multiple cores.

Jinmei: We should have more fine-grained ACLs.

Mark: Do you have an EDNS0 client-subnet yet?

Hooks discussion with DHCP team

@1308 hrs

Shane: We agree on the basic idea on hooks.
Shane: We need an option to skip processing

Shane: We may have a requirement about the order in which hooks are
called. The order in the config file sets it, but we may want it to be
configured on a per-hook basis.

Discussion on a packet context.

Michal presents "our" design for the hooks API.

  dns::Message* m;
  ev->get("message", m);

We discussed:
* Performance issues if marshalling is involved
* Explicit init() function
* Dynamic registering hooks
* Whether this design would make things more complex
* Examples in python, etc. were discussed

Shane: Is this a high priority for the DHCP team?
Stephen: Yes. Comcast's requirements will need it.

Tomek: We'd only require the C++ bits for now. We have no plans to write Python wrappers.
Would Comcast need python wrappers?

There is initial agreement that we could use this new design.

ACTION: Document this as a design. Mukund and Stephen will work on it.

Scaling the resolver across multiple cores


Original suggestion (years ago) not to use threads for scaling because two developers earlier on didn't want to, but not concern now.

Reviewing bind10-dev email proposals for resolver. "Scaling the resolver across multiple cores"

Should we use threads or multiple processes for resolver is the question.

Threads should be much simpler than bind9's use of threads.

Won't create a thread for each upstream request.

Will look at boost coroutine.

Naive thread, start a thread for each query and wait for answer. In coroutines case, each process has many user-level threads has no locks. Can use event-based for either method. Event-based may be single- or multi-threaded.

If worker is idle, it picks up a task from a queue and starts working. But if needs a shared resources, it puts into another queue (for cache). That has a single thread for that queue so don't need to lock the cache. Looks up in the cache and put into large queue when it is done. Another worker picks it up. Another one for upstream queries. All CPU work is done by workers. All communication between queues happens in batches, like 100 tasks at once and dumps all at once, so only locks once. To minimize lock contention. Landlord is a separate thread. Serializing task access.

Maybe use fine-grained locking and allow many to access to cache.

Separation of the threads is similar to bind9 model.

Whole architecture is quite complicated.

Context switches limited by amortizing by doing batches of 100 at a time.

Atomic swap. May have portability issue, but probably doesn't matter with modern architectures.

Probably use RCU from someplace else instead of implementing it ourselves. Take something already done for supporting multiple architectures too.

Consider having lookup landlord and cache landlord, so cache landlord doesn't pay attention to queries coming in. Response handler landlord. Answer/receiver landlord.

Maybe co-routines model might be simpler.

Some tasks are made to measure performance to compare, like #2871 (in review now) and #2873 and #2874 and #2875.

Multiple processes may not help other than in rapid recovery if one crashed. Or a hot-swap resolver. If didn't have shared cache, then second resolver (not used) may not benefit much since no cache.

Multiple servers, like virtual machine, each maybe doing duplicate queries, but provide redundancy.

Several resolvers each has own small cache. If answer is not in the cache, so asks bigger cache -- another process. If that bigger cache doesn't know, then it asks next layer of cache, and so on. The last one does outside lookup. Answers are put into each cache. The smaller cache uses lru. Most recently used are cached closest to the resolver. If NUMA system, sitting on L2 cache, then don't need to go main computer to access the answer (hardware optimizing). Some like this model more. Not sure how much cost for multiple levels.

Maybe implement co-routines (multiple threads in the same process) and RCU.

We need newer better captures of queries from stubs to busy resolver. (Maybe SIE feed doesn't have that side.)

Share resource with minimizing.

  • cache
  • runtime states, timeouts, runtime objects may have to be shared multiple core
  • network resources (sockets), randomized multiple source ports

Consider proposals in that higher-level context.

Would it make sense to use the landlord queues idea (since is generic) for other parts of system?

If multiple layer caches are a bottleneck, could have multiple use a hash to know which to look at. Could put all the network communication in a single place (single thread, process).

Wait for detailed analysis after research is done.



review of doc/design/reso .... (from jinmei on shane's system)

unbound speed compared with echo server.

Like to see qps as function of cache-miss rate.

Consider separate cache for negative cache. Bind9 uses same cache but different format, so caused some difficulties. Be careful is using one cache. Inclined to use same data structure, but identified positive or negative.


Pending data that has not been validated yet versus real data. Bind9 keeps in same structure, but has interesting side effects that needs to be fixed in bind9.

One question is whether we want to separate delegation information. Jinmei's experiments show this is quite independent. It may improve performance, even with duplicate information. A separate cache for the glue.

Maybe keep the NS right in the main cache too. Maybe pointers to keep in sync.

Current experimental resolver does have two caches for severs and records.

[16:01] four attendees need to leave

Last modified 5 years ago Last modified on Apr 26, 2013, 6:40:16 AM