The high-performance data-source tree

The idea about high performance data source is all data is in memory in a tree. There are several optimisations to minimise the number of times the tree is traversed. This page describes an idea how the tree could be made faster than usual tree (AVL, rb…).

Background and assumtions

  • There are a lot more lookups than modifications. Therefore we can take more time to update and make the tree nicer (more balanced, etc…).
  • Computers today have multi-level memory caches. While CPU can perform multiple instructions per single tick, it needs data. If it is in register, it is ready right away, access from L1 cache usually takes a tick, L2 takes tens of ticks, access of main memory takes hundreds of ticks.
  • Cache is divided into blocks/lines. Transfer from the slower memory happens by whole lines.
  • Memory is divided into pages. There is TLB (Translation Lookup Buffer) which speeds up lookups of real addresses. If we get a TLB miss, full lookup must be made, which contains several memory accesses, each being possible cache miss. Accessing the same page increases the chance of TLB hit.

Basic layout

The tree is dual-level kind-of (a,b)-tree (it does node joining/splitting, when it is too empty/full). The tree is composed of „big“ nodes with large fan-out (blue ones in the picture). Each big node takes a page of memory (and is aligned to it).

Each big node is a tree itself, composed of „small“ nodes (red ones). Each small node is aligned to cache line (so when we access it, we get it for single memory access).

The image shows only part of the tree and has one big node „zoomed“. Both big and red ones would have bigger fan-out in reality and would be more of them inside one big.


Since there are only few (256 if we have 16bytes long cachelines and 4096 bytes large pages), we can use single-byte indexes instead of pointers between small nodes inside the same big one. That would save space inside the small nodes, increasing their fan-out. Pointers outside the big nodes is to another big nodes, it's root is at position 0. It might be useful to use another flag to mark short offset pointers (for example 2byte ones). They should be majority there (because they point to whole pages, not memory location) and save more space.

Internal structure of big nodes

It is really simple. After some kind of header (which could contain a RW-lock, maybe a bitmap of free space, …), there are small nodes, one after another, like in an array.

Internal structure of small nodes

The string data to compare must be inside the nodes. But, as they are short, only parts of it should be there (since when we are somewhere down the tree, we already checked part of it). There are two possibilities with the keys. Either they mark edges (eg. trie) or they mark separators in the tree. For simplicity, it makes no sense to mix these two inside the same node (so this should be a node-level flag). The trie ones clearly mark what is already checked, the separator ones just split the possibilities to lower nodes, without really checking anything in their name (eg. with trie, the prefix must match, with separator, it must be in between them). Maybe better showed with an image. The gray labels are labes that live there, black ones are the ones actually stored (again, the tree is not complete, we are not sure we really found what we were looking with this information). The left side is trie approach and when we get to the left node (after following the „a“ edge), we no longer need to check the first character.

The use for the separator ones is at places where it is „dense“ (we will not fit the whole alphabet into single node).

Now, what needs to be stored inside the node? Some flags („I'm a trie vertex“ or „I have a data pointer, to some RR sets“), some strings (of varying lengths, but no length should be longer than the node itself, so 4 or 5 bits per length should be enough) and some pointers (of various types ‒ local indices, offset 2byte ones, full-length ones). It makes sense to squash all the bit arrays together so space is not wasted. Then the string data would go. Since pointers must be alligned, they would fit from the end ‒ full-length ones at the rear, 2byte ones before them, single byte indices before. So it would look like this (bits are blue, strings yellow and pointers red):


Let us look at adding, removing would be similar.

On the small-node level, it acts as a b-tree. If the new data can not fit it, we split it and insert another pointer at parent node (which we remember from the path how we got here). And it repeats until we fit somewhere (or create new empty root). The only difference here is we are not limited by number of keys/pointers, but by space they take (we can have a node with many pointers if they are local and single-character strings or just one if it has really long string to check). When we allocate a new node, we put it into the same big node.

If we fail to allocate space inside the same big node, we need to split the big node (in a b-tree-like way). So we split its root into two small nodes, put each one into a separate big node and move the descendants to according big node.

Last modified 8 years ago Last modified on Oct 20, 2010, 1:53:07 PM

Attachments (8)

Download all attachments as: .zip