Work changes and a determination to actually finish my PhD mean I really should make a bit of an effort here and so I will.

Here is a puzzle that has been bothering me about binary trees which has come from my PhD research…

In that research I am investigating how to implement virtual memory on a many-core Network-on-Chip (NoC) system. Essentially I have been building and running emulators.

The core of the emulation is the “memory tree” – the connection between the cores of the NoC and the global memory controller. The “Bluetree” memory tree is a binary tree arrangement, with the root node (level 0) connecting to the memory controller and the leaves at the NoC tiles (with each tile having a single processor).

At every non-leaf level of the tree there are multiplexors (mux) with a buffer that can hold a single memory packet – so if two requests for global memory arrive at once and the buffer is free there needs to be an arbitration process to decide which packet gets to occupy the buffer and which has to wait.

We have 128 leaf nodes – as in this tree…

With this many leaf nodes, the question of the arbitration policy of the muxes is important. A naïve approach would be to say, if two packets contest for the buffer, let the packet on the left win the first time and then pass the packet on the right: after all this guarantees no starvation.

But it is actually a hopeless policy … the leftmost leaf node (shall number from zero) faces no opposition, the rightmost (number 127) loses out to every other node.

But number of leaf node alone is not a sufficient way of describing the relative number of blocks – for instance leaf node 16 appears to be in a better place than leaf node 15 – as going up the tree leaf node 15 can be blocked at level 6 (the root node is at level 0), at level 5, level 4, level 3 and level 2: while leaf node 16 is only blocked at level 1.

Practical testing of this naïvely fair approach in the emulator gives results like the following… as can be seen the right hand part of the tree performs appallingly but even there lots of variation can be seen:

My question is this: what is the correct (i.e., correct mathematically) way of describing the order of the leaf nodes?

4 thoughts on “Reviving this blog … with a question about binary trees”

This model seems designed to fail. Too little bandwidth at the root for so many processors. Unless you don’t need much bandwidth. But I could well be missing something.

Are packets only going one way? Towards the root? That seems odd, but it seems to be what you describe.

Is the bandwidth of each interior node identical?

You seem to mention buffering of only the delayed packets but I would think something has to hold even the packets passed through, at least temporarily. Does the sending node or the receiving node or perhaps both buffer?

Since the interior nodes are not simple, adding a little complexity would not be relatively expensive. How about a single bit so that they could each implement round-robin priority. Clearly fair.

Thanks for the comment.

I know this approach (to the muxes) doesn’t work and I need to use a round-robin approach: my question really is how to better describe the problem mathematically.

With Bluetree (see https://www-users.cs.york.ac.uk/~robdavis/wmc2013/paper17.pdf) returning packets are not buffered and I’ve not mentioned them here: my research is concentrating on looking at ways of overcoming the memory tree bottleneck – you can see an earlier paper here.

Think of the node number in binary. Then you can read it as a map from the root: 1 means turn left, 0 means turn right at each step. So the number of on bits is the number of places in the tree that a root-bound packet does not have right of way. So the number of delays (in a saturated system) is the “population count” of the number of the node sending the message. But I think you knew that.

It is easy to reorder the numbering but that will not change the number of nodes with each possible number of delays. Those proportions are easy: 1 node with 0 delay, 2 nodes with 1, 4 nodes with 2, … 2^n nodes with n.

It is interesting to ponder which numbers to not use if the total number of nodes isn’t of the form 2^n – 1.

I’m not sure what you mean by a round-robin approach, so it may be the same as what I have in mind. To balance delays a bit more, and assuming each non-leaf node can spare one bit for bookkeeping, you could do something akin to the “possession arrow” in basketball. The right-of-way bit starts at 0 and flips any time two packets arrive simultaneously at that node. The winning side is determined by the bit’s value before it toggled. So if left won the first tie, right wins the next tie, etc. A variation on this would be to flip the bit in favor of each delayed packet. For instance, if the buffer is occupied by a packet and the right edge gets a blocked packet, the bit flips (if necessary) to favor the right edge in the event of the next tie, regardless of where the packet in the buffer arose.

This model seems designed to fail. Too little bandwidth at the root for so many processors. Unless you don’t need much bandwidth. But I could well be missing something.

Are packets only going one way? Towards the root? That seems odd, but it seems to be what you describe.

Is the bandwidth of each interior node identical?

You seem to mention buffering of only the delayed packets but I would think something has to hold even the packets passed through, at least temporarily. Does the sending node or the receiving node or perhaps both buffer?

Since the interior nodes are not simple, adding a little complexity would not be relatively expensive. How about a single bit so that they could each implement round-robin priority. Clearly fair.

Thanks for the comment.

I know this approach (to the muxes) doesn’t work and I need to use a round-robin approach: my question really is how to better describe the problem mathematically.

With Bluetree (see https://www-users.cs.york.ac.uk/~robdavis/wmc2013/paper17.pdf) returning packets are not buffered and I’ve not mentioned them here: my research is concentrating on looking at ways of overcoming the memory tree bottleneck – you can see an earlier paper here.

Think of the node number in binary. Then you can read it as a map from the root: 1 means turn left, 0 means turn right at each step. So the number of on bits is the number of places in the tree that a root-bound packet does not have right of way. So the number of delays (in a saturated system) is the “population count” of the number of the node sending the message. But I think you knew that.

It is easy to reorder the numbering but that will not change the number of nodes with each possible number of delays. Those proportions are easy: 1 node with 0 delay, 2 nodes with 1, 4 nodes with 2, … 2^n nodes with n.

It is interesting to ponder which numbers to not use if the total number of nodes isn’t of the form 2^n – 1.

I’m not sure what you mean by a round-robin approach, so it may be the same as what I have in mind. To balance delays a bit more, and assuming each non-leaf node can spare one bit for bookkeeping, you could do something akin to the “possession arrow” in basketball. The right-of-way bit starts at 0 and flips any time two packets arrive simultaneously at that node. The winning side is determined by the bit’s value before it toggled. So if left won the first tie, right wins the next tie, etc. A variation on this would be to flip the bit in favor of each delayed packet. For instance, if the buffer is occupied by a packet and the right edge gets a blocked packet, the bit flips (if necessary) to favor the right edge in the event of the next tie, regardless of where the packet in the buffer arose.