Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Assignment to the earliest group

If there are multiple groups that we’re allowed to add an operation to, which one do we choose? In general, it seems wise to choose the earliest-scheduled group, that is the leftmost one in the graph. To see why, consider the following example. Imagine we’ve built up the following graph, where:

  • The operation w1 targets shard B.
  • w2 targets shard A and depends on w1.
  • w3 targets B and has no dependencies.
  • w4 targets C and depends on w3.
  • w5 targets B and depends on w4.

This gives us the following graph according to the above rules:

Shard │
      │                 ┌────┐
    A │          .------- w2 │
      │         /       └────┘
      │        /
      │   ┌───/────────┐    ┌────┐
    B │   │ w1      w3 │    │ w5 │
      │   └───────────\┘    └/───┘
      │                \    /
      │                ┌\──/┐
    C │                │ w4 │
      │                └────┘

w1 and w3 have been merged into a group because they do not depend on each other, so it is safe to execute them together as long as the other writes start after the write to B is complete. The indirect dependency of w5 on w3 means it cannot be merged with w3 and must start a new group, otherwise it would be committed before w4.

Now say we have another operation targeting B, w6. It has no dependencies and we decide to merge it into the group with w5.

Shard │
      │                 ┌────┐
    A │          .------- w2 │
      │         /       └────┘
      │        /
      │   ┌───/────────┐    ┌────────────┐
    B │   │ w1      w3 │    │ w5      w6 │
      │   └───────────\┘    └/───────────┘
      │                \    /
      │                ┌\──/┐
    C │                │ w4 │
      │                └────┘

Next we add w7, targeting A and depending on w6. We can merge this with w2 safely, with the effect that this group is now scheduled after all the others.

Shard │
      │                          ┌────────────┐
    A │          .---------------- w2      w7 │
      │         /                └────────/───┘
      │        /                         /
      │   ┌───/────────┐    ┌───────────/┐
    B │   │ w1      w3 │    │ w5      w6 │
      │   └───────────\┘    └/───────────┘
      │                \    /
      │                ┌\──/┐
    C │                │ w4 │
      │                └────┘

Finally we add w8 which targets B and depends on w4 and w7. Its indirect dependencies on w3 and w6 mean it cannot be merged into any existing group without violating causality, and must be put in a group by itself.

Shard │
      │                          ┌────────────┐
    A │          .---------------- w2      w7 │
      │         /                └────────/──\┘
      │        /                         /    \
      │   ┌───/────────┐    ┌───────────/┐    ┌\───┐
    B │   │ w1      w3 │    │ w5      w6 │    │ w8 │
      │   └───────────\┘    └/───────────┘    └/───┘
      │                \    /                 /
      │                ┌\──/┐                /
    C │                │ w4 ----------------'
      │                └────┘

This gives us the following execution for this set of writes, which preserves the required ordering so that we do not start writing an operation until after all its dependencies are successfully committed.

Shard │
      │                                                   ┌─────────────┐
    A │                                                   │ WRITE ░░░░░░│
      │                                                   ├─────────────┘
      │                                                   └─ w2, w7
      │   ┌─────────────┐                 ┌─────────────┐                 ┌─────────────┐
    B │   │ WRITE ░░░░░░│                 │ WRITE ░░░░░░│                 │ WRITE ░░░░░░│
      │   ├─────────────┘                 ├─────────────┘                 ├─────────────┘
      │   └─ w1, w3                       └─ w5, w6                       └─ w8
      │                   ┌─────────────┐
    C │                   │ WRITE ░░░░░░│
      │                   ├─────────────┘
                          └─ w4

However, this execution is not optimal; we perform a total of five writes, all sequentially, and w2 is scheduled much later than necessary, considering it only depends on w3 being committed. What if instead we put w6 in the first group, with w1 and w3?

Shard │
      │                 ┌────┐
    A │          .------- w2 │
      │         /       └────┘
      │        /
      │   ┌───/────────────────┐    ┌────┐
    B │   │ w1      w3      w6 │    │ w5 │
      │   └───────────\────────┘    └/───┘
      │                \            /
      │                ┌\───┐      /
    C │                │ w4 ------'
      │                └────┘

Now, w7 can again be merged with w2, but now we do not delay w2 until after w5 is committed.

Shard │
      │                 ┌───────────┐
    A │          .------- w2     w7 │
      │         /       └───────/───┘
      │        /               /
      │   ┌───/───────────────/┐    ┌────┐
    B │   │ w1      w3      w6 │    │ w5 │
      │   └───────────\────────┘    └/───┘
      │                \            /
      │                ┌\───┐      /
    C │                │ w4 ------'
      │                └────┘

When we plan w8, we can now put it in the group with w5, on which it does not depend. We cannot place w8 any earlier because of its indirect dependencies on w3 and w6.

Shard │
      │                 ┌───────────┐
    A │          .------- w2     w7 ------.
      │         /       └───────/───┘      \
      │        /               /            \
      │   ┌───/───────────────/┐    ┌────────\───┐
    B │   │ w1      w3      w6 │    │ w5      w8 │
      │   └───────────\────────┘    └/───────/───┘
      │                \            /       /
      │                ┌\───┐      /       /
    C │                │ w4 ------'-------'
      │                └────┘

We now have four writes rather than five, and two of them can be performed concurrently whereas our previous plan required all writes to happen sequentially, so the depth has been reduced from five to three.

Shard │
      │                     ┌───────────────┐
    A │                     │ WRITE ░░░░░░░░│
      │                     ├───────────────┘
      │                     └─ w2, w7
      │   ┌───────────────┐                   ┌───────────────┐
    B │   │ WRITE ░░░░░░░░│                   │ WRITE ░░░░░░░░│
      │   ├───────────────┘                   ├───────────────┘
      │   └─ w1, w3, w6                       └─ w5, w8
      │                     ┌───────────────┐
    C │                     │ WRITE ░░░░░░░░│
      │                     ├───────────────┘
                            └─ w4

This suggests that when trying to merge an operation into an existing group, the earliest or “leftmost” eligible group in the graph should be chosen, to reduce the likely total depth of the graph. For “leaf” operations with no dependencies, this means picking the first group at all times, but non-leaf operations must be placed in groups to the right of any operations they depend on. In general, operations should be scheduled as early as possible, but no earlier.

There is another important property at work here. Look again at the graph state before we add w6:

Shard │
      │                 ┌────┐
    A │          .------- w2 │
      │         /       └────┘
      │        /
      │   ┌───/────────┐    ┌────┐
    B │   │ w1      w3 │    │ w5 │
      │   └───────────\┘    └/───┘
      │                \    /
      │                ┌\──/┐
    C │                │ w4 │
      │                └────┘

The [w1, w3] group is already depended upon by the [w2] group, and so adding w6 and w7 to these groups respectively does not create any new inter-group dependencies that would block admission of new operations into the first B group. It was already impossible for anything depending on the [w2] group to enter the [w1, w3] group, and the addition of w6 and w7 has not changed that.

Shard │
      │                 ┌───────────┐
    A │          .------- w2     w7 │
      │         /       └───────/───┘
      │        /               /
      │   ┌───/───────────────/┐    ┌────┐
    B │   │ w1      w3      w6 │    │ w5 │
      │   └───────────\────────┘    └/───┘
      │                \            /
      │                ┌\───┐      /
    C │                │ w4 ------'
      │                └────┘

So it is tempting to try to plan execution by avoiding the creation of new inter-group dependencies, if possible. However, if we’re considering each operation one at a time while building the graph, when w6 is added we do not yet know what other operations might come to depend on it.

Shard │
      │                 ┌────┐
    A │          .------- w2 │
      │         /       └────┘
      │        /
      │   ┌───/────────────────┐    ┌────┐
    B │   │ w1      w3      w6 │    │ w5 │
      │   └───────────\────────┘    └/───┘
      │                \            /
      │                ┌\───┐      /
    C │                │ w4 ------'
      │                └────┘

Doing this sort of analysis requires comparing whole dependency sets and their possible intersections, and every possible combination of groups they could form, rather than building the graph one operation at a time, and is likely to be more computationally expensive and harder to define an algorithm for it. So for now we will take the simpler option of preferring to place each operation in the earliest possible group.