Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

remove (DirPath)

The remove() operation is in some ways the inverse of update(). We’ve seen how an update() is composed of two lower-level operations, put() and link(), and that put() must be performed only after all the link() operations have succeeded. A typical update() looks like this, including the required get() and list() to get the current state of the document and its parent directory:

Shard │
      │   ┌─────────────┐                       ┌─────────────────────┐
    A │   │ get('/doc') ------------------------- put('/doc', newDoc) │
      │   └─────────────┘                       └/────────────────────┘
      │                                         /
      │                                        /
      │   ┌───────────┐     ┌─────────────────/┐
    B │   │ list('/') ------- link('/', 'doc') │
      │   └───────────┘     └──────────────────┘

This graph permits the four operations to execute in one of three orders; the list(), link() and put() operations must happen in that order, while the get() can happen at any time before the put().

Order 1                 Order 2                 Order 3
------------------------------------------------------------------------
list('/')               list('/')               get('/doc')
link('/', 'doc')        get('/doc')             list('/')
get('/doc')             link('/', 'doc')        link('/', 'doc')
put('/doc', newDoc)     put('/doc', newDoc)     put('/doc', newDoc)

remove() similarly breaks down into two lower-level operations: rm() removes the document itself, and unlink() removes an item from a directory. If removing the document from its parent directory leaves the directory empty, then that directly should itself be removed, and so on up the tree. This can be implemented by getting the list() of all the ancestor directories and performing unlink() for those that contain a single name forming a chain upward from the removed document.

In order to obey the safety condition of a document always being reachable via links in all its ancestor directories, remove() must not perform any unlink() calls before the rm() has completed. For now we’ll assume that multiple unlink() calls may happen in parallel, but they must not precede the rm(). This constraint gives us the following graph:

Shard │
      │   ┌─────────────┐     ┌────────────┐
    A │   │ get('/doc') ------- rm('/doc') │
      │   └─────────────┘     └───────────\┘
      │                                    \
      │                                     \
      │   ┌───────────┐                     ┌\───────────────────┐
    B │   │ list('/') ----------------------- unlink('/', 'doc') │
      │   └───────────┘                     └────────────────────┘

(The get() here represents loading the shard containing /doc; we don’t need the current state of /doc itself in order to delete it, but we do need to load the shard it resides in, in order to modify it.)

These operations can also be performed in one of three orders, which differ only in the position of the list() call relative to other operations:

Order 1                 Order 2                 Order 3
------------------------------------------------------------------------
get('/doc')             get('/doc')             list('/')
rm('/doc')              list('/')               get('/doc')
list('/')               rm('/doc')              rm('/doc')
unlink('/', 'doc')      unlink('/', 'doc')      unlink('/', 'doc')

If the document has more than one ancestor directory, for example /path/to/note.txt, that doesn’t fundamentally change the structure of these graphs, it just adds more link() and unlink() calls that can happen in parallel.

The key risk in implementing remove() is that it can be executed concurrently with an update() that affects the same directories, causing a race condition. For example, if we have a document named /path/to/note.txt, then an update() call will call link('/path/', 'to/') while a remove() may attempt unlink('/path/', 'to/') if it believes the /path/to/ directory will be left empty. If remove('/path/to/note.txt') is called concurrently with update('/path/to/readme.md'), then remove() might try to delete the /path/to/, /path/ and / directories entirely, believing it’s deleting the only document inside them, not knowing that another document is in the process of being written.

We must therefore pay careful attention to how the operations in update() and remove() are executed. In a CAS system we can rely on the underlying adapter to raise an update conflict if two processes concurrently modify the same directory item; for example this protects us from removing directories we believe will become empty if some other process modified them since we called list() to get their current state. The question for us here is how we handle such conflicts to make sure that databases always remain in a safe state, given any possible interleaving of the operations inside an update() and a remove().

In general a CAS adapter will raise a conflict in the following cases:

  • Between a get() call to read a document, and a put() or rm() call to update that document, another process updates the shard that document resides in.

  • Between a list() call to read a directory, and a link() or unlink() call to update that directory, another process updates the shard that directory resides in.

This means a CAS adapter will definitely raise a conflict if two processes concurrently modify the same item, but it will also raise a conflict if any other item in the same shard is updated at the same time. To detect whether a conflict is actually due to our item being updated, rather than some other item in the same shard, it may be desirable to give individual items version IDs.

We’ll say that a document write that comes between a get() and a put()/rm() for the same document invalidates the get(). That is, it renders the version ID returned by get() stale, and the following put()/rm() will fail with a conflict error. Similarly, a write to a directory item can invalidate a list(), causing a future link()/unlink() to fail. We need to arrange things so that any sequence of operations executed by a set of EscoDB clients does not leave the data in an unsafe state. Any operation that would lead to such a state should cause an invalidation somewhere, leading to unsafe operations being rejected with a conflict.

We can see that Order 1 for the update() and remove() functions given above is fundamentally unsafe, as follows. If an update() and remove() for /doc happen one after the other, it gives the following sequence of operations:

updateremove
list('/')
link('/', 'doc')
get('/doc')
put('/doc', newDoc)
get('/doc')
rm('/doc')
list('/')
unlink('/', 'doc')

If update and remove are executed independently by different clients, then the operations within each one remain in the same order but the two sets of operations can become interleaved. For example, this order of operations is also possible:

updateremove
list('/')
link('/', 'doc')
get('/doc')
rm('/doc')
list('/')
unlink('/', 'doc')
get('/doc')
put('/doc', newDoc)

None of the writes in this sequence fail, because none of the get() or list() calls are invalidated by some other write coming between them and the corresponding put()/rm() or link()/unlink(). And so we end up executing put() for the document after we called unlink() to remove it from its parent directory, violating the safety condition.

Thinking about how we’d implement Order 1 using a Locking adapter helps illustrate why it’s not safe. To implement update in this order it would be sufficient to do the following:

  • Acquire a lock on the shard holding the / directory (shard B)
  • Read shard B
  • Write shard B to update the / directory
  • Release the lock on shard B
  • Acquire a lock on the shard holding the /doc document (shard A)
  • Read shard A
  • Write shard A to update the /doc document
  • Release the lock on shard A

At no point are we holding locks on both shards at the same time, and this is what allows another process to come in and make changes between our updates to / and /doc leading to a safety violation. To fix this we need to force the extent of these locks to overlap. In a CAS system, we are implicitly holding a lock from the time we read a shard, to the next time we write it – our write succeeds as long as nobody else writes to the shard while we were looking at it.

Order 1 being unsafe means the get() in update must come before the link(); using the locking analogy, this would mean acquiring the lock on A while we’re still holding the lock on B. Similarly, the list() in remove must come before the rm(). This gives us the following dependency graph for update:

Shard │
      │      ┌─────────────┐                    ┌─────────────────────┐
    A │      │ get('/doc') │                    │ put('/doc', newDoc) │
      │      └────────────\┘                    └/────────────────────┘
      │                    \                    /
      │                     \                  /
      │   ┌───────────┐     ┌\────────────────/┐
    B │   │ list('/') ------- link('/', 'doc') │
      │   └───────────┘     └──────────────────┘

And the following graph for remove; these graphs allow Orders 2 and 3 but not Order 1:

Shard │
      │   ┌─────────────┐     ┌────────────┐
    A │   │ get('/doc') ------- rm('/doc') │
      │   └─────────────┘     └/──────────\┘
      │                       /            \
      │                      /              \
      │          ┌──────────/┐              ┌\───────────────────┐
    B │          │ list('/') │              │ unlink('/', 'doc') │
      │          └───────────┘              └────────────────────┘

Now let’s examine Order 2 and see what further constraints it reveals. Without interleaving, the Order 2 operation sequence for update and remove looks like this:

updateremove
list('/')
get('/doc')
link('/', 'doc')
put('/doc', newDoc)
get('/doc')
list('/')
rm('/doc')
unlink('/', 'doc')

To violate safety, we need to find a way to perform a write that produces an unsafe state, but does not cause a CAS conflict, and so executes successfully. In our previous example, this happened when put() was executed after unlink():

updateremove
list('/')
get('/doc')
link('/', 'doc')
get('/doc')
list('/')
rm('/doc')
unlink('/', 'doc')
put('/doc', newDoc) ⚠️

In this order of operations, put() produces a conflict, indicated by ⚠️, because an rm() occurs between the get() and put() in update, invalidating the get(). To avoid this, the get() in update would have to happen after the rm() operation, for example:

updateremove
list('/')
get('/doc')
list('/')
rm('/doc')
get('/doc')
link('/', 'doc')
unlink('/', 'doc') ⚠️
put('/doc', newDoc)

We’re not changing the operation order within each function, we’re just changing how the two functions interleave. Placing the get() in update after rm() forces the link() call to happen after the list() in remove. It thus invalidates that list() call and causes unlink() to fail with a conflict. At no point in this execution is the database in an unsafe state, because the following three writes are successful:

  • rm() removes the document but leaves it linked in its parent directory
  • link() ensures the document is linked
  • put() writes a new version of the document

The database is never in a state where the document exists, but is not linked, so this execution is safe. While keeping get() after rm(), what other executions are possible? Only two other orderings of the link(), put() and unlink() calls are possible; either unlink() happens first:

updateremove
list('/')
get('/doc')
list('/')
rm('/doc')
get('/doc')
unlink('/', 'doc')
link('/', 'doc') ⚠️
put('/doc', newDoc) ⛔️

This invalidates the list() in update and makes link() fail with a conflict, so we don’t actually execute put() at all (indicated by ⛔️). This is equivalent to the remove fully completing and the update not happening at all, which is safe. The other possible ordering is that unlink() happens last:

updateremove
list('/')
get('/doc')
list('/')
rm('/doc')
get('/doc')
link('/', 'doc')
put('/doc', newDoc)
unlink('/', 'doc') ⚠️

This is equivalent to the earlier case where link() causes unlink() to fail. update completes successfully, and remove partially executes but does not introduce a safety violation.

The above sequence of operations demonstrates two important constraints. First, it’s important that unlink() fails here, because if it succeeded it would unlink the existing /doc document, leaving things in an unsafe state. This can only happen if we execute the link() call, even if the / directory already contains doc, purely to update the shard’s version ID and cause the unlink() call to fail. We cannot skip link() calls that do not change a directory’s contents as we’d earlier hoped we might. (If the CAS adapter’s versioning is just a content hash, then writing the same content to it may not change its version. We can force a version change by re-encrypting the relevant directory item with new random keys, thereby changing the shard’s contents, or by incrementing a counter in the shard’s metadata whenever we write it. This also prevents the ABA problem from happening.)

Second, it shows that if an operation fails with a conflict, we cannot retry just that operation – we must retry the entire function it was part of. For example, imagine we handle the above conflicted unlink() call by just re-reading the / directory and trying the unlink again:

updateremove
list('/')
get('/doc')
list('/')
rm('/doc')
get('/doc')
link('/', 'doc')
put('/doc', newDoc)
unlink('/', 'doc') ⚠️
list('/')
unlink('/', 'doc')

The second unlink() succeeds, but now we’ve caused a safety violation by unlinking /doc when it still exists. Instead we need to redo any reads affected by the conflict to (i.e. the link() call) to get their new state and version IDs, and then redo all the writes for the remove function. We don’t need to redo reads for shards we’ve not seen a conflict on, for example, we don’t need to redo the get() call here, we can continue to use the version ID we got from the rm() call, as long as there’s no evidence that version is now stale.

updateremove
list('/')
get('/doc')
list('/')
rm('/doc')
get('/doc')
link('/', 'doc')
put('/doc', newDoc)
unlink('/', 'doc') ⚠️
list('/')
rm('/doc')
unlink('/', 'doc')

This leaves the database in a safe state because /doc is removed before being unlinked. If any other client updates / or /doc while this remove is retrying, then we will get further conflicts and we can start over once again.

Now, recall that we’re trying to find executions that lead to unsafe states but don’t trigger CAS conflicts. We just did this by looking at traces where get() in update is not invalidated because it occurs after rm(). The other way to avoid a conflict on the /doc shard would be for rm() to happen after put(), that is:

updateremove
get('/doc')
list('/')
list('/')
get('/doc')
link('/', 'doc')
put('/doc', newDoc)
rm('/doc') ⚠️
unlink('/', 'doc') ⛔️

In this execution, the get() in remove is invalidated by the put(), causing rm() to fail with a conflict. To avoid this, the get() in remove would have to happen after put() as well:

updateremove
list('/')
get('/doc')
link('/', 'doc')
put('/doc', newDoc)
get('/doc')
list('/')
rm('/doc')
unlink('/', 'doc')

But this is just the two functions executing sequentially, so no race condition is possible. Each function executes to completion, keeping the database in a safe state at all times by the way their internal operations are ordered.

Other attempts to construct a safety violation without a conflict end in the same way. For example, in order for the unlink() call not to cause a conflict in the update function, it must happen either before the list() call in update (which would be a sequential, non-interleaved execution), or it must happen after the link() call, as in:

updateremove
get('/doc')
list('/')
rm('/doc')
list('/')
get('/doc')
link('/', 'doc')
unlink('/', 'doc') ⚠️
put('/doc', newDoc)

Calling unlink() here would be unsafe, as it would unlink a doc that is then created via put(). Thankfully, the unlink() fails with a conflict because link() invalidates the list() in remove. To avoid this, the list() in remove would have to happen after link():

updateremove
get('/doc')
list('/')
get('/doc')
link('/', 'doc')
list('/')
rm('/doc')
unlink('/', 'doc')
put('/doc', newDoc) ⚠️

This forces rm() to happen later, which causes put() to fail, or put() happens first and causes rm() to fail. Any attempt to escape a conflict either moves the conflict to somewhere else, or produces something equivalent to safe sequential execution. None of these cases lead to a safety violation without a conflict arising somewhere.

Similar arguments apply for Order 3 executions. The key thing is that by reading all the affected shards before performing any writes, and thus getting their version IDs before we try to change anything, we’re essentially acquiring overlapping locks that will alert us if anything changes from the initial state while we’re updating things. The update() and remove() operations are each individually constructed so that we end in a safe state if they are partially executed, and this additional ordering constraint means any concurrent execution of one of these functions will receive a conflict if they attempt to create an unsafe state.

All remove() calls will involve at least one unlink() operation, to remove the document from its parent directory. The above examples consider a document that’s a direct child of the root directory /. Documents nested further down the tree will potentially require more than one unlink() call, if deleting them would leave their parent directory empty. For example, consider a database containing this tree of documents:

/
└─┬ path/
  ├── a.txt
  └─┬ to/
    └── b.txt

This tree contains five items:

pathtypevalue
/directory['path/']
/path/directory['a.txt', 'to/']
/path/a.txtdocument<blob>
/path/to/directory['b.txt']
/path/to/b.txtdocument<blob>

If we remove /path/to/b.txt, this should implicitly remove the /path/to/ directory, because b.txt is the only document in that directory. It’s not a safety violation for empty directories to exist, but it reduces the efficiency of the find() and prune() operations so in general we’d like to avoid them.

When performing a remove('/path/to/b.txt') operation, we will first read all the relevant items – the shards containing the /, /path/ and /path/to/ directories and the /path/to/b.txt document itself. We’ll observe that /path/to/ only contains a single item, and removing that item would leave it empty, and so we can unlink it from its parent. In general we can continue doing this all the way up the directory tree until we find a directory with more than one item. So, this remove() call would perform the following operations:

  • rm('/path/to/b.txt')
  • unlink('/path/', 'to/')
  • unlink('/path/to/', 'b.txt')

These would leave the database in the final state…

/
└─┬ path/
  └── a.txt

… containing just three items:

pathtypevalue
/directory['path/']
/path/directory['a.txt']
/path/a.txtdocument<blob>

We must make sure that we don’t delete ancestor directories that are not in fact empty – if we remove the /path/to/ directory but it actually contains items other than b.txt, then we may leave some documents unlinked. Imagine that a call to update('/path/to/c.txt') occurs concurrently with this remove() call, consisting of the following operations:

  • link('/', 'path/')
  • link('/path/', 'to/')
  • link('/path/to/', 'c.txt')
  • put('/path/to/c.txt', doc)

From the starting state containing b.txt, this would produce the following state…

/
└─┬ path/
  ├── a.txt
  └─┬ to/
    ├── b.txt
    └── c.txt

… consisting of six items:

pathtypevalue
/directory['path/']
/path/directory['a.txt', 'to/']
/path/a.txtdocument<blob>
/path/to/directory['b.txt', 'c.txt']
/path/to/b.txtdocument<blob>
/path/to/c.txtdocument<blob>

Is it possible to execute these remove() and update() calls concurrently in a way that leads to an unsafe state, by incorrectly deleting the /path/to/ directory? If these are executed sequentially, this is the complete set of operations involved, where we have labelled each operation with a letter to make them easier to refer to:

updateremove
Alist('/')
Blist('/path/')
Clist('/path/to/')
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Funlink('/path/', 'to/')
Gunlink('/path/to/', 'b.txt')
Hlist('/')
Jlist('/path/')
Klist('/path/to/')
Lget('/path/to/c.txt')
----
Mlink('/', 'path/')
Nlink('/path/', 'to/')
Plink('/path/to/', 'c.txt')
----
Qput('/path/to/c.txt', doc)

The lines reading ---- denote barriers: within the update and remove calls, operations may be executed in parallel and so run in unpredictable orders, but the implementation makes sure they’re not re-ordered across these barriers. First each call performs all its reads – the list() and get() calls. update then performs a set of link() operations, and then a put() if all those succeed. remove performs rm() and then a set of unlink() calls if that succeeds. list() and get() calls may be re-ordered relative to each other, link() calls may happen in any order, as may unlink() calls. But link() cannot happen after put(), nor can rm() happen after unlink(), in order to preserve our safety condition.

Given this set of operations, is it possible for them to execute in an order such that we mistakenly call F: unlink('/path/', 'to/') and end up leaving c.txt unlinked? In a sequential execution of remove followed by update, this trivially does not happen, because update does all the necessary link() calls before writing c.txt. If the calls overlap in time, then for unlink('/path/', 'to/') to be written incorrectly, a couple of conditions must hold:

  • The call C: list('/path/to/') must return a list containing a single item, so that we decide to perform call F. This means C must happen before call P adds c.txt to this directory.

  • The call F must succeed without conflict, i.e. no conflicting link() call happens between the relevant list() and unlink() calls for /path/ in remove. That is, call N must not come between calls B and F.

It turns out there is an ordering of these operations allowed by these rules that produces a problem:

updateremove
Hlist('/')
Jlist('/path/')
Klist('/path/to/')
Lget('/path/to/c.txt')
----
Mlink('/', 'path/')
Nlink('/path/', 'to/')
Alist('/')
Blist('/path/')
Clist('/path/to/')
Plink('/path/to/', 'c.txt')
----
Qput('/path/to/c.txt', doc)
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Funlink('/path/', 'to/')
Gunlink('/path/to/', 'b.txt') ⚠️

As required, P comes after C, and N is not between B and F. However, P happens between C and G and causes G to fail with a conflict, as C has been invalidated. The update function runs to completion: calls M, N and P happen before remove performs any writes, and Q affects an item that remove does not touch. remove fails with a conflict, but only after F has run and unlinked the /path/to/ directory. This means that after F completes, the document /path/to/c.txt exists but is not fully linked since the /path/ directory does not contain to/.

We can address this by forcing unlink() calls to execute sequentially, starting with the deepest directory. That is, F must happen after G, which we’ll denote by adding a barrier between them.

updateremove
Hlist('/')
Jlist('/path/')
Klist('/path/to/')
Lget('/path/to/c.txt')
----
Mlink('/', 'path/')
Nlink('/path/', 'to/')
Alist('/')
Blist('/path/')
Clist('/path/to/')
Plink('/path/to/', 'c.txt')
----
Qput('/path/to/c.txt', doc)
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt') ⚠️
----
Funlink('/path/', 'to/') ⛔️

This modification to the previous execution means F will no longer execute; the barrier means that F will only run if G completes successfully. At no point in this execution do we violate the requirement that all documents are fully linked to the root directory.

To prevent the conflict on G, we would need P not to happen between C and G. Remember we only perform F if P happens after C, so the only situation worth investigating is where P happens after G:

updateremove
Hlist('/')
Jlist('/path/')
Klist('/path/to/')
Lget('/path/to/c.txt')
----
Mlink('/', 'path/')
Nlink('/path/', 'to/')
Alist('/')
Blist('/path/')
Clist('/path/to/')
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt')
Plink('/path/to/', 'c.txt') ⚠️
----
Qput('/path/to/c.txt', doc) ⛔️
----
Funlink('/path/', 'to/')

Now G falls between K and P and causes P to fail with a conflict, and thus Q does not happen at all. remove runs to completion, and update fails to create the document that would have triggered a safety violation.

To avoid this conflict, K would need to happen after G. We can move K down below L, but the barrier below that necessitates moving all the other update operations after this after G as well.

updateremove
Hlist('/')
Jlist('/path/')
Lget('/path/to/c.txt')
Alist('/')
Blist('/path/')
Clist('/path/to/')
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt')
Klist('/path/to/')
----
Mlink('/', 'path/')
Nlink('/path/', 'to/')
Plink('/path/to/', 'c.txt')
----
Qput('/path/to/c.txt', doc)
----
Funlink('/path/', 'to/') ⚠️

Call N now falls between B and F and causes F to fail with a conflict, and so we don’t perform the problematic unlink() call and Q is safe. We can’t move N up before B without also moving K before G, due to the barriers, so let’s instead try moving N to happen after F.

updateremove
Hlist('/')
Jlist('/path/')
Lget('/path/to/c.txt')
Alist('/')
Blist('/path/')
Clist('/path/to/')
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt')
Klist('/path/to/')
----
Mlink('/', 'path/')
Plink('/path/to/', 'c.txt')
----
Funlink('/path/', 'to/')
Nlink('/path/', 'to/') ⚠️
----
Qput('/path/to/c.txt', doc) ⛔️

F now happens between J and N, causing N to fail with a conflict and therefore Q does not happen at all and we once again have a safe execution. Avoiding this conflict would either require moving F to happen after N, which is just the preceding example, or moving it before J. Even moving J as late as possible and keeping K after G, this gives us:

updateremove
Hlist('/')
Lget('/path/to/c.txt')
Alist('/')
Blist('/path/')
Clist('/path/to/')
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt')
Klist('/path/to/')
----
Funlink('/path/', 'to/')
Jlist('/path/')
----
Mlink('/', 'path/')
Plink('/path/to/', 'c.txt')
Nlink('/path/', 'to/')
----
Qput('/path/to/c.txt', doc)

This is basically the same as sequential execution. All the writes in remove succeed because update hasn’t performed any writes yet. update then performs all the required link() calls to make the final write Q safe. By trying to construct an execution in which we end in an unsafe state, and then trying to avoid the resulting conflicts, we’ve ended up with a sequential execution.

Let’s take an even more extreme example where the entire update operation happens between G and F, which themselves cannot be re-ordered. Is it possible for the call to G to succeed, and still have F happen erroneously?

updateremove
Alist('/')
Blist('/path/')
Clist('/path/to/')
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt')
Hlist('/')
Jlist('/path/')
Klist('/path/to/')
Lget('/path/to/c.txt')
----
Mlink('/', 'path/')
Nlink('/path/', 'to/')
Plink('/path/to/', 'c.txt')
----
Qput('/path/to/c.txt', doc)
----
Funlink('/path/', 'to/') ⚠️

Here F fails with a conflict because N happens between B and F. We can move N before B, but that leaves a conflict on P because G happens first. If we move P before G then G fails instead, and prevents F from happening.

updateremove
Alist('/')
Clist('/path/to/')
Dget('/path/to/b.txt')
Hlist('/')
Jlist('/path/')
Klist('/path/to/')
Lget('/path/to/c.txt')
----
Nlink('/path/', 'to/')
Blist('/path/')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt')
Mlink('/', 'path/')
Plink('/path/to/', 'c.txt') ⚠️
----
Qput('/path/to/c.txt', doc) ⛔️
----
Funlink('/path/', 'to/')

Or, we can move N after F, but that causes a conflict on N that prevents Q from happening.

updateremove
Alist('/')
Blist('/path/')
Clist('/path/to/')
Dget('/path/to/b.txt')
----
Erm('/path/to/b.txt')
----
Gunlink('/path/to/', 'b.txt')
Hlist('/')
Jlist('/path/')
Klist('/path/to/')
Lget('/path/to/c.txt')
----
Mlink('/', 'path/')
Plink('/path/to/', 'c.txt')
----
Funlink('/path/', 'to/')
Nlink('/path/', 'to/') ⚠️
----
Qput('/path/to/c.txt', doc) ⛔️

We see that attempts to order events such that both F and Q execute and leave the database in an unsafe state ends up producing a conflict somewhere that prevents this state from arising. The only constraint we’ve had to add to the design to achieve this is that unlink() calls happen sequentially, in bottom-up order. Intuitively, this makes sense because unlink() is a conditional operation: we remove a directory from its parent if it becomes empty as a result of removing its last child. By performing G before F, we confirm that the state of /path/to/ has not changed since we read it at C, and therefore the decision to perform F is still valid.

Contrast this with link() which is unconditional: we always make sure a document is linked in its parent directories when creating or updating it. Therefore it makes sense that unlink() needs a causal/ordering constraint placed upon it while link() is allowed to execute in parallel. It’s unfortunate that we have to execute unlink() calls sequentially because it will make remove() take longer to run. However, we expect remove() calls to happen much less frequently than update(), and most remove() calls will require only one unlink(), so this is a reasonable trade-off.

Now we know the general logical dependency structure for remove(). Whereas update() is structured as a put() following a set of concurrent link() calls, remove() is an rm() call followed by a sequence of one or more unlink() calls for as many parent directories would be left empty by the removal of the document.

┌────────────────┐
│ rm('/a/b/doc') │
└────────────────┘
                  \
                   ┌────────────────────────┐
                   │ unlink('/a/b/', 'doc') │
                   └────────────────────────┘
                                             \
                                              ┌─────────────────────┐
                                              │ unlink('/a/', 'b/') │
                                              └─────────────────────┘

We will briefly consider how these operations should be scheduled for execution by the Shard Manager. We always need to perform at least one unlink() call for the document’s immediate parent directory, so it makes sense to read the shards for the document itself and its parent directory up front, before executing the writes.

Shard │
      │   ┌─────────────┐       ┌─────────────┐
    A │   │ READ ░░░░░░░│       │ WRITE ░░░░░░│
      │   ├─────────────┘       ├─────────────┘
      │   │                     │              \
      │   │ ┌─────────────┐     │               ┌─────────────┐
    B │   │ │ READ ░░░░░░░│     │               │ WRITE ░░░░░░│
      │   │ ├─────────────┘     │               ├─────────────┘
          │ │                   │               │
          │ └─ list('/a/b/')    │               └─ unlink('/a/b/', 'doc')
          │                     │
          └─ get('/a/b/doc')    └─ rm('/a/b/doc')

If further unlink() calls are required, we could wait until the first one succeeds and becomes empty before deciding to perform the second. This would mean the shard for the grandparent directory is not read until after the first unlink() write is complete.

Shard │
      │   ┌─────────────┐       ┌─────────────┐
    A │   │ READ ░░░░░░░│       │ WRITE ░░░░░░│
      │   ├─────────────┘       ├─────────────┘
      │   │                     │              \
      │   │ ┌─────────────┐     │               ┌─────────────┐
    B │   │ │ READ ░░░░░░░│     │               │ WRITE ░░░░░░│
      │   │ ├─────────────┘     │               ├─────────────┘
      │   │ │                   │               │              \
      │   │ │                   │               │               ┌─────────────┐   ┌─────────────┐
    C │   │ │                   │               │               │ READ ░░░░░░░│───│ WRITE ░░░░░░│
      │   │ │                   │               │               ├─────────────┘   ├─────────────┘
          │ │                   │               │               │                 │
          │ │                   │               │               └─ list('/a/')    └─ unlink('/a/', 'b/')
          │ │                   │               │
          │ └─ list('/a/b/')    │               └─ unlink('/a/b/', 'doc')
          │                     │
          └─ get('/a/b/doc')    └─ rm('/a/b/doc')

But we don’t need to wait for the first unlink() call to succeed before making this decision. We know that if the /a/b/ directory contains a single item, then the /a/b/ directory itself should be removed, assuming its last remaining child is successfully unlinked. This suggests we should read the shards for all the ancestor directories up front, and plan all the required writes before the rm() is executed. The dependency chain will make sure that each write is only executed if the ones before it succeed, so we can make this plan ahead of time and still avoid performing the second unlink() if the first one fails.

Shard │
      │   ┌─────────────┐       ┌─────────────┐
    A │   │ READ ░░░░░░░│       │ WRITE ░░░░░░│
      │   ├─────────────┘       ├─────────────┘
      │   │                     │              \
      │   │ ┌─────────────┐     │               ┌─────────────┐
    B │   │ │ READ ░░░░░░░│     │               │ WRITE ░░░░░░│
      │   │ ├─────────────┘     │               ├─────────────┘
      │   │ │                   │               │              \
      │   │ │ ┌─────────────┐   │               │               ┌─────────────┐
    C │   │ │ │ READ ░░░░░░░│   │               │               │ WRITE ░░░░░░│
      │   │ │ ├─────────────┘   │               │               ├─────────────┘
          │ │ │                 │               │               │
          │ │ └─ list('/a/')    │               │               └─ unlink('/a/', 'b/')
          │ │                   │               │
          │ └─ list('/a/b/')    │               └─ unlink('/a/b/', 'doc')
          │                     │
          └─ get('/a/b/doc')    └─ rm('/a/b/doc')

This is even more beneficial if the ancestor directories reside in the same shard. If we wait for the first unlink() write to succeed before deciding to perform the second, then we end up performing two sequential writes to the same shard.

Shard │
      │   ┌─────────────┐       ┌─────────────┐
    A │   │ READ ░░░░░░░│       │ WRITE ░░░░░░│
      │   ├─────────────┘       ├─────────────┘
      │   │                     │              \
      │   │ ┌─────────────┐     │               ┌─────────────┐   ┌─────────────┐
    B │   │ │ READ ░░░░░░░│     │               │ WRITE ░░░░░░│───│ WRITE ░░░░░░│
      │   │ ├─────────────┘     │               ├─────────────┘   ├─────────────┘
          │ │                   │               │                 │
          │ │                   │               │                 └─ unlink('/a/', 'b/')
          │ │                   │               │
          │ └─ list('/a/b/')    │               └─ unlink('/a/b/', 'doc')
          │                     │
          └─ get('/a/b/doc')    └─ rm('/a/b/doc')

This is sub-optimal: it’s perfectly safe to merge both unlink() calls into a single write. Either they both succeed or they both fail, but importantly the unlink() on the /a/ directory is only performed if the /a/b/ one succeeds, so the causal dependency is maintained and we don’t violate our safety requirement.

Shard │
      │   ┌─────────────┐       ┌─────────────┐
    A │   │ READ ░░░░░░░│       │ WRITE ░░░░░░│
      │   ├─────────────┘       ├─────────────┘
      │   │                     │              \
      │   │ ┌─────────────┐     │               ┌─────────────┐
    B │   │ │ READ ░░░░░░░│     │               │ WRITE ░░░░░░│
      │   │ ├─────────────┘     │               ├─────────────┘
          │ │                   │               │
          │ ├─ list('/a/')      │               ├─ unlink('/a/', 'b/')
          │ └─ list('/a/b/')    │               └─ unlink('/a/b/', 'doc')
          │                     │
          └─ get('/a/b/doc')    └─ rm('/a/b/doc')

Likewise, the first unlink() can be merged into the same write with the put() if both of them target the same shard. Operations that directly depend on each other can be merged into the same write without violating causality.

The final question to address here is what do to in the case of a conflict being reported by the backing store. To handle this conflict, we could either:

  • Bail and mark the update() or remove() as failed, or
  • Resume the function by restarting it from the beginning, based on the shards’ new versions.

The first option is safe, because these operations are individually crash-safe and so halting them part-way through is fine. However the second option isn’t unsafe; restarting a remove() operation doesn’t violate the safety requirement to make sure all existing documents are linked in their parent directories.

Rather than a safety violation, this raises a UI question: what should remove() tell the application if it detects a conflict? CAS is so far assumed to be an internal implementation detail – we don’t expose version information in the Query API, so a call to remove() can be done with no prior read, meaning the application wants the item deleted no matter what its current state is. The target use cases we have in mind include command-line credential managers that expose a “remove item” operation that doesn’t ask the user for the item’s current version before proceeding.

While this can be implemented on top of an API that requires an item to be read before it’s deleted, just to get its current version, this isn’t how we intend EscoDB to be used. update() hides the internal concurrency control by taking an update function rather than a new state, and it retries until the update succeeds. For symmetry it makes sense to treat remove() like a special kind of update and retry it until it succeeds. If the application isn’t co-ordinating the update() and remove(), it can’t have any expectation of them being applied in any particular order, and once it learns that an operation succeeded, it can’t assume the item continues in that state as some other process might modify it immediately afterward. So, it doesn’t violate any basic assumptions to have all operations retry until success, possibly with a retry or time limit, and it improves the usability for the use cases we have in mind.

If we don’t expose version information in the API, then there’s nothing an application could sensibly do with a rejected remove() other than make the exact decision we just did about whether to bail or retry based on a blanket policy, rather than on the state of the document in question. It would only make sense for EscoDB’s Query API to expose rejection as a mode of failure if we also expose versioning and don’t retry operations ourselves, which would fundamentally change the design.