Data model
Documents in EscoDB are arranged in a filesystem-like hierarchy rather than a flat namespace. This provides a means to group related documents together so that they can be found efficiently by a common prefix of their path. This enables features like tab-completion and removal of whole sets of related documents with a single function call. Let’s define a few terms:
-
Document: a blob of data, of arbitrary type. For the environments EscoDB is intended to operate in, this blob will typically be a string, either JSON or a base-64 string. However, our design doesn’t assume any particular data serialisation.
-
Directory: an object that groups one or more documents or directories together. Its value is a sorted list of the names of its children.
-
Path: a complete ID that points to a document, or a prefix of such an ID. All paths begin with a slash (
/). Document paths must not end with a slash. Directory paths, which are prefixes of document paths, must end with a slash. For example,/alice/notes.txtis a document path, and/and/alice/are directory paths. -
Name: the final segment of a path; the relative path of an item from its parent directory. Names do not begin with slashes, but must end with slashes if they identify a directory.
notes.txtandalice/are names. -
Item: a complete logical database entry consisting of a path and either a document or directory; a key-value pair.
For example, a set of documents stored in EscoDB might logically resemble this tree of files:
/
├─┬ alice/
│ └── notes.txt
├─┬ bob/
│ └─┬ pictures/
│ ├── avatar.jpg
│ └── header.png
└─┬ carol/
└── profile.json
This tree contains four documents, whose paths are /alice/notes.txt,
/bob/pictures/avatar.jpg, /bob/pictures/header.png, and
/carol/profile.json. The item with path / is a directory and its value is
the names of its direct children: ['alice/', 'bob/', 'carol/']. In total this
tree consists of nine items:
| path | type | value |
|---|---|---|
/ | directory | ['alice/', 'bob/', 'carol/'] |
/alice/ | directory | ['notes.txt'] |
/alice/notes.txt | document | <blob> |
/bob/ | directory | ['pictures/'] |
/bob/pictures/ | directory | ['avatar.jpg', 'header.png'] |
/bob/pictures/avatar.jpg | document | <blob> |
/bob/pictures/header.png | document | <blob> |
/carol/ | directory | ['profile.json'] |
/carol/profile.json | document | <blob> |
Strictly speaking, directories are redundant and their existence and values are derived from the documents that exist. However, we want to store them explicitly so that listing the documents with a common prefix is a single read and does not require scanning the entire database.
We can think of directories as indexes of their children, and they should be kept consistent – the names stored in a directory item should exactly match the set of documents that have that directory as a prefix. However, it is possible that this consistency is broken, if a write task partially fails. In any case, in the absence of transactions, we cannot guarantee consistency across multiple reads. It is possible that listing a directory returns a name that does not exist when we try to read it, for example. Since directories are derived data, it should be possible to perform a consistency check via a full database scan as a periodic maintenance task.