Update docs
This commit is contained in:
parent
1ab51646b9
commit
fbf8a0fcaf
113
background.md
113
background.md
@ -1,113 +0,0 @@
|
||||
# WIP
|
||||
# Datacubes, Trees and Compressed trees
|
||||
|
||||
This first part is essentially a abridged version of the [datacube spec](https://github.com/ecmwf/datacube-spec), see that document for more detail and the canonical source of truth on the matter.
|
||||
|
||||
Qubed is primarily geared towards dealing with datafiles uniquely labeled by sets of key value pairs. We'll call a set of key value pairs that uniquely labels some data an `identifier`. Here's an example:
|
||||
|
||||
```python
|
||||
{'class': 'd1',
|
||||
'dataset': 'climate-dt',
|
||||
'generation': '1',
|
||||
'date': '20241102',
|
||||
'resolution': 'high',
|
||||
'time': '0000',
|
||||
}
|
||||
```
|
||||
|
||||
Unfortunately, we have more than one data file. If we are lucky, the set of identifiers that current exists might form a dense datacube that we could represent like this:
|
||||
|
||||
```python
|
||||
{'class': ['d1', 'd2'],
|
||||
'dataset': 'climate-dt',
|
||||
'generation': ['1','2','3'],
|
||||
'model': 'icon',
|
||||
'date': ['20241102','20241103'],
|
||||
'resolution': ['high','low'],
|
||||
'time': ['0000', '0600', '1200', '1800'],
|
||||
}
|
||||
```
|
||||
|
||||
with the property that any particular choice for a value for any key will correspond to datafile that exists.
|
||||
|
||||
To save space I will also represent this same thing like this:
|
||||
```
|
||||
- class=d1/d2, dataset=climate-dt, generation=1/2/3, model=icon, date=20241102/20241103, resolution=high/low, time=0000/0600/1200/1800
|
||||
```
|
||||
|
||||
Unfortunately, we are not lucky and our datacubes are not always dense. In this case we might instead represent which data exists using a tree:
|
||||
```
|
||||
root
|
||||
├── class=od
|
||||
│ ├── expver=0001
|
||||
│ │ ├── param=1
|
||||
│ │ └── param=2
|
||||
│ └── expver=0002
|
||||
│ ├── param=1
|
||||
│ └── param=2
|
||||
└── class=rd
|
||||
├── expver=0001
|
||||
│ ├── param=1
|
||||
│ ├── param=2
|
||||
│ └── param=3
|
||||
└── expver=0002
|
||||
├── param=1
|
||||
└── param=2
|
||||
```
|
||||
|
||||
But it's clear that the above tree contains a lot of redundant information. Many of the subtrees are identical for example. Indeed in practice a lot of our data turns out to be 'nearly dense' in that it contains many dense datacubes within it.
|
||||
|
||||
There are many valid ways one could compress this tree. If we add the restriction that no identical key=value pairs can be adjacent then here is the compressed tree we might get:
|
||||
|
||||
```
|
||||
root
|
||||
├── class=rd
|
||||
│ ├── expver=0001, param=1/2/3
|
||||
│ └── expver=0002, param=1/2
|
||||
└── class=od, expver=0001/0002, param=1/2
|
||||
```
|
||||
|
||||
Without the above restriction we could instead have:
|
||||
|
||||
```
|
||||
root
|
||||
├── class=rd
|
||||
│ ├── expver=0001, param=3
|
||||
│ └── expver=0001/0002, param=1/2
|
||||
└── class=od, expver=0001/0002, param=1/2
|
||||
```
|
||||
|
||||
but we do not allow this because it would mean we would have to take multiple branches in order to find data with `expver=0001`.
|
||||
|
||||
What we have now is a tree of dense datacubes which represents a single larger sparse datacube in a more compact manner. For want of a better word we'll call it a Qube.
|
||||
|
||||
## API
|
||||
|
||||
Qubed will provide a core compressed tree data structure called a Qube with:
|
||||
|
||||
Methods to convert to and from:
|
||||
- [x] A human readable representation like those seen above.
|
||||
- [x] An HTML version where subtrees can be collapsed [see here](https://confluence.ecmwf.int/display/~math/Qubed+Test+Page).
|
||||
- [ ] An compact protobuf-based binary format
|
||||
- [x] Nested python dictionaries or JSON
|
||||
- [ ] The output of [fdb list](https://confluence.ecmwf.int/display/FDB/fdb-list)
|
||||
- [ ] [mars list][mars list]
|
||||
- [ ] [constraints.json][constraints]
|
||||
|
||||
[constraints]: (https://object-store.os-api.cci2.ecmwf.int/cci2-prod-catalogue/resources/reanalysis-era5-land/constraints_a0ae5b42d67869674e13fba9fd055640bcffc37c24578be1f465d7d5ab2c7ee5.json
|
||||
[mars list]: https://git.ecmwf.int/projects/CDS/repos/cads-forms-reanalysis/browse/reanalysis-era5-single-levels/gecko-config/mars.list?at=refs%2Fheads%2Fprod
|
||||
|
||||
Useful algorithms:
|
||||
- [x] Compression
|
||||
- [ ] Union/Intersection/Difference
|
||||
|
||||
Performant Membership Queries
|
||||
- [x] Identifier membership
|
||||
- [x] Datacube queries (selection)
|
||||
|
||||
Metadata Storage
|
||||
- [ ] The ability to store metadata on the tree that is unique per leaf node.
|
||||
|
||||
|
||||
|
||||
|
77
docs/algorithms.md
Normal file
77
docs/algorithms.md
Normal file
@ -0,0 +1,77 @@
|
||||
# Qube Algorithms
|
||||
|
||||
## Set Operations
|
||||
|
||||
Qubes represent sets of objects, so the familiar set operations:
|
||||
* Union `A | B` or `Qube.union(A, B)`
|
||||
* Intersection `A & B` or `Qube.intersection(A, B)`
|
||||
* Difference (both `A - B` or `B - A`) or `Qube.difference(A, B)`
|
||||
* Symmetric difference `A ^ B` or `Qube.symmetric_difference(A, B)`
|
||||
|
||||
are all defined.
|
||||
|
||||
We can implement these operations by breaking the problem down into a recursive function:
|
||||
|
||||
```python
|
||||
def operation(A : Qube, B : Qube) -> Qube:
|
||||
...
|
||||
```
|
||||
|
||||
Consider the intersection of A and B:
|
||||
```
|
||||
A
|
||||
├─── a=1, b=1/2/3, c=1
|
||||
└─── a=2, b=1/2/3, c=1
|
||||
|
||||
B
|
||||
├─── a=1, b=3/4/5, c=2
|
||||
└─── a=2, b=3/4/5, c=2
|
||||
```
|
||||
|
||||
We pair the two trees and traverse them in tandem, at each level we group the nodes by node key and for every pair of nodes in a group, compute the values only in A, the values only in B and the
|
||||
```
|
||||
for node_a in level_A:
|
||||
for node_b in level_B:
|
||||
just_A, intersection, just_B = Qube.fused_set_operations(
|
||||
node_a.values,
|
||||
node_b.values
|
||||
)
|
||||
```
|
||||
|
||||
Based on the particular operation we're computing we keep or discard these three objects:
|
||||
* Union: keep just_A, intersection, just_B
|
||||
* Intersection: keep intersection
|
||||
* A - B: keep just_A, B - A keep just_B
|
||||
* Symmetric difference: keep just_A and just_B but not intersection
|
||||
|
||||
The reason we have to keep just_A, intersection and just just_B separate is that each will produce a node with different children:
|
||||
* just_B: the children of node_B
|
||||
* just_A: the children of node_A
|
||||
* intersection: the result of calling `operation(A, B)` recursively on two new nodes formed from A and B but with just the intersecting values.
|
||||
|
||||
This structure means that node.values can take different types, the two most useful being:
|
||||
* an enum, just a set of values
|
||||
* a range with start, stop and step
|
||||
|
||||
Qube.fused_set_operations can dispatch on the two types given in order to efficiently compute set/set, set/range and range/range intersection operations.
|
||||
|
||||
### Performance considerations
|
||||
|
||||
This algorithm is quadratic in the number of matching keys, this means that if we have a level with a huge number of nodes with key 'date' and range types (since range types are currently restricted to being contiguous) we could end up with a quadtratic slow down.
|
||||
|
||||
There are some ways this can be sped up:
|
||||
* Once we know any of just_A, intersection or just_B are empty we can discard them. Only for quite pathological inputs (many enums sparse enums with a lot of overlap) would you actually get quadratically many non-empty terms.
|
||||
|
||||
* For ranges intersected with ranges, we could speed the algorithm up significantly by sorting the ranges and walking the two lists in tandem which reduces it to linear in the number of ranges.
|
||||
|
||||
* If we have N_A and N_B nodes to compare between the two trees we have N_A*N_B comparisons to do. However if at the end of the day we're just trying to determine for each value whether it's in A, B or both. If N_A*N_B >> M the number of value s we might be able to switch to an alternative algorithm.
|
||||
|
||||
|
||||
## Compression
|
||||
|
||||
In order to keep the tree compressed as operations are performed on it we define the "structural hash" of a node to be the hash of:
|
||||
* The node's key
|
||||
* Not the node's values.
|
||||
* The keys, values and children of the nodes children, recursively.
|
||||
|
||||
This structural hash lets us identify when two sibling nodes may be able to be merged into one node thus keeping the tree compressed.
|
1
docs/autobuild.sh
Executable file
1
docs/autobuild.sh
Executable file
@ -0,0 +1 @@
|
||||
sphinx-autobuild . _build
|
@ -1,113 +0,0 @@
|
||||
# WIP
|
||||
# Datacubes, Trees and Compressed trees
|
||||
|
||||
This first part is essentially a abridged version of the [datacube spec](https://github.com/ecmwf/datacube-spec), see that document for more detail and the canonical source of truth on the matter.
|
||||
|
||||
Qubed is primarily geared towards dealing with datafiles uniquely labeled by sets of key value pairs. We'll call a set of key value pairs that uniquely labels some data an `identifier`. Here's an example:
|
||||
|
||||
```python
|
||||
{'class': 'd1',
|
||||
'dataset': 'climate-dt',
|
||||
'generation': '1',
|
||||
'date': '20241102',
|
||||
'resolution': 'high',
|
||||
'time': '0000',
|
||||
}
|
||||
```
|
||||
|
||||
Unfortunately, we have more than one data file. If we are lucky, the set of identifiers that current exists might form a dense datacube that we could represent like this:
|
||||
|
||||
```python
|
||||
{'class': ['d1', 'd2'],
|
||||
'dataset': 'climate-dt',
|
||||
'generation': ['1','2','3'],
|
||||
'model': 'icon',
|
||||
'date': ['20241102','20241103'],
|
||||
'resolution': ['high','low'],
|
||||
'time': ['0000', '0600', '1200', '1800'],
|
||||
}
|
||||
```
|
||||
|
||||
with the property that any particular choice for a value for any key will correspond to datafile that exists.
|
||||
|
||||
To save space I will also represent this same thing like this:
|
||||
```
|
||||
- class=d1/d2, dataset=climate-dt, generation=1/2/3, model=icon, date=20241102/20241103, resolution=high/low, time=0000/0600/1200/1800
|
||||
```
|
||||
|
||||
Unfortunately, we are not lucky and our datacubes are not always dense. In this case we might instead represent which data exists using a tree:
|
||||
```
|
||||
root
|
||||
├── class=od
|
||||
│ ├── expver=0001
|
||||
│ │ ├── param=1
|
||||
│ │ └── param=2
|
||||
│ └── expver=0002
|
||||
│ ├── param=1
|
||||
│ └── param=2
|
||||
└── class=rd
|
||||
├── expver=0001
|
||||
│ ├── param=1
|
||||
│ ├── param=2
|
||||
│ └── param=3
|
||||
└── expver=0002
|
||||
├── param=1
|
||||
└── param=2
|
||||
```
|
||||
|
||||
But it's clear that the above tree contains a lot of redundant information. Many of the subtrees are identical for example. Indeed in practice a lot of our data turns out to be 'nearly dense' in that it contains many dense datacubes within it.
|
||||
|
||||
There are many valid ways one could compress this tree. If we add the restriction that no identical key=value pairs can be adjacent then here is the compressed tree we might get:
|
||||
|
||||
```
|
||||
root
|
||||
├── class=rd
|
||||
│ ├── expver=0001, param=1/2/3
|
||||
│ └── expver=0002, param=1/2
|
||||
└── class=od, expver=0001/0002, param=1/2
|
||||
```
|
||||
|
||||
Without the above restriction we could instead have:
|
||||
|
||||
```
|
||||
root
|
||||
├── class=rd
|
||||
│ ├── expver=0001, param=3
|
||||
│ └── expver=0001/0002, param=1/2
|
||||
└── class=od, expver=0001/0002, param=1/2
|
||||
```
|
||||
|
||||
but we do not allow this because it would mean we would have to take multiple branches in order to find data with `expver=0001`.
|
||||
|
||||
What we have now is a tree of dense datacubes which represents a single larger sparse datacube in a more compact manner. For want of a better word we'll call it a Qube.
|
||||
|
||||
## API
|
||||
|
||||
Qubed will provide a core compressed tree data structure called a Qube with:
|
||||
|
||||
Methods to convert to and from:
|
||||
- [x] A human readable representation like those seen above.
|
||||
- [x] An HTML version where subtrees can be collapsed.
|
||||
- [ ] An compact protobuf-based binary format
|
||||
- [x] Nested python dictionaries or JSON
|
||||
- [/] The output of [fdb list](https://confluence.ecmwf.int/display/FDB/fdb-list)
|
||||
- [ ] [mars list][mars list]
|
||||
- [ ] [constraints.json][constraints]
|
||||
|
||||
[constraints]: (https://object-store.os-api.cci2.ecmwf.int/cci2-prod-catalogue/resources/reanalysis-era5-land/constraints_a0ae5b42d67869674e13fba9fd055640bcffc37c24578be1f465d7d5ab2c7ee5.json
|
||||
[mars list]: https://git.ecmwf.int/projects/CDS/repos/cads-forms-reanalysis/browse/reanalysis-era5-single-levels/gecko-config/mars.list?at=refs%2Fheads%2Fprod
|
||||
|
||||
Useful algorithms:
|
||||
- [x] Compression
|
||||
- [/] Union/Intersection/Difference
|
||||
|
||||
Performant Membership Queries
|
||||
- Identifier membership
|
||||
- Datacube query (selection)
|
||||
|
||||
Metadata Storage
|
||||
|
||||
|
||||
|
||||
|
||||
|
@ -5,12 +5,13 @@ jupytext:
|
||||
format_name: myst
|
||||
format_version: 0.13
|
||||
jupytext_version: 1.16.4
|
||||
kernelspec:
|
||||
display_name: Python 3
|
||||
language: python
|
||||
name: python3
|
||||
---
|
||||
## Qubed
|
||||
|
||||
# Qubed
|
||||
|
||||
```{toctree}
|
||||
algorithms.md
|
||||
```
|
||||
|
||||
# Datacubes, Trees and Compressed trees
|
||||
|
||||
@ -92,7 +93,7 @@ but we do not allow this because it would mean we would have to take multiple br
|
||||
|
||||
What we have now is a tree of dense datacubes which represents a single larger sparse datacube in a more compact manner. For want of a better word we'll call it a Qube.
|
||||
|
||||
### HTML Output
|
||||
## HTML Output
|
||||
|
||||
```{code-cell} python3
|
||||
q.compress().html()
|
||||
@ -111,7 +112,7 @@ Methods to convert to and from:
|
||||
- [ ] [mars list][mars list]
|
||||
- [ ] [constraints.json][constraints]
|
||||
|
||||
[constraints]: (https://object-store.os-api.cci2.ecmwf.int/cci2-prod-catalogue/resources/reanalysis-era5-land/constraints_a0ae5b42d67869674e13fba9fd055640bcffc37c24578be1f465d7d5ab2c7ee5.json
|
||||
[constraints]: https://object-store.os-api.cci2.ecmwf.int/cci2-prod-catalogue/resources/reanalysis-era5-land/constraints_a0ae5b42d67869674e13fba9fd055640bcffc37c24578be1f465d7d5ab2c7ee5.json
|
||||
[mars list]: https://git.ecmwf.int/projects/CDS/repos/cads-forms-reanalysis/browse/reanalysis-era5-single-levels/gecko-config/mars.list?at=refs%2Fheads%2Fprod
|
||||
|
||||
Useful algorithms:
|
||||
|
Loading…
x
Reference in New Issue
Block a user