qubed/docs/index.md

---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.4
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---
## Qubed

# Datacubes, Trees and Compressed trees

This first part is essentially a abridged version of the [datacube spec](https://github.com/ecmwf/datacube-spec), see that document for more detail and the canonical source of truth on the matter.

Qubed is primarily geared towards dealing with datafiles uniquely labeled by sets of key value pairs. We'll call a set of key value pairs that uniquely labels some data an `identifier`. Here's an example:

```python
{
 'class': 'd1',
 'dataset': 'climate-dt',
 'generation': '1',
 'date': '20241102',
 'resolution': 'high',
 'time': '0000',
}
```

Unfortunately, we have more than one data file. If we are lucky, the set of identifiers that current exists might form a dense datacube that we could represent like this:

```python
{
 'class': ['d1', 'd2'],
 'dataset': 'climate-dt',
 'generation': ['1','2','3'],
 'model': 'icon',
 'date': ['20241102','20241103'],
 'resolution': ['high','low'],
 'time': ['0000', '0600', '1200', '1800'],
}
```

with the property that any particular choice for a value for any key will correspond to datafile that exists. So this object represents `2x1x3x1x2x2x4 = 96` different datafiles.

To save space I will also represent this same thing like this:
```
- class=d1/d2, dataset=climate-dt, generation=1/2/3, ..., time=0000/0600/1200/1800
```

Unfortunately, we are not lucky and our datacubes are not always dense. In this case we might instead represent which data exists using a tree:

```{code-cell} python3
from qubed import Qube

q = Qube.from_dict({
    "class=od" : {
        "expver=0001": {"param=1":{}, "param=2":{}},
        "expver=0002": {"param=1":{}, "param=2":{}},
    },
    "class=rd" : {
        "expver=0001": {"param=1":{}, "param=2":{}, "param=3":{}},
        "expver=0002": {"param=1":{}, "param=2":{}},
    },
})

q.print()
```

But it's clear that the above tree contains a lot of redundant information. Many of the subtrees are identical for example. Indeed in practice a lot of our data turns out to be 'nearly dense' in that it contains many dense datacubes within it.

There are many valid ways one could compress this tree. If we add the restriction that no identical key=value pairs can be adjacent then here is the compressed tree we might get:

```{code-cell} python3
q.compress().print()
````

Without the above restriction we could, for example, have:

```
root
├── class=od, expver=0001/0002, param=1/2
└── class=rd
    ├── expver=0001, param=3
    └── expver=0001/0002, param=1/2
```

but we do not allow this because it would mean we would have to take multiple branches in order to find data with `expver=0001`.

What we have now is a tree of dense datacubes which represents a single larger sparse datacube in a more compact manner. For want of a better word we'll call it a Qube.

## API

Qubed will provide a core compressed tree data structure called a Qube  with:

Methods to convert to and from:
    - [x] A human readable representation like those seen above.
    - [x] An HTML version where subtrees can be collapsed.
    - [ ] An compact protobuf-based binary format
    - [x] Nested python dictionaries or JSON
    - [/] The output of [fdb list](https://confluence.ecmwf.int/display/FDB/fdb-list)
    - [ ] [mars list][mars list]
    - [ ] [constraints.json][constraints]

[constraints]: (https://object-store.os-api.cci2.ecmwf.int/cci2-prod-catalogue/resources/reanalysis-era5-land/constraints_a0ae5b42d67869674e13fba9fd055640bcffc37c24578be1f465d7d5ab2c7ee5.json
[mars list]: https://git.ecmwf.int/projects/CDS/repos/cads-forms-reanalysis/browse/reanalysis-era5-single-levels/gecko-config/mars.list?at=refs%2Fheads%2Fprod

Useful algorithms:
    - [x] Compression
    - [/] Union/Intersection/Difference

Performant Membership Queries
    - Identifier membership
    - Datacube query (selection)

Metadata Storage