Rejig quickstart a bit

2025-04-16 15:13:06 +01:00 · 2025-04-16 15:13:06 +01:00 · a502cb6ab2
commit a502cb6ab2
parent ca944521f9
5 changed files with 182 additions and 96 deletions
--- a/docs/index.md
+++ b/docs/index.md
@ -12,44 +12,42 @@ jupytext:
 ```{toctree}
 :maxdepth: 1
 quickstart.md
+installation.md
 background.md
 algorithms.md
 fiab.md
 cmd.md
 ```

-Qubed provides a datastructure called a Qube which represents sets of data identified by multiple key value pairs as a tree of datacubes. To understand what that means go to [Background](background.md), to just start using the library skip straight to the [Quickstart](quickstart.md).
-
-Here's a real world dataset from the [Climate DT](https://destine.ecmwf.int/climate-change-adaptation-digital-twin-climate-dt/):
-
-```{code-cell} python3
-import requests
-from qubed import Qube
-climate_dt = Qube.from_json(requests.get("https://github.com/ecmwf/qubed/raw/refs/heads/main/tests/example_qubes/climate_dt.json").json())
-climate_dt.html(depth=1)
+Qubed provides a datastructure primitive for working with trees of DataCubes. If a normal tree looks like this:
+```
+root
+├── class=od
+│   ├── expver=0001
+│   │   ├── param=1
+│   │   └── param=2
+│   └── expver=0002
+│       ├── param=1
+│       └── param=2
+└── class=rd
+    ├── expver=0001
+    │   ├── param=1
+    │   ├── param=2
+    │   └── param=3
+    └── expver=0002
+        ├── param=1
+        └── param=2
 ```

-Click the arrows to expand and drill down deeper into the data. Any particular dataset is uniquely identified by a set of key value pairs:
-
-```{code-cell} python3
-import json
-for i, identifier in enumerate(climate_dt.leaves()):
-    print(identifier)
-    break
+A compressed view of the same set would be:
+```
+root
+├── class=od, expver=0001/0002, param=1/2
+└── class=rd
+    ├── expver=0001, param=1/2/3
+    └── expver=0002, param=1/2
 ```

-Here's an idea of the set of values each key can take:
-```{code-cell} python3
-axes = climate_dt.axes()
-for key, values in axes.items():
-    print(f"{key} : {list(sorted(values))[:10]}")
-```
+Qubed provides a datastructure that represents this compressed cube we call a Qube. It defines all the algorithms you would expect such as intersection/union/difference, compression, search, transformation and filtering.

-This dataset isn't dense, you can't choose any combination of the above key values pairs, but it does contain many dense datacubes. Hence it makes sense to store and process the set as a tree of dense datacubes, what we call a Qube. For a sense of scale, this dataset contains about 200 million distinct datasets but only contains a few thousand unique nodes.
-
-```{code-cell} python3
-print(f"""
-Distinct datasets: {climate_dt.n_leaves},
-Number of nodes in the tree: {climate_dt.n_nodes}
-""")
-```
+To get a little more background on the motivation and structure of a Qube go to [Background](background.md), for a more hands on intro, go to [Quickstart](quickstart.md).
--- a/docs/installation.md
+++ b/docs/installation.md
@ -0,0 +1,23 @@
+# Installation
+
+To install the latest stable release from PyPI (recommended):
+
+```bash
+pip install qubed
+```
+
+Or to build and install the latest version from github (requires rust):
+
+```bash
+pip install qubed@git+https://github.com/ecmwf/qubed.git@main
+```
+
+## Development
+
+To build the develop branch from source install a rust toolchain and pip install maturin then run:
+
+```
+git clone -b develop git@github.com:ecmwf/qubed.git
+cd qubed
+maturin develop
+```
--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@ -8,32 +8,50 @@ jupytext:
 ---
 # Quickstart

-## Installation
-To install the latest stable release from PyPI (recommended):
-```bash
-pip install qubed
-```
-Or to build and install the latest version from github (requires cargo):
-```bash
-pip install qubed@git+https://github.com/ecmwf/qubed.git@main
+First install qubed with `pip install qubed`. Now, let's dive in with a real world dataset from the [Climate DT](https://destine.ecmwf.int/climate-change-adaptation-digital-twin-climate-dt/). We'll pull a prebuild qube from github and render it in it's default HTML representation.
+
+```{code-cell} python3
+import requests
+from qubed import Qube
+climate_dt = Qube.from_json(requests.get("https://github.com/ecmwf/qubed/raw/refs/heads/main/tests/example_qubes/climate_dt.json").json())
+climate_dt.html(depth=1)
 ```

-## Development
+Click the arrows to expand and drill down deeper into the data.

-To build the develop branch from source install a rust toolchain and pip install maturin then run:
-```
-git clone -b develop git@github.com:ecmwf/qubed.git
-cd qubed
-maturin develop
+Fundamentally a Qube represents a set identifiers which are a set of key value pairs, here's the one leaf in the Climate DT dataset:
+
+```{code-cell} python3
+next(climate_dt.leaves())
 ```

-## Usage
-Make an uncompressed qube:
+We can look at the set of values each key can take:
+```{code-cell} python3
+axes = climate_dt.axes()
+for key, values in axes.items():
+    print(f"{key} : {list(sorted(values))[:10]}")
+```
+
+This dataset isn't dense, you can't choose any combination of the above key values pairs, but it does contain many dense datacubes. Hence it makes sense to store and process the set as a tree of dense datacubes, which is what a Qube. For a sense of scale, this dataset contains about 8 million distinct datasets but only contains a few hundred unique nodes.
+
+```{code-cell} python3
+import objsize
+print(f"""
+Distinct datasets: {climate_dt.n_leaves}
+Number of nodes in the tree: {climate_dt.n_nodes}
+Number of dense datacubes within this qube: {len(list(climate_dt.datacubes()))}
+In memory size according to objsize: {objsize.get_deep_size(climate_dt) / 2**20:.0f} MB
+""")
+```
+
+## Building your own Qubes
+
+You can do it from nested dictionaries with keys in the form "{key=value}":

 ```{code-cell} python3
 from qubed import Qube

-q = Qube.from_dict({
+q1 = Qube.from_dict({
    "class=od" : {
        "expver=0001": {"param=1":{}, "param=2":{}},
        "expver=0002": {"param=1":{}, "param=2":{}},
@ -43,20 +61,63 @@ q = Qube.from_dict({
        "expver=0002": {"param=1":{}, "param=2":{}},
    },
 })
-print(f"{q.n_leaves = }, {q.n_nodes = }")
-q
+print(f"{q1.n_leaves = }, {q1.n_nodes = }")
+q1
 ```

-Compress it:
+If someone sends you a printed qube you can convert that back to a Qube too:

 ```{code-cell} python3
-cq = q.compress()
-assert cq.n_leaves == q.n_leaves
+q2 = Qube.from_tree("""
+    root, frequency=6:00:00
+    ├── levtype=pl, param=t, levelist=850, threshold=-2/-4/-8/2/4/8
+    └── levtype=sfc
+        ├── param=10u/10v, threshold=10/15
+        ├── param=2t, threshold=273.15
+        └── param=tp, threshold=0.1/1/10/100/20/25/5/50
+""")
+q2
+```
+We would not recommend trying to write this representation by hand though.
+
+Finally, quite a flexible approach is to take the union of a series of dense datacubes:
+
+```{code-cell} python3
+q3 = Qube.from_datacube(
+    dict(
+        param="10u/10v/2d/2t/cp/msl/skt/sp/tcw/tp".split("/"),
+        threshold="*",
+        levtype="sfc",
+        frequency="6:00:00",
+    )
+) | Qube.from_datacube(
+    dict(
+        param="q/t/u/v/w/z".split("/"),
+        threshold="*",
+        levtype="pl",
+        level="50/100/150/200/250/300/400/500/600/700/850".split("/"),
+        frequency="6:00:00",
+    )
+)
+q3
+```
+
+## Operations on Qubes
+
+Going back to that first qube:
+```{code-cell} python3
+q1
+```
+
+We can compress it:
+
+```{code-cell} python3
+cq = q1.compress()
+assert cq.n_leaves == q1.n_leaves
 print(f"{cq.n_leaves = }, {cq.n_nodes = }")
 cq
 ```

-
 With the HTML representation you can click on the leaves to expand them. You can copy a path representation of a node to the clipboard by alt/option/⌥ clicking on it. You can then extract that node in code using `[]`:

 ```{code-cell} python3
@ -82,30 +143,7 @@ dq = Qube.from_datacube({
 ```


-
-### Tree Construction
-
-One of the quickest ways to construct non-trivial trees is to use the `Qube.from_datacube` method to construct dense trees and then use the set operations to combine or intersect them:
-
-
-```{code-cell} python3
-q = Qube.from_datacube({
-    "class": "d1",
-    "dataset": ["climate-dt", "another-value"],
-    'generation': ['1', "2", "3"],
-})
-
-r  = Qube.from_datacube({
-    "class": "d1",
-    "dataset": ["weather-dt", "climate-dt"],
-    'generation': ['1', "2", "3", "4"],
-})
-
-q | r
-```
-
-
-### Iteration / Flattening
+## Iteration

 Iterate over the leaves:

@ -117,26 +155,13 @@ for i, identifier in enumerate(cq.leaves()):
        break
 ```

-Iterate over the datacubes:
+Or if you can it's more efficient to iterate over the datacubes:

 ```{code-cell} python3
-cq.datacubes()
-```
-
-### A Real World Example
-
-Load a larger example qube:
-
-```{code-cell} python3
-import requests
-qube_json = requests.get("https://github.com/ecmwf/qubed/raw/refs/heads/main/tests/example_qubes/climate_dt.json").json()
-climate_dt = Qube.from_json(qube_json)
-
-# Using the html or print methods is optional but lets you specify things like the depth of the tree to display.
-print(f"{climate_dt.n_leaves = }, {climate_dt.n_nodes = }")
-climate_dt.html(depth=1) # Limit how much is open initially, click leave to see more.
+list(cq.datacubes())
 ```

+## Selection
 Select a subset of the tree:

 ```{code-cell} python3
@ -160,7 +185,7 @@ for key, values in axes.items():
 ```


-### Set Operations
+## Set Operations

 The union/intersection/difference of two dense datacubes is not itself dense.

@ -195,7 +220,7 @@ Symmetric Difference:
 (A ^ B).print();
 ```

-### Transformations
+## Transformations

 `q.transform` takes a python function from one node to one or more nodes and uses this to build a new tree. This can be used for simple operations on the key or values but also to split or remove nodes. Note that you can't use it to merge nodes beause it's only allowed to see one node at a time.

@ -203,3 +228,42 @@ Symmetric Difference:
 def capitalize(node): return node.replace(key = node.key.capitalize())
 climate_dt.transform(capitalize).html(depth=1)
 ```
+
+## Save to disk
+
+There is currently a very simple JSON serialisation format. More compact binary serialisations are planned.
+```{code-cell} python3
+json = climate_dt.to_json()
+Qube.from_json(json) == climate_dt
+```
+
+## Advanced Selection
+
+There is currently partial support for different datatypes in addition to strings. Here we can convert datatypes by key to ints and timedeltas and then use functions as filters in select.
+
+```{code-cell} python3
+from datetime import timedelta, datetime
+def to_timedelta(t):
+    dt = datetime.strptime(t, "%H:%M:%S")
+    return timedelta(hours=dt.hour, minutes=dt.minute, seconds=dt.second)
+
+q = Qube.from_tree("""
+root, frequency=6:00:00
+├── levtype=pl, levelist=850, threshold=-2/-4/-8/2/4/8
+└── levtype=sfc
+    ├── param=10u/10v, threshold=10/15
+    ├── param=2t, threshold=273.15
+    └── param=tp, threshold=0.1/1/10/100/20/25/5/50
+""").convert_dtypes({
+    "threshold": float,
+    "levelist": int,
+    "frequency": to_timedelta,
+})
+
+r = q.select({
+        "threshold": lambda t: t > 5,
+        "frequency": lambda dt: dt > timedelta(hours = 2),
+})
+
+r
+```
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -1,2 +1,3 @@
 numpy
 scipy
+objsize