fix set operations

2025-02-18 17:50:28 +00:00 · 2025-02-18 17:50:28 +00:00 · ea07545dc0
commit ea07545dc0
parent 9d4fcbe624
15 changed files with 58004 additions and 224 deletions
--- a/ROADMAP.md
+++ b/ROADMAP.md
@ -0,0 +1,49 @@
+
+Initial Python Implementation
+[x] Basic Qube datastructure
+[x] Compression
+[x] Set Operations (Union, Difference, Intersection...)
+[x] Query with request 
+[x] Iteration over leaves
+[x] Iteration over datacubes
+[ ] Set up periodic updates to climate-dt/extremes-dt again
+[ ] Maybe also do production db?
+[ ] Do mars list to contraints conversion
+[ ] protobuf serialization
+
+
+Rust port
+[ ] Initial object
+[ ] Sort out ownership issues, (one arena owned by python object)
+[ ] Compression
+[ ] Set Operations
+[ ] Query with request
+[ ] Iteration over leaves
+[ ] Iteration over datacubes
+[ ] Set up periodic updates to climate-dt/extremes-dt again
+
+## API
+
+Qubed will provide a core compressed tree data structure called a Qube  with:
+
+Methods to convert to and from:
+- [x] A human readable representation like those seen above.
+- [x] An HTML version where subtrees can be collapsed.
+- [ ] An compact protobuf-based binary format
+- [x] Nested python dictionaries or JSON
+- [/] The output of [fdb list](https://confluence.ecmwf.int/display/FDB/fdb-list)
+- [ ] [mars list][mars list]
+- [ ] [constraints.json][constraints]
+
+[constraints]: https://object-store.os-api.cci2.ecmwf.int/cci2-prod-catalogue/resources/reanalysis-era5-land/constraints_a0ae5b42d67869674e13fba9fd055640bcffc37c24578be1f465d7d5ab2c7ee5.json
+[mars list]: https://git.ecmwf.int/projects/CDS/repos/cads-forms-reanalysis/browse/reanalysis-era5-single-levels/gecko-config/mars.list?at=refs%2Fheads%2Fprod
+
+Useful algorithms:
+- [x] Compression
+- [/] Union/Intersection/Difference
+
+Performant Membership Queries
+- Identifier membership
+- Datacube query (selection)
+
+Metadata Storage
--- a/docs/autobuild.sh
+++ b/docs/autobuild.sh
@ -1 +1,5 @@
+# cd to current directory of script
+parent_path=$( cd "$(dirname "${BASH_SOURCE[0]}")" ; pwd -P )
+cd "$parent_path"
+
 sphinx-autobuild . _build
--- a/docs/background.md
+++ b/docs/background.md
@ -0,0 +1,87 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.16.4
+---
+# Datacubes, Trees and Compressed trees
+
+This section contains a bit more of an introduction to the datastructure, feel free to skip to the [Quickstart](quickstart.md). See the [datacube spec](https://github.com/ecmwf/datacube-spec), for even more detail and the canonical source of truth on the matter.
+
+Qubed is primarily geared towards dealing with datafiles uniquely labeled by sets of key value pairs. We'll call a set of key value pairs that uniquely labels some data an `identifier`. Here's an example:
+
+```python
+{
+ 'class': 'd1',
+ 'dataset': 'climate-dt',
+ 'generation': '1',
+ 'date': '20241102',
+ 'resolution': 'high',
+ 'time': '0000',
+}
+```
+
+Unfortunately, we have more than one data file. If we are lucky, the set of identifiers that current exists might form a dense datacube that we could represent like this:
+
+```python
+{
+ 'class': ['d1', 'd2'],
+ 'dataset': 'climate-dt',
+ 'generation': ['1','2','3'],
+ 'model': 'icon',
+ 'date': ['20241102','20241103'],
+ 'resolution': ['high','low'],
+ 'time': ['0000', '0600', '1200', '1800'],
+}
+```
+
+with the property that any particular choice for a value for any key will correspond to datafile that exists. So this object represents `2x1x3x1x2x2x4 = 96` different datafiles. 
+
+To save space I will also represent this same thing like this:
+```
+- class=d1/d2, dataset=climate-dt, generation=1/2/3, ..., time=0000/0600/1200/1800
+```
+
+Unfortunately, we are not lucky and our datacubes are not always dense. In this case we might instead represent which data exists using a tree:
+
+```{code-cell} python3
+from qubed import Qube
+
+q = Qube.from_dict({
+    "class=od" : {
+        "expver=0001": {"param=1":{}, "param=2":{}},
+        "expver=0002": {"param=1":{}, "param=2":{}},
+    },
+    "class=rd" : {
+        "expver=0001": {"param=1":{}, "param=2":{}, "param=3":{}},
+        "expver=0002": {"param=1":{}, "param=2":{}},
+    },
+})
+
+# depth controls how much of the tree is open when rendered as html.
+q.html(depth=100)
+```
+
+But it's clear that the above tree contains a lot of redundant information. Many of the subtrees are identical for example. Indeed in practice a lot of our data turns out to be 'nearly dense' in that it contains many dense datacubes within it.
+
+There are many valid ways one could compress this tree. If we add the restriction that no identical key=value pairs can be adjacent then here is the compressed tree we might get:
+
+```{code-cell} python3
+q.compress()
+````
+
+```{warning}
+Without the above restriction we could, for example, have:
+
+    root
+    ├── class=od, expver=0001/0002, param=1/2
+    └── class=rd
+        ├── expver=0001, param=3
+        └── expver=0001/0002, param=1/2
+
+but we do not allow this because it would mean we would have to take multiple branches in order to find data with `expver=0001`.
+```
+
+What we have now is a tree of dense datacubes which represents a single larger sparse datacube in a more compact manner. For want of a better word we'll call it a Qube.
--- a/docs/cmd.md
+++ b/docs/cmd.md
@ -0,0 +1,21 @@
+### Command Line Usage
+
+```bash 
+fdb list class=rd,expver=0001,... | qubed --from=fdblist --to=text
+```
+
+`--from` options include: 
+* `fdblist`
+* `json`
+* `protobuf`
+* `marslist`
+* `constraints`
+
+`--to` options include:
+* `text`
+* `html`
+* `json`
+* `datacubes`
+* `constraints`
+
+use `--input` and `--output` to specify input and output files respectively.
--- a/docs/index.md
+++ b/docs/index.md
@ -11,118 +11,54 @@ jupytext:

 ```{toctree}
 :maxdepth: 1
+background.md
 quickstart.md
 api.md
 development.md
 algorithms.md
 ```

-# Datacubes, Trees and Compressed trees
+Qubed provides a datastructure called a Qube which represents sets of data identified by multiple key value pairs as a tree of datacubes. To understand what that means go to [Background](background.md), to just start using the library skip straight to the [Quickstart](quickstart.md).

-This first part is essentially a abridged version of the [datacube spec](https://github.com/ecmwf/datacube-spec), see that document for more detail and the canonical source of truth on the matter.
-
-Qubed is primarily geared towards dealing with datafiles uniquely labeled by sets of key value pairs. We'll call a set of key value pairs that uniquely labels some data an `identifier`. Here's an example:
-
-```python
-{
- 'class': 'd1',
- 'dataset': 'climate-dt',
- 'generation': '1',
- 'date': '20241102',
- 'resolution': 'high',
- 'time': '0000',
-}
-```
-
-Unfortunately, we have more than one data file. If we are lucky, the set of identifiers that current exists might form a dense datacube that we could represent like this:
-
-```python
-{
- 'class': ['d1', 'd2'],
- 'dataset': 'climate-dt',
- 'generation': ['1','2','3'],
- 'model': 'icon',
- 'date': ['20241102','20241103'],
- 'resolution': ['high','low'],
- 'time': ['0000', '0600', '1200', '1800'],
-}
-```
-
-with the property that any particular choice for a value for any key will correspond to datafile that exists. So this object represents `2x1x3x1x2x2x4 = 96` different datafiles. 
-
-To save space I will also represent this same thing like this:
-```
- class=d1/d2, dataset=climate-dt, generation=1/2/3, ..., time=0000/0600/1200/1800
-```
-
-Unfortunately, we are not lucky and our datacubes are not always dense. In this case we might instead represent which data exists using a tree:
+Here's a real world dataset from the [Climate DT](https://destine.ecmwf.int/climate-change-adaptation-digital-twin-climate-dt/):

 ```{code-cell} python3
+import requests
 from qubed import Qube
-
-q = Qube.from_dict({
-    "class=od" : {
-        "expver=0001": {"param=1":{}, "param=2":{}},
-        "expver=0002": {"param=1":{}, "param=2":{}},
-    },
-    "class=rd" : {
-        "expver=0001": {"param=1":{}, "param=2":{}, "param=3":{}},
-        "expver=0002": {"param=1":{}, "param=2":{}},
-    },
-})
-
-# depth controls how much of the tree is open when rendered as html.
-q.html(depth=100)
+climate_dt = Qube.from_json(requests.get("https://github.com/ecmwf/qubed/raw/refs/heads/main/tests/example_qubes/climate_dt.json").json())
+climate_dt.html(depth=1)
 ```

-But it's clear that the above tree contains a lot of redundant information. Many of the subtrees are identical for example. Indeed in practice a lot of our data turns out to be 'nearly dense' in that it contains many dense datacubes within it.
-
-There are many valid ways one could compress this tree. If we add the restriction that no identical key=value pairs can be adjacent then here is the compressed tree we might get:
+Click the arrows to expand and drill down deeper into the data. Any particular dataset is uniquely identified by a set of key value pairs:

 ```{code-cell} python3
-q.compress()
-````
-
-```{warning}
-Without the above restriction we could, for example, have:
-
-    root
-    ├── class=od, expver=0001/0002, param=1/2
-    └── class=rd
-        ├── expver=0001, param=3
-        └── expver=0001/0002, param=1/2
-
-but we do not allow this because it would mean we would have to take multiple branches in order to find data with `expver=0001`.
+import json
+for i, identifier in enumerate(climate_dt.leaves()):
+    print(identifier)
+    break
 ```

-What we have now is a tree of dense datacubes which represents a single larger sparse datacube in a more compact manner. For want of a better word we'll call it a Qube.
+Here's an idea of the set of values each key can take:
+```{code-cell} python3
+axes = climate_dt.axes()
+for key, values in axes.items():
+    print(f"{key} : {list(sorted(values))[:10]}")
+```
+
+This dataset isn't dense, you can't choose any combination of the above key values pairs, but it does contain many dense datacubes. Hence it makes sense to store and process the set as a tree of dense datacubes, what we call a Qube. For a sense of scale, this dataset contains about 200 million distinct datasets but only contains a few thousand unique nodes.
+
+```{code-cell} python3
+print(f"""
+Distinct datasets: {climate_dt.n_leaves},
+Number of nodes in the tree: {climate_dt.n_nodes}
+""")
+```
+
+
+
+


-## API
-
-Qubed will provide a core compressed tree data structure called a Qube  with:
-
-Methods to convert to and from:
- [x] A human readable representation like those seen above.
- [x] An HTML version where subtrees can be collapsed.
- [ ] An compact protobuf-based binary format
- [x] Nested python dictionaries or JSON
- [/] The output of [fdb list](https://confluence.ecmwf.int/display/FDB/fdb-list)
- [ ] [mars list][mars list]
- [ ] [constraints.json][constraints]
-
-[constraints]: https://object-store.os-api.cci2.ecmwf.int/cci2-prod-catalogue/resources/reanalysis-era5-land/constraints_a0ae5b42d67869674e13fba9fd055640bcffc37c24578be1f465d7d5ab2c7ee5.json
-[mars list]: https://git.ecmwf.int/projects/CDS/repos/cads-forms-reanalysis/browse/reanalysis-era5-single-levels/gecko-config/mars.list?at=refs%2Fheads%2Fprod
-
-Useful algorithms:
- [x] Compression
- [/] Union/Intersection/Difference
-
-Performant Membership Queries
- Identifier membership
- Datacube query (selection)
-
-Metadata Storage



--- a/docs/quickstart.md
+++ b/docs/quickstart.md
@ -42,7 +42,53 @@ print(f"{cq.n_leaves = }, {cq.n_nodes = }")
 cq
 ```

-Load a larger example qube (requires source checkout):
+### Quick Tree Construction
+
+One of the quickest ways to construct non-trivial trees is to use the `Qube.from_datacube` method to construct dense trees and then use the set operations to combine or intersect them:
+
+
+```{code-cell} python3
+q = Qube.from_datacube({
+    "class": "d1",
+    "dataset": ["climate-dt", "another-value"],
+    'generation': ['1', "2", "3"],
+})
+
+r  = Qube.from_datacube({
+    "class": "d1",
+    "dataset": ["weather-dt", "climate-dt"],
+    'generation': ['1', "2", "3", "4"],
+})
+
+q | r
+```
+
+
+### Iteration / Flattening
+
+Iterate over the leaves:
+
+```{code-cell} python3
+for i, identifier in enumerate(cq.leaves()):
+    print(identifier)
+    if i > 10:
+        print("...")
+        break
+```
+
+Iterate over the datacubes:
+
+```{code-cell} python3
+for i, datacube in enumerate(cq.datacubes()):
+    print(datacube)
+    if i > 10:
+        print("...")
+        break
+```
+
+### A Real World Example
+
+Load a larger example qube:

 ```{code-cell} python3
 import requests
@ -77,43 +123,38 @@ for key, values in axes.items():
 ```


-<!-- ### Set Operations
+### Set Operations
+
+The union/intersection/difference of two dense datacubes is not itself dense.

 ```{code-cell} python3
-A = Qube.from_dict({
-    "a=1/2/3" : {"b=1/2/3" : {"c=1/2/3" : {}}},
-    "a=5" : {  "b=4" : {  "c=4" : {}}}
-    })
+A = Qube.from_dict({"a=1/2/3" : {"b=i/j/k" : {}},})
+B = Qube.from_dict({"a=2/3/4" : {"b=j/k/l" : {}},})

-B = Qube.from_dict({
-    "a=1/2/3" : {"b=1/2/3" : {"c=1/2/3" : {}}},
-    "a=5" : {  "b=4" : {  "c=4" : {}}}
-    })
-
-A.print(name="A"), B.print(name="B");
-
-A | B
-``` -->
-
-<!-- ### Command Line Usage
-
-```bash 
-fdb list class=rd,expver=0001,... | qubed --from=fdblist --to=text
+A.print(), B.print();
 ```

-`--from` options include: 
-* `fdblist`
-* `json`
-* `protobuf`
-* `marslist`
-* `constraints`
+Union: 

-`--to` options include:
-* `text`
-* `html`
-* `json`
-* `datacubes`
-* `constraints`
+```{code-cell} python3
+(A | B).print();
+```

-use `--input` and `--output` to specify input and output files respectively. -->
+Intersection:
+
+```{code-cell} python3
+(A & B).print();
+```
+
+Difference:
+
+```{code-cell} python3
+(A - B).print();
+```
+
+Symmetric Difference:
+
+```{code-cell} python3
+(A ^ B).print();
+```

--- a/notebooks/test.ipynb
+++ b/notebooks/test.ipynb
--- a/src/python/qubed/Qube.py
+++ b/src/python/qubed/Qube.py
@ -2,7 +2,7 @@ import dataclasses
 from collections import defaultdict
 from dataclasses import dataclass
 from functools import cached_property
-from typing import Any, Callable, Literal
+from typing import Any, Callable, Iterable, Literal, Sequence

 from frozendict import frozendict

@ -43,6 +43,18 @@ class Qube:
                                    )),
        )
    
+    @classmethod
+    def from_datacube(cls, datacube: dict[str, str | Sequence[str]]) -> 'Qube':
+        key_vals = list(datacube.items())[::-1]
+
+        children: list["Qube"] = []
+        for key, values in key_vals:
+            if not isinstance(values, list):
+                values = [values]
+            children = [cls.make(key, QEnum(values), children)]
+        
+        return cls.make("root", QEnum(("root",)), children)
+

    @classmethod
    def from_json(cls, json: dict) -> 'Qube':
@ -88,17 +100,33 @@ class Qube:
        return node_tree_to_html(self, depth = 2, collapse = True)
    
    def __or__(self, other: "Qube") -> "Qube":
-        return set_operations.operation(self, other, set_operations.SetOperation.UNION)
+        return set_operations.operation(self, other, set_operations.SetOperation.UNION, type(self))
    
    def __and__(self, other: "Qube") -> "Qube":
-        return set_operations.operation(self, other, set_operations.SetOperation.INTERSECTION)
+        return set_operations.operation(self, other, set_operations.SetOperation.INTERSECTION, type(self))
    
    def __sub__(self, other: "Qube") -> "Qube":
-        return set_operations.operation(self, other, set_operations.SetOperation.DIFFERENCE)
+        return set_operations.operation(self, other, set_operations.SetOperation.DIFFERENCE, type(self))
    
    def __xor__(self, other: "Qube") -> "Qube":
-        return set_operations.operation(self, other, set_operations.SetOperation.SYMMETRIC_DIFFERENCE)
+        return set_operations.operation(self, other, set_operations.SetOperation.SYMMETRIC_DIFFERENCE, type(self))
    
+    def leaves(self) -> Iterable[dict[str, str]]:
+        for value in self.values:
+            if not self.children: 
+                yield {self.key : value}
+            for child in self.children:
+                for leaf in child.leaves():
+                    if self.key != "root":
+                        yield {self.key : value, **leaf}
+                    else:
+                        yield leaf
+
+    def datacubes(self):
+        def to_list_of_cubes(node: Qube) -> list[list[Qube]]:
+            return [[node] + sub_cube for c in node.children for sub_cube in to_list_of_cubes(c)]
+
+        return to_list_of_cubes(self)
    
    def __getitem__(self, args) -> 'Qube':
        key, value = args
@ -110,6 +138,8 @@ class Qube:

    @cached_property
    def n_leaves(self) -> int:
+        # This line makes the equation q.n_leaves + r.n_leaves == (q | r).n_leaves true is q and r have no overlap
+        if self.key == "root" and not self.children: return 0
        return len(self.values) * (sum(c.n_leaves for c in self.children) if self.children else 1)

    @cached_property
@ -174,7 +204,8 @@ class Qube:
        for c in self.children:
            for k, v in c.axes().items():
                axes[k].update(v)
-        axes[self.key].update(self.values)
+        if self.key != "root":
+            axes[self.key].update(self.values)
        return dict(axes)

    @staticmethod
@ -254,12 +285,6 @@ class Qube:
        insertion = [(k, v) for k, v in identifier.items()]
        return Qube._insert(self, insertion)

-    def to_list_of_cubes(self):
-        def to_list_of_cubes(node: Qube) -> list[list[Qube]]:
-            return [[node] + sub_cube for c in node.children for sub_cube in to_list_of_cubes(c)]
-
-        return to_list_of_cubes(self)
-
    def info(self):
        cubes = self.to_list_of_cubes()
        print(f"Number of distinct paths: {len(cubes)}")
--- a/src/python/qubed/main.py
+++ b/src/python/qubed/main.py
@ -1,42 +1,74 @@
 import argparse
-
-# A simple command line app that reads from standard input and writes to standard output
-# Arguments:
-#    --input_format=fdb/mars
-#    --output_format=text/html
 import sys

+from rich.console import Console
+
+from qubed import Qube
+from qubed.convert import parse_fdb_list
+
+console = Console(stderr=True)
+

 def main():
    parser = argparse.ArgumentParser(description="Generate a compressed tree from various inputs.")
    
-    parser.add_argument(
+    subparsers = parser.add_subparsers(title="subcommands", required=True)
+    parser_convert = subparsers.add_parser('convert', help='Convert trees from one format to another.')
+    parser_another = subparsers.add_parser('another_subcommand', help='Does something else')
+
+    parser_convert.add_argument(
+        "--input",
+        type=argparse.FileType("r"),
+        default=sys.stdin,
+        help="Specify the input file (default: standard input)."
+    )
+    parser_convert.add_argument(
+        "--output",
+        type=argparse.FileType("w"),
+        default=sys.stdout,
+        help="Specify the output file (default: standard output)."
+    )
+
+    parser_convert.add_argument(
        "--input_format",
        choices=["fdb", "mars"],
        default="fdb",
-        help="Specify the input format (fdb list or mars)."
+        help="""Specify the input format:
+            fdb: the output of fdb list --porcelain
+            mars: the output of mars list
+        """
    )
    
-    parser.add_argument(
+    parser_convert.add_argument(
        "--output_format",
        choices=["text", "html"],
        default="text",
        help="Specify the output format (text or html)."
    )
+    parser_convert.set_defaults(func=convert)
    
    args = parser.parse_args()
+    args.func(args)

-    # Read from standard input
-    l = 0
-    for line in sys.stdin.readlines():
-        l += 1
+def convert(args):
+    q = Qube.empty()
+    for datacube in parse_fdb_list(args.input):
+        new_branch = Qube.from_datacube(datacube)
+        q = (q | Qube.from_datacube(datacube))

+    output = match args.output_format:
+        case "text":
+            str(q)
+        case "html":
+            q.html()

-    # Process data (For now, just echoing the input)
-    output_data = f"[Input Format: {args.input_format}] [Output Format: {args.output_format}]\n{l} lines read from standard input\n"
+    with open(args.output, "w") as f:
+        f.write(output)

-    # Write to standard output
-    sys.stdout.write(output_data)
+    console.print([1, 2, 3])
+    console.print("[blue underline]Looks like a link")
+    console.print(locals())
+    console.print("FOO", style="white on blue")

 if __name__ == "__main__":
    main()
--- a/src/python/qubed/convert.py
+++ b/src/python/qubed/convert.py
@ -0,0 +1,23 @@
+def parse_key_value_pairs(text: str):
+    result = {}
+    text = text.replace("}{", ",")  # Replace segment separators
+    text = text.replace("{", "").replace("}","").strip()  # Remove leading/trailing braces
+
+    for segment in text.split(","):
+        if "=" not in segment: print(segment)
+        key, values = segment.split("=", 1)  # Ensure split only happens at first "="
+        values = values.split("/")
+        result[key] = values
+
+    return result
+
+def parse_fdb_list(f):
+    for line in f.readlines():
+        # Handle fdb list normal
+        if line.startswith("{"):
+            yield parse_key_value_pairs(line)
+
+        # handle fdb list --compact
+        if line.startswith("retrieve,") and not line.startswith("retrieve,\n"):
+            line = line[9:]
+            yield parse_key_value_pairs(line)
--- a/src/python/qubed/set_operations.py
+++ b/src/python/qubed/set_operations.py
@ -31,7 +31,18 @@ def fused_set_operations(A: "Values", B: "Values") -> tuple[list[Values], list[V
    
    raise NotImplementedError("Fused set operations on values types other than QEnum are not yet implemented")

-def operation(A: "Qube", B : "Qube", operation_type: SetOperation) -> "Qube":
+def node_intersection(A: "Values", B: "Values") -> tuple[Values, Values, Values]:
+    if isinstance(A, QEnum) and isinstance(B, QEnum):
+        set_A, set_B = set(A), set(B)
+        intersection = set_A & set_B
+        just_A = set_A - intersection
+        just_B = set_B - intersection
+        return QEnum(just_A), QEnum(intersection), QEnum(just_B)
+                
+    
+    raise NotImplementedError("Fused set operations on values types other than QEnum are not yet implemented")
+
+def operation(A: "Qube", B : "Qube", operation_type: SetOperation, node_type) -> "Qube":
    assert A.key == B.key, "The two Qube root nodes must have the same key to perform set operations," \
                           f"would usually be two root nodes. They have {A.key} and {B.key} respectively"
    
@ -48,7 +59,7 @@ def operation(A: "Qube", B : "Qube", operation_type: SetOperation) -> "Qube":

    # For every node group, perform the set operation
    for key, (A_nodes, B_nodes) in nodes_by_key.items():
-        new_children.extend(_operation(key, A_nodes, B_nodes, operation_type))
+        new_children.extend(_operation(key, A_nodes, B_nodes, operation_type, node_type))

    # Whenever we modify children we should recompress them
    # But since `operation` is already recursive, we only need to compress this level not all levels
@ -60,36 +71,46 @@ def operation(A: "Qube", B : "Qube", operation_type: SetOperation) -> "Qube":
    

 # The root node is special so we need a helper method that we can recurse on
-def _operation(key: str, A: list["Qube"], B : list["Qube"], operation_type: SetOperation) -> Iterable["Qube"]:
+def _operation(key: str, A: list["Qube"], B : list["Qube"], operation_type: SetOperation, node_type) -> Iterable["Qube"]:
+    # We need to deal with the case where only one of the trees has this key.
+    # To do so we can insert a dummy node with no children and no values into both A and B
+    keep_just_A, keep_intersection, keep_just_B = operation_type.value
+
    # Iterate over all pairs (node_A, node_B)
+    values = {}
+    for node in A + B:
+        values[node] = node.values
+
    for node_a in A:
        for node_b in B:

            # Compute A - B, A & B, B - A
-            just_A, intersection, just_B = fused_set_operations(
-                node_a.values, 
-                node_b.values
+            # Update the values for the two source nodes to remove the intersection
+            just_a, intersection, just_b = node_intersection(
+                values[node_a], 
+                values[node_b], 
            )
-            keep_just_A, keep_intersection, keep_just_B = operation_type.value

-            # Values in just_A and just_B are simple because 
-            # we can just make new nodes that copy the children of node_A or node_B
-            if keep_just_A:
-                for group in just_A:
-                    data = NodeData(key, group, {})
-                    yield type(node_a)(data, node_a.children)
-
-            if keep_just_B:
-                for group in just_B:
-                    data = NodeData(key, group, {})
-                    yield type(node_a)(data, node_b.children)
+            # Remove the intersection from the source nodes
+            values[node_a] = just_a
+            values[node_b] = just_b

            if keep_intersection:
-                for group in intersection:
-                    if group:
-                        new_node_a = replace(node_a, data = replace(node_a.data, values = group))
-                        new_node_b = replace(node_b, data= replace(node_b.data, values = group))
-                        yield operation(new_node_a, new_node_b, operation_type)
+                if intersection:
+                    new_node_a = replace(node_a, data = replace(node_a.data, values = intersection))
+                    new_node_b = replace(node_b, data= replace(node_b.data, values = intersection))
+                    yield operation(new_node_a, new_node_b, operation_type, node_type)
+
+
+    # Now we've removed all the intersections we can yield the just_A and just_B parts if needed
+    if keep_just_A:
+        for node in A:
+            if values[node]:
+                yield node_type.make(key, values[node], node.children)
+    if keep_just_B:
+        for node in B:
+            if values[node]:
+                yield node_type.make(key, values[node], node.children)

 def compress_children(children: Iterable["Qube"]) -> tuple["Qube"]:
    """
--- a/src/python/qubed/value_types.py
+++ b/src/python/qubed/value_types.py
@ -18,6 +18,10 @@ class Values(ABC):
    def __contains__(self, value: Any) -> bool:
        pass

+    @abstractmethod
+    def __iter__(self) -> Iterable[Any]:
+        pass
+
    @abstractmethod
    def from_strings(self, values: Iterable[str]) -> list['Values']:
        pass
@ -48,6 +52,7 @@ class QEnum(Values):

    def __len__(self) -> int:
        return len(self.values)
+    
    def summary(self) -> str:
        return '/'.join(map(str, sorted(self.values)))
    def __contains__(self, value: Any) -> bool:
@ -68,6 +73,15 @@ class DateRange(Range):
    step: timedelta
    dtype: Literal["date"] = dataclasses.field(kw_only=True, default="date")

+    def __len__(self) -> int:
+        return (self.end - self.start) // self.step
+
+    def __iter__(self) -> Iterable[date]:
+        current = self.start
+        while current <= self.end if self.step.days > 0 else current >= self.end:
+            yield current
+            current += self.step
+
    @classmethod
    def from_strings(self, values: Iterable[str]) -> list['DateRange']:
        dates = sorted([datetime.strptime(v, "%Y%m%d") for v in values])
@ -105,10 +119,6 @@ class DateRange(Range):
        v = datetime.strptime(value, "%Y%m%d").date()
        return self.start <= v <= self.end and (v - self.start) % self.step == 0
    
-
-    def __len__(self) -> int:
-        return (self.end - self.start) // self.step
-    
    def summary(self) -> str:
        def fmt(d): return d.strftime("%Y%m%d")
        if self.step == timedelta(days=0):
--- a/tests/test_basic_operations.py
+++ b/tests/test_basic_operations.py
@ -40,6 +40,36 @@ def test_union():

    assert q | r == u

+def test_union_with_empty():
+    q = Qube.from_dict({"a=1/2/3" : {"b=1" : {}},})
+    assert q | Qube.empty()  == q
+
+def test_union_2():
+    q = Qube.from_datacube({
+        "class": "d1",
+        "dataset": ["climate-dt", "another-value"],
+        'generation': ['1', "2", "3"],
+    })
+
+    r  = Qube.from_datacube({
+        "class": "d1",
+        "dataset": ["weather-dt", "climate-dt"],
+        'generation': ['1', "2", "3", "4"],
+    })
+
+    u = Qube.from_dict({
+        "class=d1" : {
+            "dataset=climate-dt/weather-dt" : {
+                "generation=1/2/3/4" : {},
+            },
+            "dataset=another-value" : {
+                "generation=1/2/3" : {},
+            },
+        }
+    })
+
+    assert q | r == u
+
 def test_difference():
    q = Qube.from_dict({"a=1/2/3/5" : {"b=1" : {}},})
    r = Qube.from_dict({"a=2/3/4" : {"b=1" : {}},})
--- a/tests/test_iteration.py
+++ b/tests/test_iteration.py
@ -0,0 +1,35 @@
+from frozendict import frozendict
+from qubed import Qube
+
+
+def test_iter_leaves_simple():
+    def make_hashable(l):
+        for d in l:
+            yield frozendict(d)
+    q = Qube.from_dict({
+        "a=1/2" : {"b=1/2" : {}}
+    })
+    entries = [
+        {"a" : '1', "b" : '1'},
+        {"a" : '1', "b" : '2'},
+        {"a" : '2', "b" : '1'},
+        {"a" : '2', "b" : '2'},
+    ]
+
+    assert set(make_hashable(q.leaves())) == set(make_hashable(entries))
+
+# def test_iter_leaves():
+#     d = {
+#         "class=od" : {
+#             "expver=0001": {"param=1":{}, "param=2":{}},
+#             "expver=0002": {"param=1":{}, "param=2":{}},
+#         },
+#         "class=rd" : {
+#             "expver=0001": {"param=1":{}, "param=2":{}, "param=3":{}},
+#             "expver=0002": {"param=1":{}, "param=2":{}},
+#         },
+#     }
+#     q = Qube.from_dict(d)
+#     r = Qube.from_dict(d)
+
+#     assert q == r
--- a/tests/test_set_operations.py
+++ b/tests/test_set_operations.py
@ -0,0 +1,19 @@
+
+from qubed import Qube
+
+
+def test_leaf_conservation():
+    q = Qube.from_dict({
+        "class=d1": {"dataset=climate-dt" : {  
+            "time=0000": {"param=130/134/137/146/147/151/165/166/167/168/169" : {}},
+            "time=0001": {"param=130": {}},
+                    }}})
+
+    r  = Qube.from_datacube({
+        "class": "d1",
+        "dataset": "climate-dt",
+        "time": "0001",
+        "param": "134"
+    })
+
+    assert q.n_leaves + r.n_leaves == (q | r).n_leaves