Add filters chapter

2026-04-21 12:08:55 -07:00
parent c8f6bc1652
commit a8a57a3d77
6 changed files with 1876 additions and 36 deletions
@@ -0,0 +1,207 @@
+# Plan: Filters
+
+## Topic Summary
+
+Filters are the NIP-01 data structure for matching events. They form an elegant primitive for
+matching events independent of the client/relay context — not just for REQ messages, but as a
+general-purpose event matching and querying abstraction. The chapter covers the filter structure,
+matching semantics (AND within a filter, OR across filters), tag filters, timestamp constraints,
+limits, construction, hashing, grouping, and cardinality estimation.
+
+## Chapter Outline
+
+1. **Introduction** — Filters as a general-purpose event matching primitive. Not tied to relays;
+   they're a predicate you can evaluate against any event. Analogy to database WHERE clauses.
+
+2. **The Filter Struct** — Walk through the fields:
+   - `ids: Option<BTreeSet<[u8; 32]>>` — match event IDs
+   - `authors: Option<BTreeSet<PublicKey>>` — match event authors
+   - `kinds: Option<BTreeSet<u16>>` — match event kinds
+   - `tags: BTreeMap<String, BTreeSet<String>>` — match tag values by tag name
+   - `since: Option<u64>` — lower bound on `created_at` (inclusive)
+   - `until: Option<u64>` — upper bound on `created_at` (inclusive)
+   - `limit: Option<usize>` — result count constraint (not a matching criterion)
+
+   Explain `Option` semantics: `None` = no constraint, `Some(empty set)` = matches nothing.
+   Note that `limit` is metadata for consumers, not part of matching logic.
+
+3. **Matching** — Implement `matches(&self, event: &Event) -> bool`:
+   - AND semantics: all present fields must match
+   - Early exit on scalar checks (ids, kinds, authors) before tag matching
+   - Tag matching: for each tag filter, event must have at least one tag with that name
+     whose value is in the filter's set (OR within a tag filter, AND across tag filters)
+   - Timestamp: `since <= created_at <= until`
+   - `limit` is ignored
+   - Implement `matches_any(filters: &[Filter], event: &Event) -> bool` as a free function
+     for OR-across-filters semantics
+
+4. **Construction** — Builder pattern with fluent API:
+   - `Filter::new()` — empty filter (matches everything)
+   - `.id(id)` / `.ids(iter)` — add event IDs
+   - `.author(pk)` / `.authors(iter)` — add authors
+   - `.kind(k)` / `.kinds(iter)` — add kinds
+   - `.tag(name, value)` / `.tags(name, iter)` — add arbitrary tag filters
+   - `.since(ts)` / `.until(ts)` — set timestamp bounds
+   - `.limit(n)` — set result limit
+   - `.address(addr)` — convenience: sets kind, author, and `#d` tag from an Address
+
+5. **Serialization** — Custom serde implementation:
+   - Standard fields serialize normally, skip `None` fields
+   - `tags` BTreeMap flattened: key `"foo"` becomes JSON key `"#foo"` with array value
+   - Handle `limit: 0` vs omitted limit (Some(0) serializes as `"limit": 0`)
+   - Deserialization: any key starting with `#` collected into `tags` map
+   - Show round-trip example
+
+6. **Identity and Grouping** — Utilities for deduplication and merging:
+   - `filter_id(filter) -> String` — deterministic hash of filter contents for dedup
+   - `filter_group(filter) -> String` — hash of structural fields only (ids, kinds, authors,
+     tag keys) excluding values and temporal fields. Two filters in the same group can be
+     merged by unioning their value sets.
+
+7. **Cardinality** — `cardinality(&self) -> Option<usize>`:
+   - Returns `Some(n)` when the maximum number of matching events can be determined
+   - `ids` present → `ids.len()`
+   - All kinds are replaceable + `authors` present → `authors.len() * kinds.len()`
+   - All kinds are addressable + `authors` present + `#d` present →
+     `authors.len() * kinds.len() * d_values.len()`
+   - Otherwise → `None` (unbounded)
+   - If explicit `limit` is set, return `min(limit, computed)` when computed is Some,
+     or `Some(limit)` when computed is None
+   - Empty set in any field → `Some(0)`
+
+8. **Recap** — Summarize filter as a composable primitive. Tease usage in relay connections
+   chapter.
+
+## API Design
+
+```rust
+// --- Filter struct ---
+
+#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
+pub struct Filter {
+    pub ids: Option<BTreeSet<[u8; 32]>>,
+    pub authors: Option<BTreeSet<PublicKey>>,
+    pub kinds: Option<BTreeSet<u16>>,
+    pub since: Option<u64>,
+    pub until: Option<u64>,
+    pub limit: Option<usize>,
+    // Flattened in serde as #key -> [values]
+    pub tags: BTreeMap<String, BTreeSet<String>>,
+}
+
+// --- Construction (builder, consuming self) ---
+
+impl Filter {
+    pub fn new() -> Self
+    pub fn id(self, id: [u8; 32]) -> Self
+    pub fn ids(self, ids: impl IntoIterator<Item = [u8; 32]>) -> Self
+    pub fn author(self, author: PublicKey) -> Self
+    pub fn authors(self, authors: impl IntoIterator<Item = PublicKey>) -> Self
+    pub fn kind(self, kind: u16) -> Self
+    pub fn kinds(self, kinds: impl IntoIterator<Item = u16>) -> Self
+    pub fn tag(self, name: impl Into<String>, value: impl Into<String>) -> Self
+    pub fn tags(self, name: impl Into<String>, values: impl IntoIterator<Item = impl Into<String>>) -> Self
+    pub fn since(self, since: u64) -> Self
+    pub fn until(self, until: u64) -> Self
+    pub fn limit(self, limit: usize) -> Self
+    pub fn address(self, addr: &Address) -> Self
+}
+
+// --- Matching ---
+
+impl Filter {
+    pub fn matches(&self, event: &Event) -> bool
+    pub fn cardinality(&self) -> Option<usize>
+}
+
+pub fn matches_any(filters: &[Filter], event: &Event) -> bool
+
+// --- Identity and grouping ---
+
+pub fn filter_id(filter: &Filter) -> String
+pub fn filter_group(filter: &Filter) -> String
+```
+
+## Code Organization
+
+All code in `coracle-lib/src/filters.rs`. Single file, single module. Add `pub mod filters;`
+to `coracle-lib/src/lib.rs`.
+
+## Dependencies
+
+- `serde` / `serde_json` — already used in the events chapter for serialization
+- `std::collections::BTreeSet` / `BTreeMap` — stdlib, no external crate
+- `sha2` — already used in events chapter for hashing; reuse for filter_id
+
+No new external dependencies needed.
+
+## Narrative Notes
+
+- Open by framing filters as a standalone primitive. They're a predicate, not a protocol
+  message. The fact that relays use them in REQ is one application, but they're equally
+  useful for client-side filtering, local storage queries, and event routing decisions.
+
+- The `Option` semantics deserve careful explanation. Show the difference:
+  `None` = "I don't care about this field" vs `Some(empty)` = "this field must match
+  one of these zero values (i.e., nothing matches)". This is the key insight that makes
+  filters composable.
+
+- When explaining matching, walk through a concrete example: construct a filter, show an
+  event, trace through the matching logic field by field.
+
+- For tag filters, emphasize that tag keys are arbitrary strings — not restricted to
+  single letters. The single-letter convention is a relay indexing optimization, not a
+  protocol constraint.
+
+- `limit` gets a brief note: it's not part of matching. It tells a consumer (relay, storage
+  engine) how many results to return. Include it in the struct because it's part of the
+  NIP-01 filter object, but `matches()` ignores it.
+
+- For serialization, the interesting part is the tag flattening. Show the JSON representation
+  and explain how `tags: {"e": {"abc"}, "p": {"def"}}` becomes `{"#e": ["abc"], "#p": ["def"]}`.
+
+- `filter_id` and `filter_group` are utility functions, not methods, because they serve
+  infrastructure concerns (dedup, subscription management) rather than core filter semantics.
+
+- `cardinality` leverages kind classification from the kinds chapter. Connect the dots:
+  replaceable events have at most one per author per kind, addressable events have at most
+  one per author per kind per identifier.
+
+## Design Decisions
+
+1. **`Option<BTreeSet<T>>` for set fields** — Preserves the None-vs-empty distinction that
+   NIP-01 requires. BTreeSet gives O(log n) membership checks and deterministic iteration
+   order for serialization/hashing. (Research: rust-nostr uses this approach.)
+
+2. **Arbitrary string tag keys** — Not restricted to single letters. The protocol allows any
+   tag name; single-letter indexing is a relay optimization. Consumers can enforce restrictions.
+
+3. **Minimal builder API** — `.id()`, `.author()`, `.kind()`, `.tag()`, `.address()` plus
+   plural variants. No convenience methods for every common tag (#e, #p, #t, etc.) — the
+   generic `.tag("e", value)` is clear enough. Keeps the chapter focused.
+
+4. **`limit` in struct but not in matching** — NIP-01 defines it as part of the filter object,
+   so it belongs in the struct. But it's a result constraint, not a predicate, so `matches()`
+   ignores it. (Research: NDK, nostr-tools, all implementations agree on this.)
+
+5. **Free functions for identity/grouping** — `filter_id` and `filter_group` are not methods
+   because they serve infrastructure concerns. Keeps the Filter impl block focused on
+   construction and matching.
+
+6. **`cardinality` returns `Option<usize>`** — `None` means unbounded. Leverages kind
+   classification (replaceable, addressable) to compute tight upper bounds when possible.
+   (Research: nostr-tools' `getFilterLimit`, nostrlib's `GetTheoreticalLimit`.)
+
+7. **Custom serde for tag flattening** — Tags serialize as `#name` keys at the top level of
+   the JSON object, matching the NIP-01 wire format. This requires custom Serialize/Deserialize
+   implementations rather than derive macros.
+
+8. **`.address()` convenience** — Translates an Address into the correct combination of kind,
+   author, and #d tag filter. This is the one domain-aware convenience method because
+   address-based filtering is extremely common and error-prone to construct manually.
+
+## Open Questions
+
+- Should `filter_group` include tag *names* (keys) in the group hash, or only the set of
+  field names that are present? Including tag names means `{#e: [...]}` and `{#p: [...]}`
+  are in different groups (correct for merging). Leaning toward including tag names.