Add search chapter

This commit is contained in:
Jon Staab
2026-05-20 16:07:58 -07:00
parent d0709e1811
commit 75381b653e
5 changed files with 1014 additions and 9 deletions
+189
View File
@@ -0,0 +1,189 @@
# Plan: Search
## Topic Summary
NIP-50 adds an optional full-text `search` field to the subscription filter from
chapter 11. A relay that supports the capability interprets the query against
event content (and, for some kinds, other fields), returning results ordered by
relevance rather than `created_at`, with `limit` applied after ranking. The
query may carry `key:value` extensions — `domain:`, `language:`, `sentiment:`,
`nsfw:`, `include:spam` — which relays may support or ignore.
This chapter extends `Filter` with a `search` field, threads it through
serialization / grouping / set algebra, introduces a typed `SearchQuery` that
splits free-text terms from `key:value` extensions, and implements a best-effort
local relevance **score in [0, 1]** used to both include and rank events —
mirroring the NIP's "descending order by quality of result, limit last."
## Chapter Outline
1. **Intro / framing** — Search as a relay-defined, optional capability; content
discovery is client-initiated routing, not a global index; results are
partial and ranked by the relay. The local matcher is an honest best-effort
fallback, not a reimplementation of relay search.
2. **The `search` field** — Add `search: Option<String>` to `Filter`; builder
methods `add_search` / `clear_search`; note it joins the derived `Hash` (so
`id()` covers it for free).
3. **Serialization** — Emit/parse a plain `"search"` key in the hand-written
serde impl, present only when `Some`.
4. **The `SearchQuery` model** — A new `search` module: terms + ordered
`key:value` extensions, `parse`, `Display`, builders, and the `Filter` bridge.
5. **Scoring & matching**`search_score` (fraction-of-terms + diminishing
frequency bonus, capped at 1.0); `matches` includes an event when score > 0;
`rank_search_results` sorts by score then `created_at` and applies `limit`.
6. **Grouping and set algebra**`search` enters `group()` (distinct searches
never merge); `union_filters` carries it through unchanged; `intersect_filters`
keeps a conflicting-search pair separate instead of fabricating a combined query.
7. **What's next** — Brief pointer to the Domain section (relay selection,
discovering NIP-50-capable relays via relay metadata, is a later concern).
## API Design
### `coracle-lib/src/filters.rs` (extends existing `Filter`)
```rust
pub struct Filter {
// ... existing fields ...
/// NIP-50 full-text search query. Relay-interpreted; see `SearchQuery`.
pub search: Option<String>,
}
impl Filter {
pub fn add_search(self, search: impl Into<String>) -> Self; // sets Some
pub fn clear_search(self) -> Self; // sets None
/// Bridge to the typed model.
pub fn add_search_query(self, query: &SearchQuery) -> Self; // = add_search(query.to_string())
pub fn search_query(&self) -> Option<SearchQuery>; // parse the field back
/// Best-effort local relevance score in [0.0, 1.0].
/// Returns 1.0 when there is no search, or a search with no free-text
/// terms (only extensions, which are unenforceable locally).
pub fn search_score(&self, event: &Event) -> f64;
}
/// Filter `events` to those matching `filter`, sort by relevance
/// (search_score desc, then created_at desc), and apply `filter.limit`.
pub fn rank_search_results<'a>(filter: &Filter, events: &'a [Event]) -> Vec<&'a Event>;
```
`matches` gains a final check: `if self.search_score(event) == 0.0 { return false }`.
Because `search_score` returns 1.0 when there is no search (or no terms), this
only rejects when a search *with terms* matched none of them — i.e. "any term
present ⇒ included."
### `coracle-lib/src/search.rs` (new module)
```rust
/// A parsed NIP-50 search query: free-text terms plus `key:value` extensions.
#[derive(Debug, Clone, PartialEq, Eq, Default)]
pub struct SearchQuery {
pub terms: Vec<String>,
pub extensions: Vec<(String, String)>, // ordered; repeats allowed
}
impl SearchQuery {
pub fn new() -> Self;
/// Total parse: split on whitespace; a token is an extension iff it is
/// `key:value` with key in [A-Za-z0-9_-]+, non-empty value not starting
/// with '/'. Everything else is a term. Never fails.
pub fn parse(input: &str) -> Self;
pub fn add_term(self, term: impl Into<String>) -> Self;
pub fn add_extension(self, key: impl Into<String>, value: impl Into<String>) -> Self;
pub fn is_empty(&self) -> bool;
}
impl fmt::Display for SearchQuery { /* terms first, then "key:value" exts, space-joined */ }
```
`Filter::matches` / `search_score` tokenize via `SearchQuery::parse`, using only
`terms` (extensions are ignored by the local matcher).
### Scoring formula (`search_score`)
For the parsed query's distinct `terms` (case-insensitive), against
`event.content` lowercased:
- `total` = number of distinct terms; if 0 → return 1.0.
- For each term, `count` = non-overlapping occurrences in content.
- `matched` = terms with `count ≥ 1`; `extra` = (Σ count) matched (repeats
beyond the first hit of each matched term).
- `base = matched / total` (fraction of terms present, in [0, 1]).
- `bonus = (1 1/(1 + extra)) / total` (diminishing, strictly `< 1/total`, so a
partial match never reaches the next term's bucket).
- `score = (base + bonus).min(1.0)`.
Properties (asserted in tests): in [0, 1]; all terms once ⇒ 1.0; missing a term
`< 1.0`; more occurrences ⇒ ≥ score (monotonic, never exceeds 1.0); no terms
matched ⇒ exactly 0.0.
## Code Organization
- **`coracle-lib/src/filters.rs`** — add the `search` field, builders, the
serde changes, `search_score`, the `matches` check, `rank_search_results`,
and the `group()` / `intersect_filters` updates. `use crate::search::SearchQuery;`.
- **`coracle-lib/src/search.rs`** — the `SearchQuery` type. New `pub mod search;`
in `lib.rs`, placed before `filters` (filters depends on it).
- **`coracle-lib/src/prelude.rs`** — add `pub use crate::search::SearchQuery;`
(the prelude already re-exports commonly used items).
- **`coracle-lib/tests/search.rs`** — hand-written integration tests (not tangled).
## Dependencies
None new. Parsing and matching use `std` only. No FTS engine — out of scope and
against the minimal-dependency rule.
## Narrative Notes
- Open with the philosophy: search is opt-in and relay-defined; no global index;
results partial and relay-ranked. Frame the local scorer as a fallback for
in-memory/offline querying, and warn (per rust-nostr's SDK) that re-filtering a
relay's returned results client-side can wrongly drop legitimate hits — relays
rank with richer, extension-aware logic.
- Explain *why* extensions are parsed but **ignored locally**: `sentiment:`,
`domain:`, etc. require data the client doesn't have, so honoring them locally
is impossible; we keep them in the typed model for *building/inspecting*
queries, not for local evaluation.
- Justify the score model concretely: NIP-50 mandates relevance ordering, so a
boolean match is the wrong shape — a [0,1] score lets us both include
(score > 0) and rank. Walk through the fraction + diminishing-bonus formula
with a small worked example.
- For grouping: reuse the chapter-11 reasoning — two filters with different
searches can't be unioned without changing semantics, so `search` joins the
group key. Show that `union_filters` then keeps them separate automatically.
- For `intersect_filters`: explain the one structural change — `combine_pair`
returns `Option<Filter>`; a pair whose two searches differ returns `None`, and
the caller emits both filters separately rather than concatenating queries.
## Design Decisions
1. **Typed `SearchQuery`, lean/generic.** Terms + a generic ordered list of
`key:value` extensions, with `add_term`/`add_extension`. No per-extension
helpers or typed enums — keeps the surface small and forward-compatible with
relay-specific extensions. (Every reference treats search as opaque; the typed
model is our value-add.)
2. **Local relevance score in [0, 1]**, fraction-of-terms + diminishing frequency
bonus, capped at 1.0. Chosen over a boolean to model NIP-50's relevance
ordering. Extensions excluded from scoring.
3. **`matches` includes on score > 0** ("any term present"); ranking via
`rank_search_results` handles relevance + `limit`-after-sort.
4. **`search` participates in `group()`**, so `union_filters` never merges
distinct searches.
5. **`intersect_filters` keeps a conflicting-search pair separate** (combine
returns `Option`, `None` ⇒ emit both) rather than concatenating, per the
user's choice.
6. **Builder naming `add_search`/`clear_search`** to match the existing
`add_since`/`clear_since` vocabulary (not rust-nostr's `search`/`remove_search`).
7. **Unicode-aware lowercasing** (`to_lowercase`) for the local matcher rather
than ASCII-only, given multilingual nostr content; note the allocation
trade-off. Substring counting via `str::matches`.
8. **Extension parse heuristic** documented: a colon-bearing token like a URL may
be read as an extension; applications needing exact control build
`SearchQuery` field-by-field instead of parsing.
## Open Questions
- Exact wording of the frequency-bonus explanation — keep the formula in prose
light; lean on a worked example. (Resolved during writing.)
- Whether `rank_search_results` belongs as a free function (consistent with
`matches_any`/`union_filters`) — yes, free function.