Files
coracle-rust/book/plan/search.md
T
2026-05-20 16:07:58 -07:00

190 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Plan: Search
## Topic Summary
NIP-50 adds an optional full-text `search` field to the subscription filter from
chapter 11. A relay that supports the capability interprets the query against
event content (and, for some kinds, other fields), returning results ordered by
relevance rather than `created_at`, with `limit` applied after ranking. The
query may carry `key:value` extensions — `domain:`, `language:`, `sentiment:`,
`nsfw:`, `include:spam` — which relays may support or ignore.
This chapter extends `Filter` with a `search` field, threads it through
serialization / grouping / set algebra, introduces a typed `SearchQuery` that
splits free-text terms from `key:value` extensions, and implements a best-effort
local relevance **score in [0, 1]** used to both include and rank events —
mirroring the NIP's "descending order by quality of result, limit last."
## Chapter Outline
1. **Intro / framing** — Search as a relay-defined, optional capability; content
discovery is client-initiated routing, not a global index; results are
partial and ranked by the relay. The local matcher is an honest best-effort
fallback, not a reimplementation of relay search.
2. **The `search` field** — Add `search: Option<String>` to `Filter`; builder
methods `add_search` / `clear_search`; note it joins the derived `Hash` (so
`id()` covers it for free).
3. **Serialization** — Emit/parse a plain `"search"` key in the hand-written
serde impl, present only when `Some`.
4. **The `SearchQuery` model** — A new `search` module: terms + ordered
`key:value` extensions, `parse`, `Display`, builders, and the `Filter` bridge.
5. **Scoring & matching**`search_score` (fraction-of-terms + diminishing
frequency bonus, capped at 1.0); `matches` includes an event when score > 0;
`rank_search_results` sorts by score then `created_at` and applies `limit`.
6. **Grouping and set algebra**`search` enters `group()` (distinct searches
never merge); `union_filters` carries it through unchanged; `intersect_filters`
keeps a conflicting-search pair separate instead of fabricating a combined query.
7. **What's next** — Brief pointer to the Domain section (relay selection,
discovering NIP-50-capable relays via relay metadata, is a later concern).
## API Design
### `coracle-lib/src/filters.rs` (extends existing `Filter`)
```rust
pub struct Filter {
// ... existing fields ...
/// NIP-50 full-text search query. Relay-interpreted; see `SearchQuery`.
pub search: Option<String>,
}
impl Filter {
pub fn add_search(self, search: impl Into<String>) -> Self; // sets Some
pub fn clear_search(self) -> Self; // sets None
/// Bridge to the typed model.
pub fn add_search_query(self, query: &SearchQuery) -> Self; // = add_search(query.to_string())
pub fn search_query(&self) -> Option<SearchQuery>; // parse the field back
/// Best-effort local relevance score in [0.0, 1.0].
/// Returns 1.0 when there is no search, or a search with no free-text
/// terms (only extensions, which are unenforceable locally).
pub fn search_score(&self, event: &Event) -> f64;
}
/// Filter `events` to those matching `filter`, sort by relevance
/// (search_score desc, then created_at desc), and apply `filter.limit`.
pub fn rank_search_results<'a>(filter: &Filter, events: &'a [Event]) -> Vec<&'a Event>;
```
`matches` gains a final check: `if self.search_score(event) == 0.0 { return false }`.
Because `search_score` returns 1.0 when there is no search (or no terms), this
only rejects when a search *with terms* matched none of them — i.e. "any term
present ⇒ included."
### `coracle-lib/src/search.rs` (new module)
```rust
/// A parsed NIP-50 search query: free-text terms plus `key:value` extensions.
#[derive(Debug, Clone, PartialEq, Eq, Default)]
pub struct SearchQuery {
pub terms: Vec<String>,
pub extensions: Vec<(String, String)>, // ordered; repeats allowed
}
impl SearchQuery {
pub fn new() -> Self;
/// Total parse: split on whitespace; a token is an extension iff it is
/// `key:value` with key in [A-Za-z0-9_-]+, non-empty value not starting
/// with '/'. Everything else is a term. Never fails.
pub fn parse(input: &str) -> Self;
pub fn add_term(self, term: impl Into<String>) -> Self;
pub fn add_extension(self, key: impl Into<String>, value: impl Into<String>) -> Self;
pub fn is_empty(&self) -> bool;
}
impl fmt::Display for SearchQuery { /* terms first, then "key:value" exts, space-joined */ }
```
`Filter::matches` / `search_score` tokenize via `SearchQuery::parse`, using only
`terms` (extensions are ignored by the local matcher).
### Scoring formula (`search_score`)
For the parsed query's distinct `terms` (case-insensitive), against
`event.content` lowercased:
- `total` = number of distinct terms; if 0 → return 1.0.
- For each term, `count` = non-overlapping occurrences in content.
- `matched` = terms with `count ≥ 1`; `extra` = (Σ count) matched (repeats
beyond the first hit of each matched term).
- `base = matched / total` (fraction of terms present, in [0, 1]).
- `bonus = (1 1/(1 + extra)) / total` (diminishing, strictly `< 1/total`, so a
partial match never reaches the next term's bucket).
- `score = (base + bonus).min(1.0)`.
Properties (asserted in tests): in [0, 1]; all terms once ⇒ 1.0; missing a term
`< 1.0`; more occurrences ⇒ ≥ score (monotonic, never exceeds 1.0); no terms
matched ⇒ exactly 0.0.
## Code Organization
- **`coracle-lib/src/filters.rs`** — add the `search` field, builders, the
serde changes, `search_score`, the `matches` check, `rank_search_results`,
and the `group()` / `intersect_filters` updates. `use crate::search::SearchQuery;`.
- **`coracle-lib/src/search.rs`** — the `SearchQuery` type. New `pub mod search;`
in `lib.rs`, placed before `filters` (filters depends on it).
- **`coracle-lib/src/prelude.rs`** — add `pub use crate::search::SearchQuery;`
(the prelude already re-exports commonly used items).
- **`coracle-lib/tests/search.rs`** — hand-written integration tests (not tangled).
## Dependencies
None new. Parsing and matching use `std` only. No FTS engine — out of scope and
against the minimal-dependency rule.
## Narrative Notes
- Open with the philosophy: search is opt-in and relay-defined; no global index;
results partial and relay-ranked. Frame the local scorer as a fallback for
in-memory/offline querying, and warn (per rust-nostr's SDK) that re-filtering a
relay's returned results client-side can wrongly drop legitimate hits — relays
rank with richer, extension-aware logic.
- Explain *why* extensions are parsed but **ignored locally**: `sentiment:`,
`domain:`, etc. require data the client doesn't have, so honoring them locally
is impossible; we keep them in the typed model for *building/inspecting*
queries, not for local evaluation.
- Justify the score model concretely: NIP-50 mandates relevance ordering, so a
boolean match is the wrong shape — a [0,1] score lets us both include
(score > 0) and rank. Walk through the fraction + diminishing-bonus formula
with a small worked example.
- For grouping: reuse the chapter-11 reasoning — two filters with different
searches can't be unioned without changing semantics, so `search` joins the
group key. Show that `union_filters` then keeps them separate automatically.
- For `intersect_filters`: explain the one structural change — `combine_pair`
returns `Option<Filter>`; a pair whose two searches differ returns `None`, and
the caller emits both filters separately rather than concatenating queries.
## Design Decisions
1. **Typed `SearchQuery`, lean/generic.** Terms + a generic ordered list of
`key:value` extensions, with `add_term`/`add_extension`. No per-extension
helpers or typed enums — keeps the surface small and forward-compatible with
relay-specific extensions. (Every reference treats search as opaque; the typed
model is our value-add.)
2. **Local relevance score in [0, 1]**, fraction-of-terms + diminishing frequency
bonus, capped at 1.0. Chosen over a boolean to model NIP-50's relevance
ordering. Extensions excluded from scoring.
3. **`matches` includes on score > 0** ("any term present"); ranking via
`rank_search_results` handles relevance + `limit`-after-sort.
4. **`search` participates in `group()`**, so `union_filters` never merges
distinct searches.
5. **`intersect_filters` keeps a conflicting-search pair separate** (combine
returns `Option`, `None` ⇒ emit both) rather than concatenating, per the
user's choice.
6. **Builder naming `add_search`/`clear_search`** to match the existing
`add_since`/`clear_since` vocabulary (not rust-nostr's `search`/`remove_search`).
7. **Unicode-aware lowercasing** (`to_lowercase`) for the local matcher rather
than ASCII-only, given multilingual nostr content; note the allocation
trade-off. Substring counting via `str::matches`.
8. **Extension parse heuristic** documented: a colon-bearing token like a URL may
be read as an extension; applications needing exact control build
`SearchQuery` field-by-field instead of parsing.
## Open Questions
- Exact wording of the frequency-bonus explanation — keep the formula in prose
light; lean on a worked example. (Resolved during writing.)
- Whether `rank_search_results` belongs as a free function (consistent with
`matches_any`/`union_filters`) — yes, free function.