190 lines
9.3 KiB
Markdown
190 lines
9.3 KiB
Markdown
# Plan: Search
|
||
|
||
## Topic Summary
|
||
|
||
NIP-50 adds an optional full-text `search` field to the subscription filter from
|
||
chapter 11. A relay that supports the capability interprets the query against
|
||
event content (and, for some kinds, other fields), returning results ordered by
|
||
relevance rather than `created_at`, with `limit` applied after ranking. The
|
||
query may carry `key:value` extensions — `domain:`, `language:`, `sentiment:`,
|
||
`nsfw:`, `include:spam` — which relays may support or ignore.
|
||
|
||
This chapter extends `Filter` with a `search` field, threads it through
|
||
serialization / grouping / set algebra, introduces a typed `SearchQuery` that
|
||
splits free-text terms from `key:value` extensions, and implements a best-effort
|
||
local relevance **score in [0, 1]** used to both include and rank events —
|
||
mirroring the NIP's "descending order by quality of result, limit last."
|
||
|
||
## Chapter Outline
|
||
|
||
1. **Intro / framing** — Search as a relay-defined, optional capability; content
|
||
discovery is client-initiated routing, not a global index; results are
|
||
partial and ranked by the relay. The local matcher is an honest best-effort
|
||
fallback, not a reimplementation of relay search.
|
||
2. **The `search` field** — Add `search: Option<String>` to `Filter`; builder
|
||
methods `add_search` / `clear_search`; note it joins the derived `Hash` (so
|
||
`id()` covers it for free).
|
||
3. **Serialization** — Emit/parse a plain `"search"` key in the hand-written
|
||
serde impl, present only when `Some`.
|
||
4. **The `SearchQuery` model** — A new `search` module: terms + ordered
|
||
`key:value` extensions, `parse`, `Display`, builders, and the `Filter` bridge.
|
||
5. **Scoring & matching** — `search_score` (fraction-of-terms + diminishing
|
||
frequency bonus, capped at 1.0); `matches` includes an event when score > 0;
|
||
`rank_search_results` sorts by score then `created_at` and applies `limit`.
|
||
6. **Grouping and set algebra** — `search` enters `group()` (distinct searches
|
||
never merge); `union_filters` carries it through unchanged; `intersect_filters`
|
||
keeps a conflicting-search pair separate instead of fabricating a combined query.
|
||
7. **What's next** — Brief pointer to the Domain section (relay selection,
|
||
discovering NIP-50-capable relays via relay metadata, is a later concern).
|
||
|
||
## API Design
|
||
|
||
### `coracle-lib/src/filters.rs` (extends existing `Filter`)
|
||
|
||
```rust
|
||
pub struct Filter {
|
||
// ... existing fields ...
|
||
/// NIP-50 full-text search query. Relay-interpreted; see `SearchQuery`.
|
||
pub search: Option<String>,
|
||
}
|
||
|
||
impl Filter {
|
||
pub fn add_search(self, search: impl Into<String>) -> Self; // sets Some
|
||
pub fn clear_search(self) -> Self; // sets None
|
||
|
||
/// Bridge to the typed model.
|
||
pub fn add_search_query(self, query: &SearchQuery) -> Self; // = add_search(query.to_string())
|
||
pub fn search_query(&self) -> Option<SearchQuery>; // parse the field back
|
||
|
||
/// Best-effort local relevance score in [0.0, 1.0].
|
||
/// Returns 1.0 when there is no search, or a search with no free-text
|
||
/// terms (only extensions, which are unenforceable locally).
|
||
pub fn search_score(&self, event: &Event) -> f64;
|
||
}
|
||
|
||
/// Filter `events` to those matching `filter`, sort by relevance
|
||
/// (search_score desc, then created_at desc), and apply `filter.limit`.
|
||
pub fn rank_search_results<'a>(filter: &Filter, events: &'a [Event]) -> Vec<&'a Event>;
|
||
```
|
||
|
||
`matches` gains a final check: `if self.search_score(event) == 0.0 { return false }`.
|
||
Because `search_score` returns 1.0 when there is no search (or no terms), this
|
||
only rejects when a search *with terms* matched none of them — i.e. "any term
|
||
present ⇒ included."
|
||
|
||
### `coracle-lib/src/search.rs` (new module)
|
||
|
||
```rust
|
||
/// A parsed NIP-50 search query: free-text terms plus `key:value` extensions.
|
||
#[derive(Debug, Clone, PartialEq, Eq, Default)]
|
||
pub struct SearchQuery {
|
||
pub terms: Vec<String>,
|
||
pub extensions: Vec<(String, String)>, // ordered; repeats allowed
|
||
}
|
||
|
||
impl SearchQuery {
|
||
pub fn new() -> Self;
|
||
/// Total parse: split on whitespace; a token is an extension iff it is
|
||
/// `key:value` with key in [A-Za-z0-9_-]+, non-empty value not starting
|
||
/// with '/'. Everything else is a term. Never fails.
|
||
pub fn parse(input: &str) -> Self;
|
||
pub fn add_term(self, term: impl Into<String>) -> Self;
|
||
pub fn add_extension(self, key: impl Into<String>, value: impl Into<String>) -> Self;
|
||
pub fn is_empty(&self) -> bool;
|
||
}
|
||
|
||
impl fmt::Display for SearchQuery { /* terms first, then "key:value" exts, space-joined */ }
|
||
```
|
||
|
||
`Filter::matches` / `search_score` tokenize via `SearchQuery::parse`, using only
|
||
`terms` (extensions are ignored by the local matcher).
|
||
|
||
### Scoring formula (`search_score`)
|
||
|
||
For the parsed query's distinct `terms` (case-insensitive), against
|
||
`event.content` lowercased:
|
||
|
||
- `total` = number of distinct terms; if 0 → return 1.0.
|
||
- For each term, `count` = non-overlapping occurrences in content.
|
||
- `matched` = terms with `count ≥ 1`; `extra` = (Σ count) − matched (repeats
|
||
beyond the first hit of each matched term).
|
||
- `base = matched / total` (fraction of terms present, in [0, 1]).
|
||
- `bonus = (1 − 1/(1 + extra)) / total` (diminishing, strictly `< 1/total`, so a
|
||
partial match never reaches the next term's bucket).
|
||
- `score = (base + bonus).min(1.0)`.
|
||
|
||
Properties (asserted in tests): in [0, 1]; all terms once ⇒ 1.0; missing a term
|
||
⇒ `< 1.0`; more occurrences ⇒ ≥ score (monotonic, never exceeds 1.0); no terms
|
||
matched ⇒ exactly 0.0.
|
||
|
||
## Code Organization
|
||
|
||
- **`coracle-lib/src/filters.rs`** — add the `search` field, builders, the
|
||
serde changes, `search_score`, the `matches` check, `rank_search_results`,
|
||
and the `group()` / `intersect_filters` updates. `use crate::search::SearchQuery;`.
|
||
- **`coracle-lib/src/search.rs`** — the `SearchQuery` type. New `pub mod search;`
|
||
in `lib.rs`, placed before `filters` (filters depends on it).
|
||
- **`coracle-lib/src/prelude.rs`** — add `pub use crate::search::SearchQuery;`
|
||
(the prelude already re-exports commonly used items).
|
||
- **`coracle-lib/tests/search.rs`** — hand-written integration tests (not tangled).
|
||
|
||
## Dependencies
|
||
|
||
None new. Parsing and matching use `std` only. No FTS engine — out of scope and
|
||
against the minimal-dependency rule.
|
||
|
||
## Narrative Notes
|
||
|
||
- Open with the philosophy: search is opt-in and relay-defined; no global index;
|
||
results partial and relay-ranked. Frame the local scorer as a fallback for
|
||
in-memory/offline querying, and warn (per rust-nostr's SDK) that re-filtering a
|
||
relay's returned results client-side can wrongly drop legitimate hits — relays
|
||
rank with richer, extension-aware logic.
|
||
- Explain *why* extensions are parsed but **ignored locally**: `sentiment:`,
|
||
`domain:`, etc. require data the client doesn't have, so honoring them locally
|
||
is impossible; we keep them in the typed model for *building/inspecting*
|
||
queries, not for local evaluation.
|
||
- Justify the score model concretely: NIP-50 mandates relevance ordering, so a
|
||
boolean match is the wrong shape — a [0,1] score lets us both include
|
||
(score > 0) and rank. Walk through the fraction + diminishing-bonus formula
|
||
with a small worked example.
|
||
- For grouping: reuse the chapter-11 reasoning — two filters with different
|
||
searches can't be unioned without changing semantics, so `search` joins the
|
||
group key. Show that `union_filters` then keeps them separate automatically.
|
||
- For `intersect_filters`: explain the one structural change — `combine_pair`
|
||
returns `Option<Filter>`; a pair whose two searches differ returns `None`, and
|
||
the caller emits both filters separately rather than concatenating queries.
|
||
|
||
## Design Decisions
|
||
|
||
1. **Typed `SearchQuery`, lean/generic.** Terms + a generic ordered list of
|
||
`key:value` extensions, with `add_term`/`add_extension`. No per-extension
|
||
helpers or typed enums — keeps the surface small and forward-compatible with
|
||
relay-specific extensions. (Every reference treats search as opaque; the typed
|
||
model is our value-add.)
|
||
2. **Local relevance score in [0, 1]**, fraction-of-terms + diminishing frequency
|
||
bonus, capped at 1.0. Chosen over a boolean to model NIP-50's relevance
|
||
ordering. Extensions excluded from scoring.
|
||
3. **`matches` includes on score > 0** ("any term present"); ranking via
|
||
`rank_search_results` handles relevance + `limit`-after-sort.
|
||
4. **`search` participates in `group()`**, so `union_filters` never merges
|
||
distinct searches.
|
||
5. **`intersect_filters` keeps a conflicting-search pair separate** (combine
|
||
returns `Option`, `None` ⇒ emit both) rather than concatenating, per the
|
||
user's choice.
|
||
6. **Builder naming `add_search`/`clear_search`** to match the existing
|
||
`add_since`/`clear_since` vocabulary (not rust-nostr's `search`/`remove_search`).
|
||
7. **Unicode-aware lowercasing** (`to_lowercase`) for the local matcher rather
|
||
than ASCII-only, given multilingual nostr content; note the allocation
|
||
trade-off. Substring counting via `str::matches`.
|
||
8. **Extension parse heuristic** documented: a colon-bearing token like a URL may
|
||
be read as an extension; applications needing exact control build
|
||
`SearchQuery` field-by-field instead of parsing.
|
||
|
||
## Open Questions
|
||
|
||
- Exact wording of the frequency-bonus explanation — keep the formula in prose
|
||
light; lean on a worked example. (Resolved during writing.)
|
||
- Whether `rank_search_results` belongs as a free function (consistent with
|
||
`matches_any`/`union_filters`) — yes, free function.
|