Files
coracle-rust/book/research/search.md
T
2026-05-20 16:07:58 -07:00

308 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Research: Search
## Topic Summary
NIP-50 adds an optional full-text `search` field to the subscription filter
introduced in chapter 11. A relay that supports the capability interprets the
query string against event content (and, for some kinds, other fields),
returning results ordered by relevance rather than `created_at`. The query may
carry structured extensions in the form of `key:value` pairs — `domain:`,
`language:`, `sentiment:`, `nsfw:`, `include:spam` — which relays may support or
ignore.
The chapter will:
1. Add a `search` field to the existing `Filter` type, wiring it through
construction, serialization, hashing, grouping, and the union/intersect
utilities.
2. Introduce a typed `SearchQuery` model that splits free-text terms from
`key:value` extensions, so applications can build and inspect queries safely
instead of stringly-typed concatenation. (This is a deliberate departure
from every reference, which treats the query as an opaque string.)
3. Implement a best-effort, case-insensitive local matcher over event content,
while documenting that real ranking and extension semantics are
relay-defined.
The code lives in `coracle-lib`: the `search` field extends `filters.rs`, and
the query model gets a dedicated `search.rs` module.
## Philosophy
From `ref/building-nostr`, the framing relevant to search is that **content
discovery on nostr is client-initiated routing through relay selection**, not a
query against a global index. Searching is "knowing where to send queries." A
relay that supports NIP-50 is exercising an *optional, relay-authored
capability* — like content curation or access control — and defines its own
matching semantics, including which extensions it honors. This mirrors the NIP's
own "relays SHOULD ignore extensions they don't support."
Three principles bear directly on the chapter's voice:
- **No guaranteed completeness.** "No implementation will have a complete view
of every heuristic that is applicable" — so search results are neither global
nor exhaustive. A client queries the relays it knows support search and
accepts a partial, spontaneous view. This should be stated honestly, not hidden.
- **Indexing is the curator's responsibility, not the user's.** Authors publish
signed events; relays (or indexing services) that *want* content discoverable
maintain the index. Clients do nothing special beyond sending a `search`
filter to a search-capable relay.
- **Publicity, not privacy.** Full-text indexing makes content patterns
discoverable and gives relay operators visibility into queries. The honest
framing: search is a publicity feature.
The takeaway for our library: model `search` as a first-class but optional
filter field, keep the query structured enough that applications can reason
about it, and be candid that local matching is a best-effort approximation of a
relay-defined operation.
## Reference Implementation Analysis
### applesauce
`search` is an optional string on an extended `Filter` type
(`packages/core/src/helpers/filter.ts`): `Filter = CoreFilter & { search?: string }`,
extending nostr-tools' base type. **Opaque** — no extension parsing.
Dual-mode: relay subscriptions pass the string through verbatim; a local SQLite
backend (`packages/sqlite`) indexes content into an FTS5 table and runs
`events_search MATCH ?` with the raw string double-quote-escaped. Local
client-side `matchFilter()` **ignores** the search field entirely. Pluggable
"search content formatters" decide what gets indexed (default: `content`;
enhanced: kind-0 profile fields plus `t`/`subject`/`title`/`summary`/`d` tags).
Supports `order: "created_at" | "rank"` for FTS5 ranking. Low coupling; SQLite
is optional. No query-extension awareness anywhere.
### ndk
`search?: string` on `NDKFilter` (`core/src/subscription/index.ts:30`).
**Opaque, relay-only.** No parsing, no validation (filter-validation pipeline
skips it), no client-side matching (delegates to nostr-tools' `matchFilters`,
which ignores search). No helper functions for building search filters; callers
construct `{ search: "..." }` by hand. The field is serialized and sent to
relays as-is. No NIP-11 capability negotiation or fallback. Minimal by design.
### nostr-gadgets
Re-uses `@nostr/tools`' `Filter` type (`search?: string`). **Opaque,
relay-only.** Notably its local stores *reject* search: the in-memory store
returns an empty set if `filter.search` is present, and the RedEventStore docs
state "any filters supported (except 'search')." Provides a hardcoded
`SEARCH_RELAYS` constant (`defaults.ts`): `relay.nostr.band`, `nostr.wine`,
`relay.noswhere.com`, `relay.nos.today`. No query builders, no dynamic relay
capability detection.
### nostrlib (Go)
`Search string` on the `Filter` struct (`filter.go`), (de)serialized as a plain
`"search"` JSON key. The core `Filter.Matches` / `MatchesIgnoringTimestampConstraints`
**ignores** search — matching is delegated to eventstore backends. Key-value
backends (BoltDB, LMDB, MMM) return nothing for search queries; only the **Bleve**
backend implements real full-text search: per-document language auto-detection
(lingua-go, 22 languages), per-language analyzers, boolean query syntax
(`AND/OR/NOT`, parens, quoted phrases), NIP-27 reference extraction with 2× boost,
and case-insensitive substring validation of quoted phrases. Kind-0 profiles index
name/display_name/about; reposts unpack inner events. Khatru relay policies
`NoSearchQueries`/`RemoveSearchQueries` let operators disable search. SDK
`SearchUsers()` just sends a `Search` filter to designated user-search relays. No
NIP-50 *extension* parsing (treats `domain:x` as a regular word); a 2-char minimum
query length is enforced by Bleve.
### nostr-tools
`search?: string` on the base `Filter` (`filter.ts`). **The canonical
"defined-but-unused" implementation.** `matchFilter()`/`matchFilters()` do not
check search at all; `mergeFilters()` drops it entirely. No parsing, no
validation, no helpers, no tests for the field. Strictly a transport-layer
placeholder so applications can send search filters to relays. Minimal-deps
philosophy: search is purely a relay concern.
### rust-nostr
The most directly relevant reference (also Rust). In
`crates/nostr/src/filter.rs`:
```rust
/// A string describing a query in a human-readable form, i.e. "best nostr apps"
/// <https://github.com/nostr-protocol/nips/blob/master/50.md>
#[serde(skip_serializing_if = "Option::is_none")]
#[serde(default)]
pub search: Option<String>,
```
Builder API: `search<S: Into<String>>(self, value: S) -> Self` and
`remove_search(self) -> Self` — symmetric, generic, `#[inline]`. **Opaque** (no
extension parsing).
Local matching (`search_match`):
```rust
fn search_match(&self, event: &Event) -> bool {
match &self.search {
Some(query) => event.content.as_bytes()
.windows(query.len())
.any(|window| window.eq_ignore_ascii_case(query.as_bytes())),
None => true,
}
}
```
Case-insensitive **ASCII** substring via sliding window; `None` matches
everything. Gated by a `MatchEventOptions { nip50: bool, .. }` flag (default
true). Notably, the SDK relay sets `.nip50(false)` with the comment "Skip NIP-50
matches since they may create issues and ban non-malicious relays" — i.e.
client-side re-matching of a relay's search results can wrongly drop valid hits.
DB backends (LMDB, SQLite) extend matching to a fixed set of searchable tags —
`title`, `description`, `subject`, `name` — lowercasing the query once up front;
empty search → no results. A `Features { full_text_search: bool }` flag declares
backend capability.
Patterns worth emulating: `Into<String>` builder, `skip_serializing_if` for a
clean wire format, an explicit opt-out for search matching, ASCII case folding
for speed.
### welshman
The TypeScript toolkit our library descends from. `search?: string` on `Filter`
(`packages/util/src/Filters.ts`). It is the **only reference that matches search
locally and threads it through filter utilities**:
```typescript
export const matchFilter = (filter, event) => {
if (!nostrToolsMatchFilter(filter, event)) return false
if (filter.search) {
const content = event.content.toLowerCase()
const terms = filter.search.toLowerCase().split(/\s+/g)
for (const term of terms) {
if (content.includes(term)) return true
return false // <-- bug: returns after first term
}
}
return true
}
```
The intent is term-splitting + case-insensitive substring, but the early
`return false` means only the first term is ever checked. **A correct version
should decide AND vs OR across terms explicitly** — this is the one place we can
clearly improve on the reference.
Filter utilities (directly parallel to our `group`/`union_filters`/`intersect_filters`):
- `calculateFilterGroup` pushes `search:${search}` into the group key — **a
filter with a search is only mergeable with an identical search.**
- `unionFilters` treats `search` (like `since`/`until`/`limit`) as a scalar
preserved from the first filter in the group, **not merged**.
- `intersectFilters` concatenates differing searches with a space
(`[a, b].join(" ")`) — modeling "must match both" as a compound query — and
takes whichever is present otherwise.
- `getFilterId` includes search in the deterministic hash, so different searches
never dedupe.
Search-relay selection lives in the router: `getSearchRelays()` returns relays
whose NIP-11 `supported_nips` includes `"50"`. No extension parsing.
## Common Patterns
- **`search` is universally an optional plain string.** Every reference models
it as `Option<String>` / `search?: string`. None parse the `key:value`
extensions — they treat the whole query as opaque and let the relay interpret
it. Our typed `SearchQuery` is therefore a value-add, not a port.
- **Local matching is the exception, not the rule.** nostr-tools, ndk,
applesauce (in `matchFilter`), and nostrlib's core `Filter` all *ignore*
search locally; matching happens relay-side (or in a dedicated index like
Bleve/FTS5). Only rust-nostr and welshman attempt local matching, both with
case-insensitive substring over `content`.
- **Where matching exists, it's case-insensitive substring** — rust-nostr does
ASCII-only `eq_ignore_ascii_case` over byte windows (whole query as one
needle); welshman lowercases and splits on whitespace into terms (intending
multi-term, buggily). DB backends additionally search a small fixed set of
metadata tags (`title`, `description`, `subject`, `name`).
- **Search makes filters un-mergeable.** Both welshman (group key) and the
general intuition agree: two filters with different search strings can't be
unioned without changing semantics. rust-nostr sidesteps merging at this layer
entirely.
- **Client-side re-matching is risky.** rust-nostr's SDK disables NIP-50
matching when filtering relay results, because a relay's notion of a match
(ranked, fuzzy, multi-field, extension-aware) is richer than a client's
substring check — re-filtering can drop legitimate hits.
- **Relay selection by NIP-11.** Search-capable relays are discovered via
`supported_nips` containing `50` (welshman) or a hardcoded allowlist
(nostr-gadgets). This is an application/networking concern, out of scope for
`coracle-lib`.
## Considerations for Our Implementation
**Filter field.** Add `pub search: Option<String>` to `Filter`. Follow
rust-nostr: `add_search<S: Into<String>>(self, S)` and `clear_search(self)` to
match the existing `add_*`/`clear_*` builder vocabulary (our methods are named
`add_since`/`clear_since`, etc., so `add_search`/`clear_search` fits better than
rust-nostr's `search`/`remove_search`). The field already participates in the
derived `Hash` (so `id()` covers it for free), but serialization, `group()`,
`union_filters`, `intersect_filters`, and `matches()` all need explicit updates.
**Serialization.** Our `Filter` has hand-written serde (to flatten `#tag` keys).
Add `search` as a plain `"search"` key — emit only when `Some` (mirroring
`since`/`until`/`limit`), and read it in the visitor's match arm. A round-trip
test must cover it.
**Grouping / union / intersect.** Per welshman: include `search` in the
`group()` hash so filters with different searches land in different groups (never
merged). In `union_filters`, since group members share an identical search by
construction, the search carries over via the `or_insert_with(|| filter.clone())`
seed — no special merge needed, but worth a comment. In `combine_pair`
(intersect), decide how to combine two searches: welshman concatenates with a
space. Concatenation is defensible ("must match both") but lossy and surprising;
a cleaner rule for a typed model is to **merge two `SearchQuery` values** (union
their terms and extensions) or, if we keep the field as a string at this layer,
to concatenate with a space and document it. Recommend: concatenate with a space
when both present and differ, matching welshman, and note the limitation.
**Local matching.** Extend `Filter::matches` to test `search` *after* the cheap
scalar checks. Best-effort, case-insensitive. Two design choices to settle in
planning:
1. Whole-query substring (rust-nostr) vs. term-split AND/OR (welshman, fixed).
A typed `SearchQuery` makes term-split natural: match the free-text terms
(AND across terms reads as the intuitive "all words present"; document it),
and treat `key:value` extensions as *unenforceable locally* — i.e. ignored by
the local matcher, since we can't evaluate `sentiment:` or `domain:` without
external data. This honesty matches the NIP.
2. ASCII (`eq_ignore_ascii_case`) vs. Unicode lowercasing. ASCII is what
rust-nostr ships and is allocation-free; Unicode `to_lowercase` is more
correct for non-Latin content but allocates. Given nostr's multilingual
content, prefer Unicode `to_lowercase` for the local matcher — correctness
over micro-optimization, consistent with our "clarity over cleverness" rule —
and note the trade-off.
Also document, per rust-nostr's SDK, that local matching is a *fallback*:
relay results should generally be trusted as-is rather than re-filtered.
**`SearchQuery` model (new `search.rs`).** A struct splitting a query into
free-text `terms: Vec<String>` and `extensions: Vec<(String, String)>` (ordered;
NIP-50 doesn't forbid repeats, and order can matter to relays). Parsing: split on
whitespace, treat a token containing `:` (with a non-empty key before it) as an
extension, everything else as a term. Provide:
- `SearchQuery::parse(&str) -> SearchQuery` (total, never fails — unknown shapes
fall back to terms).
- `Display` / `to_string()` that re-renders to the wire string (terms first or
preserve order; planning to decide).
- Builder helpers: `term`, `extension`, plus typed convenience for the
spec-defined extensions (`domain`, `language`, `sentiment`, `nsfw`,
`include_spam`) — optional, decide scope in planning.
- A bridge to `Filter`: `Filter::add_search` can accept `impl Into<String>` so
both a raw string and `query.to_string()` work; optionally
`Filter::search_query()` to parse the field back out.
Keep `sentiment`/`nsfw` values as strings (or small enums) — leaning toward
strings to stay forward-compatible with relay-specific values, with named
constructors for the common cases.
**Dependencies.** None new. Parsing is plain string handling; matching uses std.
Avoid pulling in a real FTS engine — out of scope and against the
minimal-dependency rule.
**Out of scope (defer / mention only).** Real relevance ranking; relay-side
indexing; NIP-11 search-relay discovery (a networking concern); the `order`
hint from applesauce; multi-field/tag matching beyond `content` (could mention
`title`/`subject` as a possible extension but keep the matcher content-only for
clarity).