308 lines
16 KiB
Markdown
308 lines
16 KiB
Markdown
# Research: Search
|
||
|
||
## Topic Summary
|
||
|
||
NIP-50 adds an optional full-text `search` field to the subscription filter
|
||
introduced in chapter 11. A relay that supports the capability interprets the
|
||
query string against event content (and, for some kinds, other fields),
|
||
returning results ordered by relevance rather than `created_at`. The query may
|
||
carry structured extensions in the form of `key:value` pairs — `domain:`,
|
||
`language:`, `sentiment:`, `nsfw:`, `include:spam` — which relays may support or
|
||
ignore.
|
||
|
||
The chapter will:
|
||
|
||
1. Add a `search` field to the existing `Filter` type, wiring it through
|
||
construction, serialization, hashing, grouping, and the union/intersect
|
||
utilities.
|
||
2. Introduce a typed `SearchQuery` model that splits free-text terms from
|
||
`key:value` extensions, so applications can build and inspect queries safely
|
||
instead of stringly-typed concatenation. (This is a deliberate departure
|
||
from every reference, which treats the query as an opaque string.)
|
||
3. Implement a best-effort, case-insensitive local matcher over event content,
|
||
while documenting that real ranking and extension semantics are
|
||
relay-defined.
|
||
|
||
The code lives in `coracle-lib`: the `search` field extends `filters.rs`, and
|
||
the query model gets a dedicated `search.rs` module.
|
||
|
||
## Philosophy
|
||
|
||
From `ref/building-nostr`, the framing relevant to search is that **content
|
||
discovery on nostr is client-initiated routing through relay selection**, not a
|
||
query against a global index. Searching is "knowing where to send queries." A
|
||
relay that supports NIP-50 is exercising an *optional, relay-authored
|
||
capability* — like content curation or access control — and defines its own
|
||
matching semantics, including which extensions it honors. This mirrors the NIP's
|
||
own "relays SHOULD ignore extensions they don't support."
|
||
|
||
Three principles bear directly on the chapter's voice:
|
||
|
||
- **No guaranteed completeness.** "No implementation will have a complete view
|
||
of every heuristic that is applicable" — so search results are neither global
|
||
nor exhaustive. A client queries the relays it knows support search and
|
||
accepts a partial, spontaneous view. This should be stated honestly, not hidden.
|
||
- **Indexing is the curator's responsibility, not the user's.** Authors publish
|
||
signed events; relays (or indexing services) that *want* content discoverable
|
||
maintain the index. Clients do nothing special beyond sending a `search`
|
||
filter to a search-capable relay.
|
||
- **Publicity, not privacy.** Full-text indexing makes content patterns
|
||
discoverable and gives relay operators visibility into queries. The honest
|
||
framing: search is a publicity feature.
|
||
|
||
The takeaway for our library: model `search` as a first-class but optional
|
||
filter field, keep the query structured enough that applications can reason
|
||
about it, and be candid that local matching is a best-effort approximation of a
|
||
relay-defined operation.
|
||
|
||
## Reference Implementation Analysis
|
||
|
||
### applesauce
|
||
|
||
`search` is an optional string on an extended `Filter` type
|
||
(`packages/core/src/helpers/filter.ts`): `Filter = CoreFilter & { search?: string }`,
|
||
extending nostr-tools' base type. **Opaque** — no extension parsing.
|
||
|
||
Dual-mode: relay subscriptions pass the string through verbatim; a local SQLite
|
||
backend (`packages/sqlite`) indexes content into an FTS5 table and runs
|
||
`events_search MATCH ?` with the raw string double-quote-escaped. Local
|
||
client-side `matchFilter()` **ignores** the search field entirely. Pluggable
|
||
"search content formatters" decide what gets indexed (default: `content`;
|
||
enhanced: kind-0 profile fields plus `t`/`subject`/`title`/`summary`/`d` tags).
|
||
Supports `order: "created_at" | "rank"` for FTS5 ranking. Low coupling; SQLite
|
||
is optional. No query-extension awareness anywhere.
|
||
|
||
### ndk
|
||
|
||
`search?: string` on `NDKFilter` (`core/src/subscription/index.ts:30`).
|
||
**Opaque, relay-only.** No parsing, no validation (filter-validation pipeline
|
||
skips it), no client-side matching (delegates to nostr-tools' `matchFilters`,
|
||
which ignores search). No helper functions for building search filters; callers
|
||
construct `{ search: "..." }` by hand. The field is serialized and sent to
|
||
relays as-is. No NIP-11 capability negotiation or fallback. Minimal by design.
|
||
|
||
### nostr-gadgets
|
||
|
||
Re-uses `@nostr/tools`' `Filter` type (`search?: string`). **Opaque,
|
||
relay-only.** Notably its local stores *reject* search: the in-memory store
|
||
returns an empty set if `filter.search` is present, and the RedEventStore docs
|
||
state "any filters supported (except 'search')." Provides a hardcoded
|
||
`SEARCH_RELAYS` constant (`defaults.ts`): `relay.nostr.band`, `nostr.wine`,
|
||
`relay.noswhere.com`, `relay.nos.today`. No query builders, no dynamic relay
|
||
capability detection.
|
||
|
||
### nostrlib (Go)
|
||
|
||
`Search string` on the `Filter` struct (`filter.go`), (de)serialized as a plain
|
||
`"search"` JSON key. The core `Filter.Matches` / `MatchesIgnoringTimestampConstraints`
|
||
**ignores** search — matching is delegated to eventstore backends. Key-value
|
||
backends (BoltDB, LMDB, MMM) return nothing for search queries; only the **Bleve**
|
||
backend implements real full-text search: per-document language auto-detection
|
||
(lingua-go, 22 languages), per-language analyzers, boolean query syntax
|
||
(`AND/OR/NOT`, parens, quoted phrases), NIP-27 reference extraction with 2× boost,
|
||
and case-insensitive substring validation of quoted phrases. Kind-0 profiles index
|
||
name/display_name/about; reposts unpack inner events. Khatru relay policies
|
||
`NoSearchQueries`/`RemoveSearchQueries` let operators disable search. SDK
|
||
`SearchUsers()` just sends a `Search` filter to designated user-search relays. No
|
||
NIP-50 *extension* parsing (treats `domain:x` as a regular word); a 2-char minimum
|
||
query length is enforced by Bleve.
|
||
|
||
### nostr-tools
|
||
|
||
`search?: string` on the base `Filter` (`filter.ts`). **The canonical
|
||
"defined-but-unused" implementation.** `matchFilter()`/`matchFilters()` do not
|
||
check search at all; `mergeFilters()` drops it entirely. No parsing, no
|
||
validation, no helpers, no tests for the field. Strictly a transport-layer
|
||
placeholder so applications can send search filters to relays. Minimal-deps
|
||
philosophy: search is purely a relay concern.
|
||
|
||
### rust-nostr
|
||
|
||
The most directly relevant reference (also Rust). In
|
||
`crates/nostr/src/filter.rs`:
|
||
|
||
```rust
|
||
/// A string describing a query in a human-readable form, i.e. "best nostr apps"
|
||
/// <https://github.com/nostr-protocol/nips/blob/master/50.md>
|
||
#[serde(skip_serializing_if = "Option::is_none")]
|
||
#[serde(default)]
|
||
pub search: Option<String>,
|
||
```
|
||
|
||
Builder API: `search<S: Into<String>>(self, value: S) -> Self` and
|
||
`remove_search(self) -> Self` — symmetric, generic, `#[inline]`. **Opaque** (no
|
||
extension parsing).
|
||
|
||
Local matching (`search_match`):
|
||
|
||
```rust
|
||
fn search_match(&self, event: &Event) -> bool {
|
||
match &self.search {
|
||
Some(query) => event.content.as_bytes()
|
||
.windows(query.len())
|
||
.any(|window| window.eq_ignore_ascii_case(query.as_bytes())),
|
||
None => true,
|
||
}
|
||
}
|
||
```
|
||
|
||
Case-insensitive **ASCII** substring via sliding window; `None` matches
|
||
everything. Gated by a `MatchEventOptions { nip50: bool, .. }` flag (default
|
||
true). Notably, the SDK relay sets `.nip50(false)` with the comment "Skip NIP-50
|
||
matches since they may create issues and ban non-malicious relays" — i.e.
|
||
client-side re-matching of a relay's search results can wrongly drop valid hits.
|
||
DB backends (LMDB, SQLite) extend matching to a fixed set of searchable tags —
|
||
`title`, `description`, `subject`, `name` — lowercasing the query once up front;
|
||
empty search → no results. A `Features { full_text_search: bool }` flag declares
|
||
backend capability.
|
||
|
||
Patterns worth emulating: `Into<String>` builder, `skip_serializing_if` for a
|
||
clean wire format, an explicit opt-out for search matching, ASCII case folding
|
||
for speed.
|
||
|
||
### welshman
|
||
|
||
The TypeScript toolkit our library descends from. `search?: string` on `Filter`
|
||
(`packages/util/src/Filters.ts`). It is the **only reference that matches search
|
||
locally and threads it through filter utilities**:
|
||
|
||
```typescript
|
||
export const matchFilter = (filter, event) => {
|
||
if (!nostrToolsMatchFilter(filter, event)) return false
|
||
if (filter.search) {
|
||
const content = event.content.toLowerCase()
|
||
const terms = filter.search.toLowerCase().split(/\s+/g)
|
||
for (const term of terms) {
|
||
if (content.includes(term)) return true
|
||
return false // <-- bug: returns after first term
|
||
}
|
||
}
|
||
return true
|
||
}
|
||
```
|
||
|
||
The intent is term-splitting + case-insensitive substring, but the early
|
||
`return false` means only the first term is ever checked. **A correct version
|
||
should decide AND vs OR across terms explicitly** — this is the one place we can
|
||
clearly improve on the reference.
|
||
|
||
Filter utilities (directly parallel to our `group`/`union_filters`/`intersect_filters`):
|
||
|
||
- `calculateFilterGroup` pushes `search:${search}` into the group key — **a
|
||
filter with a search is only mergeable with an identical search.**
|
||
- `unionFilters` treats `search` (like `since`/`until`/`limit`) as a scalar
|
||
preserved from the first filter in the group, **not merged**.
|
||
- `intersectFilters` concatenates differing searches with a space
|
||
(`[a, b].join(" ")`) — modeling "must match both" as a compound query — and
|
||
takes whichever is present otherwise.
|
||
- `getFilterId` includes search in the deterministic hash, so different searches
|
||
never dedupe.
|
||
|
||
Search-relay selection lives in the router: `getSearchRelays()` returns relays
|
||
whose NIP-11 `supported_nips` includes `"50"`. No extension parsing.
|
||
|
||
## Common Patterns
|
||
|
||
- **`search` is universally an optional plain string.** Every reference models
|
||
it as `Option<String>` / `search?: string`. None parse the `key:value`
|
||
extensions — they treat the whole query as opaque and let the relay interpret
|
||
it. Our typed `SearchQuery` is therefore a value-add, not a port.
|
||
- **Local matching is the exception, not the rule.** nostr-tools, ndk,
|
||
applesauce (in `matchFilter`), and nostrlib's core `Filter` all *ignore*
|
||
search locally; matching happens relay-side (or in a dedicated index like
|
||
Bleve/FTS5). Only rust-nostr and welshman attempt local matching, both with
|
||
case-insensitive substring over `content`.
|
||
- **Where matching exists, it's case-insensitive substring** — rust-nostr does
|
||
ASCII-only `eq_ignore_ascii_case` over byte windows (whole query as one
|
||
needle); welshman lowercases and splits on whitespace into terms (intending
|
||
multi-term, buggily). DB backends additionally search a small fixed set of
|
||
metadata tags (`title`, `description`, `subject`, `name`).
|
||
- **Search makes filters un-mergeable.** Both welshman (group key) and the
|
||
general intuition agree: two filters with different search strings can't be
|
||
unioned without changing semantics. rust-nostr sidesteps merging at this layer
|
||
entirely.
|
||
- **Client-side re-matching is risky.** rust-nostr's SDK disables NIP-50
|
||
matching when filtering relay results, because a relay's notion of a match
|
||
(ranked, fuzzy, multi-field, extension-aware) is richer than a client's
|
||
substring check — re-filtering can drop legitimate hits.
|
||
- **Relay selection by NIP-11.** Search-capable relays are discovered via
|
||
`supported_nips` containing `50` (welshman) or a hardcoded allowlist
|
||
(nostr-gadgets). This is an application/networking concern, out of scope for
|
||
`coracle-lib`.
|
||
|
||
## Considerations for Our Implementation
|
||
|
||
**Filter field.** Add `pub search: Option<String>` to `Filter`. Follow
|
||
rust-nostr: `add_search<S: Into<String>>(self, S)` and `clear_search(self)` to
|
||
match the existing `add_*`/`clear_*` builder vocabulary (our methods are named
|
||
`add_since`/`clear_since`, etc., so `add_search`/`clear_search` fits better than
|
||
rust-nostr's `search`/`remove_search`). The field already participates in the
|
||
derived `Hash` (so `id()` covers it for free), but serialization, `group()`,
|
||
`union_filters`, `intersect_filters`, and `matches()` all need explicit updates.
|
||
|
||
**Serialization.** Our `Filter` has hand-written serde (to flatten `#tag` keys).
|
||
Add `search` as a plain `"search"` key — emit only when `Some` (mirroring
|
||
`since`/`until`/`limit`), and read it in the visitor's match arm. A round-trip
|
||
test must cover it.
|
||
|
||
**Grouping / union / intersect.** Per welshman: include `search` in the
|
||
`group()` hash so filters with different searches land in different groups (never
|
||
merged). In `union_filters`, since group members share an identical search by
|
||
construction, the search carries over via the `or_insert_with(|| filter.clone())`
|
||
seed — no special merge needed, but worth a comment. In `combine_pair`
|
||
(intersect), decide how to combine two searches: welshman concatenates with a
|
||
space. Concatenation is defensible ("must match both") but lossy and surprising;
|
||
a cleaner rule for a typed model is to **merge two `SearchQuery` values** (union
|
||
their terms and extensions) or, if we keep the field as a string at this layer,
|
||
to concatenate with a space and document it. Recommend: concatenate with a space
|
||
when both present and differ, matching welshman, and note the limitation.
|
||
|
||
**Local matching.** Extend `Filter::matches` to test `search` *after* the cheap
|
||
scalar checks. Best-effort, case-insensitive. Two design choices to settle in
|
||
planning:
|
||
1. Whole-query substring (rust-nostr) vs. term-split AND/OR (welshman, fixed).
|
||
A typed `SearchQuery` makes term-split natural: match the free-text terms
|
||
(AND across terms reads as the intuitive "all words present"; document it),
|
||
and treat `key:value` extensions as *unenforceable locally* — i.e. ignored by
|
||
the local matcher, since we can't evaluate `sentiment:` or `domain:` without
|
||
external data. This honesty matches the NIP.
|
||
2. ASCII (`eq_ignore_ascii_case`) vs. Unicode lowercasing. ASCII is what
|
||
rust-nostr ships and is allocation-free; Unicode `to_lowercase` is more
|
||
correct for non-Latin content but allocates. Given nostr's multilingual
|
||
content, prefer Unicode `to_lowercase` for the local matcher — correctness
|
||
over micro-optimization, consistent with our "clarity over cleverness" rule —
|
||
and note the trade-off.
|
||
|
||
Also document, per rust-nostr's SDK, that local matching is a *fallback*:
|
||
relay results should generally be trusted as-is rather than re-filtered.
|
||
|
||
**`SearchQuery` model (new `search.rs`).** A struct splitting a query into
|
||
free-text `terms: Vec<String>` and `extensions: Vec<(String, String)>` (ordered;
|
||
NIP-50 doesn't forbid repeats, and order can matter to relays). Parsing: split on
|
||
whitespace, treat a token containing `:` (with a non-empty key before it) as an
|
||
extension, everything else as a term. Provide:
|
||
- `SearchQuery::parse(&str) -> SearchQuery` (total, never fails — unknown shapes
|
||
fall back to terms).
|
||
- `Display` / `to_string()` that re-renders to the wire string (terms first or
|
||
preserve order; planning to decide).
|
||
- Builder helpers: `term`, `extension`, plus typed convenience for the
|
||
spec-defined extensions (`domain`, `language`, `sentiment`, `nsfw`,
|
||
`include_spam`) — optional, decide scope in planning.
|
||
- A bridge to `Filter`: `Filter::add_search` can accept `impl Into<String>` so
|
||
both a raw string and `query.to_string()` work; optionally
|
||
`Filter::search_query()` to parse the field back out.
|
||
|
||
Keep `sentiment`/`nsfw` values as strings (or small enums) — leaning toward
|
||
strings to stay forward-compatible with relay-specific values, with named
|
||
constructors for the common cases.
|
||
|
||
**Dependencies.** None new. Parsing is plain string handling; matching uses std.
|
||
Avoid pulling in a real FTS engine — out of scope and against the
|
||
minimal-dependency rule.
|
||
|
||
**Out of scope (defer / mention only).** Real relevance ranking; relay-side
|
||
indexing; NIP-11 search-relay discovery (a networking concern); the `order`
|
||
hint from applesauce; multi-field/tag matching beyond `content` (could mention
|
||
`title`/`subject` as a possible extension but keep the matcher content-only for
|
||
clarity).
|