215 lines
7.9 KiB
Markdown
215 lines
7.9 KiB
Markdown
# Search
|
|
|
|
NIP-50 adds one field to the filter from the previous chapter: a `search`
|
|
string. A relay that advertises the capability reads the string as a
|
|
human-readable query — `best nostr apps` — matches it against event content,
|
|
and returns results ordered by relevance rather than by `created_at`, with
|
|
`limit` applied after ranking.
|
|
|
|
Search is opt-in and implementation-defined. Relays decide whether they index events
|
|
at all, what matches, and how ranking works. The query may also carry
|
|
`key:value` extensions — `domain:`, `language:`, `sentiment:`, `nsfw:`,
|
|
`include:spam` — and a relay honors only the ones it understands, ignoring the
|
|
rest. There is no global index and no guarantee of completeness: a client
|
|
queries the relays it believes support search and accepts a partial view.
|
|
|
|
Search may be implemented relay-side, or it may be performed on a client in some
|
|
situations. This chapter provides utilities for parsing search terms along with
|
|
a very basic model for implementing search that is decoupled from filter matching
|
|
itself and entirely opt-in.
|
|
|
|
## The module
|
|
|
|
```rust {file=coracle-lib/src/lib.rs}
|
|
pub mod search;
|
|
```
|
|
|
|
```rust {file=coracle-lib/src/search.rs}
|
|
//! NIP-50 full-text search queries.
|
|
//!
|
|
//! A [`SearchQuery`] holds the terms of a search string and computes a
|
|
//! best-effort relevance score against event content — for the case where
|
|
//! search runs on the client, over events already in hand, rather than on a
|
|
//! relay.
|
|
|
|
use std::fmt;
|
|
```
|
|
|
|
## The query model
|
|
|
|
A `SearchQuery` is just the query's terms: the words split out of the search
|
|
string. NIP-50 also defines `key:value` extensions, but their meaning is
|
|
relay-defined, and the local scorer has no way to evaluate `sentiment:negative`
|
|
or `domain:example.com` without data it doesn't have. Rather than model
|
|
extensions we can't honor, we treat every token as a term. A relay that
|
|
understands an extension still sees it verbatim in the query string; the local
|
|
scorer simply matches it as text like any other word.
|
|
|
|
```rust {file=coracle-lib/src/search.rs}
|
|
/// A parsed NIP-50 search query: the terms of the query string.
|
|
///
|
|
/// NIP-50 `key:value` extensions are not modeled separately — their semantics
|
|
/// are relay-defined and cannot be evaluated locally, so each is kept as an
|
|
/// ordinary term.
|
|
#[derive(Debug, Clone, PartialEq, Eq, Default)]
|
|
pub struct SearchQuery {
|
|
/// The query's terms, in order.
|
|
pub terms: Vec<String>,
|
|
}
|
|
```
|
|
|
|
### Parsing
|
|
|
|
Parsing splits the query on whitespace. Every token becomes a term, including
|
|
anything that looks like an extension. There is nothing to reject, so parsing is
|
|
total — it never errors.
|
|
|
|
```rust {file=coracle-lib/src/search.rs}
|
|
impl SearchQuery {
|
|
/// Create an empty query.
|
|
pub fn new() -> Self {
|
|
SearchQuery::default()
|
|
}
|
|
|
|
/// Parse a raw query string by splitting it on whitespace. Every token,
|
|
/// extension-like or not, becomes a term. Parsing never fails.
|
|
pub fn parse(input: &str) -> Self {
|
|
SearchQuery {
|
|
terms: input.split_whitespace().map(str::to_string).collect(),
|
|
}
|
|
}
|
|
|
|
/// True when the query has no terms.
|
|
pub fn is_empty(&self) -> bool {
|
|
self.terms.is_empty()
|
|
}
|
|
}
|
|
```
|
|
|
|
Rendering joins the terms back into a query string. It is the inverse of
|
|
parsing: feeding the output of one into the other gives an equal query, modulo
|
|
runs of whitespace collapsing to single spaces.
|
|
|
|
```rust {file=coracle-lib/src/search.rs}
|
|
impl fmt::Display for SearchQuery {
|
|
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
|
f.write_str(&self.terms.join(" "))
|
|
}
|
|
}
|
|
```
|
|
|
|
## Scoring
|
|
|
|
NIP-50 returns results in descending order of relevance, so a boolean "matches
|
|
or not" is the wrong shape for a local implementation. The scorer instead
|
|
returns a number in `0.0..=1.0`, which can drive both inclusion (anything above
|
|
zero is a hit) and ordering.
|
|
|
|
The score has two parts. The base is the fraction of the query's terms that
|
|
appear in the content, compared case-insensitively — three terms, two present,
|
|
gives `2/3`. On top of that, repeated occurrences add a small, diminishing
|
|
bonus, so that among events matching the same set of terms the ones that mention
|
|
them more often rank higher. The bonus is bounded below `1/total`, which means
|
|
it can reorder events *within* a fraction but can never push a partial match up
|
|
to a full one: a missing term always costs more than any number of repetitions
|
|
can recover. An empty query — no terms — scores `1.0`, since there is no text to
|
|
constrain.
|
|
|
|
```rust {file=coracle-lib/src/search.rs}
|
|
impl SearchQuery {
|
|
/// Score `content` against this query's terms, in `0.0..=1.0`.
|
|
///
|
|
/// The base score is the fraction of the query's terms found in the content
|
|
/// (case-insensitive substring). Repeated occurrences add a diminishing
|
|
/// bonus, strictly less than one term's worth, so a partial match never
|
|
/// reaches `1.0`. An empty query scores `1.0`: there is no text to match.
|
|
pub fn score(&self, content: &str) -> f64 {
|
|
let total = self.terms.len();
|
|
if total == 0 {
|
|
return 1.0;
|
|
}
|
|
|
|
let haystack = content.to_lowercase();
|
|
|
|
let mut matched = 0usize;
|
|
let mut extra = 0usize;
|
|
for term in &self.terms {
|
|
let needle = term.to_lowercase();
|
|
if needle.is_empty() {
|
|
// An empty term imposes no constraint; treat it as present.
|
|
matched += 1;
|
|
continue;
|
|
}
|
|
let count = haystack.matches(needle.as_str()).count();
|
|
if count > 0 {
|
|
matched += 1;
|
|
extra += count - 1;
|
|
}
|
|
}
|
|
|
|
let base = matched as f64 / total as f64;
|
|
let bonus = (1.0 - 1.0 / (1.0 + extra as f64)) / total as f64;
|
|
(base + bonus).min(1.0)
|
|
}
|
|
}
|
|
```
|
|
|
|
Lowercasing uses `to_lowercase`, which folds case across Unicode rather than
|
|
only ASCII. That allocates, but nostr content is multilingual, and correctness
|
|
on non-Latin text is worth more than avoiding a copy in a best-effort matcher.
|
|
|
|
## Connecting queries to filters
|
|
|
|
The previous chapter gave `Filter` a `search` field but no way to set it. The
|
|
setters follow the established `add_*` / `clear_*` vocabulary.
|
|
|
|
```rust {file=coracle-lib/src/filters.rs}
|
|
use crate::search::SearchQuery;
|
|
|
|
impl Filter {
|
|
/// Set the NIP-50 search query.
|
|
pub fn add_search(mut self, search: impl Into<String>) -> Self {
|
|
self.search = Some(search.into());
|
|
self
|
|
}
|
|
|
|
/// Remove the search query, leaving no search constraint.
|
|
pub fn clear_search(mut self) -> Self {
|
|
self.search = None;
|
|
self
|
|
}
|
|
}
|
|
```
|
|
|
|
Scoring an event against a filter is then a matter of parsing the field and
|
|
delegating to `SearchQuery::score`. With no search set the method returns `1.0`,
|
|
so an unsearched filter never penalizes an event. This is purely the search
|
|
dimension — it is independent of the structural `matches` check from the
|
|
previous chapter, and the two are meant to be composed by the caller, not folded
|
|
together. A consumer that wants search-ranked results filters with `matches`,
|
|
scores with `search_score`, and sorts as it sees fit.
|
|
|
|
```rust {file=coracle-lib/src/filters.rs}
|
|
impl Filter {
|
|
/// Best-effort local relevance score for `event`, in `0.0..=1.0`.
|
|
///
|
|
/// Parses the `search` field and scores it against the event's content,
|
|
/// returning `1.0` when there is no search. This considers *only* the
|
|
/// `search` field; it is independent of [`matches`](Filter::matches).
|
|
pub fn search_score(&self, event: &Event) -> f64 {
|
|
match &self.search {
|
|
Some(query) => SearchQuery::parse(query).score(&event.content),
|
|
None => 1.0,
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## What's next
|
|
|
|
Search depends on routing the query to a relay that actually supports it.
|
|
Discovering which relays advertise NIP-50, and choosing among them, is a
|
|
networking and relay-metadata concern — the subject of the Domain and Networking
|
|
sections, where relay selection is built on top of the filter types assembled
|
|
here.
|