Add search chapter
This commit is contained in:
@@ -0,0 +1,214 @@
|
||||
# Search
|
||||
|
||||
NIP-50 adds one field to the filter from the previous chapter: a `search`
|
||||
string. A relay that advertises the capability reads the string as a
|
||||
human-readable query — `best nostr apps` — matches it against event content,
|
||||
and returns results ordered by relevance rather than by `created_at`, with
|
||||
`limit` applied after ranking.
|
||||
|
||||
Search is opt-in and implementation-defined. Relays decide whether they index events
|
||||
at all, what matches, and how ranking works. The query may also carry
|
||||
`key:value` extensions — `domain:`, `language:`, `sentiment:`, `nsfw:`,
|
||||
`include:spam` — and a relay honors only the ones it understands, ignoring the
|
||||
rest. There is no global index and no guarantee of completeness: a client
|
||||
queries the relays it believes support search and accepts a partial view.
|
||||
|
||||
Search may be implemented relay-side, or it may be performed on a client in some
|
||||
situations. This chapter provides utilities for parsing search terms along with
|
||||
a very basic model for implementing search that is decoupled from filter matching
|
||||
itself and entirely opt-in.
|
||||
|
||||
## The module
|
||||
|
||||
```rust {file=coracle-lib/src/lib.rs}
|
||||
pub mod search;
|
||||
```
|
||||
|
||||
```rust {file=coracle-lib/src/search.rs}
|
||||
//! NIP-50 full-text search queries.
|
||||
//!
|
||||
//! A [`SearchQuery`] holds the terms of a search string and computes a
|
||||
//! best-effort relevance score against event content — for the case where
|
||||
//! search runs on the client, over events already in hand, rather than on a
|
||||
//! relay.
|
||||
|
||||
use std::fmt;
|
||||
```
|
||||
|
||||
## The query model
|
||||
|
||||
A `SearchQuery` is just the query's terms: the words split out of the search
|
||||
string. NIP-50 also defines `key:value` extensions, but their meaning is
|
||||
relay-defined, and the local scorer has no way to evaluate `sentiment:negative`
|
||||
or `domain:example.com` without data it doesn't have. Rather than model
|
||||
extensions we can't honor, we treat every token as a term. A relay that
|
||||
understands an extension still sees it verbatim in the query string; the local
|
||||
scorer simply matches it as text like any other word.
|
||||
|
||||
```rust {file=coracle-lib/src/search.rs}
|
||||
/// A parsed NIP-50 search query: the terms of the query string.
|
||||
///
|
||||
/// NIP-50 `key:value` extensions are not modeled separately — their semantics
|
||||
/// are relay-defined and cannot be evaluated locally, so each is kept as an
|
||||
/// ordinary term.
|
||||
#[derive(Debug, Clone, PartialEq, Eq, Default)]
|
||||
pub struct SearchQuery {
|
||||
/// The query's terms, in order.
|
||||
pub terms: Vec<String>,
|
||||
}
|
||||
```
|
||||
|
||||
### Parsing
|
||||
|
||||
Parsing splits the query on whitespace. Every token becomes a term, including
|
||||
anything that looks like an extension. There is nothing to reject, so parsing is
|
||||
total — it never errors.
|
||||
|
||||
```rust {file=coracle-lib/src/search.rs}
|
||||
impl SearchQuery {
|
||||
/// Create an empty query.
|
||||
pub fn new() -> Self {
|
||||
SearchQuery::default()
|
||||
}
|
||||
|
||||
/// Parse a raw query string by splitting it on whitespace. Every token,
|
||||
/// extension-like or not, becomes a term. Parsing never fails.
|
||||
pub fn parse(input: &str) -> Self {
|
||||
SearchQuery {
|
||||
terms: input.split_whitespace().map(str::to_string).collect(),
|
||||
}
|
||||
}
|
||||
|
||||
/// True when the query has no terms.
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.terms.is_empty()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Rendering joins the terms back into a query string. It is the inverse of
|
||||
parsing: feeding the output of one into the other gives an equal query, modulo
|
||||
runs of whitespace collapsing to single spaces.
|
||||
|
||||
```rust {file=coracle-lib/src/search.rs}
|
||||
impl fmt::Display for SearchQuery {
|
||||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
|
||||
f.write_str(&self.terms.join(" "))
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Scoring
|
||||
|
||||
NIP-50 returns results in descending order of relevance, so a boolean "matches
|
||||
or not" is the wrong shape for a local implementation. The scorer instead
|
||||
returns a number in `0.0..=1.0`, which can drive both inclusion (anything above
|
||||
zero is a hit) and ordering.
|
||||
|
||||
The score has two parts. The base is the fraction of the query's terms that
|
||||
appear in the content, compared case-insensitively — three terms, two present,
|
||||
gives `2/3`. On top of that, repeated occurrences add a small, diminishing
|
||||
bonus, so that among events matching the same set of terms the ones that mention
|
||||
them more often rank higher. The bonus is bounded below `1/total`, which means
|
||||
it can reorder events *within* a fraction but can never push a partial match up
|
||||
to a full one: a missing term always costs more than any number of repetitions
|
||||
can recover. An empty query — no terms — scores `1.0`, since there is no text to
|
||||
constrain.
|
||||
|
||||
```rust {file=coracle-lib/src/search.rs}
|
||||
impl SearchQuery {
|
||||
/// Score `content` against this query's terms, in `0.0..=1.0`.
|
||||
///
|
||||
/// The base score is the fraction of the query's terms found in the content
|
||||
/// (case-insensitive substring). Repeated occurrences add a diminishing
|
||||
/// bonus, strictly less than one term's worth, so a partial match never
|
||||
/// reaches `1.0`. An empty query scores `1.0`: there is no text to match.
|
||||
pub fn score(&self, content: &str) -> f64 {
|
||||
let total = self.terms.len();
|
||||
if total == 0 {
|
||||
return 1.0;
|
||||
}
|
||||
|
||||
let haystack = content.to_lowercase();
|
||||
|
||||
let mut matched = 0usize;
|
||||
let mut extra = 0usize;
|
||||
for term in &self.terms {
|
||||
let needle = term.to_lowercase();
|
||||
if needle.is_empty() {
|
||||
// An empty term imposes no constraint; treat it as present.
|
||||
matched += 1;
|
||||
continue;
|
||||
}
|
||||
let count = haystack.matches(needle.as_str()).count();
|
||||
if count > 0 {
|
||||
matched += 1;
|
||||
extra += count - 1;
|
||||
}
|
||||
}
|
||||
|
||||
let base = matched as f64 / total as f64;
|
||||
let bonus = (1.0 - 1.0 / (1.0 + extra as f64)) / total as f64;
|
||||
(base + bonus).min(1.0)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Lowercasing uses `to_lowercase`, which folds case across Unicode rather than
|
||||
only ASCII. That allocates, but nostr content is multilingual, and correctness
|
||||
on non-Latin text is worth more than avoiding a copy in a best-effort matcher.
|
||||
|
||||
## Connecting queries to filters
|
||||
|
||||
The previous chapter gave `Filter` a `search` field but no way to set it. The
|
||||
setters follow the established `add_*` / `clear_*` vocabulary.
|
||||
|
||||
```rust {file=coracle-lib/src/filters.rs}
|
||||
use crate::search::SearchQuery;
|
||||
|
||||
impl Filter {
|
||||
/// Set the NIP-50 search query.
|
||||
pub fn add_search(mut self, search: impl Into<String>) -> Self {
|
||||
self.search = Some(search.into());
|
||||
self
|
||||
}
|
||||
|
||||
/// Remove the search query, leaving no search constraint.
|
||||
pub fn clear_search(mut self) -> Self {
|
||||
self.search = None;
|
||||
self
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Scoring an event against a filter is then a matter of parsing the field and
|
||||
delegating to `SearchQuery::score`. With no search set the method returns `1.0`,
|
||||
so an unsearched filter never penalizes an event. This is purely the search
|
||||
dimension — it is independent of the structural `matches` check from the
|
||||
previous chapter, and the two are meant to be composed by the caller, not folded
|
||||
together. A consumer that wants search-ranked results filters with `matches`,
|
||||
scores with `search_score`, and sorts as it sees fit.
|
||||
|
||||
```rust {file=coracle-lib/src/filters.rs}
|
||||
impl Filter {
|
||||
/// Best-effort local relevance score for `event`, in `0.0..=1.0`.
|
||||
///
|
||||
/// Parses the `search` field and scores it against the event's content,
|
||||
/// returning `1.0` when there is no search. This considers *only* the
|
||||
/// `search` field; it is independent of [`matches`](Filter::matches).
|
||||
pub fn search_score(&self, event: &Event) -> f64 {
|
||||
match &self.search {
|
||||
Some(query) => SearchQuery::parse(query).score(&event.content),
|
||||
None => 1.0,
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## What's next
|
||||
|
||||
Search depends on routing the query to a relay that actually supports it.
|
||||
Discovering which relays advertise NIP-50, and choosing among them, is a
|
||||
networking and relay-metadata concern — the subject of the Domain and Networking
|
||||
sections, where relay selection is built on top of the filter types assembled
|
||||
here.
|
||||
Reference in New Issue
Block a user