Flotilla gets stuck authenticating #111
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've noticed that when I leave app.flotilla.social running in my web browser all day it tends to get stuck Authenticating and no new events are loaded until I hit refresh. I have seen this at least 4 times in the past week or two. I'm seeing this in Firefox maybe especially after waking from sleep.
Here are some log excerpts that may be a starting point to diagnose:
dozens of these:
hundreds of these:
one dozen of these (maybe after waking from sleep? no timestamps in the console 😞)
Hi @mplorentz and @hodlbod — I've traced this to two related problems:
In syncSpaces(), derived([userGroupList, page], identity) fires on every page navigation. Each time roomsKey changes or the guard doesn't fire, pullAndListen() creates new request() subscriptions without cleaning up old ones, eventually accumulating 1000+ receive listeners on the same socket (the MaxListenersExceededWarning).
After sleep-wake, all WebSocket connections drop simultaneously. The auth policy tries to re-authenticate on every reconnected socket, but the accumulated stale listeners from (1) interfere with the auth handshake, leaving sockets stuck in "Authenticating".
My proposed fix:
(a) decouple page navigation from space sync restarts in sync.ts, and (
b) add a visibilitychange handler that cleanly tears down and restarts all sync on sleep-wake.
A secondary fix in policies.ts adds a per-socket auth timeout to prevent permanent stuck state.
This is more complex than a typical bug fix since it touches sync lifecycle. Would appreciate your thoughts on the approach before I open a PR — particularly on whether tearing down/restarting sync on visibilitychange is the right granularity or if you'd prefer a lighter-touch reconnect.
I think those are good suggestions, but instead of decoupling sync restarts from page navigation I would suggestion turning the visibilitychange handler into a store which gets subscribed to in
syncto tear down/set up everything using the same logic. This means we can keep the current idempotent approach, which keeps things easier to understand.I'm more interested in the memory leak you've found; I see that syncSpace never gets called without the corresponding unsubscriber being called. So I'm not sure how you would end up with duplicate subscriptions on a url based on the logic in sync. Can you clarify where the leak actually is?
Hi @hodlbod, first of all apologies for the late response - was unaware of your comment.
Nonetheless, that makes sense — using a visibilitychange store and subscribing to it in sync sounds much cleaner and keeps the idempotent flow intact. I’ll switch to that approach.
Regarding the leak — I initially thought along similar lines (that syncSpace unsubscribers are being called), but from the stack traces (adapter.js / Tools.js accumulating receive listeners), it seems the issue is happening at a lower level.
My current hypothesis is that even though the high-level cleanup runs, the underlying request() / adapter layer isn’t fully removing the EventEmitter listeners from the shared socket. This could be due to:
-async timing (listener attached after cleanup fires), or
-incomplete teardown in the adapter (removeListener not being called consistently)
So effectively, each re-run adds new listeners on the same socket, leading to the accumulation.
I’ll trace this more precisely in sync.ts and the adapter layer to pinpoint where the cleanup is being missed and include that in the fix. Does this direction make sense?
Yes, I think you're headed in the right direction, thanks for looking into it!
Hi @hodlbod,
Actually you were right!
I traced the listener accumulation down to the @welshman/net layer. It looks like if options.onClose throws, the cleanup path (including removing listeners) never executes, which leaves listeners attached to the shared socket.
I’m planning a two-part fix:
ensure our onClose callbacks don’t throw (i can implement it in order to get the immediate fix in Flotilla)
and open a PR upstream to guard the cleanup with a try/finally
Let me know if this approach sounds good to you!
That all sounds fine, have you validated that an onClose callback is throwing?
I haven’t conclusively seen it throw yet. The hypothesis comes from tracing the control flow where a thrown onClose could skip the adapter cleanup.
I’ll add some instrumentation around that path locally (especially during sleep/wake) to confirm if it’s actually happening, and will share logs once I can reproduce it.
@hodlbod I just ran it locally with some trace logs and onClose isn't throwing at all. So that theory is busted 😅. I'm digging into the adapter now to see if the abort signal is dropping early or if removeListener is just missing the reference.Lastly, I'll let you know what I find!
@hodlbod I instrumented the onClose path and confirmed it's not throwing; welshman cleanup completes correctly every time.
The leak is in pullWithFallback in sync.ts. It's called without await inside pullAndListen, so the async chain — including the Difference creation and request() calls in pullOneWithFallback — continues running after controller.abort() has already fired. The signal.aborted check only runs after await loadRelay() resolves, by which point new listeners are already being attached to the socket.
My Plan: add signal.aborted guards before Difference construction and before request() in pullOneWithFallback, alongside the documentVisibility store. Shall I open a PR with this?
Yes, thanks!
@hodlbod Thanks for the clarification on visibility - removed the documentVisibility integration entirely.
The PR now only contains the abort guards in the fallback pull path. Also addressed all other review comments. Please take another look.
I've closed this issue, but I'm not entirely satisfied it's fixed. Feel free to re-open if you're still experiencing it.