DECOMP — Appendix

Setup

Notation

Per session, after binning at $\Delta t = 20$ ms and aligning every behavioural stream onto the spike clock:

$Y^{(R)} \in \mathbb{R}^{n_R \times T}$ is the spike-count matrix for region $R \in \{\text{V1, CB, M1, CA1}\}$, with $n_R$ good-QC units and $T \approx 2.6 \times 10^5$ bins.
$X \in \mathbb{R}^{T \times p}$ is the design matrix of behaviour and task regressors, raised-cosine-expanded.
$Z(t) \in \mathbb{R}^{T \times d_Z}$ is the confound matrix we partial out. Minimal $Z$ has $d_Z = 2$ (wheel velocity, pupil diameter). Richer $Z$ has $d_Z = 5\text{–}6$ (+ wheel acceleration, whisker ME, body ME, lick rate).
$S_R(t) \in \mathbb{R}^{T \times K}$ is the SVCA-projected score time series of region $R$, with $K = 8$.

The research question is about representational geometry, not decoding. Two regions can carry identical scalar information through completely different population codes — a fact a linear classifier is happy to overlook. The methods below are the ones that aren't.

A.1

Per-neuron Ridge GLM

A single neuron's spike train at 20-ms resolution is the sum of many overlapping echoes. A stimulus appears and ~80 ms later a small bump arrives in the firing rate. The mouse starts moving and another bump arrives. The wheel speeds up and the rate ramps with it. Every bin sees the sum of every recent event's lingering influence, plus instantaneous coupling to ongoing covariates, plus noise. The GLM is the formal version of untangling those echoes — give it a menu of behaviours and task variables, expand each into a small bank of raised-cosine basis shapes that look like plausible neural responses, and ask which combination best predicts the count in each bin.

The number we actually care about for each neuron is not the full-model $R^2$ but the unique contribution of each kernel group after the others have already competed for the same variance. We get that by fitting the full model on a training fold, refitting without one group of regressors at a time, and reading the held-out $R^2$ gap. That gap, averaged across 5 KFold splits, is $\Delta R^2_g$ — the leave-one-group-out cross-validated unique contribution. Done for every neuron in every region, $\Delta R^2_\text{movement}$ gives a per-region distribution we can rank against the literature.

$$ y_i(t) = \beta_{i,0} + \sum_{g \in \mathcal{G}} X_g(t)\,\boldsymbol{\beta}_{i,g} + \varepsilon_i(t),\qquad \hat{\boldsymbol{\beta}}_i = \arg\min_{\boldsymbol{\beta}} \;\|y_i - X\boldsymbol{\beta}\|_2^2 + \alpha\|\boldsymbol{\beta}\|_2^2 . $$

The kernel groups $\mathcal{G}$ are stimulus, movement, feedback, choice, prior, wheel, motion-energy, and pupil. Each event-aligned regressor is convolved with $n_b = 5$ raised-cosine basis kernels of the form

$$ b_k(\tau) = \tfrac{1}{2}\bigl(\cos\bigl(\tfrac{\pi(\log(\tau + \tau_0) - c_k)}{2 w}\bigr) + 1\bigr), $$

and continuous regressors contribute their convolution with the same bank. The design matrix $X$ ends up with $p \approx 60$ columns. The unique contribution of group $g$ to neuron $i$ is then

$$ \begin{aligned} \Delta R^2_{i,g} = R^2\!\bigl(y_i, \hat y_i^{\text{full}}\bigr)\Big|_{\text{test}} - R^2\!\bigl(y_i, \hat y_i^{\setminus g}\bigr)\Big|_{\text{test}}, \\[2pt] \text{averaged over 5 KFold splits.} \end{aligned} $$

The literature ranks $\Delta R^2_\text{movement}$ largest in motor cortex, then cerebellum, smaller in visual cortex, near zero in CA1 — encoding is structured across the brain, stronger toward the motor periphery. If our pipeline does not recover that ordering on the expanded 4,886-neuron sample, our design matrix is wrong and nothing downstream is trustworthy. That is the only thing the GLM has to do here.

It is also the only thing it can do. Two regions can land at the same $\Delta R^2$ medians while encoding movement in completely different population codes — say, CB carrying running speed along a low-d running-speed axis and V1 carrying it as a high-dimensional gain modulation. The geometry-versus-decoding distinction is invisible to a per-neuron scalar. The next four stages exist to see it.

A.2

SVCA — within-region reliable subspace

The cross-region CCA in the next section needs two clean low-dimensional summaries — one per region — that genuinely represent each region's population structure rather than the noise floor it happens to sit on. A naive PCA of a 50-unit × 263,000-bin spike-count matrix will hand us "leading components" no matter what: with this much time and this few cells, the top eigenvectors look impressively coherent even on a matrix of pure noise. Downstream methods would be doing high-precision arithmetic on overfitting.

SVCA certifies its own components. Split the cells of region $R$ randomly into halves $\mathcal{F}_\text{tr}$ and $\mathcal{F}_\text{te}$ and split time into training and held-out blocks $\mathcal{T}_\text{tr}$ and $\mathcal{T}_\text{te}$. On training time, find the directions of maximal cross-covariance between the two cell-halves — these are the axes along which the two halves' activities are coupled. Now project both halves into the held-out time block and ask whether the same axis is still coupled there. If yes, the axis is a real population mode that reproduces across both a resampling of cells and a resampling of time. If no, it was overfit on training time.

Concretely, the cell-by-cell cross-covariance on the training time block is

$$ C = Y^{(R)}_{\mathcal{F}_\text{tr},\,\mathcal{T}_\text{tr}}\, Y^{(R)\top}_{\mathcal{F}_\text{te},\,\mathcal{T}_\text{tr}} \;\in\; \mathbb{R}^{|\mathcal{F}_\text{tr}|\times|\mathcal{F}_\text{te}|}. $$

An SVD $C \approx U \Sigma V^\top$ gives loadings $u_k$ and $v_k$ for the two halves. Project them into the held-out time block,

$$ s_\text{tr}^{(k)}(t) = u_k^\top\, Y^{(R)}_{\mathcal{F}_\text{tr},\,\mathcal{T}_\text{te}},\qquad s_\text{te}^{(k)}(t) = v_k^\top\, Y^{(R)}_{\mathcal{F}_\text{te},\,\mathcal{T}_\text{te}}, $$

and the cross-half product over total energy in the held-out block is Stringer 2019's reliability,

$$ \rho^\text{SVCA}_k = \frac{\langle s_\text{tr}^{(k)}\,s_\text{te}^{(k)}\rangle_t} {\tfrac{1}{2}\bigl(\langle (s_\text{tr}^{(k)})^2\rangle_t + \langle (s_\text{te}^{(k)})^2\rangle_t\bigr)}. $$

$\rho^\text{SVCA}_k$ near 1 means a reproducible population mode; near 0 means noise. For CCA in A.3 we take the top $K = 8$ PCs of the whitened $Y^{(R)}$ as the region's denoised state $S_R(t) \in \mathbb{R}^{T \times K}$, and report the reliability spectrum as a diagnostic of how many of those modes are themselves trustworthy. On the IBL pair-sessions, that number is typically 1–3 for V1 and M1 and considerably higher for CB.

SVCA is purely intra-regional. It guarantees that the $K$ directions handed to CCA are signal but says nothing about whether one region's $K$ dimensions and another region's $K$ dimensions point in compatible directions. That is the cross-regional alignment question — A.3.

A.3

CCA — cross-region alignment

A linear classifier reading from V1 spike counts predicts whether the mouse is running at 95% accuracy. The same classifier on CB also gets 95%. A naïve reading is "both regions know about movement." But the classifier is happy with completely different population codes underneath: V1 might carry running speed along a single arousal-driven gain axis while CB carries it as a high-dimensional forward-model state. Both are decodable from; the two regions share information but not necessarily geometry.

The geometric question is strictly stronger: is there a direction in V1's K-space and a direction in CB's K-space whose projections rise and fall together on held-out time? CCA finds those canonical pairs $(a_k, b_k)$ by whitening each region's covariance and SVD-ing the cross-covariance:

$$ \begin{aligned} C_{xx} &= \tfrac{1}{T} S_\text{V1}^\top S_\text{V1} + \lambda I, \\ C_{yy} &= \tfrac{1}{T} S_\text{CB}^\top S_\text{CB} + \lambda I, \\ C_{xy} &= \tfrac{1}{T} S_\text{V1}^\top S_\text{CB}. \end{aligned} $$

$$ \widetilde{C} = C_{xx}^{-1/2}\, C_{xy}\, C_{yy}^{-1/2} = U\,\Sigma\,V^\top, \qquad \rho_k = \sigma_k. $$

The diagonal of $\Sigma$ holds the canonical correlations $\rho_1 \geq \rho_2 \geq \cdots$. $\rho_1 = 1$ means perfectly aligned leading axes; $\rho_1 = 0$ means the two regions share information without sharing geometry. We evaluate $\rho_k$ on time folds held out from the fit, so an overfit alignment cannot masquerade as shared structure.

Even on held-out time, two completely unrelated regions with slow drift produce non-zero $\rho_k$ purely from autocorrelated noise — at $T \approx 263k$ bins this is not small. We therefore phase-randomise each channel's Fourier coefficients independently, preserving its power spectrum but breaking its temporal alignment with the other region, redo the CCA, and repeat 200 times. The 99th percentile of those nulls is the floor anything significant must clear.

What CCA still cannot decide is the most important thing. A leading $\rho_k$ above null can mean two qualitatively different things. The two regions might both be reading off a single global locomotion + arousal state — itself approximately a function of wheel speed and pupil — in which case the canonical pair just recovers the shared copy of one drive. Or the two regions might share representational structure that goes beyond what wheel and pupil can explain. CCA returns the same answer in both cases. The contrast that separates them is A.4.

A.4

Partial CCA — the experimental contrast

Two students who agree on 80% of test questions look like they're cheating, but they're also taking the same lectures and doing the same homework — of course they agree on a lot. The real test of cheating is whether they still agree on what is left after we subtract out everything they could have learned from the shared lectures.

The same logic applies to V1 and CB. Both are bathed in a global locomotion + arousal drive that is, to a first approximation, a linear function of wheel velocity and pupil diameter. Of course they look correlated. The question we actually want is: do they still share canonical correlations after we remove the part of each region's scores that is linearly predictable from $Z = (\text{wheel}, \text{pupil})$?

Mechanically: regress each region's K-D score vector on $Z$, take residuals, and run CCA on the residuals. The regression weights are fit on training folds only and applied to held-out folds — anything else lets the residualisation peek at the test set and spuriously deflates apparent shared variance. This is Frisch–Waugh–Lovell with strict cross-validation.

$$ \begin{aligned} S_R^{\,r}(t) &= S_R(t) - Z(t)\,\hat{B}_R, \\ \hat{B}_R &= (Z^\top Z)^{-1} Z^\top S_R. \end{aligned} $$

$$ \begin{aligned} \widetilde{C}^{\,\text{partial}} &= (C^{r}_{xx})^{-1/2}\, C^{r}_{xy}\, (C^{r}_{yy})^{-1/2} \\ &= U' \Sigma' V'^\top, \qquad \rho_k^{\text{pCCA}} = \sigma'_k. \end{aligned} $$

The headline diagnostic is the survival ratio — the partial canonical correlations summed and divided by the raw canonical correlations summed:

$$ \text{survival} = \frac{\sum_k \rho_k^{\text{pCCA}}}{\sum_k \rho_k^{\text{CCA}}}. $$

Under H₀ — the two regions are both reading the same $Z$-driven global state — the ratio collapses to zero. Under H₁ — the two regions share structure beyond what $Z$ can account for — the ratio approaches one. Anything in between is graded evidence for partial overlap.

What is in $Z$ matters. Wheel velocity and pupil are the canonical scalars for "locomotion × arousal" and survive every IBL session, which makes them the conservative, minimal probe — exactly what people mean when they say "global state". A failure to collapse with this $Z$ is stronger evidence against H₀ than a failure with a 6-D kitchen-sink $Z$ would have been. On this dataset survival stays at ≈ 0.97 under minimal $Z$ and walks back to only 0.69–0.85 under a richer 5–6-D probe; in no pair does it collapse to zero. Whatever the surviving structure is, A.5 asks dynamic questions of it.

A.5

Dynamics — lag and peri-event averages

A survival ratio of ≈ 0.97 across all three pairs is striking on its own, but it is one scalar per session. Collapsed to a median, it forces three pairs whose biology is plausibly different to look the same. Two structural properties of the leading residual canonical variate $U(t) = U_A(t) \approx U_B(t)$ — the time series of the surviving shared mode after partialling — can tell them apart, and neither is captured by the survival ratio.

The first is whether the shared mode has temporal flow. If region A leads region B, the cross-correlation $\mathrm{corr}(U_A(t), U_B(t + \tau))$ peaks at $\tau > 0$; if B leads A, it peaks at $\tau < 0$; if neither leads, it peaks at $\tau = 0$. A non-zero lag at the timescale of a known synaptic pathway (≈ 30–60 ms for disynaptic cerebellothalamocortical transmission) is direct evidence of coupling along that pathway. Zero lag is consistent with either a shared drive into both regions or transmission faster than our 20-ms grid can resolve.

$$ \mathrm{xcorr}(\tau) = \mathrm{corr}\bigl(U_A(t),\, U_B(t + \tau)\bigr),\qquad \tau \in [-500,\,+500]\,\text{ms}, $$

with a phase-shuffle null on both channels giving the 99% band.

The second property is whether the shared mode tracks task events. The IBL trial table gives us first-movement and feedback timestamps; we align $U(t)$ to those events and average across trials:

$$ \overline{U}(\tau) = \frac{1}{|\mathcal{E}|}\sum_{t_e \in \mathcal{E}} U(t_e + \tau), \qquad \mathcal{E} \in \{\text{first-movement, feedback}\}. $$

A flat $\overline{U}(\tau)$ means the residual carries no task-locked structure — the signature of an inherited slow global state. Large peri-event transients mean the residual tracks the task in some way: either as a shared cognitive state both regions reflect simultaneously (zero lag, large transient) or as one region transmitting to the other after a task event (non-zero lag, large transient).

Three pairs, three signatures. V1↔CB shows zero-lag coupling with peri-event-flat averages — an inherited slow global state we did not measure. V1↔M1 shows zero-lag coupling with large peri-event transients — shared task-related cognitive state without transmission. CB↔M1 shows cerebellum leading motor cortex by +40 ms with large motor-event-locked transients — cerebellothalamocortical transmission. The survival ratio is the same in all three; three different things are happening.

A.6

Why these five, in this order

Each stage answers a question the previous one cannot, and supplies a clean input to the next. Remove any of them and the argument weakens: skip the GLM and we cannot verify the dataset behaves as the literature says; skip SVCA and CCA inherits per-neuron noise; skip CCA and partial CCA has nothing to subtract from; skip partial CCA and we cannot decide between H₀ and H₁; skip dynamics and we cannot tell three different biological mechanisms apart even when the survival ratio agrees.

Stage	Question answered	Question opened
GLM	Does each region carry movement variance in the proportions the literature reports?	What population structure does that variance live in?
SVCA	What is the reliable low-d population state per region?	Do the reliable states align across regions?
CCA	Is there a shared subspace between two regions at all?	Is the shared subspace just an arousal echo?
pCCA	Does the shared subspace survive partialling out $Z$?	What is the residual doing in time?
Dynamics	Does $U(t)$ have lag and task locking?	What mechanism produced the residual?

The $\Delta R^2$ figure is the sanity check that the dataset behaves the way the literature predicts. The survival-ratio figure is the partialling answer: shared structure does not collapse under wheel + pupil, and only partly walks back under a richer probe. The lag-and-task-locking figure is the mechanistic answer: three pairs reach similar survival through three different biological mechanisms.