Ablating latent variables in LinearSCVI

Hello, and thank you for these amazing and very useful packages.

I have trained a series of LDVAE models and suspect that a particular latent variable is driving performance in a subset of evaluation cells.

To test this hypothesis, I would like to set that latent variable equal to zero and attempt reconstruction, to see how reconstruction error changes.

However, when I naively tried to write my own modifiable reconstruction script, trying to match get_normalized_expression, with something like:

latent_rep = np.array(model.get_latent_representation(adata))
loadings = np.array(model.get_loadings())
reconstruction = latent_rep @ loadings.T # as in Svensson 2020

I found that the resulting reconstruction values, which in my understanding are \rho, would often be negative; these appear not to be valid parameters for negative binomial distributions. This makes me suspect that there is some hidden parameter that I am not appreciating, for example some biases that are added at the end.

What is the right way to implement getting a reconstruction of some data? My eventual goal is to intercept the latent variables in this process, set one of them to zero, and decode that modified latent representation.

Hi, in the LDVAE model itself, a softmax activation function is used at the output. This makes the model not strictly linear but makes it similar to scVI. Ypu need to use this activation as well. You might want to check out DrVI if you require a strictly linear model (more complex decoder though).

Thank you for the reply! That makes sense and I have implemented the softmax on the decoded expression. The output is correlated with the output of get_normalized_expression, but I am worried that the outputs are still not exactly the same.

Below is a minimal example with some simulated expression data that shows the issue:

import numpy as np
import scanpy as sc
import scvi
import anndata as ad
import matplotlib.pyplot as plt
import seaborn as sns
import os
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)
scvi.settings.seed = 42

# Generate synthetic data
n_cells = 1000
n_genes = 200
n_latent = 10

# Create random count data from negative binomial
raw_counts = np.random.negative_binomial(5, 0.3, size=(n_cells, n_genes))

# Create AnnData object with raw counts
adata = ad.AnnData(raw_counts)

# Setup and train the model with raw counts in X
scvi.model.LinearSCVI.setup_anndata(adata)  # Use raw counts from X
model = scvi.model.LinearSCVI(
    adata,
    n_latent=n_latent,
)

# Train the model
model.train(max_epochs=100, train_size=1.0)

loadings = model.get_loadings()
latents = model.get_latent_representation()

reconstruction = np.array(model.get_normalized_expression())

#write a brief numpy based softmax function, with subtraction in the numerator for numerical stability
def stable_softmax(x):
    """Compute the softmax across the second dimension (axis=1) of matrix x."""
    x_max = np.max(x, axis=1, keepdims=True)  # Subtract max for numerical stability
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=1, keepdims=True)


manual_reconstruction = stable_softmax(np.array(latents @ loadings.T))

#correlation between manual reconstruction and reconstruction for the first five cells
for i in range(5):
    print(np.corrcoef(reconstruction[i,:], manual_reconstruction[i,:])[0, 1])

Which gives these correlations between the built-in and “manual” reconstructions:
0.4817705853184663
0.8996616372856779
0.6568411356280006
0.7262337615103117
0.9316294369319855

Am I missing some other transformation or input argument needed to make these match up?

Thank you very much!

There is another batchnorm in the decoder before applying the softmax by default, though you can disable it by setting batchnorm to „encoder“.
Also see: scvi-tools/src/scvi/nn/_base_components.py at 2f1611c5d14392fc5b0e7e9706e2b9d80b20e1d5 · scverse/scvi-tools · GitHub