Issue with multiVI get_normalized_expression and get_normalized_accessibility

Hi developers and the community,

I’m getting this issue running the MultiVI as instructed in the tutorial:

imputed_expression = model.get_normalized_expression()
>>> imputed_expression
                                   chr1:181000-181500  chr1:191000-191500  chr1:191500-192000  ...  chr17:40459500-40460000  chr17:40460000-40460500  chr17:40460500-40461000
A24-110:TATATCCTCATCCACC-1_1_1           5.421411e-06            0.000016        1.114787e-05  ...             1.362917e-06                 0.000001             6.866701e-07
A24-236:CAAGAACCATAATCGT-1_1_1           5.591661e-06            0.000017        1.120139e-05  ...             1.344341e-06                 0.000001             6.684292e-07

The returned dimension is 73236 x 61817 which matches the input “sc” layer, however, the label seems to be pulled from the input “atac” layer.

Also, I tried to pull the normalized_accessibility and have the following error:

imputed_accessibility = model.get_normalized_accessibility()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/andy/anaconda3/envs/marisol_zoo/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/andy/anaconda3/envs/marisol_zoo/lib/python3.12/site-packages/scvi/model/_multivi.py", line 601, in get_normalized_accessibility
    return pd.DataFrame(
           ^^^^^^^^^^^^^
  File "/home/andy/anaconda3/envs/marisol_zoo/lib/python3.12/site-packages/pandas/core/frame.py", line 831, in __init__
    mgr = ndarray_to_mgr(
          ^^^^^^^^^^^^^^^
  File "/home/andy/anaconda3/envs/marisol_zoo/lib/python3.12/site-packages/pandas/core/internals/construction.py", line 336, in ndarray_to_mgr
    _check_values_indices_shape_match(values, index, columns)
  File "/home/andy/anaconda3/envs/marisol_zoo/lib/python3.12/site-packages/pandas/core/internals/construction.py", line 420, in _check_values_indices_shape_match
    raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (73236, 150000), indices imply (73236, 61817)

Here is the model setup:

model.view_anndata_setup()
Anndata setup with scvi-tools version 1.4.0.post1.

Setup via `MULTIVI.setup_anndata` with arguments:
{
│   'rna_layer': None,
│   'atac_layer': None,
│   'protein_layer': None,
│   'batch_key': None,
│   'size_factor_key': None,
│   'categorical_covariate_keys': None,
│   'continuous_covariate_keys': None,
│   'idx_layer': None,
│   'modalities': {'rna_layer': 'rna', 'atac_layer': 'atac'}
}

         Summary Statistics          
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃     Summary Stat Key     ┃ Value  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│          n_atac          │ 150000 │
│         n_batch          │   1    │
│         n_cells          │ 73236  │
│ n_extra_categorical_covs │   0    │
│ n_extra_continuous_covs  │   0    │
│         n_labels         │   1    │
│      n_size_factor       │   0    │
│          n_vars          │ 61817  │
└──────────────────────────┴────────┘
               Data Registry                
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Registry Key ┃    scvi-tools Location    ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│      X       │    adata.mod['rna'].X     │
│     atac     │    adata.mod['atac'].X    │
│    batch     │ adata.obs['_scvi_batch']  │
│    ind_x     │   adata.obs['_indices']   │
│    labels    │ adata.obs['_scvi_labels'] │
└──────────────┴───────────────────────────┘
                     batch State Registry                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃     Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_batch'] │     0      │          0          │
└──────────────────────────┴────────────┴─────────────────────┘
                     labels State Registry                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃      Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_labels'] │     0      │          0          │
└───────────────────────────┴────────────┴─────────────────────┘

And the label for mod rna and atac:

>>>md_mvi.mod['rna'].var
              gene_ids feature_types
AL627309.5  AL627309.5          gene
LINC01409    LINC01409          gene
LINC01128    LINC01128          gene
LINC00115    LINC00115          gene
FAM41C          FAM41C          gene
...                ...           ...
AC008878.4  AC008878.4          gene
AC022098.2  AC022098.2          gene
AC008753.2  AC008753.2          gene
AP006621.2  AP006621.2          gene
Z97653.1      Z97653.1          gene

[61817 rows x 2 columns]
>>> md_mvi.mod['atac'].var
                                          gene_ids feature_types
index                                                           
chr1:181000-181500              chr1:181000-181500      fragment
chr1:191000-191500              chr1:191000-191500      fragment
chr1:191500-192000              chr1:191500-192000      fragment
chr1:817000-817500              chr1:817000-817500      fragment
chr1:819000-819500              chr1:819000-819500      fragment
...                                            ...           ...
chrX:155264000-155264500  chrX:155264000-155264500      fragment
chrX:155612500-155613000  chrX:155612500-155613000      fragment
chrX:155767500-155768000  chrX:155767500-155768000      fragment
chrX:155820000-155820500  chrX:155820000-155820500      fragment
chrX:156030500-156031000  chrX:156030500-156031000      fragment

[150000 rows x 2 columns]

So it seems somehow the model is pulling atac as rna although it is setup through scvi.model.MULTIVI.setup_mudata already?

I think I found a solution… First the get_normalized_expression() is only grabbing the gene_names from the FIRST mod in mudata. Thus if you reorder the rna to the first it solves the first problem. But the issue with get_normalized_accesssibility() persist. Here is a quick edit that solve the problem:

        else:
            if return_numpy:
                return imputed
            else:
                # Determine the actual number of columns in the imputed data
                actual_n_cols = imputed.shape[1]
                
                # Get region names from adata
                if isinstance(adata, MuData):
                    # Try to get region names from atac modality first, then rna
                    if "atac" in adata.mod:
                        region_names = adata["atac"].var_names
                    else:
                        region_names = adata["rna"].var_names
                else:
                    region_names = adata.var_names
                
                # Apply region_mask if it's not a slice
                if region_mask is not None and not isinstance(region_mask, slice):
                    region_names = region_names[region_mask]
                elif isinstance(region_mask, slice) and hasattr(self, 'n_regions'):
                    # When region_mask is slice(None), limit to n_regions if available
                    region_names = region_names[:self.n_regions]
                
                # Handle shape mismatch: use actual data shape
                if len(region_names) != actual_n_cols:
                    # If we have more columns in data than names, extend with numeric names
                    if len(region_names) < actual_n_cols:
                        region_names = list(region_names) + [
                            f"region_{i}" for i in range(len(region_names), actual_n_cols)
                        ]
                    # If we have more names than columns, truncate
                    else:
                        region_names = region_names[:actual_n_cols]
                
                if threshold:
                    return pd.DataFrame.sparse.from_spmatrix(
                        imputed,
                        index=adata.obs_names[indices],
                        columns=region_names[:actual_n_cols],
                    )
                else:
                    return pd.DataFrame(
                        imputed,
                        index=adata.obs_names[indices],
                        columns=region_names[:actual_n_cols],
                    )

@andywangzhou thank you for this. It surfaced several issues.

The order of modalities in which you setup the mudata is important, and we assume rna first. This also applies to Totalvi model as well.

The issue with get_normalized_accesssibility size that was based on the rna mod is real and it needed a fix. I opened a PR. It was not used during the tutorial.

Note that besides what you wrote, we can not hard-code “rna” or “atac” as a modality selection (although that is the default) as the user can call those modalities by any name.

See the fix under this PR: fix: use the modality name and not hard coded "rna" or "atac" in multivi by ori-kron-wis · Pull Request #3622 · scverse/scvi-tools · GitHub

Thank you so much!

Yes I was doing a quick fix on the code so I can get my project going. But it is better to allow specify the modality names in the future.

About the order of modalities in the mudata, it happened because I was trying to use concat, instead of just mudata. And I found that the order I put in md.concat doesn’t matter - it is going to sort the modalities alphabetically. So I had to flip it as the following.

# merge mdata_atac and mdata_sc
mdata = md.concat([mdata_sc, mdata_atac], label="dataset")

md_atac = mdata.mod["atac"]
md_rna = mdata.mod["rna"]

md_mvi = md.MuData({"rna": md_rna, "atac": md_atac})

It’ll be helpful to be able to specify the main in the future too! Thank you so much for the well maintaining of the scverse!

Absolutely, the ordering will be taken care of as well during the setup_mudata of multivi.