Managing anndata with multiple features from each gene

I am working on analysis where one variable (gene) contains N number of properties. If I want any additional analysis (pca, neighbor, leiden…) to be able to factor in all values from each gene, should I:

  1. Create multiple layers for each property.
  2. Create separate anndata object for each property.
  3. Expand the number of vars into N*vars to include the N number of properties for each gene.

I am aware that neighbors and umap have no options for which layer to operate on, so what is the best way to deal with this?

What do you mean by “factor in”? Do you want each sub-gene feature to be considered a variable?

Yes that is correct. I’m just wondering will it mess up anything if I artificially increase the number of variables.

I think the:

Expand the number of vars into N*vars to include the N number of properties for each gene.

approach is fine and the way to go if you want to treat each of these gene/ property things as a separate variable.

I don’t think this will mess anything up. But which way you structuring the data really depends on what you’re doing with it downstream.

Create multiple layers for each property

This is the approach that’s taken for an scvelo-like model. But, doesn’t get you the “each sub-gene feature to be considered a variable”.

I do understand that when it comes to cell annotation it would definitely not work, or at least I have to create a separate anndata with just the gene counts.

For algorithms like neighbour, UMAP and leiden, I want to be sure if they will be able to identify the difference of the gene properties and properly cluster the cells based on gene properties.