How to make a UMAP for single cell data and color cells by average expression of a list of genes in scanpy?

Hello,

I would like to make a UMAP where the cells are colored by the average expression of the bulk signature genes but I am not confident that I did it correctly. I would like to use scanpy for it.

I did the below:

bulk_de_genes_up_list = bulk_de_genes['Gene'].tolist()
#Subset the data based on the list of genes
adata2 = adata[:, adata.var_names.isin(bulk_de_genes_up_list)]
average_expression = adata2.X.mean(axis=1)
adata2.var['bulk_de_gene_average'] = average_expression
sc.pl.umap(adata2, color='bulk_de_gene_average', cmap='viridis')

I do get a UMAP as an output but I am not sure if it is done correctly. I am mainly worried about average_expression = adata2.X.mean(axis=1)

Is that the correct way of calculating the mean of the gene expression per cell?

Thank you

Hey there,
a simple test would be to select one gene, and see if you get the right numbers for it. If that works well, check for two genes and see that you get the average :slight_smile:

Your code seems OK (axis 1 is the columns that are the genes),
though there doesn’t seem to be a need for creating the new adata:

bulk_de_genes_up_list = bulk_de_genes['Gene'].tolist()

average_expression = adata[:, adata.var_names.isin(bulk_de_genes_up_list)].X.mean(axis=1)

adata.var['bulk_de_gene_average'] = average_expression

sc.pl.umap(adata, color='bulk_de_gene_average', cmap='viridis')

The only issue that might rise from the way you are calculating the mean is if you did some transformation of the data beforehand: total counts (CPM normalization) is not an issue for this, but averaging log1p data is not the right way (in that case you might use the .raw counts if you’ve saved them or try exponentiating the data before running the average using np.expm1:
np.expm1(adata.var_names.isin(bulk_de_genes_up_list)].X).mean(axis=1)