Hi all,
I was trying to understand how the algorithm for sc.pp.highly_variable_genes
works when operating it in a batch-aware manner. I think that I’ve figured it out so I’m writing it down in case anyone else was confused like myself.
My (possibly naive) assumption was that when a batch_key
was set the function would first output the most variable genes within all the batches, then those with highest variance in all batches except one, etc… The n_top_genes
variable would only control the number of genes being returned, and if this was lower than the number of genes that were most variable across all batches, then only those genes would be returned.
However this isn’t quite what happens, n_top_genes
also influences how the most highly variable genes are calculated in some way. Here’s an example:
Preparing the data:
import scanpy as sc
import functools
adata = sc.datasets.pbmc3k()
# Create bacthes
adata.obs['batches'] = functools.reduce(lambda a, b: a+b, list([ ["batch" + i]*int(adata.n_obs/3) for i in ["1","2","3"] ]))
# Logcounts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
2000 HVGs
sc.pp.highly_variable_genes(
adata, n_top_genes=2000, flavor="cell_ranger", batch_key='batches'
)
n_batches = adata.var["highly_variable_nbatches"].value_counts()
n_batches
0 28084
1 3623
2 716
3 315
There are 315 HVGs that have high variance in all three batches. By my initial assumption, if you only ask for 300 high variance genes, then all the returned genes will be highly-variable in all three batches. However, this isn’t what happens:
sc.pp.highly_variable_genes(
adata, n_top_genes = 300, flavor="cell_ranger", batch_key='batches'
)
n_batches = adata.var["highly_variable_nbatches"].value_counts()
n_batches
0 32045
1 560
3 74
2 59
This had me stumped, but from looking at the source code I think the reason is because n_top_genes
defines the gene threshold for each batch, not just the dataset as a whole. So when I asked for 2000 HVGs, it obtained 2000 HVGs within each batch, combined these, and ordered by the number of batches each gene had high variance in, taking the top 2000 of these. 315 of the top 2000 HVGs from each batch were in common with each of the three. When asking for 300 HVGs, the top 300 HVGs from each batch were calculated, so the overlap between all three batches will be smaller.
Initially this approach didn’t make sense to me, but I do now understand. The n_top_genes
variable defines the threshold of the HVGs variance/dispersion to take from each batch. The number of HVGs in common between the batches only makes sense in the context of where the limit for high variance of a gene is.
Maybe this was elementary to some people but I thought it worth writing down in case anyone was confused like myself. Also, please comment if I am mistaken in my understanding!