# Understanding the behaviour of sc.pp.highly_variable_genes with a batch_key and different values of n_top_genes

Hi all,

I was trying to understand how the algorithm for `sc.pp.highly_variable_genes` works when operating it in a batch-aware manner. I think that I’ve figured it out so I’m writing it down in case anyone else was confused like myself.

My (possibly naive) assumption was that when a `batch_key` was set the function would first output the most variable genes within all the batches, then those with highest variance in all batches except one, etc… The `n_top_genes` variable would only control the number of genes being returned, and if this was lower than the number of genes that were most variable across all batches, then only those genes would be returned.

However this isn’t quite what happens, `n_top_genes` also influences how the most highly variable genes are calculated in some way. Here’s an example:

Preparing the data:

``````import scanpy as sc
import functools

# Create bacthes
adata.obs['batches'] = functools.reduce(lambda a, b: a+b, list([ ["batch" + i]*int(adata.n_obs/3) for i in ["1","2","3"] ]))

# Logcounts
``````

2000 HVGs

``````sc.pp.highly_variable_genes(
adata, n_top_genes=2000, flavor="cell_ranger", batch_key='batches'
)
n_batches
``````
``````0    28084
1     3623
2      716
3      315
``````

There are 315 HVGs that have high variance in all three batches. By my initial assumption, if you only ask for 300 high variance genes, then all the returned genes will be highly-variable in all three batches. However, this isn’t what happens:

``````sc.pp.highly_variable_genes(
adata, n_top_genes = 300, flavor="cell_ranger", batch_key='batches'
)
``````0    32045
This had me stumped, but from looking at the source code I think the reason is because `n_top_genes` defines the gene threshold for each batch, not just the dataset as a whole. So when I asked for 2000 HVGs, it obtained 2000 HVGs within each batch, combined these, and ordered by the number of batches each gene had high variance in, taking the top 2000 of these. 315 of the top 2000 HVGs from each batch were in common with each of the three. When asking for 300 HVGs, the top 300 HVGs from each batch were calculated, so the overlap between all three batches will be smaller.
Initially this approach didn’t make sense to me, but I do now understand. The `n_top_genes` variable defines the threshold of the HVGs variance/dispersion to take from each batch. The number of HVGs in common between the batches only makes sense in the context of where the limit for high variance of a gene is.