Filter_by_expr parameters Decoupler

Hello,

I am struggling to understand the filtering done in decoupler.pp.filter_by_expr.
I first used the pl.filter_by_expr to visualize how genes are expressed in my pseudobulk samples using:
dc.pl.filter_by_expr(
adata=pdata_muscle,
group=“anatomical_cluster”,
min_count=10, #threshold “minimum number of counts in a given number of samples”
min_total_count=40, #threshold “minimum total number of reads across all samples” ie x-axis value
large_n=20, #nb of samples in a group to be considered large
min_prop=0.6, #proportion of samples in the smallest group that should express a gene
)

which gives the following plot:

could someone explain the plot? I know only what’s in the upper right quadrant will be kept.
Isn’t it weird to have number of samples at zero while log total sum of counts non null?

Also, I then did the filtering with dc.pp.filter_by_expr and even for low thresholds I am only left with 40 genes so it filtered a lot.

Thank you very much for your precious help!

Hi @npont! Good questions, the plot is working as expected.

Each point in the plot is a gene. The x-axis is log10(total counts across all samples). The y-axis is the number of samples where that gene exceeds a CPM (counts per million) cutoff, which is computed as min_count / median(library_size) * 1e6. The dashed lines are your thresholds, and genes in the upper-right quadrant are kept.

Regarding zero samples with non-null total counts, this is normal and expected. A gene can accumulate some total counts spread thinly across samples without ever reaching the CPM threshold in any individual sample. For example, if a gene has 1 or 2 counts in several samples but the library sizes are large, the CPM in each sample will be below the cutoff, so “number of samples passing CPM” = 0, even though total counts > 0. This is what this filter is for actually, to detect this kind of noisy genes.

Regarding the aggressive filtering: a few things to check:

  1. How many samples does your smallest anatomical_cluster group have? With large_n=20 and min_prop=0.6, if your smallest group has ≤20 samples, then min_sample_size equals that group size directly, a gene must pass the CPM cutoff in nearly all samples of that group. Basically the min_sample_size threshold (the horizontal dashed line) depends on the size of your smallest group in anatomical_cluster. If that smallest group has, say, 5 samples, then a gene needs to pass the CPM cutoff in at least 5 samples to be kept. You can check this with pdata.obs["anatomical_cluster"].value_counts(), if any group is very small, that directly sets how strict the filter is
  2. What do your library sizes look like? Very uneven library sizes can push the CPM cutoff high (it’s based on the median). Could you share the plot obtained beforehand by dc.pl.filter_samples?
  3. You could try relaxing min_count (e.g., 5) or min_prop (e.g., 0.5) to see how the number of retained genes changes.

Thank you very much @PauBadiaM for this very detailed answer!
To answer point by point:

  1. My smallest group contains 6 samples, while other anatomical_cluster groups contain between 20 and 70 samples. I guess this is, from what you explained, the reason why there is a dashed line at 6 on y-axis. So only genes exceeding the CPM cutoff (10/median(lib size)) in more than 6 samples will be kept? Nothing changes when I reduce min_prop, even down to low values like 20%.
  2. My library sizes are even isn’t it? Here is the plot you asked for:

  1. I tried varying the parameters and the nb of genes retained drastically decreases:

    • If using dc.pp.filter_by_expr with min_count = 10, min_total_count=10, large_n=10, min_prop=0.7 → we end up with only 43 genes.
    • If min_count = 5 (others params kept) → we end up with 145 genes.
    • If filtering with only filter_by_prop, requiring genes to be expressed in at least 10 % of min 2 samples, we end up with 7639 genes.
    • If using only filter_by_prop, requiring genes to be expressed in at least 30 % of min 5 samples (recall our smallest group has 6 samples), we end up with 224 genes.

    Thanks again!

Hi @npont ,

Happy to help! :grinning_face_with_smiling_eyes:

Looking at your QC plot, I think the core issue is that your pseudobulk profiles are quite thin, each sample has just around 15 cells and ~5k total counts. If we assume ~10k genes are expressed in your dataset, that means on average each gene has fewer than 1 count per sample, so most genes will never reach a meaningful CPM in any individual sample. This explains both why the filter_by_expr plot shows most genes clustered at zero on the y-axis, and why changing min_prop doesn’t help. Basically the bottleneck isn’t the proportion threshold but that the CPM cutoff itself is hard to pass with such low library sizes.

Things I would try:

  1. If possible, consider whether you can aggregate at a coarser level (fewer groups, more cells per pseudobulk sample) to get deeper profiles.

  2. You could try only using filter_by_prop for gene filtering, but I would still be careful with the downstream results of this dataset.

Hope this helps!

Hello,

I dug a bit more following what you said about the average number of counts per gene per sample. Instead of considering 10K genes expressed, I used n_genes_by_counts (nb of genes with at least one count) and looked at the distribution of total_counts/n_genes_by_count. I looked at the single cell level, and at the pseudobulk level for pseudobulks of 15 cells and of 20 cells (trying to get more counts, but I can’t increase too much my pseudobulks sizes because of data availability).

Here are the results:

We clearly see an improvement of the signal (in terms of counts per gene) by considering pseudobulks and by increasing pseudobulk size. But the average counts per gene are still quite low. However, shouldn’t we consider sparsity here?

I mean, just a minority of genes are expressed (around 5K, knowing we have a count matrix of 18k genes). We expect some genes to have strong signal, namely high total_counts, while others might have very low counts. So this computation is hiding this level of information rather crucial here.
I hope this thinking makes sense :slight_smile:

Thanks again!