scVI integration with all genes

Hello!

I would like to integrate all my samples using all the genes instead of only the top 2000 most variable genes.

I would do this in order to better characterise the variability and differences between my batches, and to subsequently run the differential gene expression on my cells using all the genes.

However, not all the samples have the same number of genes, so that I would do the following:


anndata_dir = 'C:/Users/Martina/Desktop/AnnData'
list_files = os.listdir(anndata_dir)
anndata_list = []

for filename in list_files:
    file_path = os.path.join(anndata_dir, filename)
    anndata_obj = ad.read_h5ad(file_path)
    anndata_list.append(anndata_obj)

concatenated_anndata = ad.concat(anndata_list, axis=0, join='outer')

In this way, the cells from batches that do not express certain genes are added a column for the corresponding genes with zero counts.

Would you recommend doing this (integrating samples using all the genes)? Does the model then effectively remove batch effects and correct the counts? Would I get good DGE results? Or do you recommend a different approach? If so, what would you recommend?

The only thing that makes me doubt of this is for the DGE results: please correct me if I’m wrong, but by concatenating all the sample with the ad.concat function with the join='outer' setting, I add zero counts on genes for those cells for which I do not have information about that same gene. I would then think that DGE results would be biased as I assigned zero expression for that genes on those cells: the fact that I miss info for those genes do not mean that these cells do not express the gene.

What is your opinion on this?

Thank you a lot for your help!!