Workflow to integrate dataset from two different species

Slack90 · January 17, 2025, 12:56am

Hi,

I’d like to use scvi to integrate data from the same brain region in two different species. By using clustering and manual annotation I could identify that both datasets contain the same clusters (each cluster corresponding to a cell type). Now I would like to use scvi to integrate both dataset.
So first I find the orthologue genes to both datasets, and then I subsample only the genes are orthologue between the two species. Then I do some filtering and normalization as follows:

    adata = filtered_anndata
    adata.var_names_make_unique()
    sc.pp.filter_cells(adata, min_genes=2) #get rid of cells with fewer than 200 genes
    sc.pp.filter_genes(adata, min_cells=1) #get rid of genes that are found in fewer than 3 cells
    adata.var['mt'] = adata.var_names.str.startswith('mt-')  # annotate the group of mitochondrial genes as 'mt'
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .999)
    lower_lim = np.quantile(adata.obs.n_genes_by_counts.values, .001)
    adata = adata[(adata.obs.n_genes_by_counts < upper_lim) & (adata.obs.n_genes_by_counts > lower_lim)]
    adata = adata[adata.obs.pct_counts_mt < 20]
    sc.pp.normalize_total(adata, target_sum=1e4) #normalize every cell to 10,000 UMI
    sc.pp.log1p(adata) #change to log counts
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) #these are default values
    adata.raw = adata #save raw data before processing values and further filtering
    adata = adata[:, adata.var.highly_variable] #filter highly variable
    sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt']) #Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed
    sc.pp.scale(adata, max_value=10) #scale each gene to unit variance
    sc.tl.pca(adata, svd_solver='arpack')
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)
    sc.tl.leiden(adata, resolution = 0.25)
    sc.tl.umap(adata)

Then I concatenate the two datasets and prep for scvi:

concatenated_anndata = concatenated_anndata

concatenated_anndata.obs.groupby('dataset').count()
# sc.pp.filter_genes(concatenated_anndata, min_counts=3)
concatenated_anndata.layers["counts"] = concatenated_anndata.X.copy()  # preserve counts
sc.pp.normalize_total(concatenated_anndata, target_sum=1e4)
sc.pp.log1p(concatenated_anndata)
concatenated_anndata.raw = concatenated_anndata  # freeze the state in `.raw`


scvi.model.SCVI.setup_anndata(
    concatenated_anndata,
    layer="counts",
    categorical_covariate_keys=["dataset"])

model = scvi.model.SCVI(concatenated_anndata)

However the results I get from the integration are pretty terrible:

I thought one possible reason might be that the highly variable genes selected (using the parameters I have above) lead to a very low number of common genes between the two dataset (from 10.000 of orthologues, down to ~300). So I tried to skip altogether the filtering of highly variable genes, but results look still pretty bad. So I am wondering if there is something wrong with the order of the steps I am following for this pipeline (whether I should do the selection of highly variable genes before finding the orthologues?) or the pipeline in general. Does anyone have any suggestions?

ori-kron-wis · January 21, 2025, 11:18am

Hi,
I dont see the full pipeline of your analysis here,
From what I see you should do the preprocessing above without scaling the data after datasets were concatenated with the same genes

See our documentation for similar tutorials

Topic		Replies	Views
Shared cell types not mixing when integrating datasets from different species scvi-tools integration , scvi	4	57	June 19, 2025
Dataset integration and analysis scvi-tools integration , cellassign , scvi , multivi , totalvi	3	873	May 3, 2023
Trouble integrating my dataset into a reference dataset scvi-tools integration	16	308	October 11, 2024
Understanding scVI integration inside R with Seurat v5 & SCTransform scvi-tools integration	1	167	April 6, 2025
Integrating different tissues with scVI scvi-tools integration , scvi	3	155	November 15, 2024

Workflow to integrate dataset from two different species

Related topics