Workflow to integrate dataset from two different species

Hi,

I’d like to use scvi to integrate data from the same brain region in two different species. By using clustering and manual annotation I could identify that both datasets contain the same clusters (each cluster corresponding to a cell type). Now I would like to use scvi to integrate both dataset.
So first I find the orthologue genes to both datasets, and then I subsample only the genes are orthologue between the two species. Then I do some filtering and normalization as follows:

    adata = filtered_anndata
    adata.var_names_make_unique()
    sc.pp.filter_cells(adata, min_genes=2) #get rid of cells with fewer than 200 genes
    sc.pp.filter_genes(adata, min_cells=1) #get rid of genes that are found in fewer than 3 cells
    adata.var['mt'] = adata.var_names.str.startswith('mt-')  # annotate the group of mitochondrial genes as 'mt'
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .999)
    lower_lim = np.quantile(adata.obs.n_genes_by_counts.values, .001)
    adata = adata[(adata.obs.n_genes_by_counts < upper_lim) & (adata.obs.n_genes_by_counts > lower_lim)]
    adata = adata[adata.obs.pct_counts_mt < 20]
    sc.pp.normalize_total(adata, target_sum=1e4) #normalize every cell to 10,000 UMI
    sc.pp.log1p(adata) #change to log counts
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) #these are default values
    adata.raw = adata #save raw data before processing values and further filtering
    adata = adata[:, adata.var.highly_variable] #filter highly variable
    sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt']) #Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed
    sc.pp.scale(adata, max_value=10) #scale each gene to unit variance
    sc.tl.pca(adata, svd_solver='arpack')
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)
    sc.tl.leiden(adata, resolution = 0.25)
    sc.tl.umap(adata)

Then I concatenate the two datasets and prep for scvi:

concatenated_anndata = concatenated_anndata

concatenated_anndata.obs.groupby('dataset').count()
# sc.pp.filter_genes(concatenated_anndata, min_counts=3)
concatenated_anndata.layers["counts"] = concatenated_anndata.X.copy()  # preserve counts
sc.pp.normalize_total(concatenated_anndata, target_sum=1e4)
sc.pp.log1p(concatenated_anndata)
concatenated_anndata.raw = concatenated_anndata  # freeze the state in `.raw`


scvi.model.SCVI.setup_anndata(
    concatenated_anndata,
    layer="counts",
    categorical_covariate_keys=["dataset"])

model = scvi.model.SCVI(concatenated_anndata)


However the results I get from the integration are pretty terrible:

I thought one possible reason might be that the highly variable genes selected (using the parameters I have above) lead to a very low number of common genes between the two dataset (from 10.000 of orthologues, down to ~300). So I tried to skip altogether the filtering of highly variable genes, but results look still pretty bad. So I am wondering if there is something wrong with the order of the steps I am following for this pipeline (whether I should do the selection of highly variable genes before finding the orthologues?) or the pipeline in general. Does anyone have any suggestions?

Hi,
I dont see the full pipeline of your analysis here,
From what I see you should do the preprocessing above without scaling the data after datasets were concatenated with the same genes

See our documentation for similar tutorials