Hi,
I’d like to use scvi to integrate data from the same brain region in two different species. By using clustering and manual annotation I could identify that both datasets contain the same clusters (each cluster corresponding to a cell type). Now I would like to use scvi to integrate both dataset.
So first I find the orthologue genes to both datasets, and then I subsample only the genes are orthologue between the two species. Then I do some filtering and normalization as follows:
adata = filtered_anndata
adata.var_names_make_unique()
sc.pp.filter_cells(adata, min_genes=2) #get rid of cells with fewer than 200 genes
sc.pp.filter_genes(adata, min_cells=1) #get rid of genes that are found in fewer than 3 cells
adata.var['mt'] = adata.var_names.str.startswith('mt-') # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
upper_lim = np.quantile(adata.obs.n_genes_by_counts.values, .999)
lower_lim = np.quantile(adata.obs.n_genes_by_counts.values, .001)
adata = adata[(adata.obs.n_genes_by_counts < upper_lim) & (adata.obs.n_genes_by_counts > lower_lim)]
adata = adata[adata.obs.pct_counts_mt < 20]
sc.pp.normalize_total(adata, target_sum=1e4) #normalize every cell to 10,000 UMI
sc.pp.log1p(adata) #change to log counts
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) #these are default values
adata.raw = adata #save raw data before processing values and further filtering
adata = adata[:, adata.var.highly_variable] #filter highly variable
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt']) #Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed
sc.pp.scale(adata, max_value=10) #scale each gene to unit variance
sc.tl.pca(adata, svd_solver='arpack')
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)
sc.tl.leiden(adata, resolution = 0.25)
sc.tl.umap(adata)
Then I concatenate the two datasets and prep for scvi:
concatenated_anndata = concatenated_anndata
concatenated_anndata.obs.groupby('dataset').count()
# sc.pp.filter_genes(concatenated_anndata, min_counts=3)
concatenated_anndata.layers["counts"] = concatenated_anndata.X.copy() # preserve counts
sc.pp.normalize_total(concatenated_anndata, target_sum=1e4)
sc.pp.log1p(concatenated_anndata)
concatenated_anndata.raw = concatenated_anndata # freeze the state in `.raw`
scvi.model.SCVI.setup_anndata(
concatenated_anndata,
layer="counts",
categorical_covariate_keys=["dataset"])
model = scvi.model.SCVI(concatenated_anndata)
However the results I get from the integration are pretty terrible:
I thought one possible reason might be that the highly variable genes selected (using the parameters I have above) lead to a very low number of common genes between the two dataset (from 10.000 of orthologues, down to ~300). So I tried to skip altogether the filtering of highly variable genes, but results look still pretty bad. So I am wondering if there is something wrong with the order of the steps I am following for this pipeline (whether I should do the selection of highly variable genes before finding the orthologues?) or the pipeline in general. Does anyone have any suggestions?