[Usage clarification] Should the .obs in query and reference be exactly the same?

I have a query (iPSC derived precursors snRNA, merged from 1-healthy and 2-patients) and a reference (human brain scRNA dataset). Query dataset contains ‘sample’ and ‘leiden’ and reference contains ‘sample’ and ‘cell_type’. I rename the Leiden column in the query to cell_type to match the reference. Following gene intersection and concantation, I have run the SCVI tutorial with default parameters. However, my query never overlaps with reference. So, is it because the ‘leiden’ or cell_types labels in my query (which are essentially cluster numbers) and reference are different?

In scVI the content of these columns doesn’t matter (for scANVI it will seperate based on the cell-type column). However, you use two different celltypes iPSC and brain and two different technologies scRNA/snRNA. The second one is already hard to integrate and the first one also sounds like strong differences in gene expression. If you really want to integrate both datasets something like Seurat rPCA/CCA integration might be more effective.

In the manuscript from Truetlin and Theis lab I came across the following in their methods section:

We compared the data integration performance across the following latent representations of the data: unintegrated PCA, RSS(default parameters except for using 2 layers, latent space of size 30 and negative binomial likelihood) integration, scANVI(default parameters) integrations using either snapseed level 1, 2 or 3 annotation as cell type label input, scPoli(parameters shown above) integrations using either snapseed level 1, 2 or 3 annotation or all three annotation levels at once as cell type label input, scPoli36 integrations of meta-cells aggregated with the aggrecell algorithm (first employed as “pseudocell” using either snapseed level 1 or 3 annotation as cell type label input to scPoli. We used the following scores for determining integration.

But I’m also interested to know how the scRNA and snRNA integration would be difficult? If so, using a human brain reference where snRNA is used will be more appropriate?