Questions about reference mapping

Hi, i want to ask several naive questions. I hope someone takes the time to answer.

I downloaded a reference from cellxgene in h5ad format and i want to annotate my samples using that reference.

First question, should I perform any pre-prossessing on that reference object, (normalize, log, scale etc. or not) ? also the samples?

I did the them on the reference:

AnnData object with n_obs × n_vars = 2480956 × 59357
obs: ‘ROIGroup’, ‘ROIGroupCoarse’, ‘ROIGroupFine’, ‘roi’, ‘organism_ontology_term_id’, ‘disease_ontology_term_id’, ‘self_reported_ethnicity_ontology_term_id’, ‘assay_ontology_term_id’, ‘sex_ontology_term_id’, ‘development_stage_ontology_term_id’, ‘donor_id’, ‘suspension_type’, ‘dissection’, ‘fraction_mitochondrial’, ‘fraction_unspliced’, ‘cell_cycle_score’, ‘total_genes’, ‘total_UMIs’, ‘sample_id’, ‘supercluster_term’, ‘cluster_id’, ‘subcluster_id’, ‘cell_type_ontology_term_id’, ‘tissue_ontology_term_id’, ‘is_primary_data’, ‘tissue_type’, ‘cell_type’, ‘assay’, ‘disease’, ‘organism’, ‘sex’, ‘tissue’, ‘self_reported_ethnicity’, ‘development_stage’, ‘observation_joinid’
var: ‘Biotype’, ‘Chromosome’, ‘End’, ‘Gene’, ‘Start’, ‘feature_is_filtered’, ‘feature_name’, ‘feature_reference’, ‘feature_biotype’, ‘feature_length’
uns: ‘batch_condition’, ‘citation’, ‘schema_reference’, ‘schema_version’, ‘title’
obsm: ‘X_UMAP’, ‘X_tSNE’
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
… adata, n_top_genes=2000)
/home/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scanpy/preprocessing/ FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
disp_grouped = df.groupby(“mean_bin”)[“dispersions”]
adata.raw = adata
sc.pp.scale(adata, max_value=10)

and got this warning upon training:

/home/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scvi/data/fields/ UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?

The last question, I need highly variable genes, is the subset = TRUE necessary? Does it remove all genes and keep the 2000 highly variable genes only?

pancreas_ref, n_top_genes=2000, subset=True

Hi, you need to input raw count data to scVI. Those are usually stored either in adata.X or adata.raw.X. You do a lot of normalization here so it’s difficult to follow. Just make sure that the layer that you pass into scVI are integer count values. You need to subset the genes before inputting to scvi-tools models. You can however store a second object with subset genes and afterwards continue working with adata.