Questions about reference mapping

Sirin24 · March 14, 2024, 10:35pm

Hi, i want to ask several naive questions. I hope someone takes the time to answer.

I downloaded a reference from cellxgene in h5ad format and i want to annotate my samples using that reference.

First question, should I perform any pre-prossessing on that reference object, (normalize, log, scale etc. or not) ? also the samples?

I did the them on the reference:

adata
AnnData object with n_obs × n_vars = 2480956 × 59357
obs: ‘ROIGroup’, ‘ROIGroupCoarse’, ‘ROIGroupFine’, ‘roi’, ‘organism_ontology_term_id’, ‘disease_ontology_term_id’, ‘self_reported_ethnicity_ontology_term_id’, ‘assay_ontology_term_id’, ‘sex_ontology_term_id’, ‘development_stage_ontology_term_id’, ‘donor_id’, ‘suspension_type’, ‘dissection’, ‘fraction_mitochondrial’, ‘fraction_unspliced’, ‘cell_cycle_score’, ‘total_genes’, ‘total_UMIs’, ‘sample_id’, ‘supercluster_term’, ‘cluster_id’, ‘subcluster_id’, ‘cell_type_ontology_term_id’, ‘tissue_ontology_term_id’, ‘is_primary_data’, ‘tissue_type’, ‘cell_type’, ‘assay’, ‘disease’, ‘organism’, ‘sex’, ‘tissue’, ‘self_reported_ethnicity’, ‘development_stage’, ‘observation_joinid’
var: ‘Biotype’, ‘Chromosome’, ‘End’, ‘Gene’, ‘Start’, ‘feature_is_filtered’, ‘feature_name’, ‘feature_reference’, ‘feature_biotype’, ‘feature_length’
uns: ‘batch_condition’, ‘citation’, ‘schema_reference’, ‘schema_version’, ‘title’
obsm: ‘X_UMAP’, ‘X_tSNE’
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.highly_variable_genes(
… adata, n_top_genes=2000)
/home/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:226: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
disp_grouped = df.groupby(“mean_bin”)[“dispersions”]
adata.raw = adata
sc.pp.scale(adata, max_value=10)

and got this warning upon training:

scvi.model.SCVI.setup_anndata(adata)
/home/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scvi/data/fields/_base_field.py:64: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
self.validate_field(adata)

The last question, I need highly variable genes, is the subset = TRUE necessary? Does it remove all genes and keep the 2000 highly variable genes only?

sc.pp.highly_variable_genes(
pancreas_ref, n_top_genes=2000, subset=True
)

cane11 · March 23, 2024, 9:20pm

Hi, you need to input raw count data to scVI. Those are usually stored either in adata.X or adata.raw.X. You do a lot of normalization here so it’s difficult to follow. Just make sure that the layer that you pass into scVI are integer count values. You need to subset the genes before inputting to scvi-tools models. You can however store a second object with subset genes and afterwards continue working with adata.

Topic		Replies	Views
Having some difficulties with CellAssign help please =) scvi-tools cellassign	5	524	July 13, 2021
[Usage clarification] Should the .obs in query and reference be exactly the same? scvi-tools	2	28	September 13, 2024
MrVI input and interpretation scvi-tools	23	1230	July 31, 2024
CellAssign keyword error: After Integration scvi-tools scvi	3	796	May 26, 2022
Smartseq data prep for SCVI scvi-tools scvi , preprocessing	3	638	December 10, 2022

Questions about reference mapping

Related topics