Hi, i want to ask several naive questions. I hope someone takes the time to answer.
I downloaded a reference from cellxgene in h5ad format and i want to annotate my samples using that reference.
First question, should I perform any pre-prossessing on that reference object, (normalize, log, scale etc. or not) ? also the samples?
I did the them on the reference:
adata
AnnData object with n_obs × n_vars = 2480956 × 59357
obs: ‘ROIGroup’, ‘ROIGroupCoarse’, ‘ROIGroupFine’, ‘roi’, ‘organism_ontology_term_id’, ‘disease_ontology_term_id’, ‘self_reported_ethnicity_ontology_term_id’, ‘assay_ontology_term_id’, ‘sex_ontology_term_id’, ‘development_stage_ontology_term_id’, ‘donor_id’, ‘suspension_type’, ‘dissection’, ‘fraction_mitochondrial’, ‘fraction_unspliced’, ‘cell_cycle_score’, ‘total_genes’, ‘total_UMIs’, ‘sample_id’, ‘supercluster_term’, ‘cluster_id’, ‘subcluster_id’, ‘cell_type_ontology_term_id’, ‘tissue_ontology_term_id’, ‘is_primary_data’, ‘tissue_type’, ‘cell_type’, ‘assay’, ‘disease’, ‘organism’, ‘sex’, ‘tissue’, ‘self_reported_ethnicity’, ‘development_stage’, ‘observation_joinid’
var: ‘Biotype’, ‘Chromosome’, ‘End’, ‘Gene’, ‘Start’, ‘feature_is_filtered’, ‘feature_name’, ‘feature_reference’, ‘feature_biotype’, ‘feature_length’
uns: ‘batch_condition’, ‘citation’, ‘schema_reference’, ‘schema_version’, ‘title’
obsm: ‘X_UMAP’, ‘X_tSNE’
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pp.highly_variable_genes(
… adata, n_top_genes=2000)
/home/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scanpy/preprocessing/_highly_variable_genes.py:226: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
disp_grouped = df.groupby(“mean_bin”)[“dispersions”]
adata.raw = adata
sc.pp.scale(adata, max_value=10)
and got this warning upon training:
scvi.model.SCVI.setup_anndata(adata)
/home/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/scvi/data/fields/_base_field.py:64: UserWarning: adata.X does not contain unnormalized count data. Are you sure this is what you want?
self.validate_field(adata)
The last question, I need highly variable genes, is the subset = TRUE necessary? Does it remove all genes and keep the 2000 highly variable genes only?
sc.pp.highly_variable_genes(
pancreas_ref, n_top_genes=2000, subset=True
)