TCR and BCR genes and HVGs selection

Hello world!
I’ve read in many papers that when performing a re-clustering of some populations, like T cells or B cells, prior to the step of integration and so on, they re-calculate the HVGs but excluding the TCR- or BCR-related genes, because they are donor-specific, especially when talking about BCR.

Can you help me how to remove the TCR- or BCR-related genes before computing the HVGs selection, but without removing them from the .var of the anndata, since I want to evaluate their expression during the step of cell annotation?

The code that I use to calculate the HVGs is the following:
sc.pp.highly_variable_genes(adata,
n_top_genes = 4000, flavor = “seurat_v3”,
layer = “raw”, batch_key = ‘sample_id’,
subset = False)

Thanks a lot!
Paolo

I think you could do something like:

sub = adata[:, ~tcr_bcr_genes]
hvg_df = sc.pp.highly_variable_genes(sub, inplace=False)
hvg_df.index = sub.var_names  # only necessary in scanpy <1.10

Then grab the info out of the returned dataframe.

adata.var["highly_variable"] = hvg_df["highly_variable"].reindex(adata.var_names, fill_value=False)

Thank you so much @ivirshup!!!
Do you know if there’s a repository or a document or anything else where the BCR/TCR related genes are listed (like for the ones related to the cell cycle in the guide)?

Thank you!!

No problem.

For gene lists I don’t, but maybe @grst does?

The Ensembl gene annotation provides this information in the “Transcript type” column which you can retrieve from Biomart. Other genome annotations should have similar annotations.

More speficially, BCR genes are those with IG_[VDJDC]_(gene|pseudogene) and TCR genes those with TR_[VDJDC]_(gene|pseudogene) transcript type.

As I used it yesterday:

dataset = pybiomart.Dataset(name='hsapiens_gene_ensembl', host='http://www.ensembl.org')
table = dataset.query(attributes=['ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'transcript_biotype'], use_attr_names=True)
merged = pd.merge(
    adata.var,
    table,
    left_on='gene_ids',
    right_on='ensembl_gene_id'
)


merged['transcript_biotype'].value_counts()
adata.var['gene_name'] = adata.var_names
gene_subset = pd.merge(
    adata.var,
    table[table['transcript_biotype'].isin(['protein_coding', 'lncRNA', 'IG_C_gene', 'TR_C_gene'])], # replace here with your filter.
    left_on='gene_ids',
    right_on='ensembl_gene_id',
)
gene_subset.index = gene_subset['gene_name']
gene_subset.index.value_counts()