Amortized LDA Topic Modeling - picking the right number of topics

Hi, thank you for incorporating the LDA topic modeling module into scvi. I was curious if folks had suggestions on what would be a good approach to picking the number of topics for topic modeling?
My first thought was to run LDA topic modeling on a range of n_topics from 3-100. And then try and use an unbiased approach to get a ballpark of the number of topics, followed by a more supervised approach by looking at the genes in each topic and assessing if they fit what I knew biologically.
For the unbiased approach, based on the documentation, I thought to use the ELBO and perplexity scores, but I’m unsure how to interpret these values. I also have quite a few cases where they’re NA or Inf and I’m unsure what to make of that.

I was wondering if anyone else had experience with this, and if I’m going about this analysis the correct way?

Thank you!
Kartik

I think you are going the correct way.

I would also suggest to use scib-metrics to compare the topic integration in the “topic space”.

It might take a while to produce. Generally I think you can run the 3-100 by steps of 3-5 then fine tune inside, to save time.

Hi @ori-kron-wis , thank you for your quick response, and the suggestion to run in steps of 3-5- I have quite a few cells so it indeed does take quite a while to run!

My dataset is all tumor cells from a group of samples.
To get a sense of scib-metrics, I ran topic modeling for a subset of my cells for K values of 5, 10, 15, 20. And then used the scib-metrics Benchmarker function. I crudely assigned each cell to a topic based on the maximum topic value for that cell, and used that dominant topic assignment as the label_key. I used my sample identifier for the batch_key and X_LDA embedding for embedding_obsm_key for the Benchmarker inputs.

Is it appropriate to assign a topic to a cell and use that for the label_key in this use-case?

bm = Benchmarker(
    adata_subsample,
    batch_key="sample_id",
    label_key="dominant_topic_id",
    embedding_obsm_keys=["Pre_topic_modeling", "X_LDA"],
    pre_integrated_embedding_obsm_key="X_pca",
    bio_conservation_metrics=biocons,
    batch_correction_metrics=BatchCorrection(),
    n_jobs=-1,
)

Plotting the output values for those values, this is the lineplot I get-

For this use case of Benchmarker in the context of topic modeling, are there particular metrics that would be more reliable than others? (PC regression score for example does not seem to add information here)

Also, I can tell that something is happening at 10-15 topics and the documentation for the kBET score suggests that a higher value points towards better batch mixing, so potentially 15 would be the right ballpark for this toy example. But I am lacking an understanding of what these metrics mean. Are there any resources you’d suggest I can look through to understand these better?

Hi,

My idea was to use the topic per cell as label_key, yes.

However perhaps there is a way to use the embedding formed on the topic space (see for example this: Topic Modeling with Amortized LDA — scvi-tools

You used several embedding keys in the benchmarker but we see only 1 trend line.maybe better to overlay them all on top of each other.

I hoped to see a more clear case of optimal topic number that will dominate whole metrics. But we dont see this (perhaps the optimal is between 1-5 though, or need to use other embedding as stated)

Re the metrics used in scib-metrics, There are several sources online, mainly GitHub - theislab/scib: Benchmarking analysis of data integration tools which is what scib-metrics is based on