Hi, thank you for incorporating the LDA topic modeling module into scvi. I was curious if folks had suggestions on what would be a good approach to picking the number of topics for topic modeling?
My first thought was to run LDA topic modeling on a range of n_topics from 3-100. And then try and use an unbiased approach to get a ballpark of the number of topics, followed by a more supervised approach by looking at the genes in each topic and assessing if they fit what I knew biologically.
For the unbiased approach, based on the documentation, I thought to use the ELBO and perplexity scores, but I’m unsure how to interpret these values. I also have quite a few cases where they’re NA
or Inf
and I’m unsure what to make of that.
I was wondering if anyone else had experience with this, and if I’m going about this analysis the correct way?
Thank you!
Kartik
I think you are going the correct way.
I would also suggest to use scib-metrics to compare the topic integration in the “topic space”.
It might take a while to produce. Generally I think you can run the 3-100 by steps of 3-5 then fine tune inside, to save time.
Hi @ori-kron-wis , thank you for your quick response, and the suggestion to run in steps of 3-5- I have quite a few cells so it indeed does take quite a while to run!
My dataset is all tumor cells from a group of samples.
To get a sense of scib-metrics, I ran topic modeling for a subset of my cells for K values of 5, 10, 15, 20. And then used the scib-metrics Benchmarker function. I crudely assigned each cell to a topic based on the maximum topic value for that cell, and used that dominant topic assignment as the label_key
. I used my sample identifier for the batch_key
and X_LDA
embedding for embedding_obsm_key
for the Benchmarker inputs.
Is it appropriate to assign a topic to a cell and use that for the label_key
in this use-case?
bm = Benchmarker(
adata_subsample,
batch_key="sample_id",
label_key="dominant_topic_id",
embedding_obsm_keys=["Pre_topic_modeling", "X_LDA"],
pre_integrated_embedding_obsm_key="X_pca",
bio_conservation_metrics=biocons,
batch_correction_metrics=BatchCorrection(),
n_jobs=-1,
)
Plotting the output values for those values, this is the lineplot I get-
For this use case of Benchmarker in the context of topic modeling, are there particular metrics that would be more reliable than others? (PC regression score for example does not seem to add information here)
Also, I can tell that something is happening at 10-15 topics and the documentation for the kBET score suggests that a higher value points towards better batch mixing, so potentially 15 would be the right ballpark for this toy example. But I am lacking an understanding of what these metrics mean. Are there any resources you’d suggest I can look through to understand these better?
Hi,
My idea was to use the topic per cell as label_key, yes.
However perhaps there is a way to use the embedding formed on the topic space (see for example this: Topic Modeling with Amortized LDA — scvi-tools
You used several embedding keys in the benchmarker but we see only 1 trend line.maybe better to overlay them all on top of each other.
I hoped to see a more clear case of optimal topic number that will dominate whole metrics. But we dont see this (perhaps the optimal is between 1-5 though, or need to use other embedding as stated)
Re the metrics used in scib-metrics, There are several sources online, mainly GitHub - theislab/scib: Benchmarking analysis of data integration tools which is what scib-metrics is based on
Hi,
Thank you for the suggestions! I modified my subsetting strategy to subset by samples instead of cells in the hope that might make it a little easier to find topics across samples. I also ran with a larger range of topics starting from 2.
I have overlayed the following 4 embeddings-
X_pca- original sc.tl.pca() on scanpy object
X_umap- original sc.tl.umap() on scanpy object
X_LDA- the LDA topic modeling weights per cell per topic
X_umap_LDA- the umap embedding calculated based on the topic space using-
#Calculate UMAP in topic space
sc.pp.neighbors(adata_topic, use_rep="X_LDA", n_neighbors=20, metric="hellinger")
sc.tl.umap(adata_topic)
I am a little surprised these change based on the number of topics because I don’t recalculate them for each topic. Do we expect this part of the code to modify the X_pca and X_umap values?
scvi.model.AmortizedLDA.setup_anndata(adata_topic, layer="counts")
model = scvi.model.AmortizedLDA(adata_topic, n_topics=n_topics)
#Train model
model.train(max_epochs=100)
I also had a couple topics (k=4, k=6) that ended up having those ‘outlier’ elbo and perplexity scores, which end up making the plots tricky to interpret.
Nonetheless, I think this is representative of the kind of outputs I can expect when I run this on my entire dataset. Would these curves match our expectations for evaluating an appropriate ballpark for the number of topics to use?
Thanks!
X_pca and X_umap should not change, given your ran arpack solver, but even so its all deterministic. only if you change the n_neighbors, which I dont think you did. . Are you sure theres not code issue? Yes there some random effect of the session and all, but I wouldn’t
think it will be that much.
X_LDA and X_umap_LDA will change of course (thats what we check).
From what I see you ran it from 2, with steps of 2.
For me its seems 6 is the optimal number of topics, because it maximizes your Total metric of scib (both bio conservations and batch correction). It not an outlier (although I cant explain the perplexity. can it be that the others are just INF and this is the only one with a value? surely cant be 0). same goes for get_elbo. A value other than 0 must be there.
I think that 6 topics can make sense (its a data property), given the data is not too diverse (single source, platform, tissue etc), yes, 5-15 topics is what I expect.
Got it, 6 topics for these plots makes sense, thank you for all your help!
Closing the loop on the X_pca/X_umap changing and the perplexity and ELBO values:
For the X_pca and X_umap, I’m not quite sure- I looped through all the values of K, made a copy of my original anndata object before running topic modeling on it, and then stored the various metrics for all 4 embeddings in a separate dataframe. And to be safe, I stashed the original X_umap embedding to avoid it getting overwritten.
For the elbo and perplexity values-
All the elbo values except 4 and 6 were between 249,154,710 and 273,089,479. And the value for k=4 was 5.78961E+11 and k=6 was 36,848,271,738
All the perplexity values except 4 and 6 were between ~188 and ~313. And the value for k=4 was 4.32E+196 and k=6 was Inf
ELBO although super high, make sense to be optimal on k=6, but perplexity should have been the other way (smaller is better). I think the total_counts (see code) got really small for k=4 or 6, in any case it still point that k=6 is optimal
perhaps you can drawn the figures on log scale to see better.
Actually with another check, I think the LDA perplexity calculation is missing a (-) before ELBO.