Hi,
I fetched the Replogle 2022 K562 essential dataset with pt.data.replogle_2022_k562_essential() and I’m quite confused by the gene and perturbation columns in .obs:
# Unique perturbations from PertPy dataset, sorted by frequency
print(adata_pertpy.obs['perturbation'].value_counts().head(20))
perturbation
nan 1527
chr1.10050_top_two_chr11.1801_top_two_chr12.1832_top_two_chr12.732_top_two_chr1.3789_top_two_chr16.3244_top_two_chr20.537_top_two_chr21.240_top_two_chr3.666_top_two_chr5.1476_top_two_chr5.1603_top_two_chr7.3567_top_two_chr8.2697_top_two 6
chr1.10201_top_two_chr11.1778_second_two_chr11.9_top_two_chr12.3850_top_two_chr1.3349_top_two_chr16.4939_top_two_chr16.5184_top_two_chr2.181_top_two_chr2.2107_top_two_chr2.2686_top_two_chr2.3453_top_two_chr3.3203_top_two_chr7.2795_top_two_chrX.333_top_two_GNPDA1 6
chr1.11646_top_two_chr1.7199_top_two_chr19.32_top_two_chr19.4654_top_two_chr6.1231_top_two 6
chr10.3492_top_two_chr14.1108_top_two_chr19.2350_top_two_chr19.3043_top_two_chr2.342_top_two_chr4.2449_top_two_chr5.3584_top_two_chr6.4480_top_two_chr7.2333_top_two_chrX.230_top_two_FKBP2_GUK1 6
chr10.1312_top_two_chr1.5053_top_two_chr17.2443_top_two_chr17.2868_second_two_chr18.1099_top_two_chr18.1234_top_two_chr19.3095_top_two_chr1.9538_second_two_chr1.9914_top_two_chr20.2100_top_two_chr2.6077_top_two_chr2.6219_top_two_chr3.3206_top_two_chr3.5562_top_two_chr3.5794_top_two_chr4.2715_top_two_chr6.1266_top_two_chr6.145_top_two_chr6.2696_top_two_chr8.3568_top_two 5
chr11.1970_top_two_chr11.2132_top_two_chr1.12583_top_two_chr17.4512_top_two_chr3.3031_top_two_chr3.4516_top_two_chr4.517_top_two_chr5.5202_top_two_chr6.2474_top_two_chr6.3646_top_two_chr6.5344_top_two_chr6.6121_top_two_chr7.4462_top_two_chrX.451_top_two_MAD2L1BP_RPL18A 5
chr11.3338_top_two_chr11.653_top_two_chr14.164_top_two_chr14.623_top_two_chr16.1031_top_two_chr17.4171_top_two_chr17.5683_top_two_chr19.415_top_two_chr20.312_top_two_chr22.1059_top_two_chr2.2486_top_two_chr6.1300_top_two_chr9.1385_top_two_chr9.1625_top_two_chrX.2023_top_two_H3F3B_HNRNPF_RPL18A 5
chr15.2083_top_two_chr17.3460_top_two_chr19.5328_top_two_chr4.2820_top_two_chr6.947_top_two_NUDC 5
CAP1_chr10.1724_top_two_chr1.10348_top_two_chr11.2360_top_two_chr11.5215_top_two_chr11.5229_top_two_chr12.2512_top_two_chr1.5905_top_two_chr16.4939_top_two_chr17.1301_top_two_chr17.2157_top_two_chr19.5729_top_two_chr20.2607_top_two_chr20.659_second_two_chr3.2825_top_two_chr3.5843_top_two_chr4.2526_top_two_chr6.477_top_two_CUTA 4
CD47_chr10.205_top_two_chr10.2558_top_two_chr1.4268_top_two_chr15.3009_top_two_chr15.3423_top_two_chr2.1353_top_two_chr2.6473_top_two_chr3.4516_top_two_chr3.5556_top_two_chr5.2017_top_two_chr6.4099_top_two_chr8.150_top_two_chrX.2326_top_two_chrX.769_top_two_chrX.952_second_two 4
DDX3X 4
MRPS36 4
chr1.3819_top_two_chr17.6206_top_two_chr1.8792_top_two_chr19.1559_second_two_chr19.2897_top_two_chr2.1468_top_two_chr5.1621_top_two_chr5.5155_top_two_chr6.1227_top_two_chr6.2485_top_two_chr6.3562_top_two_chr6.5151_top_two_chr8.1665_second_two_chr9.3712_top_two_ZRANB2 4
chr7.6429_top_two 4
chr10.930_top_two_chr1.12093_top_two_chr17.628_top_two_chr1.8052_top_two_chr2.2452_second_two_chr2.6306_top_two_chr2.6797_top_two 4
chr10.1125_top_two_chr10.781_top_two_chr12.2229_top_two_chr12.3972_top_two_chr14.1646_top_two_chr14.2492_top_two_chr1.4515_top_two_chr15.149_top_two_chr15.2334_top_two_chr15.420_top_two_chr19.2680_top_two_chr21.1529_top_two_chr21.1541_top_two_chr2.6700_top_two_chr5.1313_top_two_chr6.1186_top_two_chr6.2309_top_two_chr7.3110_top_two_chr7.3217_top_two_chr7.5867_top_two_chr8.1769_top_two_chr9.3630_top_two_chr9.4042_top_two_chrX.1677_top_two_chrX.239_top_two_KCNH2_NDUFA2 4
chr10.1159_top_two_chr10.144_top_two_chr11.2389_top_two_chr11.5216_top_two_chr11.882_top_two_chr1.4082_top_two_chr14.2068_top_two_chr14.2069_top_two_chr1.5038_top_two_chr15.422_top_two_chr1.6368_top_two_chr17.1202_top_two_chr17.4048_top_two_chr17.5964_top_two_chr1.9584_top_two_chr20.2593_top_two_chr20.738_top_two_chr2.1079_top_two_chr2.5826_top_two_chr3.5443_top_two_chr3.799_top_two_chr4.1579_top_two_chr4.2002_top_two_chr4.3453_top_two_chr5.2650_top_two_chr5.3240_top_two_chr6.4593_top_two_chr6.6121_top_two_chr7.4218_top_two_chr8.103_top_two_chr8.1767_top_two_chr8.2583_second_two_chr8.3346_top_two_chr9.876_top_two_CHRAC1_ZNF593 4
chr10.1312_top_two_chr10.449_top_two_chr11.2761_top_two_chr11.772_top_two_chr15.1881_top_two_chr1.6799_top_two_chr18.579_top_two_chr1.9428_top_two_chr20.659_second_two_chr5.1488_top_two_chr6.3152_top_two_ODC1_TMEM98 4
chr10.1428_top_two_chr10.3635_top_two_chr10.779_top_two_chr10.834_second_two_chr11.1733_top_two_chr11.3243_second_two_chr11.3473_top_two_chr11.5907_second_two_chr13.1177_top_two_chr15.1532_top_two_chr15.3151_top_two_chr17.1759_top_two_chr17.2901_top_two_chr17.6006_top_two_chr1.8051_top_two_chr19.488_top_two_chr20.2622_top_two_chr21.116_top_two_chr2.2328_top_two_chr2.2488_top_two_chr2.6746_top_two_chr3.1798_top_two_chr3.5641_top_two_chr4.738_top_two_chr5.1443_top_two_chr5.705_second_two_chr6.1385_top_two_chr6.3721_top_two_chr7.1855_top_two_chr7.4901_top_two_chr8.1451_top_two_chr9.108_top_two_chrX.1785_top_two_chrX.2030_top_two
Additionally, the original dataset from the authors has 310,385 cells × 8,563 genes, while the Perty dataset has 207,324 cells × 13,135 genes. I understand there are probably fewer cells in the Pertpy version from QC filtering, but where did the extra genes come from?