How to load GEO datasets for analysis using Scanpy / Scvi tools?

mostafa-ti · October 6, 2022, 7:00pm

I have a question about data export from GEO to scanpy object.

I want to make GSE81608 dataset into scanpy object file. (GEO Accession viewer)

Usually I make scanpy object by 3 files (barcodes, features, matrix) but GSE81608 has only one txt file.

So I want to ask how can I transformGSE81608 dataset toscanpy object?

Thank you.

ivirshup · October 12, 2022, 10:18pm

There’s no standardization of files uploaded to GEO, you’re going to have to figure it out on a dataset by dataset basis.

mostafa-ti · October 17, 2022, 6:06pm

Thanks for the reply.
There are lots of publicly available RNA-seq data. I’m wondering how it is possible to tailor the proper format for scanpy.
any idea?

PauBadiaM · November 7, 2022, 8:50am

Hi @mostafa-ti

As @ivirshup mentioned, the problem is that each GEO entry has a different format, some store just a csv, others h5, others zip files, etc. So depending on the dataset you will need to tailor the processing into an AnnData object.

The Bioconductor community has previsoly tackled this problem with recount (recount2: analysis-ready RNA-seq gene and exon counts datasets), a resource consisting of many RNA-seq datasets available in the SummarizedExperiment format. You could retrieve these objects and then transform them to AnnData using Zellkonverter (Conversion Between scRNA-seq Objects • zellkonverter). Unfortunately, I don’t think there is any other alternative 100% native in python. The closest thing is the scanpy function sc.datasets.ebi_expression_atlas(), which allows you to download scRNA-seq datasets stored in the EBI Single Cell Expression Atlas (https://www.ebi.ac.uk/gxa/sc/experiments).

Alternatively, if you are interested in only one dataset from GEO you can always download it and manually process it into an AnnData. For an example, check the begining of this vignette in decoupler: Bulk functional analysis — decoupler 1.2.1 documentation

Hope this is helpful!

mostafa-ti · November 9, 2022, 4:28pm

Hi @PauBadiaM
Thanks a lot for the comprehensive reply. The sources that you address are beneficial.

Best,
Mostafa

adeslatt · October 25, 2024, 10:09am

Just curious if there is an update here regarding pulling in GEO data link the EMBL-EBI Array Express data is pulled in

zEpoch · November 6, 2024, 5:24am

i can give a example, data source https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6231125

import anndata as ad
import scanpy as sc
import pandas as pd
from scipy.io import mmread
from scipy.sparse import csr_matrix

genes = pd.read_csv('../raw/GSM6231125_PclafWT_features.tsv.gz', sep = '\t', header = None)
genes.columns = ['id', 'name', 'o']
genes.index = genes['name'].tolist()

mtx = mmread('../raw/GSM6231125_PclafWT_matrix.mtx.gz')
mtx = mtx.T
mtx = csr_matrix(mtx)

barcodes = pd.read_csv('../raw/GSM6231125_PclafWT_barcodes.tsv.gz', header = None)
barcodes.columns = ['0']
barcodes.index = barcodes['0'].tolist()
del barcodes['0']

adata = ad.AnnData(X = mtx, obs = barcodes, var = genes)
adata.var_names_make_unique()
adata.write('../data/GSM6231125_PclafWT.h5ad')

wish this helpful

Topic		Replies	Views
How to convert R matrix to anndata Help scvi , anndata	4	88	January 17, 2025
Convert Scanpy (h5ad) to Seurat (rds) anndata	9	15833	September 27, 2024
Missing pre-processed dataset in h5ad format (scanorama tutorial) scanpy integration	0	264	August 8, 2023
Loading file formats different than AnnData Help scvi	1	302	May 25, 2022
Imaging data from seurat to scanpy Visium	1	928	October 4, 2023

How to load GEO datasets for analysis using Scanpy / Scvi tools?

Related topics