How to load GEO datasets for analysis using Scanpy / Scvi tools?

I have a question about data export from GEO to scanpy object.

I want to make GSE81608 dataset into scanpy object file. (GEO Accession viewer)

Usually I make scanpy object by 3 files (barcodes, features, matrix) but GSE81608 has only one txt file.

So I want to ask how can I transformGSE81608 dataset toscanpy object?

Thank you.

1 Like

There’s no standardization of files uploaded to GEO, you’re going to have to figure it out on a dataset by dataset basis.

Thanks for the reply.
There are lots of publicly available RNA-seq data. I’m wondering how it is possible to tailor the proper format for scanpy.
any idea?

Hi @mostafa-ti

As @ivirshup mentioned, the problem is that each GEO entry has a different format, some store just a csv, others h5, others zip files, etc. So depending on the dataset you will need to tailor the processing into an AnnData object.

The Bioconductor community has previsoly tackled this problem with recount (recount2: analysis-ready RNA-seq gene and exon counts datasets), a resource consisting of many RNA-seq datasets available in the SummarizedExperiment format. You could retrieve these objects and then transform them to AnnData using Zellkonverter (Conversion Between scRNA-seq Objects • zellkonverter). Unfortunately, I don’t think there is any other alternative 100% native in python. The closest thing is the scanpy function sc.datasets.ebi_expression_atlas(), which allows you to download scRNA-seq datasets stored in the EBI Single Cell Expression Atlas (https://www.ebi.ac.uk/gxa/sc/experiments).

Alternatively, if you are interested in only one dataset from GEO you can always download it and manually process it into an AnnData. For an example, check the begining of this vignette in decoupler: Bulk functional analysis — decoupler 1.2.1 documentation

Hope this is helpful!

3 Likes

Hi @PauBadiaM
Thanks a lot for the comprehensive reply. The sources that you address are beneficial.

Best,
Mostafa

Just curious if there is an update here regarding pulling in GEO data link the EMBL-EBI Array Express data is pulled in

i can give a example, data source https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6231125

import anndata as ad
import scanpy as sc
import pandas as pd
from scipy.io import mmread
from scipy.sparse import csr_matrix

genes = pd.read_csv('../raw/GSM6231125_PclafWT_features.tsv.gz', sep = '\t', header = None)
genes.columns = ['id', 'name', 'o']
genes.index = genes['name'].tolist()

mtx = mmread('../raw/GSM6231125_PclafWT_matrix.mtx.gz')
mtx = mtx.T
mtx = csr_matrix(mtx)

barcodes = pd.read_csv('../raw/GSM6231125_PclafWT_barcodes.tsv.gz', header = None)
barcodes.columns = ['0']
barcodes.index = barcodes['0'].tolist()
del barcodes['0']

adata = ad.AnnData(X = mtx, obs = barcodes, var = genes)
adata.var_names_make_unique()
adata.write('../data/GSM6231125_PclafWT.h5ad')

wish this helpful