Not able to combine two samples for TCR analysis

mgelm · June 28, 2023, 3:23pm

Hello!
I’m having issues running two of my samples (Control & KO) following the example in your tutorials ‘combining multiple samples’. When I run the loop for the two files I get an error at the sc.read_10x_vdj stage (I have also tried using sc.read_h5ad). Any idea why this is?

Below is the code with the error:

samples = {
    "Control": {"group": "Control"},
    "KO": {"group": "KO"},
}
adatas_tcr = {}
adatas_gex = {}
for sample, sample_meta in samples.items():
    gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")
    tcr_file = glob(f"/VDJ_outs/csv files/*{sample}*.csv")
    adata_gex = sc.read_h5ad(gex_file)
    adata_tcr = ir.io.read_10x_vdj(tcr_file)


TypeError                                 Traceback (most recent call last)
Input In [12], in <cell line: 4>()
      5 gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")
      6 tcr_file = glob(f"/VDJ_outs/csv files/*{sample}*.csv")
----> 7 adata_gex = sc.read_h5ad(gex_file)
      8 adata_tcr = ir.io.read_10x_vdj(tcr_file)

File ~/miniconda3/envs/VDJ/lib/python3.8/site-packages/anndata/_io/h5ad.py:219, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    211         raise NotImplementedError(
    212             "Currently only `X` and `raw/X` can be read as sparse."
    213         )
    215 rdasp = partial(
    216     read_dense_as_sparse, sparse_format=as_sparse_fmt, axis_chunk=chunk_size
    217 )
--> 219 with h5py.File(filename, "r") as f:
    221     def callback(func, elem_name: str, elem, iospec):
    222         if iospec.encoding_type == "anndata" or elem_name.endswith("/"):

File ~/miniconda3/envs/VDJ/lib/python3.8/site-packages/h5py/_hl/files.py:542, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, **kwds)
    540     name = repr(name).encode('ASCII', 'replace')
    541 else:
--> 542     name = filename_encode(name)
    544 if track_order is None:
    545     track_order = h5.get_config().track_order

File ~/miniconda3/envs/VDJ/lib/python3.8/site-packages/h5py/_hl/compat.py:19, in filename_encode(filename)
     11 def filename_encode(filename):
     12     """
     13     Encode filename for use in the HDF5 library.
     14 
   (...)
     17     filenames in h5py for more information.
     18     """
---> 19     filename = fspath(filename)
     20     if sys.platform == "win32":
     21         if isinstance(filename, str):
TypeError: expected str, bytes or os.PathLike object, not list

grst · June 28, 2023, 8:17pm

Hi @mgelm,

the problem here is that glob will always give you a list, even if it finds just one file.
So you’ll need to do

gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")[0]

As an additional sanity check (to make sure that there’s really only one file found), I would recommend to actually do

gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")
assert len(gex_file) == 1
gex_file = gex_file[0]

Further down, I spotted another issue: sc.read_h5ad is only for AnnData h5ad files. To read in the 10x h5 files, you’ll need sc.read_10x_h5 instead.

Cheers,
Gregor

mgelm · June 29, 2023, 9:55am

Hi @grst,
thanks so much for replying! I tried to run the code as you suggested but still get an error:

adatas_tcr = {}
adatas_gex = {}
for sample, sample_meta in samples.items():
    gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")[0]
    assert len(gex_file) == 1
    gex_file = gex_file[0]
    tcr_file = glob(f"/VDJ_outs/csv files/*{sample}*.csv")[0]
    adata_gex = sc.read_10x_h5(gex_file)
    adata_tcr = ir.io.read_10x_vdj(tcr_file)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [9], in <cell line: 3>()
      2 adatas_gex = {}
      3 for sample, sample_meta in samples.items():
----> 4     gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")[0]
      5     assert len(gex_file) == 1
      6     gex_file = gex_file[0]

IndexError: list index out of range

I’ve attached how the files look in the /h5 files and /csv files folders. Any ideas why this keeps happening?

grst · June 29, 2023, 2:16pm

Well, this means that glob can’t find the file.

In your case I suspect this is due to the fact that your path starts with / which refers to an absolute path. Assuming that GEX_outs is in your current directory, you need to specify a relative path that starts with ./, i.e.

gex_file = glob(f"./GEX_outs/h5 files/*{sample}*.h5")[0]

mgelm · June 29, 2023, 3:14pm

Hi Gregor,
Thank you for the suggestions! I tried to run with the code you suggested but I now get an assertion error.

adatas_tcr = {}
adatas_gex = {}
for sample, sample_meta in samples.items():
    gex_file = glob(f"./GEX_outs/h5 files/*{sample}*.h5")[0]
    assert len(gex_file) == 1
    gex_file = gex_file[0]
    tcr_file = glob(f"./VDJ_outs/csv files/*{sample}*.csv")[0]
    adata_gex = sc.read_10x_h5(gex_file)
    adata_tcr = ir.io.read_10x_vdj(tcr_file)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [15], in <cell line: 3>()
      3 for sample, sample_meta in samples.items():
      4     gex_file = glob(f"/mnt/c/Users/ha20203/OneDrive - Queen Mary, University of London (1)/Krt76 Thymus Project/Results/Bioinformatics analysis/scRNAseq Data_Genewiz/GEX&TCR Krt76 KO_Het (40-708596444)/GEX_outs/h5 files/*{sample}*.h5")[0]
----> 5     assert len(gex_file) == 1
      6     gex_file = gex_file[0]
      7     tcr_file = glob(f"/mnt/c/Users/ha20203/OneDrive - Queen Mary, University of London (1)/Krt76 Thymus Project/Results/Bioinformatics analysis/scRNAseq Data_Genewiz/GEX&TCR Krt76 KO_Het (40-708596444)/VDJ_outs/csv files/*{sample}*.csv")[0]

AssertionError:

When I remove the assert len code, it reads the files but again only remembers the KO sample without maintaining both control and KO sample data. This is the recurring error that I keep having, later in the merged IR + GEX I end up just with my KO barcodes. Any ideas why this is happening? I have tried inserting the whole file path too and the same error occurs.

print(samples.items())
print(sample_meta)
print(sample)

dict_items([('Control', {'group': 'Control'}), ('KO', {'group': 'KO'})])
{'group': 'KO'}
KO

Many thanks!
Mara

grst · June 30, 2023, 6:12am

you have [0] twice, so len actually checks the length of the filename, not the length of the list… In such a case it helps to run the code line by line and use print or the jupyter variable inspector to see what’s happening.

it reads the files but again only remembers the KO sample without maintaining both control and KO sample data.

not sure if that’s what you mean, but you are not saving adata_gex and adata_tcr in your adatas_xxx dictionary, and the variables get overwritten in each interation of the loop.

Topic		Replies	Views
Read_10x_h5 Error scanpy	1	812	May 23, 2023
Trouble Reading .h5 files scanpy	1	3264	April 13, 2023
Issue reading h5ad file scanpy	1	2814	January 12, 2023
Not able to load TCR data scirpy	9	610	September 26, 2022
Problem with airr data loading scirpy	11	700	October 3, 2023

Not able to combine two samples for TCR analysis

Related topics