Not able to combine two samples for TCR analysis

Hello!
I’m having issues running two of my samples (Control & KO) following the example in your tutorials ‘combining multiple samples’. When I run the loop for the two files I get an error at the sc.read_10x_vdj stage (I have also tried using sc.read_h5ad). Any idea why this is?

Below is the code with the error:

samples = {
    "Control": {"group": "Control"},
    "KO": {"group": "KO"},
}
adatas_tcr = {}
adatas_gex = {}
for sample, sample_meta in samples.items():
    gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")
    tcr_file = glob(f"/VDJ_outs/csv files/*{sample}*.csv")
    adata_gex = sc.read_h5ad(gex_file)
    adata_tcr = ir.io.read_10x_vdj(tcr_file)


TypeError                                 Traceback (most recent call last)
Input In [12], in <cell line: 4>()
      5 gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")
      6 tcr_file = glob(f"/VDJ_outs/csv files/*{sample}*.csv")
----> 7 adata_gex = sc.read_h5ad(gex_file)
      8 adata_tcr = ir.io.read_10x_vdj(tcr_file)

File ~/miniconda3/envs/VDJ/lib/python3.8/site-packages/anndata/_io/h5ad.py:219, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    211         raise NotImplementedError(
    212             "Currently only `X` and `raw/X` can be read as sparse."
    213         )
    215 rdasp = partial(
    216     read_dense_as_sparse, sparse_format=as_sparse_fmt, axis_chunk=chunk_size
    217 )
--> 219 with h5py.File(filename, "r") as f:
    221     def callback(func, elem_name: str, elem, iospec):
    222         if iospec.encoding_type == "anndata" or elem_name.endswith("/"):

File ~/miniconda3/envs/VDJ/lib/python3.8/site-packages/h5py/_hl/files.py:542, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, **kwds)
    540     name = repr(name).encode('ASCII', 'replace')
    541 else:
--> 542     name = filename_encode(name)
    544 if track_order is None:
    545     track_order = h5.get_config().track_order

File ~/miniconda3/envs/VDJ/lib/python3.8/site-packages/h5py/_hl/compat.py:19, in filename_encode(filename)
     11 def filename_encode(filename):
     12     """
     13     Encode filename for use in the HDF5 library.
     14 
   (...)
     17     filenames in h5py for more information.
     18     """
---> 19     filename = fspath(filename)
     20     if sys.platform == "win32":
     21         if isinstance(filename, str):
TypeError: expected str, bytes or os.PathLike object, not list

Hi @mgelm,

the problem here is that glob will always give you a list, even if it finds just one file.
So you’ll need to do

gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")[0]

As an additional sanity check (to make sure that there’s really only one file found), I would recommend to actually do

gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")
assert len(gex_file) == 1
gex_file = gex_file[0]

Further down, I spotted another issue: sc.read_h5ad is only for AnnData h5ad files. To read in the 10x h5 files, you’ll need sc.read_10x_h5 instead.

Cheers,
Gregor

Hi @grst,
thanks so much for replying! I tried to run the code as you suggested but still get an error:

adatas_tcr = {}
adatas_gex = {}
for sample, sample_meta in samples.items():
    gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")[0]
    assert len(gex_file) == 1
    gex_file = gex_file[0]
    tcr_file = glob(f"/VDJ_outs/csv files/*{sample}*.csv")[0]
    adata_gex = sc.read_10x_h5(gex_file)
    adata_tcr = ir.io.read_10x_vdj(tcr_file)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Input In [9], in <cell line: 3>()
      2 adatas_gex = {}
      3 for sample, sample_meta in samples.items():
----> 4     gex_file = glob(f"/GEX_outs/h5 files/*{sample}*.h5")[0]
      5     assert len(gex_file) == 1
      6     gex_file = gex_file[0]

IndexError: list index out of range

I’ve attached how the files look in the /h5 files and /csv files folders. Any ideas why this keeps happening?

Well, this means that glob can’t find the file.

In your case I suspect this is due to the fact that your path starts with / which refers to an absolute path. Assuming that GEX_outs is in your current directory, you need to specify a relative path that starts with ./, i.e.

gex_file = glob(f"./GEX_outs/h5 files/*{sample}*.h5")[0]

Hi Gregor,
Thank you for the suggestions! I tried to run with the code you suggested but I now get an assertion error.

adatas_tcr = {}
adatas_gex = {}
for sample, sample_meta in samples.items():
    gex_file = glob(f"./GEX_outs/h5 files/*{sample}*.h5")[0]
    assert len(gex_file) == 1
    gex_file = gex_file[0]
    tcr_file = glob(f"./VDJ_outs/csv files/*{sample}*.csv")[0]
    adata_gex = sc.read_10x_h5(gex_file)
    adata_tcr = ir.io.read_10x_vdj(tcr_file)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [15], in <cell line: 3>()
      3 for sample, sample_meta in samples.items():
      4     gex_file = glob(f"/mnt/c/Users/ha20203/OneDrive - Queen Mary, University of London (1)/Krt76 Thymus Project/Results/Bioinformatics analysis/scRNAseq Data_Genewiz/GEX&TCR Krt76 KO_Het (40-708596444)/GEX_outs/h5 files/*{sample}*.h5")[0]
----> 5     assert len(gex_file) == 1
      6     gex_file = gex_file[0]
      7     tcr_file = glob(f"/mnt/c/Users/ha20203/OneDrive - Queen Mary, University of London (1)/Krt76 Thymus Project/Results/Bioinformatics analysis/scRNAseq Data_Genewiz/GEX&TCR Krt76 KO_Het (40-708596444)/VDJ_outs/csv files/*{sample}*.csv")[0]

AssertionError:

When I remove the assert len code, it reads the files but again only remembers the KO sample without maintaining both control and KO sample data. This is the recurring error that I keep having, later in the merged IR + GEX I end up just with my KO barcodes. Any ideas why this is happening? I have tried inserting the whole file path too and the same error occurs.

print(samples.items())
print(sample_meta)
print(sample)

dict_items([('Control', {'group': 'Control'}), ('KO', {'group': 'KO'})])
{'group': 'KO'}
KO

Many thanks!
Mara

you have [0] twice, so len actually checks the length of the filename, not the length of the list… In such a case it helps to run the code line by line and use print or the jupyter variable inspector to see what’s happening.

it reads the files but again only remembers the KO sample without maintaining both control and KO sample data.

not sure if that’s what you mean, but you are not saving adata_gex and adata_tcr in your adatas_xxx dictionary, and the variables get overwritten in each interation of the loop.