Pytorch & Torchtext DataLoader Incompatibility
It took me a while to realize that pytorch
and torchtext
have some incompatibilities on their data abstraction layer, so I figured I'd write up a short post about it.
Pytorch has two main classes for handling data, the Dataset
and the DataLoader
. They're both under torch.utils.data
: https://pytorch.org/docs/stable/data.html#dataset-types, but the gist is that Dataset
is a wrapper class around the physical files or sockets, while DataLoader
is the aspect responsible for batching and splitting.
Torchtext has a similar but not compatible types: also called Dataset
and Iterator
. Notably, it is torchtext.data.dataset
and torchtext.data.iterator
. Don't be fooled like I was though, torchtext.data.dataset
and torch.utils.data.dataset
are not interchangeable.
It's best to think of these as two completely different tracks, if you want to use anything under torch.utils.data
such as the Sampler
, Subsets
, etc, those won't be available if your code is already depending on anything from torchtext
.
You'll run into any of the following errors if you try this:
- https://discuss.pytorch.org/t/out-of-vocabulary-keyerror-on-vocab-stoi-in-torchtext/65910
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] KeyError: None
- https://github.com/pytorch/text/issues/618TypeError: ‘DataLoader’ object is not callable
- https://discuss.pytorch.org/t/typeerror-dataloader-object-is-not-callable/74979
It looks like the torchtext people are working on it (https://github.com/pytorch/text/issues/664), although given that the issue has been open since December 2019 it's unclear when these changes will get upstreamed into master.