Pytorch & Torchtext DataLoader Incompatibility

It took me a while to realize that pytorch and torchtext have some incompatibilities on their data abstraction layer, so I figured I'd write up a short post about it.

Pytorch has two main classes for handling data, the Dataset and the DataLoader. They're both under torch.utils.data: https://pytorch.org/docs/stable/data.html#dataset-types, but the gist is that Dataset is a wrapper class around the physical files or sockets, while DataLoader is the aspect responsible for batching and splitting.

Torchtext has a similar but not compatible types: also called Dataset and Iterator. Notably, it is torchtext.data.dataset and torchtext.data.iterator. Don't be fooled like I was though, torchtext.data.dataset and torch.utils.data.dataset are not interchangeable.

It's best to think of these as two completely different tracks, if you want to use anything under torch.utils.data such as the Sampler, Subsets, etc, those won't be available if your code is already depending on anything from torchtext.

You'll run into any of the following errors if you try this:

It looks like the torchtext people are working on it (https://github.com/pytorch/text/issues/664), although given that the issue has been open since December 2019 it's unclear when these changes will get upstreamed into master.

Posted: 2020-12-14
Filed Under: computer