pytutorial/pytorch/interfacing_data/README.md
David Rotermund 732d474a9c
Update README.md
Signed-off-by: David Rotermund <54365609+davrot@users.noreply.github.com>
2023-12-22 14:38:49 +01:00

97 lines
3.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Interfacing data
{:.no_toc}
<nav markdown="1" class="toc-class">
* TOC
{:toc}
</nav>
## The goal
We need to handle our data and make it accessible for PyTorch.
Questions to [David Rotermund](mailto:davrot@uni-bremen.de)
There are options to interface your data.
## [torch.utils.data.TensorDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset)
```python
CLASS torch.utils.data.TensorDataset(*tensors)
```
> Dataset wrapping tensors.
>
> Each sample will be retrieved by indexing tensors along the first dimension.
>
> **\*tensors** : (Tensor) tensors that have the same size of the first dimension.
## [torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)
In the case we might not be able to load the fully dataset into memory, the **torch.utils.data.Dataset** is very helpful.
```python
CLASS torch.utils.data.Dataset(*args, **kwds)
```
> An abstract class representing a Dataset.
>
> All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite **\_\_getitem\_\_()**, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite **\_\_len\_\_()**, which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement **\_\_getitems\_\_()**, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.
We need to create a new class which is derived from **torch.utils.data.Dataset**. We can do what every we want in this class as long as we service the functions
* **\_\_len\_\_()** : gives us the number of pattern in the dataset
* **\_\_getitem\_\_(index)** : gives us the information about ONE pattern at position index in the data set. In the following example, I return the image as 3d torch.Tensor and the corresponding class for that pattern (for which I use int).
We have a lot of freedom for our own design. e.g.:
* The argument **train:bool** of the contructor was introduced by me.
* The **\_\_getitem\_\_(index)** doesn't need to give back the data for that pattern in exactly this way (means: order of variables, types of variables, number of variables).
We assume that the data is in the four following files:
* train_pattern_storage.npy
* train_label_storage.npy
* test_pattern_storage.npy
* test_label_storage.npy
```python
import numpy as np
import torch
class MyDataset(torch.utils.data.Dataset):
# Initialize
def __init__(self, train: bool = False) -> None:
super(MyDataset, self).__init__()
if train is True:
self.pattern_storage: np.ndarray = np.load("train_pattern_storage.npy")
self.label_storage: np.ndarray = np.load("train_label_storage.npy")
else:
self.pattern_storage = np.load("test_pattern_storage.npy")
self.label_storage = np.load("test_label_storage.npy")
self.pattern_storage = self.pattern_storage.astype(np.float32)
self.pattern_storage /= np.max(self.pattern_storage)
# How many pattern are there?
self.number_of_pattern: int = self.label_storage.shape[0]
def __len__(self) -> int:
return self.number_of_pattern
# Get one pattern at position index
def __getitem__(self, index: int) -> tuple[torch.Tensor, int]:
image = torch.tensor(self.pattern_storage[index, np.newaxis, :, :])
target = int(self.label_storage[index])
return image, target
if __name__ == "__main__":
pass
```