2023-12-20 16:59:11 +01:00
# Class
{:.no_toc}
< nav markdown = "1" class = "toc-class" >
* TOC
{:toc}
< / nav >
## The goal
Class has a very important job as a core container type in Python. It is really hard to find a good overview how to use them in a good practice manner.
Questions to [David Rotermund ](mailto:davrot@uni-bremen.de )
**I will use Linux. You will need a replacement for gzip under Windows.**
## Download the files
We need to download the MNIST database files
2024-01-02 21:12:43 +01:00
* t10k-images.idx3-ubyte.gz
* t10k-labels.idx1-ubyte.gz
* train-images.idx3-ubyte.gz
* train-labels.idx1-ubyte.gz
2023-12-20 19:21:50 +01:00
2024-01-02 21:12:43 +01:00
A source for that is for example [https://www.kaggle.com/datasets/hojjatk/mnist-dataset?resource=download ](https://www.kaggle.com/datasets/hojjatk/mnist-dataset?resource=download )
2023-12-20 16:59:11 +01:00
## Unpack the gz files
In a terminal:
```shell
gzip -d *.gz
```
## Convert the data into numpy files
2023-12-20 19:21:12 +01:00
### [numpy.dtype.newbyteorder](https://numpy.org/doc/stable/reference/generated/numpy.dtype.newbyteorder.html)
```python
type.newbyteorder(new_order='S', /)
```
> Return a new dtype with a different byte order.
>
> Changes are also made in all fields and sub-arrays of the data type.
>
> **new_order** : string, optional
>
> Byte order to force; a value from the byte order specifications below. The default value (‘ S’ ) results in swapping the current byte order. new_order codes can be any of:
>
> ‘ S’ - swap dtype from current to opposite endian
>
> {‘ <’ , ‘ little’ } - little endian
>
> {‘ >’ , ‘ big’ } - big endian
>
> {‘ =’ , ‘ native’ } - native order
>
> {‘ \|’ , ‘ I’ } - ignore (no change to byte order)
>
2023-12-20 16:59:11 +01:00
### Label file structure
2023-12-20 17:56:36 +01:00
> [offset] [type] [value] [description]
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0000 32 bit integer 0x00000801(2049) magic number (MSB first)
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0004 32 bit integer 60000 number of items
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0008 unsigned byte ?? label
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0009 unsigned byte ?? label
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> ........
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> xxxx unsigned byte ?? label
The labels values are 0 to 9.
2023-12-20 16:59:11 +01:00
### Pattern file structure
2023-12-20 17:56:36 +01:00
> [offset] [type] [value] [description]
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0000 32 bit integer 0x00000803(2051) magic number
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0004 32 bit integer 60000 number of images
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0008 32 bit integer 28 number of rows
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0012 32 bit integer 28 number of columns
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0016 unsigned byte ?? pixel
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> 0017 unsigned byte ?? pixel
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> ........
2023-12-20 19:21:12 +01:00
>
2023-12-20 17:56:36 +01:00
> xxxx unsigned byte ?? pixel
Pixels are organized row-wise.
Pixel values are 0 to 255. 0 means background (white),
255 means foreground (black).
## Converting the dataset to numpy
2023-12-20 16:59:11 +01:00
My source code for that task: convert.py
2023-12-20 17:56:36 +01:00
```python
2023-12-20 16:59:11 +01:00
import numpy as np
# [offset] [type] [value] [description]
# 0000 32 bit integer 0x00000801(2049) magic number (MSB first)
# 0004 32 bit integer 60000 number of items
# 0008 unsigned byte ?? label
# 0009 unsigned byte ?? label
# ........
# xxxx unsigned byte ?? label
# The labels values are 0 to 9.
class ReadLabel:
"""Class for reading the labels from an MNIST label file"""
def __init__ (self, filename: str) -> None:
self.filename: str = filename
self.data = self.read_from_file(filename)
def read_from_file(self, filename: str) -> np.ndarray:
int_32bit_data = np.dtype(np.uint32)
int_32bit_data = int_32bit_data.newbyteorder(">")
with open(filename, "rb") as file:
magic_flag: np.uint32 = np.frombuffer(file.read(4), int_32bit_data)[0]
if magic_flag != 2049:
data: np.ndarray = np.zeros(0)
number_of_elements: int = 0
else:
number_of_elements = np.frombuffer(file.read(4), int_32bit_data)[0]
if number_of_elements < 1:
data = np.zeros(0)
else:
data = np.frombuffer(file.read(number_of_elements), dtype=np.uint8)
return data
# [offset] [type] [value] [description]
# 0000 32 bit integer 0x00000803(2051) magic number
# 0004 32 bit integer 60000 number of images
# 0008 32 bit integer 28 number of rows
# 0012 32 bit integer 28 number of columns
# 0016 unsigned byte ?? pixel
# 0017 unsigned byte ?? pixel
# ........
# xxxx unsigned byte ?? pixel
# Pixels are organized row-wise.
# Pixel values are 0 to 255. 0 means background (white),
# 255 means foreground (black).
class ReadPicture:
"""Class for reading the images from an MNIST image file"""
def __init__ (self, filename: str) -> None:
self.filename: str = filename
self.Data = self.read_from_file(filename)
def read_from_file(self, filename: str) -> np.ndarray:
int_32bit_data = np.dtype(np.uint32)
int_32bit_data = int_32bit_data.newbyteorder(">")
with open(filename, "rb") as file:
magic_flag = np.frombuffer(file.read(4), int_32bit_data)[0]
if magic_flag != 2051:
data = np.zeros(0)
number_of_elements: int = 0
else:
number_of_elements = np.frombuffer(file.read(4), int_32bit_data)[0]
if number_of_elements < 1:
data = np.zeros(0)
number_of_rows: int = 0
else:
number_of_rows = np.frombuffer(file.read(4), int_32bit_data)[0]
if number_of_rows != 28:
data = np.zeros(0)
number_of_columns: int = 0
else:
number_of_columns = np.frombuffer(file.read(4), int_32bit_data)[0]
if number_of_columns != 28:
data = np.zeros(0)
else:
data = np.frombuffer(
file.read(number_of_elements * number_of_rows * number_of_columns),
dtype=np.uint8,
)
data = data.reshape(
number_of_elements, number_of_columns, number_of_rows
)
return data
def proprocess_dataset(testdata_mode: bool) -> None:
if testdata_mode is True:
filename_out_pattern: str = "test_pattern_storage.npy"
filename_out_label: str = "test_label_storage.npy"
2024-01-02 21:12:43 +01:00
filename_in_image: str = "t10k-images.idx3-ubyte"
filename_in_label: str = "t10k-labels.idx1-ubyte"
2023-12-20 16:59:11 +01:00
else:
filename_out_pattern = "train_pattern_storage.npy"
filename_out_label = "train_label_storage.npy"
2024-01-02 21:12:43 +01:00
filename_in_image = "train-images.idx3-ubyte"
filename_in_label = "train-labels.idx1-ubyte"
2023-12-20 16:59:11 +01:00
pictures = ReadPicture(filename_in_image)
labels = ReadLabel(filename_in_label)
# Down to 0 ... 1.0
max_value = np.max(pictures.Data.astype(np.float32))
pattern_storage = np.float32(pictures.Data.astype(np.float32) / max_value).astype(
np.float32
)
label_storage = np.uint64(labels.data)
np.save(filename_out_pattern, pattern_storage)
np.save(filename_out_label, label_storage)
proprocess_dataset(testdata_mode=True)
proprocess_dataset(testdata_mode=False)
2023-12-20 17:56:36 +01:00
```
2023-12-20 16:59:11 +01:00
Now we have the files:
2023-12-20 17:56:36 +01:00
* test_label_storage.npy
* test_pattern_storage.npy
* train_label_storage.npy
* train_pattern_storage.npy