pytutorial/pandas/basics
David Rotermund 89bc82a5c2
Update README.md
Signed-off-by: David Rotermund <54365609+davrot@users.noreply.github.com>
2023-12-16 17:39:42 +01:00
..
README.md Update README.md 2023-12-16 17:39:42 +01:00

Pandas

{:.no_toc}

* TOC {:toc}

The goal

Questions to David Rotermund

pip install pandas

Pandas

The two most important data types of Pandas are:

  • Series
  • Data Frames

“Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”​

It is the basis for:

This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more.

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy.

rpy2 is an interface to R running embedded in a Python process.

Pandas.Series

class pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their associated index values they need not be the same length. The result index will be the sorted union of the two indexes.

Example 1:

import pandas as pd

example = pd.Series(["Bambu", "Tree", "Sleep"])
print(example)

Output:

0    Bambu
1     Tree
2    Sleep
dtype: object

Example 2:

import numpy as np
import pandas as pd

example = pd.Series([99, 88, 32])
print(example)

Output:

0    99
1    88
2    32
dtype: int64

Example 3:

import numpy as np
import pandas as pd

rng = np.random.default_rng()
a = rng.random((5))

example = pd.Series(a)
print(example)

Output:

0    0.305920
1    0.633360
2    0.219094
3    0.005722
4    0.006673
dtype: float64

Example 4:

import pandas as pd

example = pd.Series(["Bambu", 3, "Sleep"])
print(example)

Output:

0    Bambu
1        3
2    Sleep
dtype: object

index and values

import pandas as pd

example = pd.Series(["Bambu", "Tree", "Sleep"])
print(example.index)
print()
print(example.values)

Output:

RangeIndex(start=0, stop=3, step=1)

['Bambu' 'Tree' 'Sleep']

DataFrame

class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

pandas.concat

pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)

Concatenate pandas objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

I/O operations

Pickling
Flat file
Clipboard
Excel
JSON
HTML
XML
Latex
HDFStore: PyTables (HDF5)
Feather
Parquet
ORC
SAS
SPSS
SQL
Google BigQuery
STATA