pandas

python library for manipulating dataframes

Created: Apr 13, 2023 by Pradeep Gowda Updated: May 22, 2023 Tagged: python

Top level notes

Pandas Documentation Homepage
Installing (2.0) – pip install pandas==2.0
- installing pandas will also install numpy
Pandas helps in working with tabular data (databases, spreadsheets)
Helps = explore, clean and proess
Anything you can do with SQL you can also do with Pandas.
- Group by operations
Summary Statistics. eg., mean, median, std..
The data table is called a DataFrame
pandas can read (read_ methods) from various file formats like csv, xls, parquet, hdf5, json, sql (db) and generate output (to_ methods) in the above formats.
matplotlib is used to generated plots from data tables.
split-apply-combine approach. Hmm.. map-reduce?
melt() to convert from wide -> long/tidy form and pivot() to convert from long to wide format.
combine multiple tables using concat() function for column-wise or row-wise joining
for database like joining/merging, use the merge function. pd.merge(tbone, tbtwo, how="left", left_on="col1", right_on="col2"). also merge_asof
pandas supports inner, outer, right and left joins.
use shape() to find the rowxcolumn shape.
TODO: how vectorization is done in pandas
TODO: where numpy is faster than pandas
TODO: Pyspark with pandas

Quick intro

import numpy as np
import pandas as pd

# create new Series
s = pd.Seres([1,2,3, pd.nan, 8, 121])
# pd.nan = not a number

# pd.date_range("<YYYYMMDD", periods=n) produces a
# a date range starting at YYYYMMDD for the period `n`

# dataframes are created with

df = pd.DataFrame(nparray, index=<indexvec>, columns=list(names))
# the nparray is a 2-d numpy array of mxn dimensions

# the df can also be created by passing a dict with keys as the
# column names and the value as the series.

Articles

Practical SQL for Data Analysis | Haki Benita in comparison to pandas.