pandas

python library for manipulating dataframes

Created: by Pradeep Gowda Updated: May 22, 2023 Tagged: python

Top level notes

  • Pandas Documentation Homepage
  • Installing (2.0) – pip install pandas==2.0
    • installing pandas will also install numpy
  • Pandas helps in working with tabular data (databases, spreadsheets)
  • Helps = explore, clean and proess
  • Anything you can do with SQL you can also do with Pandas.
    • Group by operations
  • Summary Statistics. eg., mean, median, std..
  • The data table is called a DataFrame
  • pandas can read (read_ methods) from various file formats like csv, xls, parquet, hdf5, json, sql (db) and generate output (to_ methods) in the above formats.
  • matplotlib is used to generated plots from data tables.
  • split-apply-combine approach. Hmm.. map-reduce?
  • melt() to convert from wide -> long/tidy form and pivot() to convert from long to wide format.
  • combine multiple tables using concat() function for column-wise or row-wise joining
  • for database like joining/merging, use the merge function. pd.merge(tbone, tbtwo, how="left", left_on="col1", right_on="col2"). also merge_asof
  • pandas supports inner, outer, right and left joins.
  • use shape() to find the rowxcolumn shape.
  • TODO: how vectorization is done in pandas
  • TODO: where numpy is faster than pandas
  • TODO: Pyspark with pandas

Quick intro

import numpy as np
import pandas as pd

# create new Series
s = pd.Seres([1,2,3, pd.nan, 8, 121])
# pd.nan = not a number

# pd.date_range("<YYYYMMDD", periods=n) produces a
# a date range starting at YYYYMMDD for the period `n`

# dataframes are created with

df = pd.DataFrame(nparray, index=<indexvec>, columns=list(names))
# the nparray is a 2-d numpy array of mxn dimensions

# the df can also be created by passing a dict with keys as the
# column names and the value as the series.

Articles