Python

Pandas

Pandas is a Python library for data manipulation and analysis built on NumPy.

Import Convention

import pandas as pd

Pandas Data Structures

Series

A one-dimensional labeled array capable of holding any data type:

s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])

DataFrame

A two-dimensional labeled data structure with columns of potentially different types:

data = {
    'Country': ['Belgium', 'India', 'Brazil'],
    'Capital': ['Brussels', 'New Delhi', 'Brasília'],
    'Population': [11190846, 1303171035, 207847528]
}
df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])

I/O Operations

Read and Write CSV

# Read CSV
pd.read_csv('file.csv', header=None, nrows=5)
 
# Write CSV
df.to_csv('myDataFrame.csv')

Read and Write Excel

# Read Excel
pd.read_excel('file.xlsx')
 
# Write Excel
df.to_excel('dir/myDataFrame.xlsx', sheet_name='Sheet1')
 
# Read multiple sheets
xlsx = pd.ExcelFile('file.xls')
df = pd.read_excel(xlsx, 'Sheet1')

SQL Operations

from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
 
# Read SQL
pd.read_sql("SELECT * FROM my_table;", engine)
pd.read_sql_table('my_table', engine)
pd.read_sql_query("SELECT * FROM my_table;", engine)
 
# Write SQL
df.to_sql('myDf', engine)

Selection

By Position

# Get one element
s['b']  # For Series
 
# Get subset of a DataFrame
df[1:]
 
# Select single value by row & column
df.iloc[[0],[0]]
df.iat([0],[0])

By Label

# Select single value by row & column labels
df.loc[[0], ['Country']]
df.at([0], ['Country'])

By Label/Position

# Select single row or subset of rows
df.ix[2]
 
# Select a single column or subset of columns
df.ix[:, 'Capital']
 
# Select rows and columns
df.ix[1, 'Capital']

Boolean Indexing

# Series where value is not >1
s[~(s > 1)]
 
# Series where value is <-1 or >2
s[(s < -1) | (s > 2)]
 
# Use filter to adjust DataFrame
df[df['Population'] > 1200000000]

Dropping

# Drop values from rows (axis=0)
s.drop(['a', 'c'])
 
# Drop values from columns (axis=1)
df.drop('Country', axis=1)

Retrieving Series/DataFrame Information

df.shape      # (rows, columns)
df.index      # Describe index
df.columns    # Describe DataFrame columns
df.info()     # Info on DataFrame
df.count()    # Number of non-NA values

Summary Statistics

df.sum()      # Sum of values
df.cumsum()   # Cumulative sum of values
df.min()      # Minimum values
df.max()      # Maximum values
df.idxmin()   # Minimum index value
df.idxmax()   # Maximum index value
df.describe() # Summary statistics
df.mean()     # Mean of values
df.median()   # Median of values

Applying Functions

# Apply function
f = lambda x: x*2
df.apply(f)
 
# Apply function element-wise
df.applymap(f)

Data Alignment

Internal Data Alignment

NA values are introduced in the indices that don't overlap:

s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
s + s3

Arithmetic Operations with Fill Methods

s.add(s3, fill_value=0)    # Addition
s.sub(s3, fill_value=2)    # Subtraction
s.div(s3, fill_value=4)    # Division
s.mul(s3, fill_value=3)    # Multiplication

Sort & Rank

df.sort_index()                # Sort by labels along an axis
df.sort_values(by='Country')   # Sort by the values along an axis
df.rank()                      # Assign ranks to entries

Getting Help

help(pd.Series.loc)

Last updated on