Unit-VI: Data Analysis with Python

Lecture 3: Pandas - Series and DataFrames

Estimated: 45-60 minutes

1. Pandas Overview

Pandas is a powerful data manipulation and analysis library for Python, built on top of NumPy.

Key Features:
  • Data structures: Series (1D) and DataFrames (2D)
  • Data alignment and handling missing data
  • Reshaping and pivoting datasets
  • Time series functionality
  • Reading/writing data from various formats (CSV, Excel, SQL, etc.)
# Standard import convention
import pandas as pd
import numpy as np  # Often used together with pandas

# Check pandas version
print("Pandas version:", pd.__version__)
2. Pandas Series

A one-dimensional labeled array capable of holding any data type.

Creating Series
Operations
Methods
# From a list
s1 = pd.Series([1, 3, 5, 7, 9])
print("From list:\n", s1)

# With custom index
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("\nWith custom index:\n", s2)

# From dictionary
data = {'a': 1, 'b': 2, 'c': 3}
s3 = pd.Series(data)
print("\nFrom dictionary:\n", s3)

# With specified index (missing values become NaN)
s4 = pd.Series(data, index=['a', 'b', 'c', 'd'])
print("\nWith specified index (missing values):\n", s4)
s1 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([10, 20, 30, 40], index=['a', 'b', 'e', 'f'])

# Element-wise operations (aligned by index)
print("Addition (aligned by index):\n", s1 + s2)

# Operations with scalars
print("\nMultiply by 2:\n", s1 * 2)

# Boolean operations
print("\nGreater than 2:\n", s1 > 2)

# Mathematical functions
print("\nExponential:\n", np.exp(s1))
s = pd.Series([1, 2, 3, 4, 5, None, 7])

# Basic statistics
print("Sum:", s.sum())
print("Mean:", s.mean())
print("Standard deviation:", s.std())
print("\nValue counts:\n", s.value_counts())

# Handling missing data
print("\nDrop NA:\n", s.dropna())
print("Fill NA with 0:\n", s.fillna(0))

# String operations (for string data)
s_str = pd.Series(['apple', 'banana', 'cherry'])
print("\nString operations:\n", s_str.str.upper())
3. Pandas DataFrames

A 2-dimensional labeled data structure with columns of potentially different types.

Creating
Indexing
Operations
# From dictionary of lists/arrays
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print("From dictionary:\n", df)

# From list of dictionaries
data_list = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df2 = pd.DataFrame(data_list)
print("\nFrom list of dictionaries:\n", df2)

# With custom index
df3 = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])
print("\nWith custom index:\n", df3)
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}, index=['a', 'b', 'c', 'd'])

# Column access
print("Single column (Series):\n", df['Name'])
print("\nMultiple columns (DataFrame):\n", df[['Name', 'Age']])

# Row access by label
print("\nRow by label (loc):\n", df.loc['a'])

# Row access by position
print("\nFirst row (iloc):\n", df.iloc[0])

# Boolean indexing
print("\nPeople older than 30:\n", df[df['Age'] > 30])
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': [100, 200, 300, 400]
})

# Element-wise operations
df['D'] = df['A'] + df['B']
print("Add columns A and B:\n", df)

# Operations with scalars
df['E'] = df['C'] * 2
print("\nMultiply column C by 2:\n", df)

# Apply function to each element
df['F'] = df['A'].apply(lambda x: x ** 2)
print("\nSquare of column A:\n", df)

# Operations with other DataFrames
df2 = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [1, 2, 3, 4],
    'C': [1, 2, 3, 4]
})
print("\nElement-wise addition of two DataFrames:\n", df[['A', 'B', 'C']] + df2)
4. Basic Data Exploration

Essential methods for exploring and understanding your data.

# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 40, 45, 50],
    'Salary': [70000, 80000, 90000, 100000, 110000, 120000],
    'Department': ['HR', 'IT', 'IT', 'Finance', 'HR', 'Finance']
}
df = pd.DataFrame(data)

# First few rows
print("First 3 rows:")
print(df.head(3))

# Basic information
print("\nDataFrame info:")
print(df.info())

# Descriptive statistics
print("\nDescriptive statistics:")
print(df.describe())

# Count of unique values
print("\nValue counts for Department:")
print(df['Department'].value_counts())

# Correlation matrix (for numerical columns)
print("\nCorrelation matrix:")
print(df.corr())
5. Practice Exercise

Given the following dataset of student grades:

data = {
    'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Math': [85, 90, 78, 92, 88, 95, 82],
    'Science': [90, 85, 88, 75, 92, 80, 85],
    'English': [78, 82, 90, 88, 85, 92, 78],
    'History': [85, 78, 82, 90, 88, 85, 92]
}
grades = pd.DataFrame(data)

Perform the following tasks:

  1. Calculate the average grade for each student
  2. Add a column 'Passed' that is True if the student passed all subjects (grade ≥ 80)
  3. Find the student with the highest average grade
  4. Calculate the average grade for each subject