Lecture 4: Modules, Packages & Regular Expressions

1. Introduction to Modules

A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py.

Creating a Simple Module

my_math.py
"""A simple math module with basic operations.""" def add(a, b): """Return the sum of a and b.""" return a + b def subtract(a, b): """Return the difference between a and b.""" return a - b def multiply(a, b): """Return the product of a and b.""" return a * b def divide(a, b): """Return the quotient of a divided by b.""" if b == 0: raise ValueError("Cannot divide by zero!") return a / b # This code runs when the module is executed directly if __name__ == "__main__": print("Running my_math module directly") print(f"2 + 3 = {add(2, 3)}") print(f"5 - 2 = {subtract(5, 2)}")

Using the Module

# Import the entire module import my_math # Use functions with module name prefix result = my_math.add(5, 3) print(f"5 + 3 = {result}") # Import specific functions from my_math import multiply, divide # Use directly without module name product = multiply(4, 6) print(f"4 * 6 = {product}") # Import with alias import my_math as mm print(f"10 - 4 = {mm.subtract(10, 4)}") # Import all names (not recommended) from my_math import *

Module Search Path

When a module is imported, Python searches for it in the following order:

  1. The directory containing the input script (or the current directory)
  2. The list of directories contained in the PYTHONPATH environment variable
  3. An installation-dependent list of directories configured at Python installation time

You can view the search path with:

import sys print(sys.path)

2. Creating and Using Packages

A package is a way of organizing related modules into a single directory hierarchy.

Package Structure

my_package/
__init__.py # Makes the directory a Python package
module1.py # Module 1
module2.py # Module 2
subpackage/ # Subpackage
__init__.py # Makes subdirectory a package
module3.py # Module in subpackage

Package Initialization

The __init__.py file can be empty or can contain initialization code for the package.

# my_package/__init__.py """ This is the my_package package. It provides useful utilities for various tasks. """ # You can define what gets imported with 'from my_package import *' __all__ = ['module1', 'module2'] # Package-level variables VERSION = '1.0.0' # You can also import functions to make them available at the package level from .module1 import some_function print("Initializing my_package...")

Using the Package

# Import from the package import my_package.module1 from my_package import module2 from my_package.subpackage import module3 # Using the imported modules my_package.module1.some_function() module2.another_function() # Import specific functions from my_package.module1 import specific_function # Using package-level imports (if defined in __init__.py) from my_package import some_function

3. Standard Library Modules

Commonly Used Standard Modules

Module Description Common Uses
sys System-specific parameters and functions Command-line arguments, interpreter settings
os Operating system interfaces File/directory operations, process management
math Mathematical functions Trigonometry, logarithms, constants
datetime Date and time handling Date arithmetic, formatting, timezones
json JSON data handling Reading/writing JSON files, parsing JSON data
re Regular expressions Pattern matching, string searching
collections Specialized container datatypes Counters, named tuples, default dictionaries
itertools Functions creating iterators Efficient looping, combinations, permutations

Example: Using Standard Modules

import os import sys import math from datetime import datetime import json # Using os module current_dir = os.getcwd() print(f"Current directory: {current_dir}") # Using sys module print(f"Python version: {sys.version}") print(f"Command line arguments: {sys.argv}") # Using math module print(f"Square root of 16: {math.sqrt(16)}") print(f"Value of pi: {math.pi}") # Using datetime module now = datetime.now() print(f"Current date and time: {now}") print(f"Formatted date: {now.strftime('%Y-%m-%d %H:%M:%S')}") # Using json module data = { 'name': 'John Doe', 'age': 30, 'courses': ['Math', 'Physics', 'Chemistry'] } # Convert to JSON string json_str = json.dumps(data, indent=2) print("JSON data:") print(json_str)

4. Introduction to Regular Expressions

Regular expressions (regex) are a powerful tool for pattern matching and manipulation of strings.

Basic Regex Patterns

Pattern Description Example Matches
. Matches any character except newline a.c abc, aac, a1c, etc.
^ Start of string ^Hello Hello world (but not Say Hello)
$ End of string world$ Hello world (but not world peace)
* 0 or more repetitions ab*c ac, abc, abbc, abbbc, etc.
+ 1 or more repetitions ab+c abc, abbc, abbbc, etc. (not ac)
? 0 or 1 repetition ab?c ac, abc (not abbc)
{m,n} m to n repetitions a{2,4}b aab, aaab, aaaab
\d Digit (0-9) \d{3} 123, 456, 000, etc.
\w Word character (a-z, A-Z, 0-9, _) \w+ hello, Python3, user_name
\s Whitespace (space, tab, newline) \s+ , ,
[abc] Any of a, b, or c [aeiou] a, e, i, o, u
[^abc] Not a, b, or c [^0-9] a, b, !, @, etc. (not digits)
| OR operator cat|dog cat or dog
() Grouping (ab)+ ab, abab, ababab, etc.

5. Using the re Module

Common re Functions

Function Description Example
re.search() Search for a pattern anywhere in the string re.search(r'\d+', 'abc123')
re.match() Match pattern at the beginning of the string re.match(r'\d+', '123abc')
re.findall() Find all non-overlapping matches re.findall(r'\d+', 'a1b22c333')
re.finditer() Return an iterator yielding match objects for m in re.finditer(r'\d+', 'a1b22c333'):
re.sub() Replace occurrences of a pattern re.sub(r'\d', '#', 'a1b2c3')
re.split() Split string by pattern re.split(r'\s+', 'split on whitespace')
re.compile() Compile a regex pattern for reuse pattern = re.compile(r'\d{3}-\d{2}-\d{4}')

Regex Examples

import re # Example 1: Simple search text = "The quick brown fox jumps over the lazy dog." match = re.search(r'fox', text) if match: print(f"Found 'fox' at position {match.start()}-{match.end()}") # Example 2: Find all email addresses text = "Contact us at support@example.com or sales@company.org" emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text) print(f"Found emails: {emails}") # Example 3: Extract dates in format YYYY-MM-DD text = "Events on 2023-10-15 and 2023-11-20" dates = re.findall(r'\b\d{4}-\d{2}-\d{2}\b', text) print(f"Found dates: {dates}") # Example 4: Replace phone numbers text = "Call 555-123-4567 or 800-555-0000" masked = re.sub(r'\b\d{3}-\d{3}-\d{4}\b', 'XXX-XXX-XXXX', text) print(f"Masked text: {masked}") # Example 5: Using groups for extraction text = "Name: John Doe, Age: 30, City: New York" pattern = r'Name: (\w+ \w+), Age: (\d+), City: (\w+ \w+)' match = re.search(pattern, text) if match: name, age, city = match.groups() print(f"Name: {name}, Age: {age}, City: {city}") # Example 6: Case-insensitive search text = "Python is awesome! PYTHON is great! python is fun!" matches = re.findall(r'python', text, re.IGNORECASE) print(f"Found 'python' {len(matches)} times (case-insensitive)")

6. Practical Example: Log File Analysis

Let's create a script that analyzes a web server log file using regular expressions.

import re from collections import defaultdict from datetime import datetime def analyze_log_file(log_file): """Analyze a web server log file and extract useful information.""" # Common log format: 127.0.0.1 - - [10/Oct/2023:13:55:36 +0000] "GET / HTTP/1.1" 200 612 log_pattern = r'''(?P\d+\.\d+\.\d+\.\d+)\s+ # IP address [^\[\]]+\s+ # Remote user (ignored) \[([^\]]+)\]\s+ # Timestamp "(?:\w+\s+)?(?P[^\s"]+)\s+[^"]*"\s+ # Request URL (?P\d{3})\s+ # Status code (?P\d+|-)\s* # Response size ''' # Compile the regex with verbose flag for comments log_regex = re.compile(log_pattern, re.VERBOSE) # Store statistics stats = { 'total_requests': 0, 'status_codes': defaultdict(int), 'popular_pages': defaultdict(int), 'requests_by_hour': defaultdict(int), 'unique_ips': set(), 'total_bytes': 0 } # Process each line in the log file with open(log_file, 'r') as f: for line in f: match = log_regex.search(line) if not match: continue stats['total_requests'] += 1 # Extract data from the match ip = match.group('remote_addr') timestamp_str = match.group(2) # Group 2 is the timestamp url = match.group('url') status = match.group('status') size = match.group('size') # Update statistics stats['status_codes'][status] += 1 stats['popular_pages'][url] += 1 stats['unique_ips'].add(ip) # Parse timestamp to get hour try: # Handle different timestamp formats if needed timestamp = datetime.strptime(timestamp_str, '%d/%b/%Y:%H:%M:%S %z') hour = timestamp.strftime('%Y-%m-%d %H:00') stats['requests_by_hour'][hour] += 1 except ValueError: pass # Ignore timestamp parsing errors # Add response size to total if size != '-': stats['total_bytes'] += int(size) # Generate report print(f"Log Analysis Report") print("=" * 80) print(f"Total requests: {stats['total_requests']}") print(f"Total data transferred: {stats['total_bytes'] / (1024*1024):.2f} MB") print(f"Unique IP addresses: {len(stats['unique_ips'])}") print("\nStatus Code Distribution:") for code, count in sorted(stats['status_codes'].items()): print(f" {code}: {count} requests") print("\nTop 5 Most Popular Pages:") for url, count in sorted(stats['popular_pages'].items(), key=lambda x: x[1], reverse=True)[:5]: print(f" {url}: {count} requests") print("\nRequests by Hour:") for hour, count in sorted(stats['requests_by_hour'].items()): print(f" {hour}: {count} requests") # Example usage if __name__ == "__main__": log_file = "access.log" # Replace with your log file path try: analyze_log_file(log_file) except FileNotFoundError: print(f"Error: Log file '{log_file}' not found.") except Exception as e: print(f"An error occurred: {e}")