Module: Mini Projects

CSV Data Analysis

Okay, here's a breakdown of a mini-project focused on CSV data analysis in Python, presented in Markdown format. I'll cover the project idea, steps, code examples, and potential extensions. This is designed to be a good starting point for someone learning Python and data analysis.

Project: Simple CSV Data Analyzer

Description:

This project involves reading a CSV file, performing some basic data analysis (calculating statistics, finding maximum/minimum values, filtering data), and presenting the results. We'll focus on a simple dataset to keep things manageable.

Dataset:

For this example, let's use a CSV file named sales_data.csv with the following structure:

Date,Product,Sales
2023-10-26,Widget A,100
2023-10-26,Widget B,150
2023-10-27,Widget A,120
2023-10-27,Widget B,180
2023-10-28,Widget A,90
2023-10-28,Widget B,200
2023-10-29,Widget A,110
2023-10-29,Widget B,170

Steps:

  1. Import the csv module: This module provides functionality for working with CSV files.
  2. Read the CSV file: Open the file and read its contents into a suitable data structure (e.g., a list of dictionaries).
  3. Data Cleaning (Optional): Handle missing values or incorrect data types if necessary. For this simple example, we'll assume the data is clean.
  4. Data Analysis:
    • Calculate the total sales.
    • Find the product with the highest sales.
    • Calculate the average sales per product.
    • Filter the data to show sales for a specific date.
  5. Present the Results: Print the analysis results in a user-friendly format.

Code Example:

import csv

def analyze_sales_data(filename):
    """
    Analyzes sales data from a CSV file.

    Args:
        filename (str): The name of the CSV file.

    Returns:
        None. Prints the analysis results.
    """

    sales_data = []
    with open(filename, 'r') as csvfile:
        reader = csv.DictReader(csvfile)  # Read as dictionaries
        for row in reader:
            # Convert 'Sales' to integer
            row['Sales'] = int(row['Sales'])
            sales_data.append(row)

    # Calculate total sales
    total_sales = sum(row['Sales'] for row in sales_data)

    # Find product with highest sales
    product_sales = {}
    for row in sales_data:
        product = row['Product']
        sales = row['Sales']
        if product in product_sales:
            product_sales[product] += sales
        else:
            product_sales[product] = sales

    best_selling_product = max(product_sales, key=product_sales.get)

    # Calculate average sales per product
    average_sales = {}
    for product, total in product_sales.items():
        average_sales[product] = total / len([row for row in sales_data if row['Product'] == product])

    # Filter sales for a specific date
    def filter_sales_by_date(data, date):
        return [row for row in data if row['Date'] == date]

    date_to_filter = "2023-10-27"
    filtered_sales = filter_sales_by_date(sales_data, date_to_filter)

    # Present the results
    print("Sales Data Analysis:")
    print("-" * 20)
    print(f"Total Sales: ${total_sales}")
    print(f"Best Selling Product: {best_selling_product} (Sales: ${product_sales[best_selling_product]})")
    print("\nAverage Sales per Product:")
    for product, avg_sales in average_sales.items():
        print(f"- {product}: ${avg_sales:.2f}")  # Format to 2 decimal places

    print(f"\nSales on {date_to_filter}:")
    for row in filtered_sales:
        print(f"- {row['Product']}: {row['Sales']}")


# Example usage:
analyze_sales_data('sales_data.csv')

Explanation:

  • import csv: Imports the necessary module.
  • analyze_sales_data(filename): This function encapsulates the entire analysis process.
  • with open(filename, 'r') as csvfile:: Opens the CSV file in read mode ('r'). The with statement ensures the file is automatically closed, even if errors occur.
  • reader = csv.DictReader(csvfile): Creates a DictReader object. This reads each row of the CSV file as a dictionary, where the keys are the column headers.
  • sales_data.append(row): Appends each row (dictionary) to the sales_data list.
  • Data Analysis: The code then performs the calculations as described in the "Steps" section. It uses list comprehensions and dictionaries to efficiently process the data.
  • filter_sales_by_date(data, date): A helper function to filter the data based on a given date.
  • print(...): Prints the results in a formatted way.

How to Run:

  1. Save the code: Save the Python code as a .py file (e.g., sales_analyzer.py).
  2. Create the CSV file: Create a file named sales_data.csv with the data shown above (or your own sales data). Make sure it's in the same directory as your Python script.
  3. Run the script: Open a terminal or command prompt, navigate to the directory where you saved the files, and run the script using python sales_analyzer.py.

Potential Extensions:

  • Error Handling: Add error handling to gracefully handle cases where the CSV file is missing, has an invalid format, or contains invalid data. Use try...except blocks.
  • Data Visualization: Use libraries like matplotlib or seaborn to create charts and graphs to visualize the sales data (e.g., a bar chart showing sales per product).
  • More Complex Analysis:
    • Calculate sales trends over time.
    • Identify the most profitable products.
    • Calculate moving averages.
  • User Input: Allow the user to specify the filename, date to filter, or other parameters.
  • Data Persistence: Save the analysis results to a file (e.g., a text file or another CSV file).
  • Pandas Integration: Use the pandas library, which is specifically designed for data analysis. It provides more powerful and convenient data structures and functions. (This is a good next step after mastering the csv module.)

Example using Pandas:

import pandas as pd

def analyze_sales_data_pandas(filename):
    """
    Analyzes sales data from a CSV file using pandas.

    Args:
        filename (str): The name of the CSV file.

    Returns:
        None. Prints the analysis results.
    """

    df = pd.read_csv(filename)

    # Calculate total sales
    total_sales = df['Sales'].sum()

    # Find product with highest sales
    best_selling_product = df.groupby('Product')['Sales'].sum().idxmax()
    best_selling_sales = df.groupby('Product')['Sales'].sum().max()

    # Calculate average sales per product
    average_sales = df.groupby('Product')['Sales'].mean()

    # Filter sales for a specific date
    date_to_filter = "2023-10-27"
    filtered_sales = df[df['Date'] == date_to_filter]

    # Present the results
    print("Sales Data Analysis (Pandas):")
    print("-" * 20)
    print(f"Total Sales: ${total_sales}")
    print(f"Best Selling Product: {best_selling_product} (Sales: ${best_selling_sales})")
    print("\nAverage Sales per Product:")
    print(average_sales)

    print(f"\nSales on {date_to_filter}:")
    print(filtered_sales)

# Example usage:
analyze_sales_data_pandas('sales_data.csv')

The Pandas version is significantly more concise and readable due to the library's built-in functions for data manipulation and analysis. It's