power-outages-analysis

Portfolio Homework for EECS 398: Practical Data Science

View on GitHub

Predicting Power Outage Duration: An Exploratory Analysis

Step 1: Introduction

The dataset we chose to use is the power outages dataset describing major power outage events in the continental U.S., which we retrieved from this ScienceDirect article and this Purdue link to the dataset.

Reasoning Behind selection:

We are not familiar with League of Legends, and power outages were more interesting to us than recipes. The example questions for recipes seem easier to intuitively predict ourselves because food and recipes are familiar, whereas we are genuinely curious about what we might find in the power outages data because we are less familiar with it. This dataset also has a high impact, since power outages affect many people

Research Question:

How do cause of outage, state, number of customers affected, and anomaly level affect the duration of an outage?

Data Information:

name description
‘CAUSE.CATEGORY’ Type of cause of event
‘U.S._STATE’ State event occurred in
‘POSTAL.CODE’ Abbreviation of state
‘CUSTOMERS.AFFECTED’ Number of customers affected by outage event
‘ANOMALY.LEVEL’ Oceanic El Niño/La Niña (ONI) index – 3 month running mean
‘OUTAGE.DURATION’ Duration of outage in minutes

Step 2: Data Cleaning and Exploratory Analysis

Data Cleaning:

  1. We first checked all of our columns that we intended on using during the analysis for NaN values. We found that OUTAGE.DURATION and CUSTOMERS.AFFECTED both had columns that contained NaN values.
  2. We dropped any rows containing NaN values for the two columns identified as containing those values.
  3. We remaned columns to easier to use/read names.

Below is the head of the cleaned data:

abbr state duration cause level customers
MN Minnesota 3060 severe weather -0.3 70000
MN Minnesota 3000 severe weather -1.5 70000
MN Minnesota 2550 severe weather -0.1 68200
MN Minnesota 1740 severe weather 1.2 250000
MN Minnesota 1860 severe weather -1.4 60000

Univariate Analysis:

Pie Chart – Distribution of Outage Causes in Dataset

Histogram – Distribution of Duration

Bivariate Analysis:

Choropleth – Mean Outage Duration and Most Common Cause of Outages by State

Box Plot – Distribution of Duration by Cause

Interesting Aggregates:

This grouped table allows us to look at the most common cause of outage for each state as well as the mean duration in minutes for each state.

state abbr cause duration
Alabama AL severe weather 1421.75
Arizona AZ equipment failure 726.875
Arkansas AR severe weather 2210.85
California CA severe weather 2289.69
Colorado CO severe weather 1178.6

Imputation:

We decided not to impute duration because we wanted to use start time and end time to fill in duration values. However, we realized that if there was NaN for duration, then at least one of the start time or end time was also NaN. We did not feel comfortable filling in with any other values since there is a lot of variety in all of our columns, so something like the mean would not accurately reflect the data.

Step 3: Framing a Prediction Problem

Problem Identification:

Step 4: Baseline Model

Step 5: Final Model