Predicting US County Level Opioid Overdose Mortality Rate

Table of Contents

  1. Presentation Slides
  2. Abstract
  3. What is the opioid epidemic?
    1. Annual Cause of Death by Opioid Overdose
    2. Number of Deaths by Demographic Factors 1999 - 2016
  4. Reading Between the Numbers - Why Should We Care?
  5. Living with an Opioid Addiction
  6. USA by Opioid Overdose Mortality Rate, 2016
    1. Top Ten Counties with Highest Mortality Rate Caused by Opioid Overdose in 2016
  7. Why Predict Opioid Overdose Mortality Rate?
  8. Datasets
  9. Variables
  10. Exploratory Data Analysis
    1. Distribution of County Level Opioid Overdose Mortality Rate
    2. Distribution of County Level Opioid Overdose Mortality Rate (Opioid Overdose Mortality Rate Greater Than Zero)
    3. Violin Plots of All Variables Except Opioid Overdose Mortality Rate
    4. Correlation Heat Map of All the Variables
    5. Top 12 Correlation Magnitude Variable Pairs
    6. Pearson's Correlation Between Crude Mortality Rate and All Other Variables
    7. Conclusion From Exploratory Data Analysis
  11. Predictions
    1. Hurdle Model
    2. Prediction Procedure
    3. Bootstrap RMSE Distributions for All Methods
    4. Boxplot of Bootstrap RMSE Distributions for All Methods
    5. Table of Two-Sample T-Test P-Values Between Models
    6. Predicted vs Actual Opioid Overdose Mortality Rate
    7. Feature Importances
  12. A Closer Look at the County Level Population Distribution
    1. Boxplot of Bootstrap RMSE Distributions for All Methods (Population >= 18,646)
    2. Predicted vs Actual Opioid Overdose Mortality Rate (Population >= 18,646)
    3. Feature Importances (Population >= 18,646)
  13. Conclusion from Prediction Analysis
  14. Further Research
  15. References
  16. Links
  17. Special Thanks

Presentation Slides

Full Screen Presentation Slides

Abstract

Since America is in the midst of the opioid epidemic, there is a greater urgency to investigate the cause of the problem. To help with this, I looked into predicting opioid overdose mortality rates for US counties in 2016. The reasoning for this is that if we can make reasonable predictions of county level opioid overdose mortality rate, we can focus on the counties that are predicted to have high opioid overdose mortality rate. This will help us to make better use of our resources as well as study further to help find the cause of the problem. For 2,962 US counties, I performed exploratory data analysis and looked into predicting the 2016 US county opioid overdose mortality rate given 35 different predictor variables. To make the predictions, I used 18 different models (linear models as well as tree-based models), and 1,000 iterations of RandomizedSearchCV to tune hyperparameters. I obtained estimated sampling distributions of root mean squared error (RMSE) by performing 10,000 repetitions of bootstrap sampling and calculating the RMSE for each of the 18 models. Based on the bootstrap sampling distributions of the RMSE, I performed two-sample t-test for the means to see if the model with the lowest mean RMSE has a mean that is statistically significantly different than the rest of the models' RMSE means. I used Amazon AWS EC2 to access multiple cores for faster processing. In terms of root mean squared error (RMSE), the best model is random forest regression with tuned hyperparameters, which had a mean bootstrap RMSE of 9.25 deaths per 100,000 people. This random forest regression model had a mean that was statistically significantly different than the second best performing model with a p-value of 9.3e-5. However, in terms of the fit to the data based on scatter plot between actual values and predicted values, the best model seemed to be XGBoost's stochastic gradient boosting with tuned parameters, which had a mean bootstrap RMSE of 9.37 deaths per 100,000 people. Hence, my final model of choice would be XGBoost's stochastic gradient boosting with tuned parameters. With this tool, we can now predict any given county's opioid overdose mortality rate. Further investigation should be performed to look into why certain demographic information, such as percentage of county's population who are American Indian or Alaska Native and percentage of county population who are between 45 to 49 years old, plays an important role in predictions.

What is the opioid epidemic?

Wikipedia defines the opioid epidemic as the following:
The opioid epidemic or opioid crisis is the rapid increase in the use of prescription and non-prescription opioid drugs in the United States and Canada beginning in the late 1990s and continuing throughout the next two decades. Opioids are a diverse class of moderately strong painkillers, including oxycodone (commonly sold under the trade names OxyContin and Percocet), hydrocodone (Vicodin), and a very strong painkiller, fentanyl, which is synthesized to resemble other opiates such as opium-derived morphine and heroin.1

From 1999 to 2016 in America, the increase in deaths due to opioid overdose increased by 425%, from 8,050 deaths in 1999 to 42,249 deaths in 2016. That's over four times the increase in 17 years.

Annual Cause of Death by Opioid Overdose

Number of Deaths by Demographic Factors 1999 - 2016

From the bar charts above, we see some surprising and disturbing numbers:
  • Almost 90% of the deaths caused by opioid overdose are whites
  • 66% of the deaths are male
  • 48% of deaths are from ages 25-54

Reading Between the Numbers - Why Should We Care?

Many are dying because of opioid overdose. But it's not just death. Many people that are not represented in these death numbers are struggling with opioid addiction. Not only does opioid addiction destroy the health of the addicted individual, but it also negatively impacts the lives of the addicted person's loved ones, his or her families, and, ultimately, the communities at large. Addiction to opioids, like any other addictions, leads to financial problems, health problems, lack of self control, relational difficulties, and ultimately destroys every aspect of the person, and the end result is death. Watch the video below of the story of Amy and her struggle with opioid addiction to see a glimpse of the struggles that come with opioid addiction.

Living with an Opioid Addiction

USA by Opioid Overdose Mortality Rate, 2016

Below, we can see the opioid mortality rates by counties across United States. It seems that midwest and the east cost regions have high concentrations of opioid overdose mortality rates. 2,261 counties have zero deaths due to opioid overdose per 100,000 people. They are colored white in the map below. 701 counties have greater than zero opioid mortality rates.

Top Ten Counties with Highest Mortality Rate Caused by Opioid Overdose in 2016

Based on the map and the bar plot, we see that Harrison, KY is the county that had the highest mortality rate per 100,000 people caused by opioid overdose with 118 deaths per 100,000 people due to opioid overdose. That is 0.118% of all the people in Harrison County dying because of opioid overdose. In terms of state, we see that West Virginia has five counties in the top ten highest mortality rate per 100,000 people caused by opioid overdose.

Why Predict Opioid Overdose Mortality Rate?

To help solve the opioid addiction crisis, it is beneficial to predict which counties in the US will have high opioid mortality rates based on variables such as county's median household income, unemployment rate, poverty rate estimate, and opioid prescription rate by health care providers. In total, there are 35 variables used to make the predictions (explained in detail later). Once we predict opioid mortality rates for the counties, we can focus on the counties that are predicted to have high mortality rates to better channel our energy and resources in fighting this epidemic. We can also use these predictions to explore potential causes of opioid overdose. In short, predicting counties' opioid mortality rate will allow us to fight this epidemic more effectively.

Datasets

I gathered the following nine datasets from the following sources and combined them into one dataframe:
  1. US County Opioid Prescription Rates 2016
  2. County Level Education Data 2016
  3. County Level Population Estimates Data 2016
  4. County Level Poverty Estimates Data 2016
  5. County Level Unemployment Rate Data 2016
  6. Opioid Overdose Mortality Rate 2016
    • Original data was obtained from this query from CDC Wonder website. The dataset was saved on github for easy access.
    • Note that statistics representing zero to nine death counts are suppressed at the region, county, and state level.
    • Counties with suppressed statistics are assumed to have zero mortality rate, since the number of deaths tend to be negligible when compared to the county population size.
  7. Overall Death Rate (All Causes) 2016
    • Original data was obtained from this query from CDC Wonder website. The dataset was saved on github for easy access.
  8. County Level Race Data 2016
    • Original data was obtained from this query from CDC Wonder website. The dataset was saved on github for easy access.
  9. County Level Age Data 2016
    • Original data was obtained from this query from CDC Wonder website. The dataset was saved on github for easy access.
The following dataset contains mortality rates and death numbers due to opioid overdose from 1999 to 2016. This data is not divided by counties. However, it includes columns that contain age-group, ethnicity, and gender. This data will be used primarily for visualization purposes. From 3147 counties, I remove all the missing data. After removing the missing data, I have left over 2962 county data and 40 columns. More details regarding data wrangling can be found here.

Variables

The following are the variables included in the main dataset that is used for analysis. In bold are the variable names.
  • 10\_14\_years\_% - Estimated percent of population in county that are between 10 and 14 years of age in 2016
  • 15\_19\_years\_% - Estimated percent of population in county that are between 15 and 19 years of age in 2016
  • 1\_4\_years\_% - Estimated percent of population in county that are between 1 and 4 years of age in 2016
  • 20\_24\_years\_% - Estimated percent of population in county that are between 20 and 24 years of age in 2016
  • 25\_29\_years\_% - Estimated percent of population in county that are between 25 and 29 years of age in 2016
  • 30\_34\_years\_% - Estimated percent of population in county that are between 30 and 34 years of age in 2016
  • 35\_39\_years\_% - Estimated percent of population in county that are between 35 and 39 years of age in 2016
  • 40\_44\_years\_% - Estimated percent of population in county that are between 40 and 44 years of age in 2016
  • 45\_49\_years\_% - Estimated percent of population in county that are between 45 and 49 years of age in 2016
  • 50\_54\_years\_% - Estimated percent of population in county that are between 50 and 54 years of age in 2016
  • 55\_59\_years\_% - Estimated percent of population in county that are between 55 and 59 years of age in 2016
  • 5\_9\_years\_% - Estimated percent of population in county that are between 5 and 9 years of age in 2016
  • 60\_64\_years\_% - Estimated percent of population in county that are between 60 and 64 years of age in 2016
  • 65\_69\_years\_% - Estimated percent of population in county that are between 65 and 69 years of age in 2016
  • 70\_74\_years\_% - Estimated percent of population in county that are between 70 and 74 years of age in 2016
  • 75\_79\_years\_% - Estimated percent of population in county that are between 75 and 79 years of age in 2016
  • 80\_84\_years\_% - Estimated percent of population in county that are between 80 and 84 years of age in 2016
  • 85+\_years\_% - Estimated percent of population in county that are 85 years old and older in 2016
  • All Other Death Rate 2016 (per 100,000) - 2016 county death rate from all causes excluding opioid overdose
  • american\_indian\_or\_alaska\_native\_% - Estimated percent of population in county that are American Indian or Alaska Native in 2016
  • asian\_or\_pacific\_islander\_% - Estimated percent of population in county that are Asian or Pacific Islander in 2016
  • Birth Rate 2016 (per 100) - Birth rate in period 7/1/2015 to 6/30/2016 (per 1000 people)
  • black\_or\_african\_american\_% - Estimated percent of population in county that are Black or African American in 2016
  • county - County's name
  • county\_state - County's name and state the county is located in
  • Crude Opioid Mortality Rate (per 100,000) - Estimated rate for deaths caused by opioid overdose in the county for 2016 (per 100,000 people)
    • Specifically, the types of drug-related deaths include the following:
      • Drug poisonings (overdose)
      • Unintentional
      • Suicide
      • Homicide
      • Undetermined
    • The specific drugs included in the death rates are the following:
      • Opium
      • Heroin
      • Other opioids
      • Methadone
      • Other synthetic narcotics
      • Other and unspecified narcotics
  • Domestic Migration Rate 2016 (per 100) - Net domestic migration rate in period 7/1/2015 to 6/30/2016 (per 1000 people)
  • fips\_code - County's unique code
  • GQ Estimates 2016 (count) - 7/1/2016 Group Quarters total population estimate (number of people)
    • all people not living in housing units (house, apartment, mobile home, rented rooms)
    • e.g. people living in correctional facilities, nursing homes, mental hospitals, dormitories, military barracks, or shelters
  • International Migration Rate 2016 (per 100) - Net international migration rate in period 7/1/2015 to 6/30/2016 (per 1000 people)
  • less\_than\_1\_year\_% - Estimated percent of population in county that are less than one year old in 2016
  • Median Household Income 2016 (dollars) - Estimate of median household Income, 2016 (in dollars)
  • Opioid Prescription Rate (per 100) - Prescription rate for opioids for the county (per 100 people)
  • Estimated percent with High School Diploma Only (%) - Estimated percent of people with high school diploma only (%)
  • Population Change 2016 (count) - Net change in resident total population 7/1/2015 to 7/1/2016 (number of people)
  • Population Estimate 2016 (count) - Estimated population for the county in 2016
  • Poverty Estimated percentage 2016 (%) - Estimated percent of people of all ages in poverty 2016 (%)
  • state - State the county is in
  • Unemployment Rate 2016 (per 100) - Unemployment rate, 2016 (per 100 people)
  • white\_% - Estimated percent of population in county that are White in 2016

Exploratory Data Analysis

Let's do some exploratory data analysis before we make predictions. First, let's examine the distribution of the county level opioid overdose mortality rate.

Distribution of County Level Opioid Overdose Mortality Rate

Here, we see that the distribution is skewed right. That is, there are a lot of counties with zero opioid overdose mortality rate. In fact, there are 2,261 counties out of 2,962 counties that have zero opioid overdose mortality rate (76.3 %). Let's now remove all the counties with zero opioid overdose mortality rate and reexamine the distribution of the county level opioid overdose mortality rate.

Distribution of County Level Opioid Overdose Mortality Rate (Opioid Overdose Mortality Rate Greater Than Zero)

When we remove all the counties with zero opioid overdose mortality rate, we see that the distribution is still right-skewed. The median for the opioid overdose mortality rate is 16.5 people per 100,000 people. 90th percentile for the opioid overdose mortality rate is 35.9 people per 100,000 people. Let's now look at the distributions of all other variables except opioid mortality rate.

Violin Plots of All Variables Except Opioid Overdose Mortality Rate

Correlation Heat Map of All the Variables

Let's examine the top twelve variable pairs with the largest correlation magnitude (absolute correlation value). That is, take the absolute value of all the correlation values and select the twelve that are closest to one.

Top 12 Correlation Magnitude Variable Pairs

From the above correlation heat map and the scatter plot, we can see the variables that have high correlation values. Also, it's important to note that not all the relationships are linear. Next, let's focus on correlation between crude mortality rate and all other variables.

Pearson's Correlation Between Crude Mortality Rate and All Other Variables

From the plot above, we see that `GQ Estimates 2016 (count)` has the highest correlation with crude opioid mortality rate. Seeing that population estimate is also one of the variables with relatively high correlation with crude opioid mortality rate, perhaps counties with higher population sizes have higher mortality rates due to opioid overdose.

Conclusion From Exploratory Data Analysis

It seems as if the the counties with higher opioid overdose mortality rates have the following:
  • Higher GQ estimates
  • Higher population percentage of 45 - 49 year olds
  • Higher population estimates
  • Higher median household income
Further research could be done to investigate as to why counties that have high GQ estimates would have high opioid overdose mortality rate. Note that GQ estimates represent number of people living in places such as group homes, mental hospitals, correctional facilities, and dormitories. It's surprising to see that counties with higher population percentage of 45 - 49 year olds have higher opioid overdose mortality rates. This is consistent with the fact that a high percentage of people who died of opioid overdose fall in this age group. It's surprising to see that counties with higher income tend to have higher mortality rates caused by opioid overdose.

Predictions

To predict opioid overdose mortality rate, I removed county population size estimate. I also converted the `GQ Estimates 2016 (count)` variable to percentage relative to population estimate. The reason for these is because these two variables were in absolute counts. Because they were in absolute counts, these variables caused the smaller counties to have predicted opioid overdose mortality rates of zero. This caused issues because just because the county had small population size did not mean that they would necessarily have zero opioid overdose mortality rate. Removing county population size and converting `GQ Estimates 2016 (count)` to percentages relative to county population size helped solve this problem. The following is a list of all the variables used to predict county level opioid overdose mortality rate. The description of these variables can be found above.
  1. %_pop_chg
  2. 10_14_years_%
  3. 15_19_years_%
  4. 1_4_years_%
  5. 20_24_years_%
  6. 25_29_years_%
  7. 30_34_years_%
  8. 35_39_years_%
  9. 40_44_years_%
  10. 45_49_years_%
  11. 50_54_years_%
  12. 55_59_years_%
  13. 5_9_years_%
  14. 60_64_years_%
  15. 65_69_years_%
  16. 70_74_years_%
  17. 75_79_years_%
  18. 80_84_years_%
  19. 85+_years_%
  20. All Other Death Rate 2016 (per 100,000)
  21. american_indian_or_alaska_native_%
  22. asian_or_pacific_islander_%
  23. Birth Rate 2016 (per 100)
  24. black_or_african_american_%
  25. Crude Opioid Mortality Rate (per 100,000)
  26. Domestic Migration Rate 2016 (per 100)
  27. gq_%_of_total_pop
  28. International Migration Rate 2016 (per 100)
  29. less_than_1_year_%
  30. Median Household Income 2016 (dollars)
  31. Opioid Prescription Rate (per 100)
  32. Percent with High School Diploma Only (%)
  33. Poverty Percentage 2016 (%)
  34. Unemployment Rate 2016 (per 100)
  35. white_%
The following eighteen algorithms were used to predict the county level opioid overdose mortality rates:
  1. Multiple Linear Regression
  2. Ridge Regression (Default Hyperparameters)
  3. Ridge Regression (Tuned Hyperparameters)
  4. LASSO Regression (Default Hyperparameters)
  5. LASSO Regression (Tuned Hyperparameters)
  6. Elastic Net Regression (Default Hyperparameters)
  7. Elastic Net Regression (Tuned Hyperparameters)
  8. CART (Default Hyperparameters)
  9. CART (Tuned Hyperparameters)
  10. Random Forest Regression (Default Hyperparameters)
  11. Random Forest Regression (Tuned Hyperparameters)
  12. Gradient Boosting Regression (Default Hyperparameters)
  13. Gradient Boosting Regression (Tuned Hyperparameters)
  14. Stochastic Gradient Boosting (Default Hyperparameters, sklearn)
  15. Stochastic Gradient Boosting (Tuned Hyperparameters, sklearn)
  16. Stochastic Gradient Boosting (Default Hyperparameters, xgboost)
  17. Stochastic Gradient Boosting (Tuned Hyperparameters, xgboost)
  18. Hurdle Model (xgboost classification and regression)

Hurdle Model

What is the hurdle model? An article from the University of Virginia states the following:
The hurdle model is a two-part model that specifies one process for zero counts and another process for positive counts. The idea is that positive counts occur once a threshold is crossed, or put another way, a hurdle is cleared. If the hurdle is not cleared, then we have a count of 0.2
The procedure I used for the hurdle model is as follows:
  1. Use XGBoost's classifier to predict whether a county has zero opioid overdose mortality rate
  2. For the counties that are predicted to have greater than zero opioid overdose mortality rate, use XGBoost's regressor to predict the county's opioid overdose mortality rate
This mimics section 2.6 from Machine learning and hurdle models for improving regional predictions of stream water acid neutralizing capacity3

Prediction Procedure

For all of the algorithms, I used the same training set and test set. 80% of the data was randomly selected for the training set. The remaining 20% of the data was used for the test set. When using default hyperparameters, the following procedure was performed:
  1. Use default hyperparameter values to fit data (no cross-validation was used)
  2. Calculate RMSE
When tuning for hyperparameters, the following procedure was performed:
  1. Tune hyperparameter values using RandomizedSearchCV (1000 iterations) and 5-fold cross-validation to fit the training set
  2. Select model that has the lowest RMSE
Once the best models are selected, I created a bootstrap sampling distribution of the RMSE to obtain an estimate of the distribution of the RMSE and also to perform two-sample t tests to see if the differences between the models were statistically significant. The procedure for the bootstrapping is as follows:
  1. From the dataset, sample counties with replacement 2369 times (80% of the data) to obtain the training set
  2. The rest of the data is the test set
  3. Fit the training set using the best models
  4. Make predictions for the test set
  5. Calculate RMSE
  6. Repeat steps one to five B = 10,000 times
You can see the bootstrap RMSE distributions for all methods below:

Bootstrap RMSE Distributions for All Methods

You can see that bootstrap RMSE distributions seem to be approximately normal distributions. Hence, I use the two-sample t-test to see if the difference in the means between the model with the lowest RMSE and all other models are statistically significantly different. You can see the results of the hypothesis testing below.

Boxplot of Bootstrap RMSE Distributions for All Methods

Table of Two-Sample T-Test P-Values Between Models

Model 1 Model 2 p-value
1 Random Forest Regression (Tuned Hyperparameters) Gradient Boosting Regression (Tuned Hyperparam... 9.307487e-05
2 Random Forest Regression (Tuned Hyperparameters) Stochastic Gradient Boosting (Tuned Hyperparam... 4.117637e-13
3 Random Forest Regression (Tuned Hyperparameters) Stochastic Gradient Boosting (Tuned Hyperparam... 2.063396e-89
4 Random Forest Regression (Tuned Hyperparameters) Stochastic Gradient Boosting (Default Hyperpar... 3.593946e-221
5 Random Forest Regression (Tuned Hyperparameters) Stochastic Gradient Boosting (Default Hyperpar... 0.000000e+00
6 Random Forest Regression (Tuned Hyperparameters) Gradient Boosting Regression (Default Hyperpar... 0.000000e+00
7 Random Forest Regression (Tuned Hyperparameters) LASSO Regression (Tuned Hyperparameters) 0.000000e+00
8 Random Forest Regression (Tuned Hyperparameters) Elastic Net Regression (Tuned Hyperparameters) 0.000000e+00
9 Random Forest Regression (Tuned Hyperparameters) Ridge Regression (Default Hyperparameters) 0.000000e+00
10 Random Forest Regression (Tuned Hyperparameters) Ridge Regression (Tuned Hyperparameters) 0.000000e+00
11 Random Forest Regression (Tuned Hyperparameters) Multiple Linear Regression 0.000000e+00
12 Random Forest Regression (Tuned Hyperparameters) CART (Tuned Hyperparameters) 0.000000e+00
13 Random Forest Regression (Tuned Hyperparameters) Random Forest Regression (Default Hyperparamet... 0.000000e+00
14 Random Forest Regression (Tuned Hyperparameters) Elastic Net Regression (Default Hyperparameters) 0.000000e+00
15 Random Forest Regression (Tuned Hyperparameters) Hurdle Model (xgboost classification and regre... 0.000000e+00
16 Random Forest Regression (Tuned Hyperparameters) LASSO Regression (Default Hyperparameters) 0.000000e+00
17 Random Forest Regression (Tuned Hyperparameters) CART (Default Hyperparameters) 0.000000e+00
You can see from the boxplot of all the models and the table that random forest regression with tuned hyperparameters are the best performing model. The difference between the mean of random forest and all the other models are statistically significant. CART with default parameters seems to be performing the worst. In general, tree-based methods tend to have lower RMSE than the linear models. Also, tuning hyperparameters lowers RMSE.

Predicted vs Actual Opioid Overdose Mortality Rate

Based on the scatterplots above, we can see that the best performing model, random forest regression with tuned hyperparameters, made conservative predictions to minimize RMSE. However, in terms of using the model, I would actually prefer to use stochastic gradient boosting from XGBoost package with tuned hyperparameters. Although the difference in the means between random forest regression with tuned hyperparameters and stochastic gradient boosting from XGBoost package with tuned hyperparameters are statistically significantly different, it does not seem to be practically different. The difference in the bootstrap RMSE means is 0.13, which is quite small in quantity (9.25 vs 9.37). In practice, this small difference would not make much difference. However, the predictions from stochastic gradient boosting from XGBoost package with tuned hyperparameters seem to fit the actual opioid overdose mortality rate better.

Feature Importances

Based on the feature importances, we can see that some of the important features from the top-performing models are asian_or_pacific\_islander\_%, american_indian_or_alaska\_native\_%, Opioid Prescription Rate (per 100), and 45_49\_years\_%. Can we improve our predictions? Perhaps we can by taking a closer look at the county level population distribution.

A Closer Look at the County Level Population Distribution

A closer look at the county level population distribution reveals the following: population_distribution After examining the population estimates, we see the following:
  • Minimum population size for counties with greater than zero mortality rate is 18,646
  • 1,078 counties have zero mortality rate and have population less than 18,646
  • 1,183 counties have zero mortality rate and have population greater than or equal to 18,646
  • 701 counties have greater than zero mortality rate and have population greater than or equal to 18,646
Based on this information, let's remove the 1,078 counties that have population less than 18,646 and redo the prediction analysis using 1,183 counties with zero opioid overdose mortality rate and 701 counties that have opioid overdose mortality rate greater than zero. In total, let's redo the analysis with 1,884 county data instead of 2,962 county data.

Boxplot of Bootstrap RMSE Distributions for All Methods (Population >= 18,646)

From the boxplots, we see that the best performing model is stochastic gradient boosting from sklearn package with tuned hyperparameters. The difference of the means is statistically significant between the best performing model and all the other models. However, in terms of practical significance, there seems to not be a very big difference among the top performing models.

Predicted vs Actual Opioid Overdose Mortality Rate (Population >= 18,646)

Based on the scatterplots, I would conclude that the fourth best performing model in terms of RMSE, stochastic gradient boosting from XGBoost package with tuned hyperparameters seem to make the best predictions.

Feature Importances (Population >= 18,646)

Based on the feature importances, we can see that some of the important features from the top-performing models are american_indian_or_alaska\_native\_%, asian_or_pacific\_islander\_%, 45_49\_years\_%, and Opioid Prescription Rate (per 100). Overall results after removing small counties with zero opioid overdose mortality rate does not seem to significantly improve predictions. This can be because we do not have enough data.

Conclusion from Prediction Analysis

After considering the prediction analysis, I recommend XGBoost's stochastic gradient boosting model with tuned hyperparameters as the model to use to make predictions. Even though XGBoost's stochastic gradient boosting model was the fourth best performing model, the difference in the RMSE means between this model and the best performing model is not very different practically. XGBoost's stochastic gradient boosting model seems to make the best fitting predictions based on the scatterplots above. We have now obtained a tool that we can use to predict a county's opioid overdose mortality rate given 36 variables about the county. This tool is useful for the following reasons: 1. Once we predict a county's opioid overdose mortality rate, we can then focus on the counties that are predicted to have high opioid overdose mortality rate. This will help us more efficiently allocate resources to combat this epidemic. 2. By focusing on the counties that are predicted to have high opioid overdose mortality rates, we can better investigate the cause of opioid addition. Once we find a cause, we can more effectively combat this epidemic. 3. We can predict the change in opioid overdose mortality rate when some of the predictor variable values change. There is also one finding that is quite surprising. There is no one predictor variable that contribute overwhelmingly to opioid overdose mortality rate. Poverty rate, median household income, and unemployment rates are not the most important predictors in opioid overdose mortality rate. Instead, what we find are variables like the following:
  1. american_indian_or_alaska\_native\_%
  2. asian_or_pacific\_islander\_%
  3. 45_49\_years\_%
  4. Opioid Prescription Rate (per 100)
Opioid Prescription Rate (per 100) seems to make sense in that higher Opioid Prescription Rate (per 100) would lead to higher opioid overdose mortality rate. However, it is surprising to see the other three predictor variables as being the top features in predicting opioid overdose mortality rate.

Further Research

Findings from this data analysis can be used to investigate some causes of opioid addiction and opioid overdose mortality rate. Further investigation should also be done to see why american_indian_or_alaska\_native\_%, asian_or_pacific\_islander\_%, and 45_49\_years\_% play an important role in predicting opioid overdose mortality rate.

References

1. https://en.wikipedia.org/wiki/Opioid_epidemic

2. https://data.library.virginia.edu/getting-started-with-hurdle-models/

3. https://agupubs.onlinelibrary.wiley.com/doi/full/10.1002/wrcr.20308

Links to Code

You can find code and more details regarding this data analysis from the following links:

Special Thanks

  • Special thanks to Tommy Blanchard for mentoring me throughout this project
  • Special thanks to Peter Lee for helping me with the presentation slides
  • Special thanks to Abraham Choe for helping me with the blog layout

Comments !

links

social