This project aims to classify whether or not a given patient has Parkinson's disease based on various variables derived from their speech sounds. We used a KNN classification algorithm with six predictors to create a model with an accuracy of 80%. We concluded it's accuracy is satisfactory for the scope of DSCI 100, working with the tools we had, but not accurate enough for commercial use. We discussed how this study may be extended to other degenerative diseases. Finally, we elaborated on how one would make such a model usable by the public by providing a website frontend for collecting new data points to test.
Parkinson’s disease (PD) is a degenerative disorder of the central nervous system (CNS) which severely impacts motor control. Parkinson’s has neither a cure nor a known cause. Speech, however, is widely known to play a significant role in the detection of PD. According to Sakar1 , vocal disorder exists in 90% of patients at an early stage.
[1] Here's the link to the original data set, and corresponding paper. Note that the paper used the data set with a similar goal but with very different predictor variables than the ones we select here.
To what extent can we use elements of speech sounds to predict whether or not a patient has Parkinson’s disease?
Our data set is from UCI’s Machine Learning repository. This data set contains numerous variables derived from speech sounds gathered by Sakar using the Praat acoustic analysis software. Each participant in the study was asked to repeat their pronunciations of the /a/ phoneme three times; each entry is thus associated with an ID parameter ranging from 0 to 251. We chose to focus on the following six predictor parameters:
Speech Sound Component | Variable Name in Data Set | Description |
---|---|---|
Pitch Period Entropy (PPE) | PPE | measures the impaired control of someone’s pitch |
Detrended Fluctuation Analysis (DFA) | DFA | a measure of the ‘noisiness’ of speech waves |
Recurrence Period Density Entropy (RPDE) | RPDE | repetitiveness of speech waves |
Jitter | locPctJitter | a measure of frequency instability |
Shimmer | locShimmer | a measure of amplitude instability |
Fundamental Frequency | meanPeriodPulses | someone’s pitch (calculated from the inverse of the mean period) |
There’s a lot of complicated mathematics associated with each of these parameters (including Fourier transforms, dense equations, etc.) so the interested reader is encouraged to reference Sakar directly.
Many researchers have attempted to use speech signal processing for Parkinson’s disease classification. Based on Sakar and other widely supported literature, the predictors identified above are the most popular speech features used in PD studies and should prove effective. They are also all related to motor control, which is what the early stages of Parkinson’s directly impacts.
Each participant’s data (i.e., their pronunciation of /a/) was collected three times; to analyze this data set, each of these three observations was merged by taking the mean of each of our predictor variables in question.
The data was not completely tidy, but the columns that we used were, so the untidy columns were neglected. Next, to be able to conduct classification, our predictor variables were standardized and centered. Each of our variables had a unique distribution; PPE, DFA, and RPDE were on a scale of 0 to 1 and the pitch variable was on an order of magnitude to 10-3, for example, so the entire data set was scaled so that we did not have to worry about each variable’s individual distribution.
Since our question is a binary classification problem, we used the K-nearest neighbours classification algorithm (KNN). We optimized our choice of k using cross-validation on our training set, and then used the optimal choice of k to create our final model. We then determined its accuracy using the test set.
To visualize the effectiveness of our KNN model, we were not able to draw a scatter plot directly, as we worked with more than 2-dimensions. We had a line plot showing the trend of increasing values of k on the accuracy of our model for many folds to assist in determining which k we should use for our final model. Then, we created a bar plot to show the number of correct predictions, false positives, and false negatives.
# Reinstall packages (if needed)
install.packages("tidyverse")
install.packages("caret")
install.packages("repr")
install.packages("GGally")
install.packages("formula.tools")
# Fix for strange error message
install.packages("e1071")
# Load in the necessary libraries
library(caret)
library(repr)
library(GGally)
library(tidyverse)
library(formula.tools)
# Make all of the following results reproducible, use this value across the analysis
set.seed(28)
To begin, we read in the data from the UCI repository. We did not read it directly from UCI's URL, as it contained a zip, and trying to deal with that in R is tedious. Rather, we uploaded the file to Google Drive and we accessed it from there.
# Reads the dataset from the web into R; the Drive URL is self-made
pd_data <- read_csv("https://drive.google.com/uc?export=download&id=1p9JuoTRM_-t56x7gptZU2TRNue8IHFHc", skip = 1)
#narrows down the dataset to the variables that we will use
baseline_data <- pd_data %>%
select(id:meanHarmToNoiseHarmonicity, class)
head(baseline_data)
As was mentioned in the methods section, each participant was represented three times (e.g., see three rows with id == 0
above). We merged these by taking the mean of each of the predictor columns, after grouping by id
.
# Averages the values of each subject's three trials so that each subject is represented by one row
project_data <- baseline_data %>%
group_by(id) %>%
summarize(PPE = mean(PPE),
DFA = mean(DFA),
RPDE = mean(RPDE),
meanPeriodPulses = mean(meanPeriodPulses),
locPctJitter = mean(locPctJitter),
locShimmer = mean(locShimmer),
# meanAutoCorrHarmonicity = mean(meanAutoCorrHarmonicity),--legacy from project proposal
class = mean(class)) %>%
mutate(class = as.factor(class)) %>%
mutate(has_pd = (class == 1))
head(project_data)
Below we created the training and test sets using createDataPartition()
from the caret
package.
# Determines which percentage of rows will be used in the training set and testing set (75%/25% split)
set.seed(28)
training_rows <- project_data %>%
select(has_pd) %>%
unlist() %>%
createDataPartition(p = 0.75, list = FALSE)
# Splits the dataset into a training set and testing set
training_set <- project_data %>% slice(training_rows)
testing_set <- project_data %>% slice(-training_rows)
head(training_set)
As mentioned in the data wrangling section of "Methods," we eventually scaled our data. Scaling and other pre-processing was done in the analysis section.
Here we looked at the testing set in more detail, exploring the balance and spread of our selected columns.
# Reports the number of counts per class
class_counts <- training_set %>%
group_by(has_pd) %>%
summarize(n = n())
class_counts
options(repr.plot.width=4,repr.plot.height=4)
class_counts_plot <- ggplot(class_counts, aes(x = has_pd, y = n, fill = has_pd)) +
geom_bar(stat="identity") +
labs(x = 'Has PD?', y = 'Number', fill = "Has PD") +
ggtitle("Balance in PD Data Set") +
theme(text = element_text(size = 18), legend.position = "none")
class_counts_plot
We had many more—almost three times as many—patients with PD than without in this data set. Therefore, we could conclude our training set was somewhat imbalanced (in fact, it was the same imbalance as the original data set thanks to createDataPartition()
handling stratification for us); however, it was not severe enough to warrant use of upScale()
. This limitation is further discussed at the end of our analysis.
# Reports the means, maxes, and mins of each predictor variable used
predictor_max <- training_set %>%
select(PPE:locShimmer) %>%
map_df(~ max(., na.rm = TRUE))
predictor_min <- training_set %>%
select(PPE:locShimmer) %>%
map_df(~ min(., na.rm = TRUE))
predictor_mean <- training_set %>%
select(PPE:locShimmer) %>%
map_df(~ mean(., na.rm = TRUE))
stats_merged <- rbind(predictor_max, predictor_min, predictor_mean)
stat <- c('max','mean','min')
stats_w_names <- data.frame(stat, stats_merged)
predictor_stats <- gather(stats_w_names,
key = variable,
value = value,
PPE:locShimmer)
predictor_stats
# Visualizes and compares the distributions of each of the predictor variables
plot_pairs <- training_set %>%
select(PPE:locShimmer) %>%
ggpairs(title = "PD speech predictor variable correlations")
# plot_pairs
plot_pairs_by_class <- training_set %>%
ggpairs(.,
legend = 9,
columns = 2:8,
mapping = ggplot2::aes(colour=has_pd),
lower = list(continuous = wrap("smooth", alpha = 0.3, size=0.1)),
title = "PD speech predictor variable correlations by class") +
theme(legend.position = "bottom")
# plot_pairs_by_class
The following two plots were created using the GGPairs
library. The first, without color, strictly provides detail about the distribution and correlation between each pair created from our six predictor variables. Three of our predictors, DFA, RPDE, and meanPeriodPulses take on a much wider range of values than PPE, jitter, and shimmer. Many of our variables exhibit somewhat positive correlations on the scatterplot, though some have an entirely fuzzy distribution. For example, compare the plots in the PPE column to those in the RPDE column. This likely comes as a result of the spread of the predictors.
options(repr.plot.width=10,repr.plot.height=10, repr.plot.pointsize=20)
plot_pairs
With this understanding, we used a second plot, grouped and colored by has_pd
, to assist in anticipating what the impact of these predictors would be. We noted that for every individual distribution, there is a marked difference between the red and blue groupings, which boded well for our analysis. On average, the healthy patients (i.e., has_pd == FALSE
) fell on the lower end of the spectrum for our predictors, apart from PPE, where healthy patients exhibited higher values on average. Though we weren’t able to visualize our final model directly (as it was in six dimensions), we predicted from these plots that the new patients which fell on the "lower end" for most of these variables would be healthy. This also made intuitive sense; Parkinson’s is a degenerative disease for the muscles, so unhealthy patients would likely experience more rapid change in various speech variables due to tremors.
options(repr.plot.width=10,repr.plot.height=10, repr.plot.pointsize=20)
plot_pairs_by_class
From our analysis, we expected to find that the six variables of speech we identified could form an effective model for determining whether or not a patient has Parkinson’s. We anticipated our findings would allow us to make reasonable predictions of whether or not a new patient has PD given their speech data.
# Scale the data set (pre-processing)
scale_transformer <- preProcess(training_set, method = c("center", "scale"))
training_set <- predict(scale_transformer, training_set)
testing_set <- predict(scale_transformer, testing_set)
# head(training_set)
# head(testing_set)
X_train <- training_set %>%
select(-class, -has_pd) %>%
data.frame()
X_test <- testing_set %>%
select(-class, -has_pd) %>%
data.frame()
Y_train <- training_set %>%
select(class) %>%
unlist()
Y_test <- testing_set %>%
select(class) %>%
unlist()
# head(X_train)
# head(Y_train)
train_control <- trainControl(method="cv", number = 10)
k <- data.frame(k = seq(from = 1, to = 51, by = 2))
knn_model_cv_10fold <- train(x = X_train, y = Y_train, method = "knn", tuneGrid = k, trControl = train_control)
# knn_model_cv_10fold
accuracies <- knn_model_cv_10fold$results
head(accuracies)
options(repr.plot.height = 5, repr.plot.width = 5)
accuracy_vs_k <- ggplot(accuracies, aes(x = k, y = Accuracy)) +
geom_point() +
geom_line() +
ggtitle("Graphing Accuracy Against K") +
labs(subtitle = "Initial attempt—all predictors used")
accuracy_vs_k
k <- accuracies %>%
arrange(desc(Accuracy)) %>%
head(n = 1) %>%
select(k)
k
It looks like a choice of 5 here yields the high accuracy value. After approximately a k value of 40, our accuracy plateaus at just above 0.75. We will now retrain our model using this choice of k.
k = data.frame(k = 5)
model_knn <- train(x = X_train, y = Y_train, method = "knn", tuneGrid = k)
# model_knn
Y_test_predicted <- predict(object = model_knn, X_test)
# head(Y_test_predicted)
model_quality <- confusionMatrix(data = Y_test_predicted, reference = Y_test)
model_quality
Our final model had an accuracy of 79.4%, which is pretty good! Give our model is in six dimensions, there is no simple visualization for it. However, we can visualize whether or not our model had more false positives or false negatives, and which it was better are predicting: sick or healthy. The confusion matrix gives us all of these values in a 2x2 grid.
# Create a dataset
matrix_table <- as.data.frame(model_quality$table) %>%
mutate(isCorrect = (Prediction == Reference))
# matrix_table
matrix_plot <- ggplot(matrix_table, aes(fill = isCorrect, x = Reference, y = Freq)) +
geom_bar(position="stack", stat="identity", width = 0.7) +
labs(title = "Confusion Matrix Bar Chart", y = "Frequency", x = "Has PD?") +
scale_fill_discrete(name="Correction Prediction?")
matrix_plot
We noted from the above bar chart that our model was fairly accurate at predicting whether or not a patient did have Parkinson's, but were very inaccurate when working with healthy patients. This was likely a result of our imbalanced training set, though we did still end up with an 80% accuracy.
Being able to predict Parkinson’s accurately has significant impact alone; doctors often struggle to make accurate, timely predictions as Parkinson’s is a long-term degenerative disease. Additionally, there are no specific tests for determining whether or not someone has Parkinson’s. Currently a neurologist must use a combination of other factors like medical history, signs and symptoms, and a physical examination to make a prediction. Speech sounds could be another cheap, fast tool for a neurologist to employ to help them make a more accurate prediction.
From this analysis, I could see us moving to three questions for further study:
Beyond these further questions, it would also be interesting to see the use of this model in a practical setting. For example, this model could be used by doctors in combination with the patient's other symptoms to make a more accurate prediction. In fact, Google has a similar system for breast cancer screening currently under international review. Given countless electronic devices we interact with on a day-to-day basis possess a microphone, this could even be done without the doctor's help. The only catch would be the consumer-facing product would need to give ample warnings about a potential false-negative or false-positive, and would additionally need to be robust enough to collect the relevant variables from a speech sound, like Praat did in collecting the data set used here.
Sakar, C. O., Serbes, G., Gunduz, A., Tunc, H. C., Nizam, H., Sakar, B. E., Apaydin, H. (2019). A comparative analysis of speech signal processing algorithms for parkinson's disease classification and the use of the tunable Q-factor wavelet transform. Applied Soft Computing Journal, 74, 255. doi:10.1016/j.asoc.2018.10.022.