Classification of Neurodegenerative Diseases using Machine Learning Methods

Abstract: In this study, neurodegenerative diseases (Amyotrophic Lateral Sclerosis, Huntington’s disease, and Parkinson’s disease) were diagnosed and classified using force signals. In the classification, five machine learning algorithms Averaged 2-Dependence Estimators (A2DE), K star (K*), Multilayer Perceptron (MLP), Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATE), Random Forest) were compared by the 10-fold Cross Validation method. K* classifier gave the best outcome among these algorithms. As a result of quad classification of the K* classifier, the best classification accuracy was 99.17%. According to the first three and five principal component qualifications which are created from these 19 features, the best classification accuracies of K* classifier were 95.44% and 96.68% respectively


Introduction
Neurodegeneration is a general term which expresses structure and function loss of neurons including their death. Many neurodegenerative diseases like Amyotrophic Lateral Sclerosis (ALS), Parkinson Disease (PD) and Huntington Disease (HD) occur as a consequence of the neurodegenerative processes. Some of the symptoms of the neurodegenerative diseases as ALS, HD and PD are mutual. Some of these symptoms are physical [1]. One of the activities where physical disorders appear is 'walking' [2]. Early detection of neurodegenerative disease is crucial to prevent the progression of the disease. Furthermore, in certain neurodegenerative diseases, early diagnosis of the disease changes the course of the treatment [3][4][5][6]. Many published articles give an idea about the diagnosis and classification of neurodegenerative diseases. Here is brief information about these articles. In the study carried out by Banaie et al. (2011), a new automatic approach was developed in order to classify the patients who can move by using the features that have come out from the gait signals. In the study, four groups were chosen. These were: people suffering from ALS, HD, PD and Control Object (CO). The classification algorithm used in this study was a quadratic Bayesian classifier [7]. In the study carried out by Daliri (2012), an approach was offered for the diagnosis of neurodegenerative diseases (ALS, HD and PD) based on gait dynamics. Support Vector Machine (SVM) algorithm, was used for diagnosis, by employing different kernels. The best performance was obtained from radial basis kernel function [8]. In the study realized by Lee and Lim (2012), Wavelet based feature extraction was performed by using gait characteristics, in order to classify the Parkinson Disease. Classification was performed by using extracted features and weighted fuzzy membership function neural network (NEWFM) [9]. Drotár et al. (2014) argue that dysgraphia, one of the first motor symptoms observed in PD, can help tell such patients from healthy control objects, using kinematic changes in handwriting to diagnose PD. They first extract the features of the handwriting and form data sets, after which they use SVM to distinguish the classes [10]. Akdemir et al. (2014) rely on Brain 18-Fluorodeoxyglucose Positron Emission Tomography (18F-FDG PET) to differentially diagnose some Parkinsonisms (PD, MSA, PSP, KBD, dementia with Lewy bodies (DLB) and control objects). The images were analyzed visually and using NeuroQ software [11]. Lee et al. (2014) analyze videos showing gait and posture characteristics of patients with PD to differentiate tem from Progressive Supranuclear Palsy (PSP) and Multiple System Atrophy (MSA) [12]. Pilleri et al. (2014) use heart rate and circadian rhythm of patients with PD and MSA for differential diagnosis. They then put the data to t-test and ANCOVA test to compare the two groups [13]. Navarro-Otano et al. (2014) study 123I-metaiodobenzylguanidine (123I-MIBG), heart rate, odor recognition, and [123I]FP-CIT (DaTscan) SPECT to differentiate PD from Vascular Parkinsonism (VP). The resulting data is then evaluated statistically [14]. Salvatore et al. (2014) rely on T1 weighted MRI to differentiate PD from PSP, using Principle Component Analysis (PCA) for feature extraction and SVM for classification of diseases [15]. Feng et al. (2014) use 3.0T MRI data for a differential diagnosis of PD and MSA, and statistical tests for comparison [16]. Baudrexel et al. (2014) compare 18F-FDG PET and Diffusion Weighted Imaging (DWI) for a differential diagnosis of PD, MSA and PSP, and then use ANOVA and ROC analysis for classification of the diseases [17]. Huertas-Fernández et al. (2015) use [123I]FP-CIT (DaTscan) SPECT method for differential diagnosis between PD and VP, and then use Logistic Regression, Linear Differential Analysis and SVM for classification of the diseases [18]. Zanigni et al. (2015) rely on MR Proton Spectroscopy (1H MRS) method for differential diagnosis between PD and other parkinsonian syndromes (MSA-C, MSA-P and Richardson syndrome (PSP-RS)). They focus on cerebellum differently from other studies, and analyze statistically data obtained in study [19]. Bradvica et al. (2015) rely on Transcranial Sonografy and odor test for differential diagnosis between PD and Essential Tremor (ET) but also compare results of Transcranial Sonografy and Dopamin Transporter Scan (DaTSCAN), and find out high degree of compatibility between them [20]. In the study performed by Xia et al. (2015), a system based on machine learning was developed to classify the neurodegenerative diseases (ALS, HD, PD and normal subjects (healthy subjects)). The algorithms used in the classification were SVM, Random Forest, Multiple Layer Perceptron (MLP) and k-Nearest Neighbors (kNN) [21]. Vranová et al. (2016) rely on Clusterin protein level in cerebrospinal fluid for differential diagnosis between neurodegenerative diseases (PD, DLB, AD, MSA, PSP). They use statistical analysis in comparing the data [22]. Drotár et al. (2016) rely on kinematic and pressure characteristics of handwriting for differential diagnosis between patients with PD and control objects, and then use kNN, AdaBoost and SVM in classifying the two groups [23]. In the studies on the diagnosis of neurodegenerative diseases, diagnosis is done either by focusing on particular diseases (especially PD and CO group) or frequent diseases as ALS, HD and PD are analyzed one by one with the control groups. However, in this study, quad classification was established by using all disease groups and control groups data together. Moreover, the accuracy of this study is higher than the other studies.

Material and Methods
In this study, statistical features were derived from gait force signals. After this step, the dataset was formed. Then, among the statistical features that were produced, the features that give more information were chosen using various feature selection methods. Those feature selection methods are: InfoGainAttributeEval, ChiSquaredAttributeEval, ConsistencySubsetEval, GainRatioAttributeEval, ReliefFAttributeEval, SVMAttributeEval and SymmetricalUncertAttributeEval. And the process of feature selection is performed by taking the average of the methods. Chosen features were examined experimentally on whether they increase the performance of Machine Learning algorithms. Thus, feature vector that gives the most information was formed. Since the dimension of this vector was big, its dimension was reduced by using Principal Component Analysis. Feature extraction was completed with this dimension reduction. Finally, the best classifier was chosen using the obtained features. Flow chart of this process is shown in Figure 1.

Gait Dynamics Signals
Gait analysis (GA) used in this study is a kinetic analysis which is one of the measures of 3D GA. The only data which can be measured in kinetic analysis is Ground reaction force vector (GRFV). GRFV is measured with plates that gauge the total impulse of foot on the ground. Data used in the study were collected from the right and left feet of Hausdorff's 64 subjects (15 PD patients, 20 HD patients, 13 ALS patients and 16 healthy people). These data were taken from PhysioNet database [24]. These gait signals from the subjects were sampled with 300 Hz frequency and there are five minutes of data for each subject. The mechanism which is used to obtain those data was developed by Hausdorff et al [25]. Hausdorff examined ALS and its changes on the gait rhythm. That study is based on the measurements of duration and length of strides on normal individuals and ALS patients [26]. In another study of Hausdorff et al., duration of double step was examined in relation to age and Huntington disease [27]. In that study, 5-minute force signals that were taken from each patient (right and left foot) were split up for 1 minute. So, 640 new data (320 for right foot and 320 for left foot) were created. Neurodegenerative patients' force signals on right and left foot were shown in Figure 2 and Figure 3 respectively. The left foot force-sensitive resistor data for one ALS patient (age: 68, sex: male), one PD patient (age: 77, sex: male), one HD patient (age: 42, sex: male) and one CO subject (age: 57, sex: female). The x-axis label marks the sampling timestamp of the gait signal, and the y-axis label is its amplitude.

Figure 3.
The right foot force-sensitive resistor data for one ALS patient (age: 68, sex: male), one PD patient (age: 77, sex: male), one HD patient (age: 42, sex: male) and one CO subject (age: 57, sex: female). The x-axis label marks the sampling timestamp of the gait signal, and the y-axis label is its amplitude.

Statistical Features and Preparation of the Datasets
Statistical parameters were formed by using 320 gait signals (force signal) and each parameter was matched to one feature interpret the causes of the classification results better. Therefore, the causes of classification results could be interpreted better.

Kstar (K*) Classifier
The algorithm that gave the best results in classification problems in this study is K* algorithm. K* algorithm is also a simple instance-based classification and resembles kNN algorithm [28]. K* algorithm uses entropic measurement based on the probability of transforming an instance within another one through random selection among all possible transforms. In fact, the transform complexity of an instance within another one is the distance between the instances. There are many benefits of using entropic distance. These are; using missing values and real valued attributes. Let I be a possibly infinite set of instances and T a finite set of transformations on I. Each t ∈ T maps instances to instances: t:I→I. K* function is calculated as shown in the statement (Eq. 3).
1 , , In this statement, P* is the probability of all transformational paths from x to y. It is shown in statement (Eq. 4).

Methodology Statistical Criteria used for the Analysis of Experiments
Some criteria were required in order to decide how successful a classifier is at the end of the learning process. The criteria are as follows: Kappa Statistics was used to measure consistency between the predicted and the observed classifications in a dataset [29]. Kappa value is given in (Eq. 5). P(a) shows the accuracy of the classifier and P(e) is the expected accuracy gained by the classifier that makes a random prediction on the same data set. In other words, it is the probability of occurrence of the prediction.
F-Measure is particularly an important criterion in the preparation process of training data in order to increase the performance of the classifier. Accordingly, in this study, it is aimed to get a happy graph [30] from learning curve drawn between F-Measure and data size. Bias-Variance tradeoff has a key role to understand Machine learning algorithms. Use of Bias-Variance tradeoff in experimental studies has gradually increased in recent years [29]. Terms of Bias and Variance help explain how superior the simple predictors can be to the complicated ones and how superior model groups can be to the simple models [31]. Some methodologies were applied within the experiments in this study in order to measure the performance of the learning system correctly, in other words, in order to guarantee the accuracy of experiment results. This method is known as k-fold cross validation (k-fold CV). K-fold CV method was used to determine if training data is enough for a classifier to learn. K-fold CV offers a method of dividing data into about k equal parts in order to predict a classifier's accuracy. The answer of the question: "why the k value was 10?" in this study is that it was observed that the correct number was about 10 in order to make the most accurate prediction of the rate of error in the common tests done by using different techniques on various training and test sets. Additionally, theoretical proofs support the statement [29]. Leave-one-out crossvalidation (LOOCV) which is the specialized phase of k-fold CV method was used sometimes while testing performances of Machine Learning classifiers. This method was used particularly to reveal how different from one another the models created by the classifiers were.

Results and Discussion
The experiments were started by choosing the optimum of four datasets gathered from the raw data. Once this dataset was chosen, all of the experiments were done using that dataset. The classifier results for election process of the dataset are shown in Table 1. In order to evaluate the efficiency of the classifiers used in our experiments, we have calculated two baselines: a majority based random predictor; a class distribution based random predictor and a most recent candidate matching predictor. Majority based classification (ZeroR classifier) was done by assigning the most frequent class to each instance, which was the HD class in our dataset. Class distribution based random classification was carried out by randomly assigning classes to the instances on the basis of their distributions. Baseline classifiers were used to compare the performances of other classifiers. Other classifiers were expected to perform better than baseline classifiers. The five chosen classifiers performed better when compared to baseline classifiers. Except the baseline classifiers, the classifiers at the Dataset 4 performed better at classifying in comparison to others according to the results shown on Table 1. After Dataset 4 was chosen as training data, the next step was the selection of the classifier which achieved the highest performance. The parameters of the highest performance classifiers K*, DECORATE and Random Forest were changed in order to inspect the accuracy, precision, recall, f-measure and kappa values. In the light of these results, it was observed that classifier K* had better results compared to other classifiers. The results are shown on Table 2. After Dataset 4 was chosen as the training data, K* classifier was chosen, the following step was the feature selection. The features that were not required for feature selection were eliminated to increase the accuracy of the classifier, and to decrease the required time. The results were calculated by averaging the seven methods used in feature selection. These results express the effect of each feature in terms of classifying accuracy. According to the data, the F11 feature, which was also called the maximum and minimum range of the left, provided the highest amount of information. The lowest information provided was in the F5 feature which was also called the skewness value.
After the ranks of the features were acquired, starting with the primary information providing feature, new Feature datasets were created by adding secondary and tertiary and lesser information providing features to the primary one. In total, 26 feature datasets were created.  Figure 4. According to all these results, the most suitable Globalblend value was 19. The Graph of bias-variance is shown in Figure 5. This is a graph that demonstrates the error origination point of a classifier; whether it is bias based or variance based. If the error origination point is based on bias, the convergence of the values of RMSECV and RMSEtraining stops at a certain value when the amount of data is increasing. In other words, the error could not be corrected because of bias even if data numbers are the highest. Additionally, there was a significant gap between the values of RMSECV and RMSEtraining. If the error origination point is based on variance, the situation is the opposite of bias based error. When the amount of data increases, the values of RMSECV and RMSEtraining start to get closer, and the gap between those two the lines closes. On the other hand, the increase in the amount of data does not always decrease the errors caused by variance. When Figure 5 is inspected, it is seen that when the amount of data increases, the gap between two values progressively decreases. In addition, regardless of the amount of data presented, RMSEtraining always had the value of zero, but the value RMSECV progressively decreases and finally had the value of 0.07. In conclusion, the complexity of local model created by classifier K* and the complexity of data fit each other. Happy Graph for classifier K* is shown on Figure 6. When Figure  6 is observed, it is seen that when the amount of data is 241, 242 and 243, the value of F-measure is 0.992. In addition, when the amount of data is 320, it is observed that the value of F-measure is 0.991. According to this information, the amount of data needed for classifier training can be limited to 241. When the amount of data was 241, it was observed that classifier K* achieved Accuracy = 99.17% and RMSE = 0.05 values. According to these results, the amount of data should be limited to 241.  The classifying results of classifier K* are shown on Table 3. According to the results, the correct classifying percentage of samples is 99.1701, as 239 samples were classified correctly while only 2 samples were classified incorrectly. It is also possible that classifier achieved perfect learning as the Kappa value was 0.9888. In addition to this, four error measurement statistics proved really low error values. Detailed accuracy table of classifier K* is shown in Table 4. According to this table, classifier K* created a local model which learns all the classes (ALS, CO, HD, PD) excellently because the precision, recall, F-measure and Receiver Operating Characteristic (ROC) Area values higher than 0.5 were the desired results. All of the values are either 1 or very close to 1 in Table 4. Thus, it is possible to claim that classifier K* created a model that could make generalization on all classes.
The percent variability explained by each principal component is shown in Figure 7. The total amount of principal components was 19. Only the first three of the components explain 95.967% of the variability. Hence, the others are not shown in Figure 7. The first three principal components, feature vectors and data are shown in Figure 8. Additionally, how 19 features contributed to the three principal components was shown in Figure 8 with a vector consisting of length and direction.

Conclusion
In this study, gait signals were used to classify neurodegenerative diseases and control objects. Thirteen statistical features for each left and right foot were created from these signals. Four datasets were created from the constructed features. Compared to others, the dataset which consisted of both left and right data was learned better by all of the classifiers. After dataset was chosen, the most successful classifier was defined as classifier K*. Classifier K* achieved better performance results in comparison to the classifiers of A2DE, MLP, DECORATE and Random Forest. After that step, the election of the features with the highest informative value was carried out. At this step, 19 out of 26 features were elected. (F1, F2, F3, F4, F7, F8, F9, F10, F11, F13, F14, F15, F16, F17, F20, F21, F23, F24, F26) After the election of features, the parameters that had the best performance for classifier K* were chosen. Therefore, GlobalBlend parameter value was defined to be 19. The gap between the classifying results of Classifier K* were observed when the value of k was changed during CV process. It was observed that the performance of classifier K* didn't change when the value of k was changed during CV process. According to that information, it is possible to state that classifier K* had a stable condition. After that step, number adjustment of training data used for the training process of classifier K* which consisted of 320 numbers of data was carried out. A happy graph was created and the number of data required for the training process was defined to be 241. Therefore, during 10-fold CV process, classifier K* achieved the results of Accuracy = 99.1701% and RMSE = 0.0507 when GlobalBlend parameter was 19 and training dataset consisting of 241 number of data with 19 features was used. The 19 features which were obtained by feature selection process were used to create new features by PCA method. These features were created by first three and first five principal components. During 10-fold CV process for the first three principal components, classifier K* achieved the results of Accuracy = 95.4357% and RMSE = 0.1555 when GlobalBlend parameter was 14 and a training dataset consisting of 241 number of data was used. During 10-fold CV process for the first five principal components, Classifier K* achieved the results of Accuracy = 96.6805% and RMSE = 0.1229 when GlobalBlend parameter was 20 and a training dataset consisting of 241 number of data was used.