Sleep stage classification via ensemble and conventional machine learning methods using single channel EEG signals

Sleep-stages play important roles in the diagnosis of the sleep disorders and the sleep-related illnesses. In this sense, accurate identification of the sleep-stages is a necessity for more robust and efficient diagnosis systems. Several traditional machine-learning and pattern recognition algorithms are deployed on the modern computer aided diagnosis systems. However, current results are not as satisfactory as expected. In the last two decade, a new concept has emerged with ‘ensemble learning’ title. It has attracted the attention of many researchers from various disciplines. In this study, several ensemble-learning methods are utilized and inspected on EEG signals for sleep-stage classification. Conventional machine-learning methods are also performed in same testing phase to report comparative results. Additionally, methods are evaluated in two different scenarios; subject specific and independent. Study proves that combination of DTs and SVMs in bagging theorem surpasses all of the conventional methods used in the experiments. Moreover, test trials reveal that both conventional and ensemble models need to be improved for subject independent scenario which is more essential case in the development of patient independent computer based diagnosis systems.


Introduction
Nowadays, inventive researches are being carried out to develop new methods for the identification and treatment of sleep disorders such as narcolepsy, idiopathic hypersomnia and sleep apnea. Problems related with sleep adversely affect physical and social quality of life of a person [1]. Besides the sleep disorders, sleeprelated illnesses, including diabetes, cardiovascular diseases, obesity, etc. are other focuses of the researches [2, 3, and 4]. For this reason, the accurate identification of sleep-stages is an important subject in computer aided diagnosis that may lead to more precise diagnoses. A handbook about the determination and scoring of the human sleep stages has been published by twelve researchers, under the editorship of Rechtschaffen and Kales [5,6]. According to this manual, the duration of the sleep for a healthy person can be divided into two main stages; rapid eye movement (REM) and non-rapid eye movement (NREM) stages. The NREM stage also consists of four sub-periods (NREM I, NREM II, NREM III, and NREM IV) that have discriminative amplitudes of certain frequencies. All stages are defined according to Polysomnography (PSG) results of the patients. According to sleep staging method developed by the American Academy of Sleep Medicine (AASM), NREM III and IV stages are defined in single stage, known as slow wave sleep (SWS) or deep sleep [7]. Polysomnography (PSG) is a "gold standard" method for clinical diagnosis; sleep medicine industry and sleep-stage classification studies. PSG contains crucial physiologic signals, including electroencephalography (EEG), electrooculography (EOG), electromyography (EMG), pulse oximetry (SpO2), and electrocardiography (ECG). Analysis of PSG requires the participation of an expert in a specific sleep centre during recording, but this is a relatively expensive and timeconsuming procedure for both the patients and the experts on sleeping. Hence, automatic sleep-staging has become an important challenge for researchers in different disciplines [8, 9, and 10]. In literature, several methods have been studied on the classification of sleep-stages. The frequency-domain analysis methods [11, 12, and 13], wavelet transform [10,14] and fuzzy logic [15] are examples of some methods with agreement rates ranging from 60% to 80%. Virkkala et al. have classified the sleep stages using only EOG signals with the agreement of 72% [16]. Mendez et al. have utilized a Hidden Markov Model (HMM) with spectral features of heart rate variability to classify NREM and REM and the classification accuracy is measured around 80% in both training and test sets [17]. Liang et al. have presented a rulebased sleep-stage classification method using features of temporal and spectral analyses of the EEG, EOG, and EMG signals with an agreement rate of 86.68% [18]. Different types of feature selection methods, including the multiple iterative, suitable linear and nonlinear methods have been proposed for classification of sleep stages by Zoubek et al. [19], and accuracies of wakefulness (W), NREM I, II, SWS, and REM are obtained as 84.57%, 64.56%, 85.55%, 92.90% and 72.81%, respectively. In [20], the energy features of single-channel EEG signals are utilized for classification of sleep-stages using neural classifiers and EEG epochs were classified as wakefulness, NREM I, II, SWS or REM, and the overall accuracy is 81.8%. In another study, Koley and Dey [21] applied a Support Vector Machine (SVM) based ensemble method on their data set to classify it into five stages as similar to [20] with using different feature extraction methods. Furthermore, a type of SVM based recursive feature elimination algorithm is applied on 39 extracted features in order to enhance their result. It is reported in the study that 85% and 87% agreements were obtained with training and independent testing data sets respectively.
The goal of this study is to evaluate and compare the latest and conventional learning methods on sleep-stage classification. The best classifier for EEG signals based automatic sleep-staging system may be assessed according to the obtained results of this study. For this purpose, several well-known conventional methods (i.e., Support Vector Machines (SVMs), Naive Bayes (NB), Linear Discriminant Classifier (LDC), K-Nearest Neighbour (KNN) and Decision Tree (DT)) and some of their combination in ensemble learning (Bagging and Adaboost) are selected as classifier. It is aimed to demonstrate the effectiveness of ensemble combinations on results with comparative tables. This paper consists of six sections. The details of the Sleep-EDF database are presented in Section 2. Signal pre-processing, feature extraction, and classification methods are described in Section 3. Performance metrics and the results are placed in Section 4. Discussions about results are given in Section 5, and the study ends up with conclusion and future works in Section 6.

Materials
PSG is a multi-parametric test that is used to identify illnesses caused by sleep disorders. It is also effectively used to derive the characteristic schema of the sleep. Several PSG records which are obtained from PhysioNet open database have been used in this study [22]. The PhysioNet [23] is a well-known biomedical data source, which is frequently used in many studies [18, 19, and 20]. PSG data sets contain several signals from various sensors. It is obtained from the records of eight Caucasian male and female volunteers aged from 21 to 35 years. The records are separated into two groups according to obtaining procedures. The first four patients with "sc" initial letters are combined into Group I. The other group contains the rest of four patients, which are designated with "st" initial letters. Group I records were acquired over 24 hour period from healthy patients in normal daily life with a modified cassette tape recorder. Group II records were obtained from patients having mild difficulty falling asleep and otherwise healthy in a hospital setting a 12-hour night period. None of the patients in both groups have been given medication for any illnesses or disorders. Group II records are more challenging according to sleep disorder reports of the patients. Group I have more clear signals because of the modified analog cassette recorder. Furthermore, Group I recordings have more samples than Group II, which provides more efficient classification performance. Despite the differences, all PSG records commonly contain two EEG channels (Fpz-Cz and Pz-Oz) and a single EOG channel with a sampling frequency of 100 Hz. Additionally, both groups have sub-mental EMG signals with different sampling frequencies. Moreover, Group I recordings include additional signals, oral nasal airflow and rectal body temperature, sampled at a 1-Hz frequency. The records are scored by using Rechtschaffen and Kales (R&K) rules with 30-second intervals, which are called epochs. Each epoch has one sleep-stage label stored in a hypnogram. According to R&K standards, sleep-stages are divided into six stages, namely W (Wakefulness), REM (Rapid Eye Movement), NREM I, II, III, and IV (Non-Rapid Eye Movement). In some studies NREM III and IV stages are combined into a single stage (SWS -Slow-wave Sleep) to increase classification ratio [19,20], but in our study all stages will be separately taken into account in order to show the efficiency of ensemble methods. In the proposed study, sleep-stage classification is performed based on only EEG signals. The representation of the EEG channels is referred as a montage, and different montages are available in practical sessions at hospitals [24]. The sleep EDF data set includes two channels recorded under the sequential montage. Because the Fpz-Cz channel is more distinctive than others according to some recent research papers [20], we also utilize and focus on Fpz-Cz channel in our study.

Methodology
Fpz-Cz channel of EEG recordings is selected as input for the evaluations of the classification methods. Fig. 1 indicates the flow diagram and steps of this study. Methodology will be investigated in four sub-sections: (a) signal preprocessing, (b) feature inference and extraction, (c) formation of training and testing sets, and (d) classification. Performance metrics will be explained as a final step.

Signal Pre-processing
Biomedical signals can be easily affected by artificial or natural non-controllable factors hence signal pre-processing step is a necessity and inevitable process in order to isolate raw signals. A small part of samples are labeled as 'undefined' in the records. Therefore, these samples are accepted as noise and removed. Additionally, Butterworth band-pass filters with 0.2 Hz and 40 Hz cut off frequencies are implemented on the records as noiseremoval process. Signals over 40 Hz and below 0.2 Hz frequencies are mostly EEG irrelevant signals. Sample distributions with corresponding stages after de-noising processes are presented in Table 1.

Feature Inference and Extraction
Feature extraction methods directly state the success of the used classification algorithm in next step; hence feature extraction is crucial step for any signal classification as in biomedical field. Features should express the original signal as much as possible. Moreover, they must be discriminative and informative in order to increase classification results. Generally in the literature, feature extraction methods can be categorized into three sections: time, frequency, and spatial domain based techniques. Additionally, combined time-frequency based techniques are also available such as short time Fourier transforms (STFTs) and wavelet transforms [25]. In this study, frequency domain based feature extraction methods are selected. The EEG signals can be represented in frequency domain with seven characteristic waves, namely alpha (α), beta (β), theta (θ), delta (δ), spindle, saw-tooth, and K-complex. The 10th-order infinite impulse response (IIR) Butterworth filters are designed with relevant cut off frequencies, and applied on signals after preprocessing step in order to obtain these waves. The names of the waves and the corresponding spectral-band frequencies are presented in Table 2.  (1) where N denotes the total number of samples in one epoch, which is 3000 by taking 30-second intervals at the sampling frequency of 100 Hz. Here, representsthe ℎ sample in corresponding characteristic wave.The sum of the energy values of relevant waves are assigned as discriminative features of the signal. However, the distributions of feature characteristics vary with different ranges; hence these features need to be normalized in order to use them together. Additionally, the normalization procedure provides more accurate assessment for the subject independent scenario. In that case, all features are normalized into the [0-1] range before the classification step as follows: where shows the normalized ℎ energy value for the corresponding k characteristic wave. and are minimum and maximum values within the ℎ characteristic wave. As a summary, normalized energy values of extracted characteristic waves of Fpz-Cz EEG channel signals are selected as feature sets which will be divided into training and testing set in next section.

Training and Testing set formations
Two testing approaches are mainly presented in literature; subject independent and subject specific [26]. Classification methods are tested under both scenarios in this study. There are differences in sample selection step between the scenarios. Training samples are selected from one patient's records with a split ratio in the subject specific strategy, and the testing is performed on the rest of data belongs to same patient. On the other hand, training set is formed by entire records of all patients except one in the subject independent strategy. Isolated patient is reserved for testing within the same group. First strategy gives more theoretical information about the success of the model in terms of the machine learning concept, and other scenario is related with more practical experiments in order to apply the methods on unseen samples. Automatic sleep-staging system deals with more practical problems which is more likely to be encountered in hospitals, clinics and institutes. A well-known method named as k-fold cross-validation is implemented for the subject specified strategy. The k value represents the number of partitions. Samples are divided into k equal sizes for every class. Number of k-1 defines the size of training set and the rest of data is assigned to testing set. The minimum k can be two which indicates that training and testing sets are formed with equal number of samples. The k also represents the rotation number which indicates the number of repetitions with different samples but same size in training set. k is chosen 2 in this study. Additionally, testing process is repeated 10 times for strengthen the results. Final decision is made by majority voting technique which is based on calculation of average score of all results. Another cross validation method, leave-one-out cross-validation, is used in order to arrange the sample proportions for the subject independent strategy. All the records of three patients are selected as training set and the remaining is considered as testing set within the same group.

Classification Methods and Parameter Settings
Several well-known classification methods are utilized in this study. Selected methods are separated into two titles: a.) Conventional machine-learning and b.) Ensemble-learning methods. The brief descriptions of the utilized methods and parameter settings are explained in following sub-sections. As a preliminary work, each method are tested with their different parameter settings in order to find the model's best accurate results and corresponding settings. Parameter setting tests are performed on the same dataset (50% of data set assigned as training, another as testing) at once. Afterwards, all methods with defined parameters are evaluated in experimental section with abovementioned formation of data set. Same as parameter setting tests, comparative tests are also performed on same testing data set at once for all methods in experimental tests.

Conventional Machine Learning Methods
Many algorithms are developed with the fast advance of the machine-learning. Majority of these algorithms are highly utilized on biomedical data sets to derive more meaningful information and classify with better accuracy. This study contains several familiar machine-learning algorithms (Support Vector Machines (SVMs), Decision Tree (DT), K-Nearest Neighbor (KNN), Naive Bayes (NB), Artificial Neural Network (ANN) and Linear Discriminative Classifier (LDC)) to evaluate the methods on the abovementioned sleep-stage data sets.

Support Vector Machine (SVMs)
SVM is one of the prominent classification algorithm which can be used large-scale data sets and provides more efficient results than statistical and neural classifiers. In SVM, higher classification accuracies can be achieved by even small size train sets with the help of well-fitted cost function in kernel space as well [27]. In this section, SVM terminology and its usage in the sleep stage classification are briefly explained. SVM uses the core idea of kernel based learning. Kernel based learning aims to separate data in high dimensional feature space by mapping data points with a kernel function. SVM creates a decision surface between the samples of different classes by finding the optimal hyperplane that is closest to the deciding training samples (support vectors). That way an optimal classification can be achieved for linearly separable classes. In case of linearly inseparable situations, kernel versions of SVM are defined. The main purpose of kernel approach in SVM is to transform the data to a higher dimensional space (ϴ : → ℎ , ℎ > ) where binary classification can be achieved linearly again [28]. Kernel functions are mainly used to define cost function, and the response of the cost function defines the weight and bias values in the learning model. Fig. 2 graphically demonstrates an example of binary classification problem. Data samples are classified into two classes -1 and +1 with a formed hyperplane by SVM model. The cost function is set o to define the margin distances between the support vectors. The minimum response/value of the cost function provides the best position of the hyperplane. The penalty parameter is another variable used in the calculation of the cost function. Model flexibility in the formulation can be adjusted with penalty parameter given by the user. The large values of penalty parameter make the model stricter, and it ends with more misclassification errors. On the contrary, with the small values, the model becomes loose and, therefore classifies some outliers as well [29]. SVM maintains a binary classification of two-class datasets. In order to use SVM in multiclass structures, "one against one" or "one against all" are the most popular strategies in literature. Each strategy has own advantages and disadvantages mentioned in [30]. In order to define well-fitted settings of SVM on sleep stage classification problem, different penalty (1, 10, 100, 250, 500), kernels (radial-basis, Polynomial, quadratic, linear) and its parameters are tested at the initial part of study and registered in Table 3. According to the accuracies of parameter testing, two different SVM model with polynomial and RBF kernels are included in the study. Polynomial kernel selected as 3th degree of equation and RBF kernel fixed with sigma '1'. Penalty parameters are defined as 25 and 100 respectively. Additionally, one against one strategy is used for evaluation between classes owing to 6 classes' presence in sleep stage datasets.

Naïve Bayes Classifier (NBC)
Naive Bayes is a kind of probabilistic approaches in machine learning concept using modified Bayes theorem. Generally, in probabilistic classification, it is maintained based on the sample distributions, and samples aren't strictly assigned into the classes. Models give the probabilities of samples over set of classes instead of single class. Bayes theorem uses all probabilities of the features, but Naive Bayes assumes that features are independent each other. In this sense, algorithm can be performed with less computational cost rather than regular probabilistic methods. Naive Bayes result will provide us to see the probabilistic classifier success on Sleep Stage Classification problem within this study.

Linear Discriminative Classifier (LDC)
Linear classification is major issue in the machine learning literature. In this study, linear discriminative classifier realizes simple classification using only covariance matrices. Obtained model forms a multivariate normal density to each group derived from training set and estimates testing samples' labels with calculated covariance with estimations [31]. Basic linear classification is tested in order to demonstrate effects of a simple linear model on the sleep stage classification besides complex methods.

K-Nearest Neighbour (KNN)
KNN is a benchmark method in many classification problems in the literature due to the high accuracy results and easy to implement. As a short explanation of the KNN, samples are classified based on predefined K labels of the nearest neighbors. In the testing stage of the algorithm, new samples from test set are assigned to the classes according to the closest K number of K samples' class label in training set with majority voting technique. The K value is the key determinant parameter in the definition of class labels. In this study, K value is set to 5 according to preliminary studies on parameter selection of K. Distance metric determined as "Euclidean" algorithm within tested other distance metrics (Euclidean, Cityblock, Chebychev, Minkowski, Mahalanobis, and Cosine). Table 4 shows the accuracies of other K values with different distance metrics.

Decision Tree (DT)
Decision Tree is known as rule based machine-learning method. Basically, it works based on tree terminology. The path from root to leaf presents classification rules. The roots represent the most informative features and the leaves indicate the labels. Information gain (IG) is the rule defining criteria. The most widely used methods are entropy, twoing, and Gini to calculate the IG. Decision Tree is easy to implement similar to KNN. Additionally, interpretation of the classification is much easier than other methods and, it can be useful for some regression problems. However, DT produces low performance on large scale data sets with few training samples compare to SVM [32]. Furthermore, the pruning process is another obstacle point to avoid from overfitting. According the results of preliminary studies on parameter settings, DT model was modified with pruning functionality and Gini's Diversity Index for IG.

Ensemble Learning Methods
Ensemble learning methods are evolved from the principles of conventional machine learning concepts. The key point of the ensemble learning relies on the proper combination of several machine learning algorithms. Not only one method as in conventional methods, many learners contribute to decision step of classification in ensemble methods, therefore it provides higher success. Machine learning classifiers such as decision trees, Bayes classifiers, KNN, etc. is called base learners or weak classifier in ensemble models. Three ensemble models, which have different base learner combinations and/or sample selection strategies, are implemented in this study. Majority voting is used to define final decision of base learners.

Random Forest (RF) -(DT + Bagging)
Random Forest is combination of multiple decision trees with bagging sample selection strategy. Bagging is shortened form of bootstrap aggregation, which is a way for improving the quality of estimates by the aid of well-formed train samples. It is also cited as re-sampling. The main strategy underlying the bagging is to distort the data set by re-sampling, and to train weak learners using re-sampled training sets. The distortion of the samples is carried out with a voting process of weight parameters. The weights of the samples are defined equally in bagging, therefore, train sets are generated by random selection. As a result of bagging, different samples are selected in train set iteratively. Process helps to enhance the diversity of the samples' distribution. The average of the each decision of base learners determines the final decision.
More information about RF can be found in [33]. RF is commonly used by many studies in literature because of fast computation time, high accuracy, easy to handle with noise and over fitting problems. Various number of decision tree combination from 10 to 1000 is tested over sleep stage dataset in order to define best parameter settings. According to results, 200 decision tree combination is dedicated to use in experimental tests.

Adaptive Boosting -(DT + Adaboost)
Boosting is another technique similar to bootstrap. The difference between boosting and bootstrap is at the re-sampling step. Bootstrap ignores the weight values of the samples and it resamples randomly, however boosting technique defines different weights for each samples after first iteration. At the end of the first step, the probabilities of misclassified samples are boosted for the second step, and subsequent classifiers are trained. Likewise, other steps are sustained with different weight parameters defined by technique. Readers are referred to an essential guide [34] for boosting theorem in literature.
Adaboost is abbreviation of adaptive boosting which mainly outperforms other regular boosting techniques and, more robust for over-fitting problem. However, it is still easily affected by noise in data and outliers. In this study, the same ensemble model structure in RF strategy is used to assess the effects of Adaboost re-sampling over the sleep stage classification (200 DTs combination).

Random Subspace (RSS) -(KNN + Bagging)
RSS is a generalized form of the RF algorithm. RF is composed of decision tree ensembles whereas RSS can be derived from any other classifiers. In this study, KNN classifiers are used in RSS as base learners. The identical number of base learners similar to other ensemble models are utilized in order to demonstrate the effect of re-sampling on regular KNN methods in terms of Ensemble concept.

Ensemble SVM -(SVM + Bagging)
SVM is already explained in previous sub-section, but regular SVM uses random sample selection within the concept of binary classification. However, this study aim to present comparative results, hence, regular SVM is modified with bagging process to indicate the effect of ensemble theory. Polynomial kernel SVM is only adapted with Bagging re-sampling and combination theory. Same parameters are arranged for base learners in ensemble SVM model (25 for penalty parameter and polynomial kernel having 3th degree of equation). More details can be found in [35].

Evaluation Metrics and Testing Results
Kappa, Accuracy, F-measure, sensitivity (recall) and precision values are considered as performance measurements in this study. Brief information about evaluators is provided in the following sub-sections.

Evaluation Metrics
Generally, performance metrics is derived from confusion matrix which is an essential  Accuracy is the key benchmark metric for any classification. It signifies the percentage of the correctly classified samples within all testing set by using (3). An accuracy of 100% shows the given samples in test set is all correctly classified. However, higher accuracies does not mean the success of the model entirely. In the terminology in literature, accuracy paradox [36] tells that all distribution of the confusion matrix is important to evaluate the model success, however accuracy only indicates the true classified samples. Other metrics are also given in the studies in order to prove complete model success.
Sensitivity (SE) and precision (PR) are accepted as other performance metrics in order to evaluate the model in terms of Type I and Type II errors. Sensitivity and precision can be calculated by (4) and (5), respectively.

= +
= + (5) F-Measure (F1-Score) can be derived from sensitivity and precision measures as in (6). It reaches to the best value at 1 and worst score at 0. F-Measure values are more reliable than accuracy rates due to inclusion of the FP and FN in the results.
Cohen's Kappa (κ) is another performance metric, commonly used in many statistical problems [37]. It mainly assesses the inter-rater agreement which covers the similarity of the raters to each other.
κ statistics reveals more informative results due to taking into account the prior probabilities than other metrics. Also in some cases having similar accuracy values but different confusion matrix, κ gives more reliable information about the success of the learning model. It also evaluates the raters. κ score is calculated with using (7).
where represents the proportion of observed values and demonstrates proportion of real values derivedfrom confusion matrices. and will be generated from (8) and (9) in multiclass models with two raters. (9) j is the total number of classes and and are abbreviated as predicted and actual values for ℎ class respectively. In sleep stage classification case, six stages are defined under R&K rules. Two raters are considered as actual hypnogram and predicted results. The Kappa schema for the sleep-stage classification can be seen on Table 5. Kappa score for each class is calculated based on this schema with referred formulas (7, 8, and 9). General κ scores of the models are derived by using (10). It calculates the averages of each kappa scores corresponding to classes.
scores will be present in comparison tables. (10)

The Subject Specific Scenario
The subject specific scenario contains the analyses of the methods on a certain record. Both training and testing samples are selected from the specific record. Models are concurrently trained with predefined number of samples located in selected record with the cross-validation technique. The rest of the samples in the same record are allocated for testing set. Individual and averaged accuracies of methods with relevant group divisions are presented in Table 6. Tests are repeated 10 times to strengthen and generalize the results. Standard deviations (Std) occur between repetitive tests because of the sample rotation in testing set with the theory of cross validation. Std values are also noted in the Table 6 to show the consistency of the corresponding methods. It is certainly more preferred to have minimum deviation between all tests, hence, methods will be evaluated in this respect as well. Accuracy scores can give an idea about the performance of methods in general, but scores is more stronger and trustable criteria that contains inter-rater comparison as well. In this sense, averaged scores of groups are given in Table 7. Best numerical results for each learning concepts are separately signified with bold and italic numbers. Bold style is used for ensemble, and italic is assigned for conventional methods best case registration. Additionally, F-measure scores of each methods are demonstrated as bar charts in Fig. 5 and 6 for Group I and II respectively. In this way, more individual and meaningful inferences of each method can be derived from visual demonstrations. Explications about tables and figures will be made in Discussion section. In the subject independent scenario, models are trained with three records whereas other record in the same group is reserved for the testing. This selection method is referred as 'leave-one-out crossvalidation' in literature. The goal of this scenario is to evaluate the success of the model on classification of unseen samples, which is likely to be encountered in clinics and hospitals. Results are presented in similar forms as in the subject specific tests. Only one difference can be seen that standard deviation does not occur for this scenario, because there is no sample rotation in testing set. Individual and averaged accuracies with scores are recorded in Table 8. It can be derived from the table that the subject independent scenario is obviously more challenging than the subject specific scenario because of the relatively low accuracy and κ scores with the same configuration of the handled methods.

The Subject Independent Scenario
F-measure scores are presented in Fig. 7 and 8 for Group I and Group II respectively. Figures indicate individual performances of classifiers over each sleep stages. In other words, methods can be analyzed with more detailed based on sleep stages. The main challenge is the individual differences of the patients in this scenario. Additionally, different artifacts can be occurred while recording the signals with several kind of noises. All features are normalized at the pre-processing step in order to scale the signals in a standard form and overcome outlier problems. Otherwise, some records can be incompatible or inconsistent with each other. However, remaining outliers induce under-training in model learning significantly, and as a result of that evaluation metrics produce relatively lower results than the subject specified scenario.
On the other hand, the subject independent tests are more important than the subject specified tests, because the key idea behind the subject independent scenario is to provide the results in a patientfree system. System can be trained by previously retrieved healthy and unhealthy records to build a model, then the diagnosis of unknown case can be made based on predefined criteria. In this sense, the subject independent tests are more beneficial in current computer aided diagnosis systems which mainly aim to give diagnose directly. However, the subject specific tests depend on the long term diagnoses of specific patients. The variations in the conditions of patient during the treatment can give a trace about the diagnosis in the specific scenario.

Discussions
Ensemble SVM with a bagging resampling idea surpasses over all other methods in overall accuracy according to Table 6. Another ensemble method, Random Forest (DTs combination with  Bagging), is measured as second successful method. However, same combination with Adaboost resampling strategy doesn't result in as successful as in Bagging version. Similar to Adaboost and DT combination, also KNN with Bagging ensemble method is not successful and meaningful combination for subject specified sleep stage classification as it can be seen on Table 6 with worst accuracy results of all. KNN is evaluated as the best accurate classifier within conventional machine learning methods with 82% accuracy. It is also graded as third rank classifier in overall accuracy. However, it is not consistent like RF and ensemble SVM. Several subjects are classified with low accuracies and KNN is evaluated as fourth rank or more in another cases. On the contrary, RF and ensemble SVM are always steady during all subjects' classification.
Ensemble SVM classified all individual subjects with the best accuracies and RF comes second. Besides to the best results, another ensemble method, Random Sub Space having KNN classifier with bagging resampling, is also steady in misclassification of all subjects. In both state; successive and failure, results prove that ensemble methods act more stable which makes the algorithms more reliable for implementation on real system design. Controversially, regular methods result with different rankings on each datasets. For example, NB gives better results on some subjects within Group II, whereas KNN or LDC come up in classification of another subjects within same group. This is suspicious aspects of regular methods in sleep stage classification usage. Another consistency criteria, Standard Deviation (std), also emphasize the importance of ensemble SVM or any other ensemble method within this study. Classifiers with minimum standard deviation is more preferable in practical usage. In that meaning, ensemble SVM with the 0.80 deviation is highlighted within all other tested methods. Similar to accuracy sorting of methods, best three methods are same in consistency based on standard deviation; SVM (1st), RF (2nd) and KNN (3th) in subject specified scenario. As a kernel based method, regular SVM (using random sample selection strategy), is evaluated with less classification accuracies contrary to ensemble version. Similarly, also DT classification results indicates the importance of having several learner instead of using only one learner. When compare to ensemble combination of DTs, accuracies of each subjects classification in DT stay far behind from multi DT combination. It can be derived that the rulebased or even a kernel based method is not useful alone for sleep stage classification based on single EEG channel. When several of them combined with a bagging re-sampling strategy, results are increased noticeably in terms of subject specified scenario. In some cases, accuracy is not enough criteria for evaluation. Principal problem in accuracy formula is the ignorance of Type I and Type II error in confusion matrixes. In order to resolve that problem, summarized scores of each methods are also presented besides the accuracy table to verify the results. In the theory, formula grades the results as 'accidental (by chance)' or 'not accidental'. Lower results than 0.5 score is submit as 'accidental', and it is advised that method should be avoided for application. This study presents that the classification of KNN combination with bagging re-sampling is directly an accidentally resulted process for subject specified scenario. It is an important criteria for medical science, because accidentally results shouldn't be taken into account in human life. In that respect, ensemble KNN is entirely useless for sleep stage identification. Other methods have different scores which are all over 0.5, but it is better to use the one which nearest to 1. In that meaning, as in accuracy results, ensemble SVM or DT should be utilized for sleep stage classification in subject specified scenario as well. Fig. 5 and Figure 6  In these studies, classification will be made into 4 class. However in this study, all stages are individually observed to emphasize ensemble methods. Obviously, combined approach will increase the success rates.
For an overall classification results without group division in subject specific scenario, Ensemble SVM with bagging resampling idea is the most successful method in terms of accuracy rate of 86.60%. The second promising method is another ensemble method; DTs with bagging. Third one and also the most accurate method within conventional machine learning algorithms is KNN with 82.52% accuracy. Another tested scenario, subject independent, is considered as a special case for more practical usage be-cause tests are performed on unseen records such as in hospitals. Subject specified tests generally resulted with relatively high accuracies, but it is not at satisfactory level in the subject independent case. Diversity of the records and artifacts during recordings are the main reason of inefficiency. Artifacts and individual differences are tried to be eliminated by normalization and noise removal procedures but some samples still remained as outliers for the models.
Tests are performed with leave-one-out cross validation method which focuses on one subject's entire samples in each testing step. The rest of three records are used for training set for learning process of method. As presented in Table 8, similar to subject specified scenario, Group II classification accuracies are lower than Group I and the best obtained accuracies are unstable. Not only specific algorithm, different algorithms concluded with best accuracies, so it is better to analyze results independently based on records instead of using overall success. In that meaning, the maximum accuracy results of each records are separately written with bold numbers in Table 8. Group II records are classified with more consistent and stable than Group I. NN is the most accurate method for Group I with an overall score of 76.57%. It achieved its maximum accuracy with the last record in Group I whereas others failed. Additionally, NN is also prosperous for the first record. Ensemble DT with bagging resampling and regular KNN have more power on second and thirty records respectively. Group II classification results are more determined. SVM is graded as the best model with 53.77 %, and DT with Bagging comes behind with little difference. As an overall conclusion for Group II, all methods give unsuccessful accuracy in subject independent scenario. Additionally, κ scores also promote that outcome. The κ scores are provided in detailed form rather than summarized table as in subject specified scenario, because each score of the method is important in terms of tested subject record. κ scores over than 0.5 means not accidental results. It gives more meaning to corresponding accuracy rates, otherwise, calculated accuracies are not important because method classified the samples by chance.
Only two records accuracies are acceptable in both groups according to κ scores. Rest of them is accidentally classified. Even if the accuracy rate is high but κ score below than 0.5, it is defined as unsuccessful classification such as in subject 2 with 75.98 % accuracy but 0.36 score in ensemble DT method. The worst classification result obtained by all methods on subject 2 in Group II. As an overall result, the highest score is obtained by Ensemble SVM, but it is not acceptable and needs to be improved because it is lower than 0.5. Similar to subject specified scenario, Fig 6 and 7 are prepared for subject independent case. As it can be seen on both figures, success of the methods is low in terms of individual stage classification as well. Only the KNN with bagging resampling has an impact on Wake stages, but in other stages, similar to subject specified scenario, KNN with bagging is still useless. Mostly methods classified NREM I and REM stages in subject independent scenario. It can be derived from all results of subject independent scenario that the tested methods are insufficient and accidental because of the aforementioned complexities and differences.

Conclusions
In this study, sleep stages are classified based on single-channel EEG signals. Several prominent ensemble and conventional machine learning methods are tested on a well-known dataset.
Comparative results provide to define best method which can be used in an automatic sleep staging system in the future. Furthermore, detailed figures and explanations shed light on the compelling side of stages. Dataset is divided into two parts as Group I and II according to patient status. Group I records are taken during daily life with an analog modified cassette, thus records within the Group I has more samples and more clear than Group II. On the other hand, Group II records are obtained in hospital during night period with digital recorders which can be affected by other devices in terms of external factors. Effects can be described as artifacts and noises. Additionally, Group II have some mild difficulties in sleeping which causes more complexity in EEG signals. At the first step, preprocessing is performed on EEG signals in order to remove outliers and noises. In the feature extraction phase, some frequency based characteristic waves are obtained from signals and, subsequently, a set of energy features are derived from these waves as representative features. Normalization process is applied on energy features to scale the ranges of various records of subjects.
In the subject specific scenario, obtained results are in efficient level with 92.81% and 80.39% of averaged accuracy rates for Group I and II, respectively. The highest individual accuracies of each record vary between 76% and 97%. The highest accuracy rates are obtained by multiple SVM combination with bagging resampling in terms of ensemble learning concept. Another ensemble method, DT combination with bagging, is the second prominent classifier. In addition, KNN resulted as the best method within conventional methods whereas stated in 3rd place in overall success list. Contrary to the subject specified scenario, the subject independent tests are resulted with lower success because of the different patients data used for another patient's sleep stage prediction. Differences in each metabolism and device settings affect the prediction results. As a consequence, the averaged results stay behind the regular specified scenario with an accuracy of 75.89% and 53.77% for Group I and II. Ensemble SVM is still the most robust classifier for Group II whereas algorithms are resulted with various rates for Group I. The highest individual accuracies can be seen between 32% and 86% κ scores are more reliable criteria instead of accuracy rates. κ scores prove that the extra processes still need to be applied to eliminate outliers and noises in order to increase the classification even if it has remarkable accuracy rates. As a conclusion, the subject independent scenario is not an easy task for computer aided diagnosis. Mainly ensemble SVM and partly some other methods provide better results, but practical problems needs more generalized solution which can be implemented in any condition. Ensemble SVM surpasses other methods in terms of classification accuracy for Group II, but not robust enough according to κ scores. Furthermore, results are inconsistent in Group I. Several methods are finalized with the best accuracies on different recordings instead of only one as in Group II. In order to constitute an automatic sleep staging system, a combined method of those should be optimized or an artificial decision system performed to decide the classification method according to each recording characteristics, for example 3rd recording should be utilized with KNN whereas NN should be used especially for 4th records. This study is aim to be a source for sleep stage classification based on both ensemble and conventional machine learning algorithms by using single channel EEG. As a future work, some extra preprocessing and feature extraction techniques methods will be investigated especially for Group II data sets and subject independent scenario. Additionally, different machine learning ensembles will be tested for further improvement on the sleep stage classification in order to compose more acceptable solution on a computer aided diagnosis system.