Comparison of Multi-Label Classification Methods for Prediagnosis of Cervical Cancer

Cervical cancer is one of the most common causes of cancer death of women. Prediagnosis of cervical cancer at early stages is critical to reduce mortality ratios. Additionally, early prediction of cervical cancer can help both the patients and the physicians depending on easiness of treatment. Cervical cancer results from various risk factors such as family history, education level, having multiple full-term pregnancies, smoking, and sexually transmitted diseases and etc. Recently, different types of advanced methods were developed for risk prediction analysis based on machine learning techniques. The purpose of this study is to investigate the efficacy of using multi-label classification techniques for diagnosing cervical cancer at early stage. Four common learning algorithms such as Naïve Bayes, J48 Decision Tree, Sequential Minimal Optimization, and Random Forest were compared in terms of their accuracy, hamming loss, exact match (subset accuracy) and ranking loss performance evaluation metrics. Thus, this study can help to physicians, academics and cancer researchers to make fast and accurate diagnosis.


Introduction
A risk factor highly affects chance of getting rid of a disease, especially in cancer. Different cancer types have different risk factors. For example, cervical cancer arise from the interaction between a genetic predisposition and behavioral and environmental risk factors. Five or more full term pregnancies, smoking cigarettes or breathing in secondhand smoke, use of hormonal contraceptives for five or more years, previous exposure to sexually transmitted diseases (STDs), i.e. AIDS, some herpes viruses, and Hepatitis B are some causing factors associated with an increase in the risk of cervical cancer among [1,2]. It is crucial to build predictive models using risk factors for interventions relating to the progress of cervical cancer. Because, early prediction of cancer plays a pivotal role in the treatment process and effective preventive strategy. Actually, correct and timely diagnosis of cervical cancer is one of the biggest problem in health. The World Health Organization (WHO) suggests the development of basic strategies to identify those at risk of cervical cancer and ensure them with early lifestyle interventionist [3]. In this study, different combination of multi-label classification methods and learning algorithms were used for diagnosing cervical cancer based on common risk factors. Four widely used multi-label classification models, Binary Relevance (BR), Conditional Dependency Networks (CDN), Classifier Chains (CC), and Label Combination (LC) were used. As a base classifier, four learning algorithms such as Naïve Bayes, J48 Decision Tree, Sequential Minimal Optimization, and Random Forest were analyzed and compared using MEKA software (Multi-label Extension to WEKA data mining tool). These software ensure an open source implementation classifier chains techniques for multi-label classification [4].

Multi-label classification
The methods used in classification problems are divided into two groups according to the number of labels; single label and multilabel. As seen Table 1, for the single label classification, labels (category) are mutually exclusive and each instance is assigned to only one category. On the other hand, in the multi-label classification, the labels are interrelated and each instance corresponds to multiple class labels ( Table 2). The multi-label classification has recently attracted attention of many researchers and has been used in several applications such as scene classification [5,6], text classification [7,8], bioinformatics [9], and music and movie categorization [10,11]. In recent years, many well-established methods have been developed to analyze multi-label classification problems. According to Tsoumakas and Katakis [12], the multi-label classification methods are separated into two main categories: (a) Problem Transformation (PT) methods and (b) Algorithm Adaption (AA) methods. Algorithm adaptation methods (fitting algorithm to data), modify traditional single-label classification algorithms to handle multilabel data directly. Several base algorithms have their multi-label variants such as lazy learning [13], support vector machines [14], neural networks [15], and decision trees [16]. Problem Transformation methods (fitting data to algorithm), transform a multi-label classification problem into one or more single label classification problems. In this study, the most popular PT methods for multi-label classification such as Binary Relevance (BR), Classifier Chains (CC), and Conditional Dependency Networks (CDN), Label Combination (LC) were used on cervical dataset. The comparison of these methods were briefly described on Table 3.

Binary Relevance
Employing independent classifiers in a series of various decisions is the continuation to the single label problem. In the multi-label literature, this approach often called binary relevance for case of binary labels. Binary Relevance (BR) is a well-known and the most popular transformation method that learns q binary classifiers; one for each possible labels in L. As illustrated in Figure 1a, BR converts a multi-label classification problem into several different single-label binary classification problems according to the onevs.-all strategy. Each binary classifier is responsible for predicting the association of a single label [17].

Classifier Chains
J. Read et al. [19] proposed Classifier Chains (CC) that contains q binary classifiers like BR, but includes previous predictions as feature attributes. Classifiers are connected along a chain where each classifier deals with the binary relevance problem associated with label L, see Figure 1 (b). The attribute space of each link in the chain is extended with the 0/1 label relevance of all previous classifiers; therefore building a classifier chain. This method improves prediction performance and can be applied with any type of base classifier.

Label Combination
Label Combination (LC) is an alternative paradigm to BR (Binary Relevance) method, is also known as Label Power set [19]. LC uses all label sets as single labels, i.e. each label set becomes a single class label within a single label problem. Therefore, the set of single labels represents all different label subsets in the multilabel training data. As a graphical model for this approach was illustrated by Figure 1 (c).

Conditional Dependency Networks
Conditional Dependency Network (CDN) is a fully connected bidirectional graphical model, which ensures an intuitive representation for the dependencies between multiple label variables, and a well-integrated structure for efficient model training using binary classifiers and label predictions using Gibbs sampling inference [20]. CDN can effectively exploit the label dependency to improve multi-label classification performance. Moreover, it allows a very simple training procedure, while its representation naturally facilitates a simple Gibbs sampling inference on the test instances. It can also incorporate a wide range of simple classification algorithms, including both probabilistic classifiers and nonprobabilistic classifiers. The graphical model of CDN was represented on Figure 1 (d).

Dataset
The dataset used in this study was obtained from UCI machine learning repository [23]. The dataset consist of demographic information, habits, and historic medical records of 858 patients with 36 attributes. The dataset is multi-class and multi-label i.e. each patient can be involved in multiple classes (categories) at the same time. Several types of tests can be used to diagnose cervical pre-cancers and cancers. In this experiments, four different medical test results of patients such as Hinselmann, Schiller, Cytology, and Biopsy were used as target variables to classify. Table 4 summarizes the dataset attributes. Four common learning algorithms such as NB, SMO, J48, and RF were analyzed and compared using MEKA data mining tool. MEKA is a multi-label extension of WEKA software and provides predicting multiple output variables for each input instance. The dataset was randomly divided into two sets; training and test. A training data set and testing data set containing 70% (566) and 30% (292) patients, respectively. In order to evaluate the unbiased estimate of the four prediction models for comparing their performances the 10-fold cross-validation methods were used.

Performance Evaluation Metrics
Multi-label classification methods needs different measurements than used in traditional single-label classification.
In the multi label classification the performance of all labels should be taken into account [24][25][26][27]. There are number of evaluation measures available for multi-label classification, from them accuracy, hamming loss, exact match ratio, and rank loss were used in this study, described on Table 5. In this table, ( , ) is the instances of multi-label dataset for = 1,2, 3, … , , Yi ⊆ L is the set of true labels, = ( : = 1,2,3, … , ) is the set of all labels.
is the set of labels that are predicted by an algorithm, ( ) is the LR method for the label l, I is the indicator function defined as I(true) = 1 and I(false) = 0, ∆ stands for the symmetric difference of two sets, and � is the complementary set of Yi with the respect to L.

Results and Discussion
Multi-label classification was performed on cervical cancer dataset using four classification methods and learning algorithms. Their performance were evaluated based on accuracy (AC), Exact Match (EM), Hamming Loss (HL) and Rank Loss (RL). Figure 2 shows the evaluation of the classifiers based on these performance parameters. As seen from Fig. 2, the accuracy percent for examined algorithms were approximately over 80%, except for J48-BR and J48-CDN. Similar behavior was observed for exact match where J48-BR and J48-CDN showed lower values than 80%. All algorithms with CC and LC methods yielded in close accuracy, exact match, hamming loss and ranking loss results.

Measure Equation Description
Accuracy The percentage of the correctly classified labels to the total number of labels for each example.

Hamming Loss
This is the opposite of Hamming Score which evaluates number of times an sample-label pair is in correctly classified. Its best value occurs when its value is equal to 0. As seen in

Conclusion
Nowadays, MLC has received significant attention in the machine learning literature and large number and considerable variety of In this study, a comparative study on application of some popular multi-label classification methods was presented. The assessment of methods was performed using cervical cancer dataset of 858 patients with 36 risk factors. Different combinations of multi-label classification with four learning algorithms were compared. The performance of methods were measured and evaluated with four measures: Hamming loss, accuracy, rank loss and exact match. Results of this study may help researchers and physicians for diagnosing of cervical cancer at early stage.