The Impact of Feature Selection on Urban Land Cover Classification

Many of the studies in the literature about land cover classification are focused on the feature extraction and classification rather than feature selection. In this paper, the impact of feature selection on urban land cover classification is extensively analyzed. Three types of features namely spectral, texture, and size/shape features are used for this analysis. This analysis is carried out using three variations of a filter based feature selection method and three widely-known classification algorithms. The feature selection method used for the comparison is a multivariate filter method namely correlation-based feature subset selection where a feature subset evaluator and a search method are integrated. Best first search, genetic search, and greedy stepwise search are three different search methods used for this integration. The classification algorithms employed are Bayesian network, random forest, and support vector machine. The experimental results explicitly indicate that feature selection improves classification accuracy in all cases. Besides, according to the experimental results, random forest classifier is the most successful one among these three classifiers while both feature selection is applied and not applied. Largest improvement in the classification performance is obtained when greedy stepwise search based feature selection method and support vector machine classifier is applied together. Also, the contribution of spectral features to the performance of classification is more than size/shape and texture features.


Introduction
The dynamic development of urban areas caused an increasing need for automatic identification of urban land cover types [1]. This makes land cover information a vital input to various processes such as developmental, environmental, and resource planning applications [2]. Hence, extraction of urban land cover information from high resolution images and automatic classification of this information to various land cover types is a common research topic in remote sensing [3][4][5]. There are two main approaches for extracting land cover information from high resolution data. These approaches are pixel based image classification [6] and geospatial object-based image analysis [7]. Pixel-based approaches analyze the spectral properties of every pixel within the area of interest. While making this analysis, spatial or contextual information related to these pixels are not taken into account [8]. In geospatial object-based approaches, image is segmented into relatively homogeneous regions generally referred as segment or image objects [6,[9][10][11] In the literature, there exist numerous studies dealing with classification of urban land cover and various approaches are utilized in these studies. As an example, QuickBird image data is used over a central region in the city of Phoenix, Arizona in order to examine the performance of object-based classifiers for identifying urban classes [10]. According to the experimental results, the object-based classifier achieves a high overall accuracy while one of the most commonly used pixel-based classification method, namely maximum likelihood classifier, produces a lower overall accuracy. It is stated that the object-based classifier is significantly better than the classical per-pixel classifiers. Besides, the performances of pixel-based and object-based image analysis approaches for classifying broad land cover classes over agricultural landscapes are evaluated using three separate supervised machine learning algorithms namely decision tree, random forest, and support vector machine classifiers [12]. It is reported that the difference between the overall classification accuracies of pixel-based and object-based classifications are not statistically significant when the same machine learning algorithms are applied. Landsat ETM+ images of a mountainous area in Mexico are used to analyze the performance of combined object-based and pixel-based land cover classification [13]. It is stated that the combination method produces the best results among all in terms of overall accuracy. Besides, the performance of different classification methods for land cover mapping in the vicinity of the Alto Ribeira Tourist State Park, a Brazilian Atlantic rainforest area is investigated in a more recent study [14]. The classification results show that the object-based classification explicitly outperforms the pixel-based classification in terms of accuracy where the accuracies are 89.7% and 57.8% for objectbased and pixel-based classification, respectively. A multi-scale approach is used for classifying land cover in a high resolution image of an urban area [15]. In this work, super-object information is used as additional input data for image classification. The accuracies of classifications including super-object variables are compared with the classification accuracies of image segmentations not including super-object information. The experimental results show that the accuracy of the system that uses super-object information is higher than the one not including super-object information.
Furthermore, three types of classification methods which can be referred as pixel and objectbased versions of support vector machine, and pixel-based version of decision tree, are used to classify SPOT 5 satellite image into land cover types [16]. According to the experimental results, the object-based version of support vector machine classifier is the best performer among all in terms of accuracy.
In the literature, feature selection techniques are commonly applied for various types of image classification tasks. Since the land cover information extracted from images can be highdimensional, selecting smaller subsets of features consisting of informative ones may be useful to classification as it is valid for various pattern recognition applications. However, there exist limited numbers of studies dealing with feature selection for tasks related to land cover classification. As an example, the impact of feature selection is analyzed for classification of urban structure types with maximum likelihood classifier [17]. In this study, 50 spatial feature types are calculated initially and sequential forward feature selection method is subsequently employed. As another example, the impact of feature selection on support vector machine classification of two hyperspectral sensor data sets is analyzed with four separate feature selection method [18]. According to the experimental results, the accuracy of classification may decline significantly with the addition of features when a small training sample is used. However, it is reported that feature selection may be useful when a large training sample is available and this makes is a valuable process for support vector machine classification. Besides, a class-based feature selection method based on fast constrained search algorithm is employed for classification of hyperspectral data [19]. In this study, a new scheme for feature selection employing Bayesian classifier is proposed and the experimental results show that the proposed method increase effectiveness of classifier in terms of accuracy. Furthermore, classification tree analysis feature selection method is employed for object-based land cover classification and it is evaluated with support vector machine, maximum likelihood, and neural network classification algorithms [20]. The total numbers of object features extracted are 47 and the numbers of selected features vary between 1 and 22. The experimental results show that both support vector machine and neural network classifier produces stable results as the feature dimension increases towards 22. On the other hand, the performance of maximum likelihood classifier decreases considerably in terms of accuracy when the feature dimension increases towards 22. It is reported that the eigen-space projectionbased parameter selection method provides better classification accuracy than other feature selection methods. Furthermore, land cover classification frameworks including feature selection stage and support vector machine classifier are analyzed [21]. In this study, a third-order class-dependent mutual information based feature selection method is proposed and compared with three separate feature selection methods namely maximum mutual information, maximum-relevance minimum-redundancy, and conditional mutual information maximization. It is stated that the proposed method gives a comparatively better ranking than the rests. Many of the studies in the literature are focused on the feature extraction and classification part of the land cover classification task. However, feature selection is less focused part in these studies and also the analysis of feature selection on urban land cover classification is given as a future work in one of the previous recent studies [15]. For this purpose, in this study, an extensive analysis is realized for measuring the impact of feature selection on objectbased urban land cover classification. In order to make this analysis, a recently published public dataset including features extracted from a high resolution image of an urban area is used. This dataset includes spectral, texture, size/shape features obtained at the end of the feature extraction process for object-based land cover classification. In order to evaluate feature selection process, the performances of three variations of a multivariate filter-based feature selection method and three widely-used classifiers are compared. This comparison is realized with the well-known F-Measure and 10-fold cross validation is used in the experiments for fair evaluation. Besides, profiles of the feature sets obtaining higher scores are analyzed for detecting common informative features and their corresponding categories.
Rest of the paper is organized as follows: Section 2 describes the features used in the experiments. Feature selection methods employed in the experiments are clarified in Section 3. Section 4 explains classifiers utilized in the experiments. The experimental work is presented in Section 5. Finally, some concluding remarks are given in Section 6.

Feature Extraction
The features used in this study can be categorized as spectral, texture, and size/shape features. Detailed information about these features is given below.

Spectral Features
Mean values for each band (green, red, near infrared), brightness and normalized difference vegetation index (NDVI) can be listed as spectral features.

Texture Features
Texture features consist of three types of grey-level co-occurrence matrix (GLCM) and standard deviation values of three spectral bands (green, red, near infrared). GLCM texture calculations are calculated with NIR band of source image.

Size/Shape Features
The size/shape features used in this study can be listed as border index, shape index, area, round, compactness, length/width, rectangularity, density, asymmetry, and border length.

Feature Selection
In order to determine best performing feature subsets, three variations of a multivariate filter-based feature selection method are utilized in this study. The feature selection method used for the comparison is widely-known correlation-based feature subset selection method [22] consisting of a feature subset evaluator and a search method. The three variations of correlation-based feature subset selection method include best first search, genetic search, and greedy stepwise search. WEKA [23] software is used to utilize feature selection processes. These three methods are explained in the next subsections.

Best first search based feature selection (BFSFS)
BFSFS involves best first search (BFS) as a part of correlationbased feature subset selection method. BFS searches the space of attribute subsets by greedy hill climbing method with a backtracking facility [24,25]. BFS may both search forward or backward depending on starting with the empty set of attributes or the full set of attributes. In the experiments, BFS is executed as forward search, in which input to the algorithm is an empty attribute subset. Moreover, the search termination criteria is set to 5, i.e. BFS will terminate the search after 5 consecutive backtracks.

Genetic search based feature selection (GSFS)
GSFS includes genetic search (GS) as a part of correlation-based feature subset selection method. GS is a suboptimal search method inspired from biological evolution process [26]. The main idea behind GS is the survival of the fittest solutions among potential solutions for a specific problem. In the initialization phase of GS, it generally begins with a random sample of candidate attribute subsets which is also known as a population. New generations are obtained by applying the genetic operators, namely crossover and mutation, on these attribute subsets which can also be referred as chromosomes. The chromosomes are encoded with binary (0, 1) alphabet. While the indices in a chromosome represented with "1" indicate the selected attributes, the ones represented with "0" indicates attributes which are not selected. As an example, the chromosome {0 1 0 1 0 0 0 1} specifies that the 2nd, 4th, and 8th attributes are used while the others are disregarded. In the evaluation phase of GS, each chromosome in the population is evaluated using a fitness function which is a kind of success measure. A proportion of the existing population is selected to create new generations in each step in GS. This process is repeated until reaching a termination condition [27]. In the experiments, default parameters of Weka [23] software are used. Thus, the population size, crossover probability, mutation probability, maximum number of generations as termination condition was set as 20, 0.6, 0.033, and 20, respectively.

Greedy stepwise search based feature selection (GSSFS)
GSSFS involves greedy stepwise search (GSS) as a part of correlation-based feature subset selection method. GSS performs a greedy search either forward or backward through the space of attribute subsets [28]. The search can be both initialized with the empty set of attributes or the full set of attributes. Unlike BFS, GSS does not perform backtracking on the search space of attribute subsets. GSS stops when the addition or deletion of any attributes that remains results in a decrease in evaluation.

Classifiers
In order to investigate contributions of the selected features to the performance of classification, three different classification algorithms were employed. The first classifier is Random Forest [29] which is a non-linear classifier. The second one is linear support vector machine classifier [30]. The third and last classifier is a Bayesian network classifier [31].

Random Forest (RF)
Ensemble classification algorithms consist of multiple classifiers and they have an increased interest because of being more accurate than an individual classifier. RF is an ensemble classification method involving a combination of decision tree classifiers where each classifier is constructed using a random vector sampled independently from the training set [32]. For classification, each decision tree in a RF contributes with a single vote for determining the class label of an input vector. Then, the output of the RF classifier is determined by a kind of majority voting technique. RF can handle high dimensional data.

Support Vector Machines (SVMs)
SVM aims to find a hyperplane that successfully separates the samples into two classes. In order to make an effective separation, it is necessary to find a decision boundary that minimizes misclassifications [33]. The essential point of SVM is the margin concept [29] where margin is the distance of closest samples from the decision boundary. The main objective of SVM is to find the appropriate hyperplane which maximizes the margin. For this purpose, it is necessary to detect support vectors, which are the data points that lie at the border between the two classes. SVM can be either linear or nonlinear classifier according to its kernel type and the widely-known kernel functions are linear, polynomial, radial basis function and sigmoid kernels [34]. In this study, SVM with linear kernel is employed in the experiments.

Bayesian network (BN)
BN is a directed acyclic graph having a conditional probability distribution for each node [35]. While each node in this graph represents an attribute, each arc between these nodes represents a probabilistic dependency. A BN can be used to compute the conditional probability of an attribute using the values assigned to the other attributes. Once a directed acyclic graph has been constructed, the joint probability of any particular instantiation of all n variables in a BN can be calculated as follows: where Xi represents the instantiation of the variable Xi and πi represents the instantiation of the parents of Xi [36]

Experimental Work
In the experimental work, an in-depth investigation is carried out to analyze the impact of three different variations of correlation based feature selection method for urban land cover classification. Experimental settings including the utilized dataset and classification algorithms are first briefly described. Then, the profile of selected features by three different feature selection methods and their corresponding accuracy scores are provided.

Settings
The dataset used in the experiments is obtained from a recent remote sensing study whose aim is to classify a high resolution aerial image into 9 types of urban land cover [15]. This high resolution image data is collected from an area around the city of Deerfield Beach, Florida. Urban land cover types in the dataset are concrete, shadows, trees, asphalt, buildings, grass, pools, cars, and soil. The dataset originally contains separate train and test splits. However, in this study, these splits are explicitly combined in order to make a fair evaluation by applying 10-fold cross-validation technique. The dataset currently consist of 675 samples and 147 features. The class distribution of the dataset is indicated in Table  1. However, it is also necessary to note that these features are calculated according to seven different scale parameters changing between 20 and 140 at an interval of 20 on segmentation process. Consequently, the numbers of features used in the study were 147 in spite of the number of feature types is 21. Weka software is used in the experiments in order to evaluate feature selection methods and classification algorithms. The experiments consist of two parts. In the first part, RF, SVM, and BN classifiers are built using this dataset. These classifiers consider all features in the dataset. The first part is performed in order to see the performance of the classifiers without feature selection. In the second part, feature selection was performed on the urban land cover dataset using three different methods described in Section 3. The resulting feature subsets were used to build BN, RF and SVM classifiers. Traditional F-Score metric was used for evaluation in all of the experiments. Then, a comparison was carried out between all cases in order to see the best performing setting and the impact of feature selection methods on urban land cover classification.

Profiles of reduced feature sets
In this subsection, the profiles of reduced feature sets are analyzed for three feature selection methods. Table 2 shows the ratio of the selected features among all. In this table, the column FS refers the feature selection method employed, SRF refers the size of reduced feature subset and TNF refers the total number of features. The bold ids in the table indicate the common features that are selected by all of the three methods.  As shown in Section 2, features in the dataset can be grouped into size/shape, texture and spectral. In this subsection, features listed in Table 2 were also analyzed according to their membership to these groups. As an example, for BFSFS, the ratios of size/shape, texture and spectral features is 24%, 17% and 59%, respectively. For GSFS, the ratios of size/shape, texture and spectral features is 30%, 18%, and 52%, respectively. Similarly in GSSFS, the ratios of those features are 25%, 18% and 57% respectively. In the light of this information, it can be said that the contribution of spectral features is more than size/shape and texture features. Also, the contribution of size/shape features are more than texture features. It is necessary to note that these findings are valid for all of the three feature selection methods. Apart from these, the ratio of common features for BFSFS, GSFS, and GSSFS are 79%, 40%, and 82%, respectively.

Accuracy Analysis
In this subsection, individual performances of three classifiers and contribution of feature selection methods to the classification performance were extensively analyzed. Table 3-5 shows the corresponding F-Scores for RF, SVM, and BN classifiers, respectively. In the tables, the first column represents the performance scores when feature selection is not applied. Then the other columns show the performance scores when BFSFS, GSFS, and GSSFS methods were utilized for feature selection, respectively. In the experiments, as mentioned before, 10 fold cross validation was used for fair evaluation. According to Table 3-5, feature selection improved the performance of classification and the best F-Scores were obtained when GSSFS feature selection method is employed for all of the three classifiers. The best performance obtained for RF, SVM, and BN classifiers were 0.876, 0.803, and 0.866, respectively. According to the tables, one can note that the best performing setting was the case that RF classifier is used and GSSFS feature selection method was employed. The runner-up best performing feature selection method is BFSFS except one case that the SVM classifier was employed. The increase of performance in terms of F-scores was 0.037, 0.08, and 0.048 for RF, SVM, and BN, respectively. Thus, the impact of feature selection for SVM classifier was more than the RF and BN classifiers. Besides, according to the class-based F-Sores, the best performance of classes changes for different settings. However, the recognition ratio of soil class is the worst one among all.

Conclusion
In this study, the impact of feature selection was extensively analyzed using three different variations of a multivariate filterbased feature selection method and various widely-known classifiers. For this analysis, a recently published dataset including 147 features was employed. Experiments were realized using 10fold cross validation and the success measure used in the experiments were widely-known F-Score. According to the experimental results, it can be said that feature selection improves the performance of classification for all cases and RF classifier was more successful than BN and SVM classifiers. Also, if the profiles of reduced feature sets are investigated, it can be easily seen than the contribution of spectral features to the performance of classification is more than size/shape and texture features. As a future work, the impact of feature selection may also analyzed for various types of remote sensing tasks that there is limited number of studies as urban land cover classification.