Comparison of Classification Techniques on Energy Efficiency Dataset

The definition of the data mining can be told as to extract information or knowledge from large volumes of data. Statistical and machine learning techniques are used for the determination of the models to be used for data mining predictions. Today, data mining is used in many different areas such as science and engineering, health, commerce, shopping, banking and finance, education and internet. This study make use of WEKA (Waikato Environment for Knowledge Analysis) to compare the different classification techniques on energy efficiency datasets. In this study 10 different Data Mining methods namely Bagging, Decorate, Rotation Forest, J48, NNge, K-Star, Naïve Bayes, Dagging, Bayes Net and JRip classification methods were applied on energy efficiency dataset that were taken from UCI Machine Learning Repository. When comparing the performances of algorithms it’s been found that Rotation Forest has highest accuracy whereas Dagging had the worst accuracy.


Introduction
Developments in Information Technology and database software immense amount of data are collected.This large amount of data has appeared as one of the culprits of meaningful knowledge extraction.Collected large amount of data although contains hidden patterns, as the amount of the data increases, cannot be converted into useful information by traditional methods.Consequently, to analyze the immense amount of data, fairly new method known as data mining methods are widespread in practice [1].Data mining is used as an information source to find unities, make classification, clustering and estimations by using information discovery systems which are the combination of data warehouses, artificial intelligence techniques and statistical methods [2] [3].Classification is a method frequently used in data mining and used to uncover hidden patterns in database.Classification is used to insert the data object into predefined several classes.The welldefined characteristics play a key role in performance of the classifier.Classification is based on a learning algorithm.Training cannot be done by using all data.This is performed on a sample of data belonging to the data collection.The purpose of learning is the creation of a classification model.In other words classification is a class determination process for an unknown record [4][5] [6].
Energy consumption of buildings has received increasing great interest in today's economies.As buildings represent substantial consumers of energy worldwide, with this trend increasing over the past few decades due to rising living standards, this issue has drawn considerable attention.The largest part of the energy consumption is due to the use of so-called heating, ventilation and air-conditioning systems in the residential buildings.High energy consumption of buildings and the increase in building energy demand require the design of energy efficient buildings and an improvement of their energy performance.One way to reduce the increased energy demand is to have more energy-efficient building designs.Another significant issue is the effect of this continuous increase of energy consumption on the environment.Buildings use about 40% of global energy, 25% of global water and 40% of global resources according to United Nations Environment Program (UNEP).There is a main danger that, as a consequence of global warming and climate change, energy demand and CO2 emissions will increase even further in many countries.In particular, the buildings design has a major impact on its energy footprint.In order to reduce the impact of building energy consumption on the environment, the European Union has adopted a directive requiring European countries to conform to proper minimum requirements regarding energy efficiency [7] [8].Designing energy efficient buildings, it is important for architects, engineers and designers to identify which parameters will significantly influence future energy demand.After the identification of these parameters, architects and building designers usually need simple and reliable methods for rapidly estimating building energy performance, so that they can optimize their design plans.In recent years, several methods have been proposed for modeling building energy demand.For the estimation of the flow of energy and the performance of energy systems in buildings, analytic computer codes are often used [7] [9].

Literature Survey
Tsanas and Xifara (2012) developed a statistical machine learning framework to study the effect of eight input variables (relative compactness, surface area, wall area, roof area, overall height, orientation, glazing area, glazing area distribution) on two output variables, namely heating load (HL) and cooling load (CL), of residential buildings.Extensive simulations on 768 diverse residential buildings show that they can predict HL and CL [10].Castelli et all. (2015) proposed a genetic programming-based framework for estimating the energy performance (the heating load and the cooling load) of residential buildings.The proposed framework blends a recently developed version of genetic programming with a local search method and linear scaling.The resulting system enables to build a model that produces an accurate estimation of both considered parameters.Extensive simulations on 768 diverse residential buildings confirm the suitability of the proposed method in predicting heating load and cooling load [7].

Dataset
The energy efficiency dataset used in this study was taken from UCI Machine Learning Repository.We perform energy analysis using 12 different building shapes simulated in Ecotect.The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters.We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes.The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses.It can also be used as a multi-class classification problem if the response is rounded to the nearest integer [11].In this study, we investigate the effect of eight input variables: Relative compactness (X1), surface area (X2), wall area (X3), roof area (X4), overall height (X5), orientation (X6), glazing area (X7), and glazing area distribution (X8), to determine the output variables Heating load (Y1) and Cooling load (Y2) of residential buildings.The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by Y1 and Y2).The aim is to use the eight features to predict each of the two responses [10] [11].

Software-WEKA
Weka (Waikato Environment for Knowledge Analysis) written in Java, developed at the University of Waikato, New Zealand [11].Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection.All techniques of Weka's software are predicated on the assumption that the data is available as a single flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute types are also supported) [12][13].

Methods
Having done in this study 10 different classifying techniques were used to energy efficiency.Short information about each of the classifying techniques namely Bagging, Decorate, Rotation Forest, J48, NNge, K-Star, Naïve Bayes, Dagging, Bayes Net and JRip will be mentioned in the following paragraphs.J48 is an extension of ID3.The additional features of J48 are accounting for missing values, decision trees pruning, continuous attribute value ranges, derivation of rules, etc.In the WEKA data mining tool, J48 is an open source Java implementation of the C4.5 algorithm.The WEKA tool provides a number of options associated with tree pruning.In case of potential over fitting pruning can be used as a tool for précising.In other algorithms the classification is performed recursively till every single leaf is pure, that is the classification of the data should be as perfect as possible.This algorithm it generates the rules from which particular identity of that data is generated.The objective is progressively generalization of a decision tree until it gains equilibrium of flexibility [22].
NNge learns incrementally by first classifying and then generalizing each new example.It uses a modified Euclidean distance function that handles hyperrectangles, symbolic features, and exemplar and feature weights.Numeric feature values are normalized by dividing each value by the range of values observed.The class predicted is that of the single nearest neighbor.NNge uses dynamic feedback to adjust exemplar and feature weights after each new example is classified.When classifying an example, one or more hyperrectangles may be found that the new example is a member of, but which are of the wrong class.NNge prunes these so that the new example is no longer a member.Once classified, the new example is generalized by merging it with the nearest exemplar of the same class, which may be either a single example or a hyperrectangle.In the former case, NNge creates a new hyperrectangle, where as in the latter it grows the nearest neighbor to encompass the new example.Over generalization, caused by nesting or overlapping hyperrectangles, is not permitted.Before NNge generalizes a new example, it checks to see if there are any examples in the affected area of feature space that conflict with the proposed new hyperrectangle.If so, the generalization is aborted, and the example is stored verbatim [23].
K-Star algorithm is an instance-based classifier that uses entropic distance measurement with different data sets.It produces a predictive pattern by using some similar function.The class of a test instance is based on the training instances similar to it, as determined by some similarity function.It differs from other instance-based learners in that it uses an entropy-based distance function.Instance-based learners classify an instance by comparing it to a database of pre-classified examples.The fundamental assumption is that similar instances will have similar classifications.The question lies in how to define "similar instance" and "similar classification".The corresponding components of an instance-based learner are the distance function which determines how similar two instances are, and the classification function which specifies how instance similarities yield a final classification for the new instance.The K-star algorithm uses entropic measure, based on probability of transforming an instance into another by randomly choosing between all possible transformations [20] [24].
Naïve Bayes algorithm is an intuitive method that uses the conditional probabilities of each attribute belonging to each class to make a prediction.It uses Bayes' Theorem, a formula that calculates a probability by counting the frequency of values and combinations of values in the historical data.Parameter estimation for naive Bayes models uses the method of maximum likelihood.In spite over-simplified assumptions, it often performs better in many complex real world situations.One of the major advantages of Naive Bayes theorem is that it requires a small amount of training data to estimate the parameters [25][26].
Dagging is meta classifier creates a number of disjoint, stratified folds out of the data and feeds each chunk of data to a copy of the supplied base classifier.Predictions are made via majority vote [18][27].
Bayes Net is probabilistic graphical model (a type of statistical model) that represents a set of random variables and their conditional dependencies via a directed acyclic graph (DAG) [11] [18].
JRip (Java Repeated Incremental Pruning) is a prepositional rule learner, i.e.Repeated Incremental Pruning to Produce Error Reduction (RIPPER).Initial rule set for each class is generated using IREP [18] [28].

Experimental Study
Bagging classifier technique was used to energy efficiency dataset and the results shown in Table 1 is obtained.Thus the correct classification ratio is % 65,8854 (Y1) and % 55,3385 (Y2).Another future direction can be testing with data sets of different domains other than standard UCI repository that can be from real life data or obtained from survey on different domains.
Bagging (Bootstrap Aggregating) algorithm uses bootstrapping (equiprobable selection with replacement) on the training set to create many varied but overlapping new sets.The base algorithm is used to create a different base model instance for each bootstrap sample, and the ensemble output is the average of all base model outputs for a given input[14][15][16][17][18]. Decorate (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples) iteratively generates an ensemble by learning a new classifier at each iteration.In the first iteration the base classifier is built from the given training data set and each successive classifier is built from an artificially generated training data set which is the result of the union of the original training data and artificial training examples, known as diversity data.The classifier built from the new training data set is added to the ensemble only if it reduces the ensemble training error, otherwise it is rejected and the algorithm continues iterating.Artificial training examples are generated from the data distribution and they are obtained by probabilistically estimating the value of each attribute.The labels for the new examples are selected with a probability that is inversely proportional to the prediction of the current ensemble.Decorate tries to maximize the diversity of the base classifiers by adding new artificial examples and re-weighting the training data [14][19].Rotation Forest is a classifier that transforms the dataset to generate ensemble of classifiers.In this classifier, each base classifier is trained which extracts attributes in a different sets.The main goal is to embed feature extraction and reform approximately an attribute set for each classifier in the ensemble [20][21].

Table 1 .
Accuracy Ratio of Bagging Application

Table 2 .
Accuracy Ratio of Decorate Application

Table 3 .
Accuracy Ratio of Rotation Forest Application

Table 4 .
Accuracy Ratio of J48 Application

Table 5 .
Accuracy Ratio of NNge Application

Table 6 .
Accuracy Ratio of KStar Application

Table 7 .
Accuracy Ratio of Naïve Bayes Application

Table 8 .
Accuracy Ratio of Dagging Application

Table 9 .
Accuracy Ratio of Bayes Net ApplicationJRip classifier technique was used to energy efficiency dataset and the results shown in Table10is obtained.Thus the correct classification ratio is % 58,2031 (Y1) and % 50,3906 (Y2).

Table 10 .
Accuracy Ratio of JRip Application Following classifier techniques of WEKA have been applied to energy efficiency datasets: Rotation Forest, Decorate, J48, Bagging, NNge, KStar, JRip, Bayes Net, Naïve Bayes and Dagging, The results obtained from related classification techniques were presented in Table 11 according to each dataset, When comparing the performances of algorithms it's been found that Rotation Forest has highest accuracy whereas Dagging had the worst accuracy.

Table 11 .
Ratio Of Each Classification Technique On Each Energy Efficiency Dataset