A Novel Feature Selection Method for the Dynamic Security Assessment of Power Systems Based on Multi-Layer Perceptrons

In this study, the effect of feature selection methods on the performance of multi-layer perceptrons used for the dynamic security assessment of electric power systems is investigated. The existence of many measurable parameters (features) characterizing the power system security status complicates the use of multi-layer perceptron both in terms of prediction accuracy and training time. In this paper, the dynamic security of a power system subject to a number of critical contingencies is assessed as the critical clearing time of any credible fault is predicted by a multi-layer perceptron. In addition to the study of two different feature selection methods, which are Minimum Redundancy Maximum Relevance (mRMR), and Regressional ReliefF (RReliefF), a novel multi-layer perceptron based feature selection method is proposed to be applied in the prediction of security indices. The performance of the feature selection methods on the dynamic security assessment is investigated on a 16-generator, 68-bus test system.


Introduction
Power systems are being operated under more and more stressed operating conditions with reduced security margins and unplanned power flow patterns within the network as the consumption of power, enforcement of economics and energy markets and the uncertainties in generation due to the utilization of renewable sources increase. In such cases, transient instabilities caused by credible contingencies could become an important factor in determining the security level of the system and in taking the necessary control actions when they are needed. Therefore, fast and accurate dynamic security assessment (DSA) methods involving the study of transient stability have always been important for a safe and reliable operation of power systems [1]. The proposed methods for DSA can be divided into a number of classes [2]. Using numerical integration methods [3] is an approach mostly followed for off-line applications, while it is considered as computationally intensive for on-line applications. Direct methods [4] utilize Lyapunov functions to assess the stability instead of using a numerical integration, but their efficiency depends on the simplicity of the models that are used. Probabilistic methods that determine the probability distributions for the stability of the system are suitable for system planning due computational requirements [5]. On the other hand, pattern recognition methods such as artificial neural networks (ANNs) assess the security of an operating point (OP) as they develop a mapping between the features representing the states of the system and its security through the use of a knowledge base (KB) generated off-line [6][7][8][9][10][11][12][13][14]. As a specific implementation of ANNs, multi-layer perceptrons (MLPs) can be used for the prediction of security indices during the DSA of a large power system. However, in these applications, a large number of measurements (features) that characterize the power system and its security brings a complexity to the training process of MLPs. Using a large number of features has two potential adverse effects. The time complexity of training a neural network increases dramatically with the increasing number of input features [15]. In addition, the existence of irrelevant features increases the number of inputs without providing new information. Thus, exponentially increasing number of training samples are required to train the ANN effectively [17]. To mitigate these adverse effects on the training of ANN and to increase its prediction performance, which directly affects the success in DSA, an effective feature selection must be applied. Various feature selection methods adopting the measures of feature quality such as sensitivity index [15], divergence [16] and Fisher discrimination [17] have been used for the DSA of power systems based on MLPs. In this study, the effect of using filter type feature selection methods, such as Minimum Redundancy Maximum Relevance (mRMR) [18] and Regressional ReliefF (RReliefF) [19], on the final performance of MLP for predicting the power system security indices is investigated. The security indices selected in this work are the critical clearing time (CCT) of the faults and the minimum damping ratio for the electromechanical oscillations occurring in the system. In addition to the use of mRMR and RReliefF, this paper proposes a new methodology for feature selection based on the MLP weights. The proposed MLP Weight Based Method (MLPW), uses training weights of MLP to draw conclusions on the correlations of input features and identify an optimal subset of features. Neural network based feature selection has been reported in [20], however, only the output of ANN is used for sensitivity analysis and feature selection, in contrast to MLPW that uses training information for feature selection. Performance of DSA using MLPW method is compared with the mRMR and RReliefF methods on the dynamic security assessment for a 16

Feature Selection for Neural Networks
Finding the optimal subset of features serves a central role in predicting the security index, fast and accurately, for a real power system whose dynamic performance can be characterized in various ways involving a large number of variables. Consider a set of N instances of operating points and the corresponding CCT value constituting the KB { , }, = 1, … , . Each operating point is an instance itself, consisting of D features ( = 1, . . , ). The task of feature selection involves identifying an optimal subset of features = � : ∈ {1, … , }� for the DSA. mRMR and RReliefF feature selection methods, which have been applied in many different types of problems, have also been used in this study. In addition to these methods, an MLP based method has been introduced and used. In all feature selection methods investigated in this study, m is a design parameter which represents the size of the optimal features subset S. In addition, none of these methods can compute the optimal m. For computing the optimal m, a parameter selection step is used and the details of this step are provided in Section 3.

mRMR
mRMR [18] is a filter type feature selection method. The candidate subset S provided by mRMR algorithm is independent of the prediction step and MLP is just used for predicting the security index of the system. The main criterion implemented in mRMR algorithm maximizes the relevance of features to the target output while minimizing the redundancy of selected features. By maximizing the relevance of features and the output, features that contain useful information about the target output are kept in S. In addition, minimizing the redundancy prevents the selection of features that depend on each other, hence does not add any new information about the target. This two-step algorithm provides a candidate optimal subset of features S [18].

RReliefF
RReliefF [19] is another filter type feature selection method. Similar to the previous approach, the MLP is used as a predictor in the mRMR method. However, here, the main criterion to select candidate feature subsets is different than the one in mRMR. The main idea is to rank features by their qualities in discriminating the values corresponding to instances that are close to each other, hence emphasizing on local information of instances. What makes RReliefF a viable option for feature selection is its ability to estimate the dependency between features, while most other methods, including mRMR, assume independence between features. In a domain that there is a strong dependency between features, RReliefF usually performs better. RReliefF generates a ranking of features based on their importance in predicting the target parameter [19].

MLP Weight Based Method
As filter type methods that do not require classifier or predictor training, mRMR and ReliefF are considered as fast feature selection methods. However, their performances depend on the predictor which, later, is used on the candidate optimal features set. Although there are other feature selection methods that use the predictor in feature selection steps, such as backward and forward feature selection, they are computationally intensive and unfit for online DSA. In MLP weight (MLPW) based method, the behavior of MLP predictor during the sample training is used. MLP is a general function estimator which has the form ( , ℎ , ) ∈ , where x denotes the vector of input variables, and ℎ are the matrices of weights of the input layer and the second layer of the MLP, respectively. The process of training comprises of repeated estimation and fine-tuning of network weights and ℎ to minimize the discrepancy between the actual target output y and the prediction of MLP, ( , ℎ , ). In order to determine the correlated features, the behavior of input layer weights are observed during training. It is expected that the correlation between input features are embodied in the correlation of weights of the network during the training phase. If the weights connecting two features behave similarly during training, then it is assumed that these features should have higher correlation. The general framework of the proposed methodology consists of training a sample MLP with all features once, recording the training weight, performing the correlation analysis and feature ranking. These training weights are recorded in column vectors. For each feature , = 1, … , , and for each epoch , = 1, . . , of batch training, MLP has h, the number of hidden units, weights corresponding to the weights of the connections between feature i and each of the h hidden units. Then, these weights of different training epochs are concatenated, so that it ends up with D columns of length × ℎ. In other words, the result is a -by-( × ℎ) matrix of training weights, ∈ ×( ×ℎ) . The next step is to calculate the correlations between different columns of the NetW matrix. The absolute value of the Pearson correlation is used to form a matrix of pairwise correlations between network weights of different features. The general steps of the proposed feature selection method are represented in the algorithm below: The proposed algorithm gets the matrix NetW of training weights, then produces a list of features, featureRank, which records the features in order, so that irrelevant features can be eliminated to reach an optimal subset of the feature set. In this sense, the proposed algorithm gives a feature ranking similar to the ReliefF algorithm. Line 5 of the MLPW algorithm is an important step where the feature that minimizes the following is obtained: Thus, the feature that is highly correlated with the referenceFeature, from the point of view of MLP, and at the same time, that has the lowest correlation with all the features, including the referenceFeature, is found.

Neural Networks for DSA
DSA can be formulated as a classification or a regression problem depending on the security information required for the application.
In the classification task, classifiers can classify the security status of the system at a particular operating point (OP) as secure or insecure, whereas in the regression task, the regression tools can predict a security index that measures how far the system is away from becoming unstable if the contingencies do occur. In this study, the MLPs developed for DSA are trained through a supervised learning algorithm which requires a KB composed of various examples of OPs and the security status of the system operating at them. The method starts with the task of a contingency scan over a wide range of OPs as the critical contingencies are distinguished. The critical contingencies are selected from the ones leading angular instabilities after the occurrence of three-phase faults. A crucial step in the method is to generate the training data of representative instances that are properly chosen. For a number of different topologies and loading levels, a large set of OPs is created using the power flow solutions. Then, at each OP, the security status of the system is determined by means of the postcontingency dynamics through time-domain simulations. When the problem is formulated as a classification problem, the security of the system at a particular OP can be represented as secure or insecure by a straightforward calculation of the angle stability index, where δmax is the maximum angle separation in degrees of any two generators at the same time in the post-contingency response and OP is labeled as secure if the index is positive. In the regression problem, the security is quantified by a security index, the CCT, which is the maximum time allowed for clearing the fault without causing any angular instability. In addition to the assessment for transient stability using the aforementioned indices, the dynamic security assessment of the system is extended with the assessment for small-signal angle stability. For all operating conditions, the system must also be secure in the sense that both inter-area and intra-area oscillations are well damped in a power system when it is subjected to disturbances. This can be assessed by computing the eigenvalues associated with the electromechanical modes of oscillations and distinguishing the ones with the minimum damping ratio. In this study, a multilayer perceptron (MLP) is trained and developed to determine the security status of the system operating at a given OP for each critical contingency. The KB contains a large number of OPs from a wide range of operating conditions, each of which is defined by a particular schedule of generation and distribution of the load demand, a particular network topology and a loading level. Having such a large KB, a feature selection process becomes crucial to attain better performances of the MLPs predicting the security as well as to reduce the size and the complexity of the MLPs without losing too much information. In the next section, the framework for feature selection and the selection of the best MLP structure are given in more detail.

Feature Selection
The feature selection step consists of determining two parameters: the size of the optimal feature subset, m, and the size of the hidden layer of the MLP, h. First, the size of the optimal subset of features m is calculated. For this purpose, a set of initial guesses for m is required, then, for each of them, the best subset of features are calculated by using mRMR and RReliefF. This optimal subset of features is fed to the MLP and the accuracy of prediction is evaluated for different number of hidden units, h. Finally, the best combination of m and h is chosen based on the prediction error. The steps of the proposed feature selection and training approach are shown in Fig. 1. Increasing the number of the units in the hidden layer of MLP always increases the training accuracy. However, the training performance represents only the ability of the MLP in predicting instances that exist in the training set while the actual performance of the MLP is tested on instances that have not been seen yet. Therefore, increasing the value of h will have a negative effect on the prediction performance on the unknown instances as it decreases the generalization performance. To overcome this effect, a portion of data is set aside for evaluation only and named as validation set. The value of h is determined based on the performance of the MLP on the validation set. By increasing h, a decrease in the prediction error is expected at first, however, then the prediction error will eventually increase. Additionally, to eliminate any adverse effect of choosing a fixed validation set on the final performance of prediction, 10 fold cross-validation is applied. In this method, the KB is divided into 10 validation and training sets, each validation set consisting of 10% of the knowledge base, and the remaining 90% is designated as the corresponding training set. By combining the feature selection step and model selection step, it is possible to identify the best subset of features that in combination with the MLP yields the best prediction performance. This is important since the choice of MLP as the predictor will affect the size of the optimal feature subset, m.

Test System
The proposed methodology is applied on a 16-generator, 68-bus test system [21] shown in Fig. 2. Each generating unit is modeled with the 6th order two-axis synchronous generator model, a static exciter model of order 1, a power system stabilizer model and a speed governor turbine unit model of order 3. Through a contingency scan over a wide range of operating condition, 12 critical contingencies (Fig. 2), which lead to instabilities, are found. Each critical contingency is a three-phase fault at one end of a transmission line cleared by the removal of the faulted line well after its CCT. For each OP, the power flow computations and DSA via timedomain simulations and eigenvalue analysis are performed. The KB including instances (OPs) at various loading levels (75% -125%) with 45 different topologies is generated by DSATools TM [22]. Instances are represented by 136 features, which are the magnitudes and phase angles of the bus voltages, and the stability index values (the CCTs related to the critical contingencies and the minimum damping ratios at the pre-fault conditions).

Prediction of CCT
In this study, a KB of 3692 instances of operating points and their corresponding CCT values is generated. 20% of the KB has been randomly selected and allocated to evaluate the final performance of the MLP-based CCT predictor and is not used in any step of the feature selection or the MLP model selection. The remaining 80% is used for validation and training sets. Thus, it is guaranteed that the reported prediction error is as close to a test experiment as possible. For the training of MLP, scaled conjugate gradient back propagation method is used. In addition, to prevent the MLP from over-fitting, early stopping is applied. The proposed procedure for all feature selection methods is independent of the initial selection of critical contingencies. Therefore, it is applicable to other contingencies. The measure for evaluating the performance of MLP is chosen to be mean squared error (MSE). MSE is a widely used performance criterion for prediction problem in MLP. For a set of instances and target values in form of { , } with ( = 1, … , ) and a set of corresponding K prediction � , the MSE is defined as where and � are the actual and the predicted values of CCT, respectively. The evaluation step has been repeated 10 times to ensure an accurate representation of the actual prediction error and the average MSE is reported. At each repetition of the evaluation, the MLP is initialized with a different set of weight so that the small change in the final performance of the MLP is considered. In order to choose the best number of hidden units, for a fixed number of input features m, after feature selection step, MLP is trained and evaluated by using cross-validation, for different number of hidden units. For each size of the hidden unit, crossvalidation provides us with two performance measures, one for training and one for validation. The validation error is used to choose h. Thus, it is ensured that the MLP does not over-fit the training set. After choosing h, the MLP with h hidden units and for different values of m is trained and its performance on the test data is evaluated. The number of hidden units is changed from 5 to 60, while the number of input features is changed from 1 to 136. Fig. 3 shows the performance analysis graph of the proposed framework on the contingency 2. For this contingency, it is observed that the proposed feature selection method, MLPW, results in a small MSE using less features than the other methods. The performance of the CCT prediction using MLPW has a lower average MSE. In addition to this, the standard deviation (SD) of the error should also be considered. A sizable SD is an indication of low quality feature subset with unstable regression performance. In other words, it signifies that the regression performance could vary significantly using the same set of features. It should be noted that, for different contingencies, different feature selection methods may yield the best performance results. For instance, in Fig. 4, MLPW outperforms mRMR and RReliefF and the best result is achieved using 40 features selected by MLPW. As seen in Fig. 3 and Fig. 4, using a proper feature selection method will improve the performance prediction. In addition to CCTs, the proposed methodology is applied to predict the minimum damping ratio of electromechanical oscillations. Fig. 5 shows the results of the analysis. In this case, RreliefF surpasses all the other methods in terms of the feature selection performance. However, the results of the different parts of the graph may be different. For example, if for any reason only the top 20 features are used, mRMR provides a better performing subset of features, with both smaller error and SD of error. However, for feature subsets of size larger than 20, MLPW and RreliefF consistently outperform mRMR. To validate the observations, the dependency of features and outputs to each other is also investigated. For this purpose, mutual information measure is used.  By definition, mutual information meters the amount of information obtained from one variable, using another variable. Fig. 6 shows the plot of mutual information score for different features, measured against contingency 2. The observation is that most of the features have the same amount of information about the CCT value of contingency 2. Therefore, it is validated that all features can be equally informative about the output variable.

Conclusion
In this study, the effect of feature selection algorithms in the DSA based on MLPs is investigated. The results show that a comparable prediction performance is achievable using only a subset of power system parameters which have an important effect on simplifying the process of security assessment. The other side of the argument is the dependency of features to each other; how much information a feature can provide about a different feature. In this case, high mutual information score indicates that one of the two features can be excluded without much loss of overall information provided to the predictor. Fig. 7 provides the pairwise mutual information of the features. The results show that some of the features are relatively informative of each other. This observation is in line with the error rate graphs of Fig. 3. For contingency 2, at around 60 features, the rate of decrease of the prediction error drastically reduces. In other words, the added useful information for prediction selecting more than 60 features is miniscule. Due to space limitations, only the top 5 most important features obtained in each algorithm for the contingencies, ctg. 2, ctg. 4, ctg.8, and ctg. 10, are provided in Table 1, where the voltage magnitude and its phase angle at bus i of the test system are denoted by V_i and θ_i, respectively. As it is expected, the closer features to the location of each contingency are found as important. Incorporating smaller set of network parameters reduces the time complexity and model complexity of MLP based DSA and prove to be critical in large power systems in which fast security assessment and control are critical to deliver high quality service.  In addition to complexity reduction, another effect of feature selection is to measure less number of variables in order to predict the security status of the system. This can be a critical factor in practical situations when measuring or estimating some of the features is costly or time consuming. Therefore, by means of a proper feature selection method and identifying the important features, the effort for prediction and online DSA of the system can be optimized.