An Empirical Study of the Extreme Learning Machine for Twitter Sentiment Analysis An Empirical Study of the Extreme Learning Machine for Twitter Sentiment Analysis

Extreme Learning Machine (ELM) method is proposed for single hidden layer feed-forward networks (SLFNs). The ELM employs feed-forward neural network architecture and works with randomly determined input weights. In this aspect, ELM depends on principle that enables to determine weights and biases in the network. In the first phase of ELM that can be named as feature mapping, the usage of random values differs the ELM from other methods that employ a kernel function for feature mapping such as Support Vector Machines (SVM) and Deep Neural Networks. After the feature mapping, the main goal of the ELM is to learn weights between hidden and output layers by minimizing the error. The ELM has gained much more popularity recently; and can be utilized for classification, regression, and dimension reduction. In literature, Twitter sentiment analysis is generally considered as a classification task. Therefore, in this study, the basic ELM is utilized for Twitter sentiment analysis and compared with the SVM which is one of the most successful machine learning algorithms used for sentiment analysis. Experiments are conducted on two different Turkish datasets. Experimental results show that the performance of the two methods are slightly different, but SVM outperforms basic ELM.


Introduction
Today, social media is not only just a communication tool but also it has become an integral part of our everyday life.Social media users can generally express their thoughts, feelings, ideas, and experiences about any subject by using tools like Google+, Twitter, and Facebook [1].Social media usage have been increasing day by day therefore, users also employ it for sharing and organizing news about popular entities [2].In this aspect, social media provides a rich data source that can be used in many research fields such as commerce, politics, economy, and opinion mining.Sentiment analysis deals with the textual data generated in social media to make analysis about entities like products, movies, companies, people, brands, etc. [3].Among the social media platforms, Twitter is one of the most popular and it has 313 million monthly active users.It is proved that Twitter is the most used tool for sharing the breaking news, with 40 million tweets posted on the day of the 2016 U.S. Presidential election [4].As it provides an easy way to access and download published tweets, Twitter is considered as one of the largest data source of user generated content [5].Therefore, in this study, Twitter is selected as data source and sentiment analysis is performed on Turkish Twitter feeds using Support Vector Machine (SVM) and Extreme Learning Machine (ELM).The ELM is a kind of single layer feed-forward neural networks (SLFNs) and can be employed for regression and classification.It determines the weights and biases randomly between input and hidden neurons but never updates these values [6].The ELM learns only these weights by solving a linear model.However, compared to traditional feed-forward neural networks, it provides better performance in generalization, and its training time is remarkably efficient.It is also quite popular recently, due to its successful results in many other fields such as image segmentation, medical diagnosis applications, and forest covertype prediction [7].However, in this study, we utilize the basic ELM in sentiment analysis performed on Turkish tweets.Our aim is to investigate the applicability of the ELM in this field.Therefore, we compare the results of the ELM and SVM in terms of accuracy.We prefer the SVM in comparison due to it is generally being the most successful traditional machine learning algorithm in sentiment analysis [8].
The rest of this paper is organized as follows.The related works are summarized in Section 2. The datasets used in the experimental evaluation are briefly described in Section 3. Methods applied for pre-processing and classification stages are described in Section 4. Our experimental results are presented and discussed in Section 5, and finally, summarized results and conclusions are given in the last section.

Related Works
Twitter sentiment analysis is generally performed by using traditional machine learning algorithms.The sentiment analysis is usually considered as a text classification problem which is a supervised machine learning task.The ELM has been firstly proposed for supervised learning tasks such as classification and regression.Moreover, researchers have developed the variants of the ELM by making some extensions to use it for supervised [9] and unsupervised [10] learning, and feature selection [11].It has ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________ , 1,2,3 Computer Eng. Cukurova University, Adana -01330, TURKEY

Datasets Used in the Experiments
In this section, we describe the datasets used for the evaluation of the ELM.Sentiment analysis on Twitter is generally performed on two different types of datasets including subject-dependent and subject-independent [16].Subject-dependent means the dataset is composed of tweets that related to any topic, whereas subject-independent represents the dataset containing tweets that does not have relation with any topic.In this study, we use two different Turkish Twitter datasets from which the Dataset1 (DS1) is subject-dependent and the second one (DS2) is subjectindependent.The DS1 [17] has manually labeled tweets which are related with a private company in the telecom sector.It contains totally 3000 tweets in three categories that are positive, negative, and neutral.The second dataset (DS2) is a subset of another dataset [18] which is automatically collected and labeled by using emoticons contained in the tweets (i.e., distant supervision) with the help of Twitter API.This dataset contains 32000 tweets which are equally distributed in positive and negative categories.However, we use a subset of this dataset to form DS2 as the size of the original dataset is out of the scope for this paper.The DS2 is composed of 10000 tweets in total which have equal distribution for positive and negative classes.The distributions of samples both in DS1 and DS2 datasets are given in Table 1.

Methods
We apply five main steps to perform sentiment analysis.These steps consist of (i) pre-processing, (ii) feature extraction, (iii) term weighting, (iv) classification, and (v) evaluation.
In the pre-processing step, we clean the Twitter feeds from the incomprehensible words and characters.Then, we extract features in two different ways including Bag of Words (BoW) and ngram.After the feature extraction phase, weights are assigned to features and document vectors are computed.Then we build a learning system by training a classifier.In the final phase, we employ this learning system to make polarity classification of the tweets.In the learning system, we just employ the ELM and SVM algorithms as classifiers to compare their performances with each other.

Pre-processing
As the pre-processing task we apply the following steps to extract meaningful information from the tweets:  Cleaning noise from the tweets by removing the incomprehensible words and meaningless characters (only in BoW model) such as punctuation marks and digits.


Lowercase conversion to remove case sensitivity of features.


Applying text normalization to prevent from having high dimensional feature space by removing the repeating letters and reducing them to one letter as in [19].


Removing the tweets that have less than one feature (i.e., word or n-gram) by applying minimum term count filter.


Stripping the features, whose length is less than two characters, out from the tweets.
Sentiment analysis has sparse and high dimensional feature space as it is considered as text classification task.Especially, the high dimensionality of feature space stems from the traditional feature extraction models.Therefore, to prevent from having high dimensional feature space we also apply the following steps only in BoW model:  Stop-words removal, and  Stemming We remove the stop-words from the tweets by using the default list in Lucene API [21].We also perform word stemming by utilizing the Zemberek that is an open source Turkish NLP tool [22].After the completion of the pre-processing steps which are illustrated in Table 2, we tokenize the tweets and extract features to classify the datasets.

Feature Extraction
In Vector Space Model (VSM), the data instances are converted to numerical vectors by using different methods which have high impact on the accuracy of the classification system applied.Data instances (i.e., tweets) are represented as numerical vectors by using the extracted features and assigning weights to these features.In this study, we use two different methods that are BoW and n-gram (i.e., trigram) to extract features.We apply both methods to the two datasets, therefore we have 2 different representations for each dataset.We call them according to the feature extraction method used as DS1BoW, DS1Trigram, DS2BoW, and DS2Trigram respectively.In BoW model, numerical vector representation of a document is often done by associating a word with a numerical weight which is generally proportional to its frequency on the document [23].Therefore, each word is taken as a feature in BoW model.
The n-gram model, on the other hands, is an alternative feature extraction technique.It can be applied in two different ways: i) character level, and ii) word level n-grams.In character level ngram model, the features are formed by taking n consecutive characters in the text content.For example, the character level ngrams of the string "opinion mining" are obtained as follows: Character level n-grams are language independent and they can handle with misspelling and abbreviations [24,25].Therefore, in this study, we use character level trigrams.

Term Weighting
After all features are extracted from the whole document collection by using either BoW, or n-gram methods, weight of each feature for each document is computed to form the document vector.Term weighting is a process to determine the importance of a term for a document.For this reason, it has an important role in the correct and effective representation of textual data.In this study, we use Term Frequency-Inverse Document Frequency (TF*IDF) that is the most widely used unsupervised and traditional weighting method [27] to assign weights to terms.The TF*IDF can be formulated as follows: where, , , , and  represent any term, document, number of documents (i.e., tweets) in the collection, and the document collection, respectively.The  also corresponds to the observed raw frequency of term  in document .
In VSM, a document vector is formed for a document i as shown in equation 2 [26]: where   is document vector for document i and   represents weight of term j in document i, and m is the total number of features extracted from the whole dataset.Therefore, all document collection can be represented as a matrix A as in equation 3.

Extreme Learning Machine
The ELM was proposed by Huang et al. [28] and its architecture is shown in Figure 1.As it can be seen from Figure 1, ELM is a type of SLFNs.The main idea of the ELM is to initialize the neural network weights and biases (between input and hidden neurons) randomly.Then weights between the hidden and output layer are obtained analytically [29].
where () is nonlinear infinitely differentiable function in any interval,   = [ 1 ,  2 ,  3 , … ,   ]  is the weight vector that represents connections between input neurons and th hidden neuron,   = [ 1 ,  2 ,  3 , … ,   ]  is the weight vector that represents connections between the th hidden neuron and output neurons, and   corresponds to the bias of the th hidden neuron respectively [30].  •   indicates the inner product of   and   .= 0, that is, there exist (  ,   ) and   [9] such that: For  sample, the above equation can be written in a more compact format as: where, In equations above, , , and  are matrices represent the output matrix of hidden layer, the output weight matrix, and the target matrix, respectively.The th column of  is the th sample output vector for samples and the th row is the th sample output vector for all hidden neurons.As activation functions, different nonlinear functions can be used such as Sigmoid, Sinusoid, and Gaussian.Finding the least-squares solution  ̂ of the linear system given in equation ( 6) is simply equivalent to train an ELM.Then, the smallest norm least squares of equation ( 6) is obtained using the Moore-Penrose generalized inverse definition [31].

Activation Function
Traditional gradient-based learning algorithms only work with differentiable activation functions.However, the ELM can work with all bounded nonconstant piecewise and continuous activation functions [32].In ELM, the  that is the output matrix of hidden layer is calculated by using an activation function.In this study, we use five different activation functions including Sigmoid, Sine (Sin), Hard-limit (Hardlim), Radial-basis (Radbas), and Linear transfer (Purelin) to analyse the effect on performance of the ELM.These functions are formulated as follows: In above equations,  is the output of each neuron in the hidden layer which is represented as   •   +   .

Example
To apply ELM for classification, first of all, features extracted in the pre-processing step and their computed weight values for each tweet are taken as shown below: where we have 4 tweets and 6 features such that matrix x represents feature weights for the training dataset such that in each row we have a document vector for each tweet; and t denotes the class labels of each tweet in matrix x.In the training phase, the matrices for input weights () and the biases of hidden neurons () are randomly initialized as shown below.In our example, the number of neurons in the hidden layer is taken as  = 4, which corresponds to the number of tweets in the training set.Therefore, the input weight matrix has 4 (i.e., # of tweets) rows and 6 (i.e., # of features) columns.After randomly initialization of the input weights and biases, the matrix for the output of the hidden layer  is calculated with the following equation: In above equation,  represents the activation function.In this example, we use the sigmoid activation function.Following the calculation of , the output weights could then be calculated by using equation ( 4).After the output weights are computed, the training phase is completed.Then, we can classify a previously unseen test data (with the same number of hidden neuron and feature space) by using the computed output weights.In the testing phase, if the following vector x for a tweet with class label t which is 3 is given to the trained ELM system,  = [1 0 0 0 0 0] and  = [3] then, the  matrix for the test tweet is calculated by applying the same steps as in the training phase.Therefore,   is computed as follows: After that the class label of the test tweet is predicted by using the equation   =    as shown below: The resulting   shows the prediction of the ELM.As it can be seen from   the last value is the maximum of all values, therefore the predicted class label is equal to last class which is 3 (see matrix  in the training phase for all possible classes).This prediction is correct for the given test instance.
In this study, we also use SVM with linear kernel, which is a kernel-based and robust machine learning algorithm for sparse data, to make comparison with ELM.It has two types in practice including linear and non-linear [33] SVM.Linear SVM aims to find a hyperplane that has maximum margin between classes without increasing dimension of the feature space.Non-linear SVM transforms the data to a higher dimension by a kernel function (e.g., Radial basis, Sigmoid) and performs classification in this space.SVM is mainly concerned with solving binary class problems, however, it can be employed successfully in multiclass classification by different methods such as one-against-one and one-against-all [34].

Performance Evaluation Measures
In this section, we describe the validation approach and performance evaluation measures that we use to compare the classification methods.We employ -fold cross validation (CV) approach [36] to validate the predictive model.We prefer this method especially by considering the ELM.Since the ELM uses random values in the computations, even if we use the same activation function and the same number of hidden neurons, the input weights and the hidden layer biases may change.Therefore, the ELM should be run multiple times.In the -fold CV, the dataset is randomly divided into  disjoint subsets, then one of these subsets is taken as the test set whereas the rest are used as the training set.This process is repeated for each subset.The overall success of the classification model is calculated by taking the average of these  iterations.In this study, we perform 10 folds CV.
To compute success rate of the classification models we use precision (P), recall (R), F-measure (F1), and accuracy (Acc) values that are computed as in equations 14, 15, 16, and 17, respectively.
= ( + )/ (17) where  represents the number of samples in test dataset, TP, TN, FP, and FN values have the following definitions [37] as shown in Figure 2.

Experimental Results
To perform sentiment analysis, first we apply pre-processing to the datasets by removing the emoticons, short terms having length less than two characters, and other Twitter specific terms.
In this way, we use only the textual content to train the classifier.Therefore, the size of the feature space is reduced by 3.95% and 14.15% for DS1 and DS2 respectively as shown in Table 3.We also remove some samples, which does not have enough number of terms, from the datasets.After the preprocessing phase, the total number of samples for each datasets is given in Table 4.
Then we extract the features by using BoW and character level trigram methods.As it can be seen from Table 4, the number of BoW features is less than that of trigram features for each dataset.
The main reason for this is that we apply stemming and stopwords removal only in BoW model.In the last step of the preprocessing, we compute term weights.Finally, we perform the classification to obtain the experimental results.
Our experiments consist of two phases.According to the results of the first phase, we found that performance of the ELM is not so sensitive for the number of neurons in the hidden layer.However, the most successful result is generally obtained when the number of neurons in the hidden layer is selected as 500.We also observed that the activation function is more effective on the performance of the ELM with respect to the number of neurons in the hidden layer.As it can be seen from Figure 3, the most successful activation function is purelin with one exceptional case.Therefore, we decided to use the purelin function, and 500 neurons in the hidden layer for ELM in subsequent experiments.In the second phase of our experiments, our aim is to compare the basic ELM with SVM.For this purpose, we perform classification using both ELM (using best parameter configuration) and SVM on DS1 and DS2 datasets under different feature extraction models.We report our results in Table 5 and Table 6, respectively.In Table 6, each letter (A, B, C, etc.) corresponds to the related classifier and dataset combination used in Table 5.As an example, letter A in Table 5 refers to the accuracy of classification done on DS1BoW by using ELM, however letter A in Table 6 refers to the same accuracy value, also the row starting with letter A in Table 6 presents precision, recall, and F-measure of classification on DS1BoW using ELM.According to our results, the performance of both classifiers are quite close to each other.However, SVM is generally more successful than basic ELM.We also observe that both classifiers produce more successful results on DS2 dataset.In addition, it becomes clear that ELM generally produces better results on BoW features for each dataset, whereas vice versa for SVM.

Conclusion and Future Works
In this study, we employ the basic ELM in Twitter sentiment analysis.Our main goal is to investigate the applicability of the ELM for sentiment analysis by comparing it with SVM which is generally the most successful traditional machine learning algorithm.For this purpose, we use two different Twitter datasets which consist of Turkish tweets.According to experimental results, we report that SVM slightly has better classification performance than the basic ELM on the datasets that we used.We think that the reason for this is the robustness of the SVM to the data sparsity.The data sparsity stems from Twitter specific circumstances such as the length restriction and use of informal language in tweets.The success of ELM is not so sensitive to the number of neurons in the hidden layer.However, we observed that performance of the ELM generally rises when the number of neurons in the hidden layer increases.On the contrary, the activation function has a remarkable effect on the success of the ELM.Among the activation functions, the purelin is the most successful since it does not change the TF*IDF weights of features, whereas other activation functions remove the effect of the term weighting process.Consequently, we conclude that there is only a slight difference between the accuracies of the two classifiers.The basic ELM is quite successful in Twitter sentiment analysis, and it has high generalization performance even though its random parts.It is also efficient in terms of training time when compared to the SVM.
In previous works, researchers made different extensions on the architecture of the basic ELM to improve its performance.By making such extensions they can produce better results when compared to the basic ELM.Therefore, in our future work, we are planning to implement a kernel-based ELM, to improve the generalization performance of the basic ELM.We hope that, we may increase the accuracy of the ELM in Twitter sentiment analysis.

Figure 2 .
Figure 2.The four outcomes of a classification[35] First, we conduct the experiments on DS1BoW to find the best combination of the number of neurons in the hidden layer and the activation function for ELM.We use five different activation functions and nine different values for the number of neurons in the hidden layer resulting in a total of 45 combinations.The five different activation functions are sigmoid, sine, radbas, hardlim, and purelin.The nine different values of the number of neurons in the hidden layer are 10, 20, 30, 50, 80, 100, 200, 500, and 1000.

Table 3 .
The feature reduction in pre-processing phase

Table 4 .
Total number of samples and unique features for DS1 and DS2 datasets after pre-processing

Table 5 .
Comparison of two classifiers when ELM has 500 neurons and employs purelin activation function in its hidden layer

Table 6 .
Weighted average values of evaluation measures for the classification accuracies given Table5