Predicting the Severity of Motor Vehicle Accident Injuries in Adana-Turkey Using Machine Learning Methods and Detailed Meteorological Data

: Traffic accidents are among the most important issues facing every nation in the world as they cause many deaths and injuries as well as economic losses every year. In this study, the traffic accidents that took place in Adana, have been classified according to injury severity (i.e. fatal or non-fatal) and the factors affecting the accident outcome are investigated. The study included the traffic accident reports kept by Regional Traffic Division and the weather data provided by the Regional Directorate of Meteorology during 2005-2015. Five major machine learning methods (i.e. k-Nearest Neighbor, Naive Bayes, Multilayer Perceptron, Decision Tree, Support Vector Machine) and one statistical method, Logistic Regression, were employed for prediction models and performances of the models as well as the effective parameters were compared. The main objective of the study is to determine how important weather and other phenomena are for the occurrence of traffic accidents. Decision Tree, k-Nearest Neighbor, and Multilayer Perceptron based models yielded higher accuracy in classification of accidents compared to other models. Furthermore, in Area Under Curve based analysis of factor importance, it was determined that Mean Cloudiness, Existence of Traffic Control and Ground Surface Temperature had higher positive effects, while Maximum Temperature and Weather (kept by traffic officers) parameters decreased the accuracy of models.


Introduction
Traffic accidents kill around 1.3 million people every year in the world. In addition, 20 to 50 million people suffer non-fatal injuries, with many incurring a disability as a result of their injury [1]. Also, traffic injuries cause considerable economic losses to victims, their families, and to nations as a whole. These losses arise from the cost of treatment (including rehabilitation and incident investigation) as well as reduced/lost productivity (e.g. in wages) for those killed or disabled by their injuries, and for family members who need to take time off work (or school) to care for the injured. There are few global estimates of the costs of injury, but research carried out in 2010 suggests that road traffic crashes cost countries approximately 3% of their gross national product. This figure rises to 5% in some low-and middle-income countries [2]. In Turkey's case, the number of motor vehicles increases every year, nearly doubled between 2005 and 2015 [3]. However, the number of traffic accidents increased faster compared to vehicle number in the same period, which resulted in a higher ratio of killed and injured persons in the total population. Nearly 1% of the total population dies and 4% suffers from injuries due to traffic accidents, which have become an important risk factor of life in Turkey. According to the records kept by officers after fatal traffic accidents in 2015, driver faults accounted for 89.30% of total faults, pedestrian faults 8.80%, road defects 0.91%, vehicle defects 0.55% and passenger's faults 0.43% [3]. The records were kept only for fatal injuries and lacked many possible additional factors that could contribute to the occurrence of the accidents, one of which is weather. The relationship between traffic accidents and weather condition is a well-known fact. A number of studies have attempted to develop injury severity models using weather data. The previous studies in this area can be divided into two main categories from the methodological perspective: statistically based models and the machine learning based models: The statistical-based models explore the characteristics of crashes using Logistic Regression (LR). Reference [4] used Negative Binominal modeling technique to model the frequency of the accident occurrences and involvements over 1.606 accidents on a principal highway in Florida-USA. They used road and driver characteristics as explanatory variables. Their result showed that heavy traffic volume, speeding, narrow lane width, larger number of lanes, urban roadway sections, narrow shoulder width and reduced median width increase the likelihood of accident occurrence. They also reported that female drivers experience more accidents than male drivers in heavy traffic volume and younger drivers have a greater tendency of being involved in accidents. Reference [5] analyzed the pattern of traffic accidents based on several severity types. They included a total of 11.564 accidents reported in Seul-Korea and 22 factors such as vehicle and road characteristics. They employed Multilayer Perceptron (MLP), LR and Decision Tree Classifier (DTC) models to classify accidents into three main subgroups, (1) death or major injury, (2) minor injury and (3) property damage only. Consequently, they observed no significant difference in the classification accuracy of the models. Reference [6] analyzed driver injury severities for single-vehicle crashes occurring in rural and urban areas using data collected in New Mexico from 2010 to 2011. They used nested logit models and mixed logit models to identify contributing factors for driver injury severities. The data used in the study include weather information such as Clear, Fog, Rain etc. They identified five factors only significant for the rural model, including animal involved crashes, rainy condition, icy condition, crashes in no passing zone and pickup involve crashes. On the other hand, they determined six factors significantly influencing driver injury severity in urban crashes, which includes crashes during peak hours, curved roadways, roadways with multi-lanes, tractor involved crashes, drugimpaired drivers, and drivers between 16 and 20-year-old. Machine learning based models have also been widely used in predicting the severity of road traffic crashes: Reference [7] analyzed 971 traffic accidents occurred in Abu Dhabi in 2014, consisting of 121 fatal and 135 severe injuries. They employed DTC, Rule Induction, Naive Bayes Classifier (NBC) and MLP methods. The results indicated that key factors associated with fatal severity were age, gender, nationality, year of the accident, casualty status and collision type and 18-30 years old group as the most vulnerable group. Reference [8] investigated the effects of certain traffic and weather parameters on the likelihood of a secondary accident following the occurrence of a traffic accident. They employed MLP and logit models. They identified that traffic speed, duration of the primary accident, hourly volume, rainfall intensity and a number of vehicles involved in the primary accident are the top five factors associated with the secondary accident likelihood. In addition, changes in traffic speed and volume, number of vehicles involved, blocked lanes, and percentage of trucks and upstream geometry also influence the probability of having a secondary incident. Reference [9] analyzed a total of 1.536 accidents on rural highways in Spain using Bayesian Network models. They aimed to determine the effects of several factors including driver characteristics, highway features, vehicle characteristics, accidents and weather parameters on accident severity. Consequently, they identified that accident type, driver age, lightning and number of injuries are most associated with accident severity. Reference [10] investigated the application of MLP, DTC and a hybrid combination of DTC and MLP to build models that could predict injury severity. Their dataset contained traffic accident records from 1995 to 2000, a total number of 417.670 cases. The total set included labels of year, month, region, primary sampling unit, the number describing the police jurisdiction, case number, person number, vehicle number, vehicle make and model; inputs of drivers' age, gender, alcohol usage, restraint system, eject, vehicle body type, vehicle age, vehicle role, initial point of impact, manner of collision, rollover, roadway surface condition, light condition, travel speed, speed limit and the output injury severity. The injury severity had five classes: no injury, possible injury, non-incapacitating injury, incapacitating injury, and fatal injury. Their results revealed that, for the nonincapacitating injury, the incapacitating injury, and the fatal injury classes, the hybrid approach performed better than MLP, DTC and Support Vector Machines (SVM). For the no injury and the possible injury classes, the hybrid approach performed better than MLP. The no injury and the possible injury classes could be best modeled directly by DTC. Reference [11] deals with some classification models to predict the severity of the injury that occurred during traffic accidents. For this purpose, they used the dataset that contains 34.575 accident cases belonging to the year 2008 produced by the transport department of the government of Hong Kong. They employed Naive Bayes, J48, AdaBoostM1, PART and Random Forest classifiers for predicting classification accuracy. They used Genetic Algorithm for feature selection to reduce the dimensionality of the dataset. They investigated three different cases such as Accident, Casualty, and Vehicle for finding the cause of the accident and the severity of the accident. Their final result showed that the Random Forest outperformed other four algorithms. Reference [12] investigated common features between accidents. They researched road accident data of major national highways that pass through Krishna district for the year 2013 by applying machine learning techniques. They formed clusters using K-medoids, and applied expectation maximization algorithms to discover hidden patterns using a priori algorithm. Their aim was to generate association rules that could analyze how to discover hidden patterns that are the root causes for accidents among different combinations of attributes of a larger dataset. They used density histograms for visualizing region-wise such as fatal versus weather, fatal versus time, time versus day, fatal versus month, fatal versus traffic, and fatal versus age. Their results showed that the selected machine learning techniques are able to extract hidden patterns from the data. The motivation of this study is to examine the role of detailed meteorological weather reports in determining the results of (fatal or non-fatal) motor vehicle accidents. It is known that weather affects every single dimension of our daily life, even our moods. However, weather condition information in traffic accident datasets is kept very simple in previous studies. The injury severity of accidents can be estimated accurately if detailed meteorological weather reports could be combined with accident records. In this work, machine learning based prediction models are developed to estimate the results of the accidents occurred in Adana (a southern city of Turkey); in addition, LR method is also used to give a statistical comparison basis for machine learning methods.

Accident Data and Meteorological Data
This study is conducted based on ten-year crash data consisting of fatal and non-fatal traffic accidents and meteorological records collected in Adana from 2005 to 2015, provided by the General Directorate of Security-Traffic Services Department and Turkish State Meteorological Service. The dataset is composed of two major sub-datasets: The first one includes Day of Week, Crash Time Period, Location, Division of Road, Roadway Surface, Weather Information, Traffic Control, Pavement Marking, Shoulder, Slope, and Crossing. This dataset consists of 25.015 accident record, of which only 246 are fatal and the rest non-fatal. Due to the unbalanced distribution of the original accident records, it would be impossible to develop accurate prediction models because any method can just classify all cases as nonfatal and still achieve over 90% accuracy. Therefore, we kept all the fatal accident records in the dataset and arbitrarily reduced the size of non-fatal accidents to three-fold (1:3) and one-fold (1:1) of the fatal accidents. We obtained two different datasets after this process: the first one consisted of 246 fatal (25%) and 738 nonfatal (75%) accidents, while the second one included 246 fatal (50%) and 246 non-fatal (50%) accidents. In this way, all fatal accidents were included in both dataset, and nonfatal accidents were randomly selected. Then, 10-fold cross-validation was used for both datasets before the application of each method to eliminate chance factor. The detailed meteorological data is obtained from the Turkish State Meteorological Service. The parameters used are Mean Wind Speed (m/sec), Mean Pressure (hPa), Maximum Temperature (°C), Minimum Temperature (°C), Mean Cloudiness, Mean Relative Humidity (%), Total Global Solar Radiation (cal/cm²), Total Precipitation (mm) and Ground Surface Temperature (°C). All meteorological data are daily measured. Table 1 shows descriptive statistics of the meteorological dataset. Considering the period of the study 2005-2015, only these parameters fulfilled the requirement of continuity. Some part of the meteorological observations had so many missing values that it would be impossible to complete the series with statistical methods.

Naive Bayes Classifier (NBC)
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x) and P(x|c). NBC assumes that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. This assumption is called class conditional independence [13][14].
 P(y|x ⃗ ) is the posterior probability of class (target) given predictor (attribute).  P(y) is the prior probability of class.  P(x ⃗ |c) is the likelihood which is the probability of predictor given class.  P(x ⃗ ) is the prior probability of predictor.
Estimating P(x |y), however, is not easy. The additional assumption is the Naive Bayes assumption: Estimating log(P(xα|y)) is easy as we only need to consider one dimension. And estimating P(y) is not affected by the assumption.

k-Nearest Neighbor (kNN) Method
The kNN methodology relies on a simple distance learning approach whereby an unknown member is classified according to the majority of its k-nearest neighbors in the training set. The nearness is measured by an appropriate distance metric [15]. It is used for classifying objects based on the closest training examples in the feature space. kNN algorithm is among the simplest of all machine learning algorithms. In the classification process, the unlabeled query point is simply assigned to the label of its knearest neighbors. Typically, the object is classified based on the labels of its k-nearest neighbors by majority vote [13], [16]. If k equals 1, the object is simply classified as the class of the object nearest to it. When there are only two classes, k must be an odd integer. However, there can still be a tie when k is an odd integer during multiclass classification. In the study, Euclidean distance is used as the distance function for kNN: where x and y are in X = Rm.

Decision Tree Classifier (DTC)
DTCs are decision trees used for classification. As any other classifier, the DTCs use values of attributes/features of the data to make a class label (discrete) prediction. Structurally, DTCs are organized like a decision tree in which simple conditions on (usually single) attributes label the edge between an intermediate node and its children [13], [14], [16], [17]. In the study, CART implementation of MATLAB according to Breiman et al. 1984 was used. In this model, Gini Index (GI) was used as splitting measure. GI is an impurity-based criterion that measures the divergences between the probability distributions of the target attributes. The Gini index is defined as: And, the evaluation criterion for selecting the attribute ai is defined as: Error-based pruning is employed in the model. The error rate is estimated using the upper bound of the statistical confidence interval for proportions.
Where, ∈ ( , ) denotes the misclassification rate of the tree T on the training set S. Z is the inverse of the standard normal cumulative distribution and α is the desired significance level. The growing phase continues until a stopping criterion is triggered [19].

Support Vector Machine (SVM)
SVM is a kernel-based learning algorithm in which only a fraction of the training examples is used in the solution (these are called the support vectors), and where the objective of learning is to maximize a margin around the decision surface. The basic idea of applying SVM to classification can be stated briefly as first map the input vectors into one feature space (possibly with a higher dimension), either linearly or nonlinearly, which is relevant with the selection of the kernel function; then within the feature space, seek an optimized linear division, i.e. construct a hyperplane which separates two classes [18]. Considering classification for two classes with training vectors + , = 1 … and {1, −1}, SVC solves the following problems [20]: The dual of this formula is, Where is the vector of all ones, C > 0 is the upper bound, Q is an n by n positive semidefinite matrix, Here training vectors are implicitly mapped into a higher dimensional psace by the function . The decision function is:

Multilayer Perceptron (MLP)
MLP is a supervised neural network based on the original simple perceptron model with back propagation for training the network. It commonly consists of an input layer of source nodes, an output layer and one or more hidden layers of computation nodes (neurons) that increasing the learning power of the MLP model. The number of hidden neurons determines the learning capacity of MLP network. It is most recommended to select the network which performs best with the least possible number of hidden neurons [19]. Considering an MLP consisting of a single input, hidden and output layers, n-dimensional feature or input vector is denoted by X = (x1, …, xn) and weight vector by W= (w1,…,wn), then the weighted outputs of each neuron in the hidden layer will be: After that, the calculated value is passed through an activation function to yield an output value. Taking the activation function at layer j as ( ) ( ), the output can be determined as follows: Hyperbolic tangent is commonly used as the standard sigmoid activation function in MLPs, of which values range between 0 and 1.
Where is the euler number and is the output of the MLP. Thereafter, the error for the computation is calculated as follows: Where err denotes the difference between the real target (T) and the obtained output of the MLP ( ( ) ). Then, the system can be optimized by minimizing the following equation: In order to update the weights, the amount of change is calculated for each weight by partial differentiation and the chain rule as follows: Where denotes the learning rate. In the final step, each weight is updated as follows: The procedure given above comprises only one "epoch" and the same calculations are repeated until reaching the stopping criteria. MLP is capable of modeling complex functions, good at ignoring irrelevant inputs and noise, it can adapt its weights and it is easy to use. MLPs have been used as the main method in many studies conducted in the field of traffic accidents or to make comparisons.

Logistic Regression (LR)
LR is a predictive analysis and used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. It is a frequently used and well known statistical analysis [13], [21]. Suppose we have a binary output variable Y, and we want to model the conditional probability Pr(Y=1|X=x) as a function of x, any unknown parameters in the functions are to be estimated by maximum likelihood. Let log p(x) be a linear function of x so that changing an input variable multiplies the probability by a fixed amount. The logarithms are unbounded in only one direction, and linear functions are not. Through logistic (or logit) transformation of log p(x), we obtain: When ≥ 0.5, = 1 and when < 0.5, = 0. This means guessing 1 whenever 0 + . is non-negative, and 0 otherwise. So, LR gives a linear classifier. The decision boundary separating the two predicted classes is the solution of 0 + . = 0, which is a point if x is one dimensional, a line if it is two dimensional [22].

Performance Metrics
The performance of a classifier model is defined from a matrix, known as confusion matrix, which shows the correctly and incorrectly classified instances for each class. TP, TN, FP, FN metrics can be described as follows [23]: The measures that are used to evaluate the performance of a classifier are computed from the generated confusion matrix. Sensitivity and specificity are the most widely used statistics in a diagnostic test. Sensitivity (True Positive Rate (TPR)) is the proportion of instances correctly labeled positive in all positive instances tested (1); Specificity (True Negative Rate (TNR)) is the proportion of instances correctly labeled negative in all the negative instances tested (2); the Positive Predictive Value (PPV) is defined as (3) where a "true positive" is the event that the test makes a positive prediction; the Negative Predictive Value (NPV) is defined as (4) where a "true negative" is the event that the test makes a negative prediction. Accuracy (ACC) is the likelihood of a correctly predicted total number of modules (5)

Performance of the Prediction Models
In the study, NBC, kNN, DTC, SVM, MLP and LR methods were chosen as classifiers in order to provide a deep understanding into the nature of classification with a wide range of machine learning methods. The machine learning methods were all applied using MATLAB software, and LR method was performed with IBM SPSS Statistics software. For each method, twenty predictor variables obtained from two abovementioned datasets were provided as input variables. And the severity of accident (fatal and nonfatal) was set as output. Specificity, Sensitivity, Accuracy, ROC and AUC results were measured to compare the performances of classifiers. As mentioned in Section 2.1, the original accident dataset has an unbalanced structure; therefore, two different datasets were created (the first one has (25-75%) fatal/nonfatal ratio; the second one has (50-50%) fatal/nonfatal ratio. Then, 10-fold cross-validation was carried out for these two different combinations of inputs. For this purpose, cvpartition method of MATLAB software was used with a for-loop. 10-fold cross-validation is performed by separating 90% training and 10% test data in each fold randomly. Then, all analysis methods are applied in turn. kNN classification is achieved by 'fitcknn' method of MATLAB, where distance is set to Euclidean and number of neighbors is set to 1, as there are two output classes, fatal and nonfatal. NBC is made by 'fitcnb' method of MATLAB, where Gaussian distribution is specified to model data. DTC is performed by 'fitctree' method of MATLAB, which automatically selects the optimal subset of algorithms for each split using the known number of classes and levels of a categorical predictor. The parameters are chosen as follows: 'prune=on','minparentsize=10','AlgorithmForCategorical = PCA','qetoler=1E-6' and 'mergeleaves=on'. SVM classification model was designed by running 'fitcsvm' function in MATLAB. Several combinations have been tried and Radial Basis Function is chosen as the kernel for performance comparison, while Sequential Minimal Optimization is chosen as a solver for gradient difference between upper and lower violators. MLP model, a supervised neural network model, is designed with 'patternnet' function of MATLAB. It takes three parameters, which are set as 'hiddenSizes=10','trainFcn=trainscg' and 'performFcn=crossentropy'. The mean results of the first dataset (25/75 fatal/nonfatal) are given in Table 2. R-square is 0.467 for LR. Based on the results given in Table 2, the following remarks can be made:  In conclusion, only the DTC and LR methods provided a fair classification for fatal instances. As known, the highly non-linear relationship between variables will result in failure for models and thus make such models invalid. However, DTC does not require any assumptions of linearity in the data. This could have triggered the success of DTC in the analysis. On the other hand, LR's success mainly derived from its high accuracy of the non-fatal instance. This could be attributed to the fact that LR is great at simple classification problems. The initial data set contains a large amount of non-fatal instances, which could have contributed to the success of LR. As a result, DTC and LR can be seen as good classifiers due to their high AUC values despite their relatively low FPVs.
For the next analysis, the methods with the same parameter settings were applied to the second dataset consisting of an equal number of nonfatal and fatal (50/50) instances. The obtained results are given in Table 3. R-square is 0.401 for LR. Based on the analysis results given in Table 3, the following remarks can be made:  In conclusion, the results of the second analysis indicated that the accuracy of complex classification methods significantly increased, and all methods except for LR achieved similar rates of overall accuracy, with slightly better results of kNN and DTC. kNN's performance depends on the close neighborhood of similar target and can yield good predictive accuracy in low dimensions. Likewise, DTC proved to be more accurate in low dimensions, but both kNN and DTC have poor run-time performance when the data set becomes large. This is because each new node requires the computation of distance to every other node in the model for kNN, and similarly, the more decisions there are in a tree, the less accurate any expected outcomes are likely to be. On the other hand, MLP and SVM are known to produce even more accurate results with high dimensions and large datasets.

Predictor Importance Analysis
The AUC-based method was carried out with the second dataset consisting of an equal number of fatal and nonfatal accidents using 10-fold cross-validation. The obtained results are given for each classification method in Table 4. As a whole, no single parameter made big difference alone in the classification methods, and they all made similar small contributions to the model outputs either it was negative or positive. The input parameters are listed in descending order by AUC level.
The following inferences can be made from the results in

Conclusions
In this paper, NBC, kNN, DTC, SVM and MLP methods and LR statistic method are used to analyze motor vehicle accident data according to the accident result (i.e. fatal and non-fatal), also the significant factors that are associated with detailed meteorological reports in traffic accidents are identified. Property damage-only accidents are not included in this study A total of 20 parameters were used as input to classify accidents into two classes, fatal or non-fatal. The most difficult part of the study is to classify fatal instances accurately due to their low percentage value in total. The first dataset consisted of 246 fatal and 738 non-fatal cases, while the second included 246 fatal and 246 non-fatal cases. As a result, DTC and kNN algorithms yielded slightly more accurate results in classifying fatal instances in both datasets. On the other hand, MLP yielded the highest accuracy in both nonfatal and fatal instances combined as well as the highest AUC rate. Although LR performed well in the first dataset, its accuracy significantly decreased with the second dataset. The success of kNN and DTC could be attributed to the low dimensionality of the datasets.
To analyze predictor importance of the prediction models, AUCbased input ranking method is used. Based on this method, Mean Cloudiness, Traffic Control and Ground Surface Temperature variables were found to have a higher weight on classification results; in addition, Maximum Temperature and weather parameters negatively affected the classification performance of all models. The dataset lacks information on driver and vehicle characteristics, which was the main disadvantage of the study. The current traffic accident report should include information about driver characteristics like age, gender, education, etc. as well as vehicle characteristics like model, age, and type. With this additional information, the more detailed analysis could be carried out in the future.