Abstract: |
Road traffic safety is a crucial issue that affects millions of people worldwide. Traffic accidents can cause severe injuries, disabilities, and even fatalities, making it one of the leading causes of death globally. In Saudi Arabia, a significant portion of deaths is caused by road traffic accidents (RTA), accounting for 9.19% or 12,317 fatalities in 2020. Eastern Province is one of the most densely populated regions in the country, covering an area of around 672,522 square kilometers and having a population of over five million residents. Due to increased motorization since the 1970s oil boom, there has been a surge in road accidents in the area.
This study examined different machine learning algorithms to predict crash severity in the Eastern Province of Saudi Arabia, using data accidents from 2009 to 2021. The data was prepared for modeling by cleaning it, imputing missing values, and grouping it. The geographic and temporal distribution of the data was examined to ensure its reliability. Correlation analysis was used to assess the strength of the relationship between two sets of items, and new features were constructed from existing ones using a feature construction process. To address the imbalance of the dataset, Synthetic Minority Oversampling Technique (SMOTE) was used as an oversampling technique. Data transformation was also employed to create new features using transformation functions on existing ones. The most informative machine learning features were selected using a recursive feature elimination with the cross-validation (RFECV) wrapper technique. The final dataset used in this study contained a total of 31,728 traffic accidents that were deemed valid, with 7,293 fatalities and 24,435 injuries.
The study used four common classification algorithms, Logistic Regression (LR), Decision Tree Classifier (DTC), Random Forest Classifier (RFC), and eXtreme Gradient Boosting (XGBoost), which were trained and evaluated. Optimizing the models' performance was enhanced through several preprocessing techniques, hyperparameter tuning with RandomizedSearchCV, and a custom refit strategy. Preprocessing techniques were employed to convert categorical data into numerical data using one of three methods (ordinal encoding, one-hot encoding, and dummy encoding), and scaling numerical data using one of four scaling strategies (without scaling, standardization, normalization, and standardization then normalization). The approach also involved feature selection or using all features. Cross-validation was used to ensure generalization. A custom refit strategy was implemented to avoid overfitting.
The performance of the model using the best parameters was evaluated using seven different metrics. The DTC and XGBoost models exhibited the best performance, with the choice of which model to use depending on the desired balance between performance and execution time. For projects requiring high performance and greater computational resources, the XGBoost model may be the optimal choice. The LR or DTC model may be more appropriate if computational resources are limited, and a quick solution is needed. Exploring different preprocessing strategies is worth considering since they can significantly affect the model's performance. Applying these algorithms to predict crash severity in Saudi Arabia's Eastern Province can significantly improve road safety, inform targeted safety strategies, and reduce societal and economic costs of traffic accidents. |