Accident Severity Analysis using Machine Learning
Authors: Paras Mehan, Isha Gupta, Raghav Gupta
Road Accidents have a huge economic and societal impact costing hundreds of billions of dollars every year. Reducing accidents, especially serious accidents, is an important challenge. If we can better understanding the critical factors influencing an accident, we might be able to implement well-informed actions and better allocate financial and human resources. In this project, using machine learning techniques we will model the severity of the accidents using factors that are readily available factors without much investigation of the accident site. Then, we will identify the key factors responsible for determining the severity of the accident.
Introduction
The objective of our project is to identify the key factors affecting the accident severity. We would train a model that predicts Accident Severity — a number between 1 to 4, where 1 indicates the least impact on traffic and 4 indicates a significant impact on traffic.
To be specific, for a given accident, without any detailed information like driver information, vehicle type, etc. the model would predict the likelihood of the accident is a severe one or not. The model could take real-time traffic accidents and predict severe accidents in real-time.
After training a model to predict accident severity, we will use various methods like Gini Importance, etc. to get to know which factors are more important than others for predicting the accidents.
Literature Survey
Moosavi et al. [3] propose a “Deep Accident Prediction(DAP) “model for real-time prediction for a geographical region of reasonable size. They use a variety of features like weather, points-of-interest, and time to predict whether an accident is going to happen or not. They also publish their dataset; the U.S. Accidents dataset which contains 2.25 million traffic accidents in the US, that we are using for our study.
Assi et al. [1] predict crash severity using vehicle attributes and road condition attributes. The authors also propose a fuzzy c-means based support vector machine (SVM- FCM) were in this study to predict severity. They cluster the data-points using fuzzy c-means clustering, and then for each of the clusters, they train two models for each cluster (one FNN and one SVM) to predict severity in each of these clusters. At the time of testing, they divide the test data points between the clusters and then use the corresponding model to predict severity.
Underwood [4] presents a range of factors that influence an individual’s approach to accident analysis and can prevent the adoption and usage of analysis techniques.
Dataset
We used the US Accidents dataset by Moosave et al [2]. It consists of 3.5 million traffic accidents that took place in the United States, from February 2016 to June 2020.
Description
The data consists of 49 attributes:
- The severity of the accident
- Location of the Accident with start and end coordinates, state, country, zip code, airport code, etc.
- Weather conditions like wind speed, temperature, humidity, pressure, visibility, weather condition(text), etc.
- Time of the accident i.e. start time, end time and sunrise sunset information to know was it night or day
- Road descriptions during the accident like a bump, crossing, junction, amenity, etc.
Dropping Columns
There were several columns with null values or columns which were not related to the severity of accidents. We removed the following columns:
- TMC: 1/3rd of the values in this column were null values.
- End Latitude and End Longitude: 2/3rd of the row was missing.
- Source, Timezone, and Weather Timestamp: These columns are unrelated to the severity of the accident.
- Description of the Accident: Advanced NLP technique would be required to process this information.
- Street, Street Number, City, State, Country, County, Zip Code, Airport Code: Redundant columns since exact GPS coordinates of the accident is given. These are easier to read but mathematically only the GPS coordinates will be helpful. These columns, however, were used for EDA to see the frequency distribution across states.
- Amenity, Bump, Give Way, No Exit, Railway, Round-about, Station, Stop, Traffic Calming, Turning Loop: These are boolean values. Only 10% of the values are different from others.
- Civil Twilight, Nautical Twilight, Astronomical Twilight: Since the Sunrise Sunset field already tells the time of the day, these also reported almost similar results, and hence were removed.
Feature Engineering
- Binning: We did binning on the Start Time into Year, Month, Day, and Time in seconds. We further binned Time In Seconds to Time Of Day i.e. morning, afternoon, evening, and night. We further binned Month into Seasons i.e. ’Spring’,’ Summer’, ’Autumn’ and ’Winter’. We binned Day into Day Type, i.e. Weekday and Weekend.
- Feature Addition: We added a new feature, Duration = End Time — Start Time. It signifies the time taken for the traffic to clear after the accident occurred. The end time column was dropped after introducing the duration feature.
- Handling Categorical Values: There were two columns with categorical values, Wind Direction and Weather Condition. We noticed that these columns had some noise which needed to be cleaned. We simplified Weather Direction into 10 unique classes by merging similar values. Weather Condition contained 100+ unique values, out of which many values were closely related to each other. We grouped these values into 8 categories ’Clear’, ’Cloud’, ’Rain’, ’Heavy Rain’, ’Snow’, ’Heavy Snow’, and ’Fog’. We performed One-Hot Encoding on these two columns to be able to use it in our models.
- Handling NA values: We filled NA values with the mean values of the columns in Temperature, Wind Chill, Humidity, Pressure, Visibility, Wind Speed, and Precipitation. Precipitation had a high amount of missing values. But we still chose to keep precipitation because if precipitation was a factor in the accident then most likely it would be mentioned in the report. Our data’s location, the USA, has a low precipitation number. Chances are that if it wasn’t included then it wasn’t a factor in the accident.
Preparation of Training and Testing Data
First, we split the dataset in a 7:3 ratio using stratified splitting. Then we normalized the data using MinMaxScaler so that the model is not skewed due to absolute values.
Methodology
Initially, we trained our model using the start longitude and latitude information of the accidents. But we later decided not to include them. This was because our model was over-fitting on the longitude and latitude feature and ignoring other features.
Logistic Regression Classifier
We used grid search on the Solver, Loss Function, C (Inverse of regularization strength) and trained multiple models to find the optimum hyperparameters for the Logistic Regression Classifier.
Decision Tree Classifier
Decision Tree Classifier is a tree-based model that has hyperparameters including max depth, max features used for splitting, and the type of splitter used.
Random Forest Classifier
Building on top of decision trees, we used the grid search to determine the optimal number of trees in the random forest classifier. The number of trees is the most dominant hyper-parameter in random forests, as they classically use unpruned weak classifiers for ensemble learning.
ADA Boost Classifier
ADA Boost is also a type of ensemble learning in which trees are weighted and the second tree is grown on top of the first one ie the weak classifiers are not independent of each other. We used the grid search to estimate the optimal number of trees in the forest.
Gradient Boost Classifier
Gradient Boost is also an ensemble learning algorithm which is a general case of AdaBoost (which uses a more specific loss function).
Support Vector Machine
We also tried SVC to classify severity. But since SVC’s complexity is very high, considering that the number of data points is of the order of millions, we took a random sample of 2 lakh data-points to train the SVC model.
Neural Network
We tried many neural networks with one to six fully connected hidden layers and different activation functions such as Leaky-ReLU, ReLU, Tanh, Linear, Sigmoid. The best validation accuracy that we got was using the parameters as defined in Fig. 4.
Results
Exploring Models
In our experiments, we extensively explored the Machine Learning models (in Sec. 4) to predict the accident severity class, whose results are given in Table 2. Logistic Regression gave low training and testing accuracy, which implies that the data must not be linearly separable. The decision tree gave a much better performance, so the ensemble methods can likely give better performance than other methods. We see that AdaBoost, Gradient Boost, and Random Forest model perform best with 69% to 74% accuracy. We see that SVM performed poorly because we couldn’t train it on the whole dataset as the dataset was massive. We only used a sample of training data to train SVM, due to which it generalized poorly on the test set. Neural Nets also performed decently with an accuracy of 71 percent. By overall performance, Random Forest gives the best accuracy of 0.74 on the test data.
As shown in Fig. 7 and Fig. 8, the duration is an important feature for classifying the severity of an accident. We also observed that the weather conditions are one of the most important features which affect the severity of the accident.
Exploring Features
Now, we wanted to test the importance of different features for classifying severity. For that, we first grouped similar features to make the different classes of features, as shown below :
- Distance(mi)
- Side
- Temperature Condition : Temperature(F), Humid- ity(%), Pressure(in), Visibility(mi), Precipitation(in)
- Wind Conditions: Wind Chill(F), Wind Speed(mph), Wind Direction E, Wind Direction N, Wind Direction NW, Wind Direction S, Wind Direction SE, Wind Direction SW, Wind Direction VAR, Wind Direction W
- Rain Condition: Clear, Cloud, Rain, Heavy Rain, Snow, Heavy Snow, Fog
- Junction, Crossing, Traffic Signal
- Sunrise/Sunset : Sunrise/Sunset
- Duration
- Time of Day: TimeofDay Early Morning, TimeofDay Evening, TimeofDay Morning
- Seasons: Season Spring, Season Summer, Season Winter
- Day Type
Then for each of the classes, we trained and tested the Random forest model (since it gave the best accuracy among all the tried models) by excluding the features of that particular class only and including all the other features. The results of this experiment are shown in Table 3. Based on the results we observe that :
- Distance is the most important feature, removing which we see a significant drop in the prediction accuracy
- Distance is closely followed by Temperature Conditions and Sunrise/Sunset as the most important feature.
- The least important feature is the rain conditions
Conclusion
We started off by exploring the problem of predicting the severity of accidents and understood its importance.
We studied our dataset, which initially consisted of 49 features and 3.5 million records. After careful data analysis, we removed less useful and redundant columns, dropped rows with incomplete data, and added more useful and conclusive features. We also removed features that could add to overfitting and bias.
After creating our initial baseline models, we moved to- wards training more advanced models like Decision Trees, Random Forest, ADA Boost, Gradient Boost, and SVM. We used Grid Search to get the optimal hyperparameters for these models and compared the results.
We also tried neural networks with various architectures. From our results, we found that Random Forest performed best. This could be because it being an ensemble learning, with multiple unpruned trees. Decision Tree with optimal parameters also performed quite well, followed by Gradient and ADA Boost. Due to the high complexity of SVC, limited data could be passed for training, and hence it performed marginally less than the other models.
Finally, for exploring features, we grouped similar features into different classes and then trained the Random for- est model with only removing features from these classes. From this experiment, we found that distance is the most important feature, while the rain has less importance, which is a very counterintuitive result.
Link to the GitHub Repository
This project is done for the CSE-343 Machine Learning Course at IIIT Delhi.
Professor: Dr. Jainendra Shukla (IIITD Faculty Profile, LinkedIn)
Team Members
Paras Mehan:
Isha Gupta:
Raghav Gupta
References
[1] Khaled Assi, Syed Masiur Rahman, Umer Mansoor, and Nedal Ratrout. Predicting crash injury severity with machine learning algorithm synergized with clustering technique: A promising protocol. International journal of environmental research and public health, 17(15):5497, 2020.
[2] Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. A countrywide traffic accident dataset. CoRR, abs/1906.05409, 2019.
[3] Sobhan Moosavi, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. Accident risk prediction based on heteroge- neous sparse data: New dataset and insights. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 33–42, 2019.
[4] Peter Underwood and Patrick Waterson. Accident anal- ysis models and methods: guidance for safety profes- sionals. Loughborough University, 2013.