An Exploratory Analysis of San Jose Traffic Accidents
Catherine Rauch
Abstract
Traffic data analysis can be used to investigate accident hotspot locations and potentially predict future accidents. This report examines traffic
accident data from 2020 reported in San Jose, California to understand how external factors interact with traffic accidents as well as how traffic
accidents occur over a weekly/hourly basis. The data was manipulated and graphed, findings include traffic accidents are more likely to occur near
road junctions, during afternoon rush hour, and on Friday's.
Introduction
Background
220 million Americans average an hour and a half each day in their cars and according to the Bureau of Transportation Statistics about 3.3 million
people travel at least 50 miles to work one way. That's a lot of people spending a lot of time in a car.
Traffic Study is an investigation and analysis into the transportation system of a specific area to examine a problem. By examining traffic accidents you
can learn about accident hotspot locations, common times of traffic accidents, and the impact of external stimuli such as weather/visibility/windspeed
on accidents.
This data is part of a A Countrywide Traffic Accident Dataset that includes information from 49 states collected from February 2016 to Dec 2020.
Currently, there are about 3 million accident records in this dataset, but this report will only be examining recent data from 2020 that took place only
in San Jose, California.
Aims
The first aim of this analysis was to examine what external variables, if any, are common in traffic accidents. This was approached by sorting, grouping,
and graphing the data to examine the relationships between variables. No conclusive information was found from the data.
The second aim of this report was to identify a pattern or hotspot, within traffic accident distributions daily and weekly. This was approached by
manipulating the time variable, grouping by day of the week or hour, and graphing. It was primarily found that most accidents occur on Friday and
during peak rush hour in the afternoon.
Materials and methods
Data description
This data is information about individual traffic accidents in the United States collected since 2016 from APIs streaming traffic event data. It is publicly
avaliable:
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident
Dataset.”, 2019.
It was collected by Sobhan Moosavi and team for their research.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk
Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems, ACM, 2019.
Sample and measurement information
The samples were collected using two real-time data providers, “MapQuest Traffic” and “Microsoft Bing Map Traffic”, these APIs broadcast traffic
events captured by the state department of transportation, law enforcement agencies, traffic cameras, and traffic sensors. Data was collected every 90
seconds from 6am-11pm, and every 150 seconds from 11pm-6am. The relevant population comprises traffic accidents reported in 2020 that occured
in San Jose, Santa Clara County, California. Because the sample was collected nonrandomly, there is no scope of inference, however the information
collected from this study can be relevant without needing to generalize or infer about the all traffic accidents in the population.
Data structure
For this report the observational units are individual traffic accidents. The variables are severity (impact on traffic), start/end time, location
(street/city/county), weather conditions (temperature/visbility/wind speed), and several true or false indicator variables regarding points of interest
around the accident (junction/traffic signal/stop sign). Specifics of these variables can be found in Table 1.
Table 1: variable descriptions and units for each variable in the dataset.
Name Variable description Units of measurement