6/11/2021
project-final-report
file:///C:/Users/Catherine/Downloads/project-final-report (4).html
1/5
An Exploratory Analysis of San Jose Traffic Accidents
Catherine Rauch
Abstract
Traffic data analysis can be used to investigate accident hotspot locations and potentially predict future accidents. This report examines traffic
accident data from 2020 reported in San Jose, California to understand how external factors interact with traffic accidents as well as how traffic
accidents occur over a weekly/hourly basis. The data was manipulated and graphed, findings include traffic accidents are more likely to occur near
road junctions, during afternoon rush hour, and on Friday's.
Introduction
Background
220 million Americans average an hour and a half each day in their cars and according to the Bureau of Transportation Statistics about 3.3 million
people travel at least 50 miles to work one way. That's a lot of people spending a lot of time in a car.
Traffic Study is an investigation and analysis into the transportation system of a specific area to examine a problem. By examining traffic accidents you
can learn about accident hotspot locations, common times of traffic accidents, and the impact of external stimuli such as weather/visibility/windspeed
on accidents.
This data is part of a A Countrywide Traffic Accident Dataset that includes information from 49 states collected from February 2016 to Dec 2020.
Currently, there are about 3 million accident records in this dataset, but this report will only be examining recent data from 2020 that took place only
in San Jose, California.
Aims
The first aim of this analysis was to examine what external variables, if any, are common in traffic accidents. This was approached by sorting, grouping,
and graphing the data to examine the relationships between variables. No conclusive information was found from the data.
The second aim of this report was to identify a pattern or hotspot, within traffic accident distributions daily and weekly. This was approached by
manipulating the time variable, grouping by day of the week or hour, and graphing. It was primarily found that most accidents occur on Friday and
during peak rush hour in the afternoon.
Materials and methods
Data description
This data is information about individual traffic accidents in the United States collected since 2016 from APIs streaming traffic event data. It is publicly
avaliable:
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident
Dataset.”, 2019.
It was collected by Sobhan Moosavi and team for their research.
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. "Accident Risk
Prediction based on Heterogeneous Sparse Data: New Dataset and Insights." In proceedings of the 27th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems, ACM, 2019.
Sample and measurement information
The samples were collected using two real-time data providers, “MapQuest Traffic” and “Microsoft Bing Map Traffic”, these APIs broadcast traffic
events captured by the state department of transportation, law enforcement agencies, traffic cameras, and traffic sensors. Data was collected every 90
seconds from 6am-11pm, and every 150 seconds from 11pm-6am. The relevant population comprises traffic accidents reported in 2020 that occured
in San Jose, Santa Clara County, California. Because the sample was collected nonrandomly, there is no scope of inference, however the information
collected from this study can be relevant without needing to generalize or infer about the all traffic accidents in the population.
Data structure
For this report the observational units are individual traffic accidents. The variables are severity (impact on traffic), start/end time, location
(street/city/county), weather conditions (temperature/visbility/wind speed), and several true or false indicator variables regarding points of interest
around the accident (junction/traffic signal/stop sign). Specifics of these variables can be found in Table 1.
Table 1: variable descriptions and units for each variable in the dataset.
Name Variable description Units of measurement
6/11/2021
project-final-report
file:///C:/Users/Catherine/Downloads/project-final-report (4).html
2/5
Name Variable description Units of measurement
ID Unique identifier for the accident
Severity Severity of the accident Number between 1 and 4
Start_Time Start time of the accident in local time zone Date & 24-hour Clock
End_Time End time of the accident in local time zone- refers when impact of accident on traffic flow was dismissed Date & 24-hour Clock
Street Street name in address record
Temperature(F) Temperature at time of accident Fahrenheit (F)
Distance length of the road extent affected by the accident Miles (mi)
Visibility(mi) Visibility at time of accident Miles (mi)
Wind_Speed(mph) Wind speed at time of accident Miles per hour (mph)
Weather_Condition Weather condition at time of accident Rain, Snow, Thunderstorm, Fog, etc
Crossing T/F indicates presence of crossing nearby
Junction T/F indicates presence of junction nearby
Railway T/F indicates presence of railway nearby
Station T/F indicates presence of station nearby
Stop T/F indicates presence of stop nearby
Traffic_Signal T/F indicates presence of traffic_signal nearby
Sunrise_Sunset Period of day based on sunrise/sunset Day/Night
The first few rows of the data is shown in Table 2.
Table 2: example rows of traffic accident data.
ID Severity Start_Time End_Time Street Temperature(F) Visibility(mi) Distance(mi) Wind_Speed(mph) Weather_Condition Crossing Junction
A-
532
2
2020-12-
19 01:08:00
2020-12-
19
03:20:48
I-280 S 39.0 10.0 0.153 5.0 Fair False False
A-
670
3
2020-06-
20 05:36:19
2020-06-
20
08:20:25
I-280 N 56.0 9.0 0.000 6.0 Partly Cloudy False True
A-
815
2
2020-01-
16 18:57:00
2020-01-
16
19:32:23
I-880 N 43.0 10.0 0.000 9.0 Partly Cloudy False False
A-
1667
2
2020-02-
08 18:50:00
2020-02-
08
19:44:01
I-280 S 54.0 10.0 0.000 9.0 Fair False True
A-
1777
2
2020-12-
14 00:45:00
2020-12-
14
03:06:02
Bayshore
Fwy N
45.0 9.0 0.253 3.0 Fair False False
Methods
Exploratory analysis was conducted on the dataset, aimed at understanding the relationship between external factors and traffic accidents as well as
how time contributed to traffic accidents. Quantitative variables, if distanced based were converted to feet. Then, relationships between variables were
explored with bar charts and scatterplots. Categorical, and other indicator variables were explored using Pandas functions, alongside arthmetic
operations to calculate percentages, in order to understand what factors were most present in the data. Lastly, accidents were grouped by hour and
day of the week then graphed as point/line charts to visualize distributions of accidents over time.
Results
Exploratory analysis focused on what factors were present during a traffic accident.
Figure 1 shows the count totals of accidents that occured during the Day and Night
Figure 1: Count total for Daytime accidents and Nighttime accidents.
6/11/2021
project-final-report
file:///C:/Users/Catherine/Downloads/project-final-report (4).html
3/5
Day Night
0
500
1,000
1,500
2,000
2,500
Count of Records
55% of accidents occured during the day thats 2598 out of a total 4733. This would indicate the data is split quite evenly with neither daytime
accidents or nighttime accidents being more common.
Figure 2: Percentage breakdown of severity of accidents.
1 2 3 4
0
10
20
30
40
50
60
70
80
90
Percentage
82% of accidents were were rated level 2 severity, meaning there was a somewhat short delay as a result of the accident with fewer than 1% rated a 4
(a substantionaly long delay).
Figure 3 shows the breakdown of accidents that occurred under each reported weather condition.
Figure 3: Count total for each category of documented weather condition.
0 200 400 600 800 1,000 1,200 1,400 1,600 1,800 2,000
ID
Fair
Mostly Cloudy
Partly Cloudy
Cloudy
Light Rain
Rain
Smoke
Haze
Fog
Fair / Windy
Mist
Light Rain with Thunder
Heavy Rain
Mostly Cloudy / Windy
Partly Cloudy / Windy
Light Rain / Windy
T-Storm
Weather_Condition
It can be seen from figure 3, that "hazardous" weather conditions ie heavy rain, are not common among the documented traffic accidents.
Continuing the investigation into external factors, Point-Of-Interest (POI) attributes were grouped and totaled. Figure 3 shows the amount of
accidents that occurred near each POI.
Figure 4: Count total for each category of recorded POI.
6/11/2021
project-final-report
file:///C:/Users/Catherine/Downloads/project-final-report (4).html
4/5
42% of accidents occured near one of the POI locations. In figure 4, it is clear that a large percentage of accidents occur at road junctions while very
few occur at stop signs or stations.
Further analysis looked into hour of the day and day of the week for peek times of accidents.
Figure 5: Count totals for accidents on each day of the week. 0 is Monday, 6 is Sunday.
0 1 2 3 4 5 6
Day of the Week
580
600
620
640
660
680
700
720
740
760
780
Count of Records
There appears to be a large spike in accidents on Wednesday and Friday with a large drop on the weekends.
Figure 6: Count totals for accidents at each hour of the day.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hour of the Day
0
50
100
150
200
250
300
350
400
Count of Records
While, most accidents seem to occur during peak rush hour 7am - 9am and 3pm - 6pm.
Discussion
This project analyzed relationships between traffic accidents and variables from 2020 data recorded in San Jose, CA. Focusing on whether external
factors ie weather (Figure 3) and POI areas (Figure 4) had an apparent effect on traffic accidents and examined how traffic accidents occurred over
time within a day (Figure 6) and over a week (Figure 5).
The analysis suggests that there is no clear relationship between weather, temperature, and visibility in traffic accidents. Although, this could be due
to the data collected. San Jose, or more specifically Northern California, does not have extreme (hazardous) weather conditions that would be
expected to disrupt traffic visibility and flow. It also appeared to be no more likely that a traffic accident would occur at night rather than the day. This
6/11/2021
project-final-report
file:///C:/Users/Catherine/Downloads/project-final-report (4).html
5/5
could be due to less total traffic flow at night resulting in fewer traffic accidents. Further analysis with data about total traffic flow would be needed to
examine that.
While certain POIs seemed to have more traffic accidents than others (accidents at junctions were most common), overall did not seem to be a
majority factor in accounting for traffic accidents as roughly 58% of accidents occurred away from these areas. Further analysis into the time of the
accident appeared to yield more concrete results. Accidents are more likely to occur on Wednesdays or Fridays with a large drop on the weekend.
This could be due to fewer people, weekdays more people are out commuting to, and from work while weekends may be reserved for time spent at
home. There is some evidence of this when examining total accidents by the hour. Most occur during peak rush hour before and after work with lulls
from 10am to noon and in the early hours of the morning. Further analysis could look more deeply into the relationship between accidents on specific
streets during certain times of day.