Forecast of Southern Resident Killer
Whale Sightings in the Salish Sea
Catherine Rauch
Summary
The goal of this report was to compile sighting data of the Southern Resident Killer Whales in the Salish
Sea, and to accurately predict future months of sightings data from past data.
The analysis started with a sample of 16,722 observations, these observations were grouped by month
from year 2008 to 2015. The raw data had a decreasing variance and presented a seasonal trend.
Appropriate transformation and differencing was applied to achieve a stationary time series. The plots
of sample autocorrelation (ACF) and sample partial autocorrelation (PACF) were examined to determine
an appropriate model. Models were sorted by AICc (corrected Akaike information criterion), and
checked for non-stationarity and non-invertiblity. The model was then used to forecast the next 18
months of sightings data.
This dataset predicts the decrease in total whale sightings from the previous year, which is a part of a
larger downward trend of fewer sightings and a continuing seasonal trend as the whales continue to
migrate for winter and summer. Whale watching companies will have to prepare for a decrease in whale
sightings and potentially a decrease in revenue and tourism due to fewer whale sightings.
Introduction
The Southern Resident Killer Whales are a unique sub-species of killer whale. Their name, resident,
reflects their summer behavior, when they often spend weeks in the same area, but travel seasonally
between summer and winter.
The whale watching industry in Washington State is estimated to be worth $65 to $70 million annually.
With the Whale Museum estimating that, annually, more than 500,000 people go whale watching on
commercial /private vessels in the waters of Washington State and British Columbia (the Salish Sea).
NOAA Technical Memorandum conducted an analysis of the whale watching industry and found that
many businesses are based on tourism that would not survive without the Southern Resident Killer
Whales. It’s estimated that wildlife watching activities support more than 21,000 jobs in Washington.
However, these whales are critically endangered, so their population is small and often, they can be
elusive, hard to see, and in recent years their sightings have decreased all together.
The main goal is to identify a sightings pattern to be able to forecast expected arrival of the whales,
furthermore, although not in the scope of this time series analysis, this projection can allow for
prediction in estimated revenue for whale watching businesses.
In detailing techniques used on the raw time series data, a transformation was applied to stabilize the
non constant variance. The data was split into two data sets, one to be used in training the fitted model
and another to test the fitted model. The transformed data was differenced to remove the seasonality
trend. Through examination of the ACF and PACF parameters were identify to fit multiple SARIMA(p, d,
q) × (P, D, Q)s models. They were compared using AICc and model coefficients were used to assess
whether the model was invertible and stationary. Diagnostic checking was performed on the residuals to
assess if model needed adjusting. The model was fitted to the training data set along with the test data
set to compare and then used to forecast. This forecast predicted a overall decrease in total whale
sighting in the Salish Sea compared to the last year. It also predicted a continuation of a downward
trend of less total whale sightings year to year.
This data set is a compilation of monthly Southern Resident Killer Whale sighting counts from 2008 to
2015. This information came from The Whale Museums online sightings hotline, located on their
website, where they have public records available. This data was parsed and compiled using pandas in
Jupyterlab, and then analyzed and forecasted using R in Rstudio.
Findings
The initial overview of the data indicated that it had a variance that was non constant and had a
seasonal trend. This indicates the time series is non-stationary. The data was plotted below.
To assess the quality of the model, the raw data was split into a training set and test set. Training set
was the data that was analyzed to build the model and consisted of the monthly data from 2008-2014.
The training data was transformed using a log transformation to stabilize the variance. The resulting
data is graphed below
This graph still shows a yearly seasonal trend, so the data was differenced at lag 12 to eliminate this.
This newly differenced and transformed data appeared to have no apparent trend/seasonality.
The PACF and ACF below of the series were used to preliminarily estimate a SARIMA(p, d, q) × (P, D, Q)s
model.
One seasonal differencing was applied, indicating D = 1 at lag s = 12, as the data was reported monthly.
The ACF shows a peak at h = 1s, this shows Q can be estimated as 1. The PACF shows a strong peaks at h
= 2s, so estimation for P will be 2.
The series was differenced again to estimate p and q. Below are the graphs.
From the within-year ACF and PACF, q can be estimated as 1 and p as 2 or 1. Because the data was only
differenced to remove seasonality, d will be equal to 0.
3 models were then fitted.
Model 1: SARIMA(1, 0, 1) × (2, 1, 1)12
Model 2: SARIMA(1, 0, 1) × (2, 1, 2)12
Model 3: SARIMA(2, 0, 1) × (1, 1, 1)12
AICc was used to compare the model fit. As model 3 has the lowest value, the next step was to examine
model roots to check invertibility and stationarity. Roots were plotted below.
Model roots are in red and should be outside of the unit circle, inverse roots in blue and should be inside
the unit circle. No roots should be on the circle. Since all conditions are satisfied in these graphs, the
model is stationary and invertible.
Coefficients were estimated by the Rstudios sarima function in the package {astsa}.
Autoregressive (AR) coefficient 1 = -0.2097, AR coefficient 2 = 0.2146
Moving Average (MA) coefficient = 0.2475
Seasonal AR coefficient = 0.1763
Seasonal MA coefficient = -0.7657
SARIMA(p, d, q) × (P, D, Q)s equation:






  
  
SARIMA(2, 0, 1) × (1, 1, 1)12:
    
  

  
  

Fitted equation:
 

 

 

 

 

Diagnostic checking was conducted on the residuals of model 3. The portmanteau tests, BoxPierce and
LjungBox in Rstudios {stats} package were computed to examine independence of residuals. The p-
values were reported as 0.5796 and 0.5002, respectively. It can be concluded that the residuals are
independent. Below are the residuals of model 3 graphed.
After examining the graphs and taking into consideration the p-values reported in the tests, it can be
concluded the residuals are approximately normal and approximately independent. The graphs also
indicate that the residuals are white noise, ie they have no trend, no seasonality, no change of variance.
From these conclusions it was decided no adjustment of the fitted model was needed.
From this model, using the sarima.for function in Rstudios {astsa} package, a forecast was fitted on the
training model. The test data was included. The shaded blue is the confidence interval calculated by the
function sarima.for.
The above graph is of the transformed data. Below the model was fitted on the original data.
From the prediction, it can be concluded that a decrease in whale sightings from the previous year will
occur. There appears to be a continuing trend of total amount of whale sightings decreasing in recent
years. Additionally, the strong seasonal trend will continue.
Conclusion
The goal of this analysis was to compile sighting data of the Southern Resident Killer Whales in the Salish
Sea, and to identify a sightings pattern in order to predict future months of sightings. This goal was
achieved.
The final fitted model:
 

 

 

 

 

From the obtained prediction, it can be concluded that a decrease in whale sightings from the previous
year will occur. There appears to be a continuing trend of total amount of whale sightings decreasing in
recent years. These whales are critically endangered and in the past years have had a downward trend
in population. The sighting data appears to mirror this; fewer whales’ means fewer sightings. Whale
watching companies will have to prepare for an overall decrease in whale sightings and potentially a
decrease in revenue and tourism due to this.
References
The Whale Hotline. The Whale Museum, hotline.whalemuseum.org/
Southern Resident Killer Whale Chinook Salmon Inititive. srkwcsi.org/
Sightings of Southern Resident Killer Whales in the Salish Sea 1976−2014: the Importance of a Long-
Term Opportunistic Dataset.
Coastal Ocean Research Institute, Vancouver, BC, Canada
Jennifer K. Olson , Jason Wood , Richard Osborne , Lance Barrett-Lennard, Shawn Larson
www.eopugetsound.org/sites/default/files/features/resources/Olson%20et%20al.%202018%20SRKW%
20Sightings.pdf
The U.S. Whale Watching Industry of Greater Puget Sound A Description and Baseline Analysis NOAA
Technical Memorandum
https://www.nwfsc.noaa.gov/assets/25/7442_04172014_143023_WhaleWatchingTM126WebFinal.pdf
Appendix
# read data into Rstudio
whale <- read.csv("C:/Users/Catherine/Downloads/new_whale (1).csv")
whale <- whale[,2]
# variance changes over time, so apply transformation
whale <- log(whale)
# split model up into data for model to be trained on and then tested
train <- whale[1:84]
test <- whale[85:96]
test <- ts(test, frequency = 12)
whale.ts <- ts(train, frequency = 12)
plot.ts(whale.ts) # used above in report
# difference to remove seasonality
whale.diff <- diff(whale.ts, 12)
plot.ts(whale.diff)
# plot acfs and pacfs to estimate parameters
#(plots used above in report)
acf(whale.diff, lag.max = 40)
pacf(whale.diff, lag.max = 40)
acf(diff(whale.diff, 12), lag.max = 40)
pacf(diff(whale.diff, 12), lag.max = 40)
# fit models with predicted parameter values estimated from acf and pacf
# model 1
fit.i <- sarima((train), p = 1, d = 0, q = 1, P = 2 , D = 1, Q = 1, S = 12)
## initial value -0.077803
## iter 2 value -0.301345
## iter 3 value -0.320685
## iter 4 value -0.331440
## iter 5 value -0.336954
## iter 6 value -0.337901
## iter 7 value -0.338157
## iter 8 value -0.338249
## iter 9 value -0.338293
## iter 10 value -0.338361
## iter 11 value -0.338515
## iter 12 value -0.338566
## iter 13 value -0.338578
## iter 14 value -0.338579
## iter 14 value -0.338579
## final value -0.338579
## converged
## initial value -0.042883
## iter 2 value -0.079369
## iter 3 value -0.094209
## iter 4 value -0.094667
## iter 5 value -0.094997
## iter 6 value -0.095120
## iter 7 value -0.095296
## iter 8 value -0.095854
## iter 9 value -0.096457
## iter 10 value -0.096825
## iter 11 value -0.097023
## iter 12 value -0.097040
## iter 13 value -0.097076
## iter 14 value -0.097109
## iter 15 value -0.097119
## iter 16 value -0.097121
## iter 17 value -0.097125
## iter 18 value -0.097127
## iter 19 value -0.097130
## iter 20 value -0.097134
## iter 21 value -0.097135
## iter 22 value -0.097135
## iter 23 value -0.097135
## iter 24 value -0.097135
## iter 25 value -0.097135
## iter 26 value -0.097135
## iter 27 value -0.097135
## iter 27 value -0.097135
## final value -0.097135
## converged
# model 2
fit.ii <- sarima((train), p = 1, d = 0, q = 1, P = 2 , D = 1, Q = 2, S = 12)
## initial value -0.077803
## iter 2 value -0.294933
## iter 3 value -0.315719
## iter 4 value -0.329583
## iter 5 value -0.341670
## iter 6 value -0.343374
## iter 7 value -0.347261
## iter 8 value -0.347934
## iter 9 value -0.348118
## iter 10 value -0.348128
## iter 11 value -0.348143
## iter 12 value -0.348150
## iter 13 value -0.348158
## iter 14 value -0.348165
## iter 15 value -0.348168
## iter 16 value -0.348169
## iter 17 value -0.348169
## iter 17 value -0.348169
## iter 17 value -0.348169
## final value -0.348169
## converged
## initial value -0.008415
## iter 2 value -0.067954
## iter 3 value -0.093028
## iter 4 value -0.093418
## iter 5 value -0.093900
## iter 6 value -0.094110
## iter 7 value -0.094959
## iter 8 value -0.095439
## iter 9 value -0.095728
## iter 10 value -0.095935
## iter 11 value -0.096327
## iter 12 value -0.096756
## iter 13 value -0.096984
## iter 14 value -0.097065
## iter 15 value -0.097085
## iter 16 value -0.097088
## iter 17 value -0.097090
## iter 18 value -0.097095
## iter 19 value -0.097114
## iter 20 value -0.097151
## iter 21 value -0.097158
## iter 22 value -0.097160
## iter 23 value -0.097160
## iter 24 value -0.097163
## iter 25 value -0.097163
## iter 26 value -0.097163
## iter 27 value -0.097164
## iter 28 value -0.097164
## iter 29 value -0.097165
## iter 30 value -0.097165
## iter 31 value -0.097165
## iter 31 value -0.097165
## final value -0.097165
## converged
# model 3
fit.iii <- sarima((train), p = 2, d = 0, q = 1, P = 1 , D = 1, Q = 1, S = 12)
## initial value -0.041606
## iter 2 value -0.047253
## iter 3 value -0.142442
## iter 4 value -0.153759
## iter 5 value -0.160822
## iter 6 value -0.163548
## iter 7 value -0.164328
## iter 8 value -0.164558
## iter 9 value -0.164609
## iter 10 value -0.164680
## iter 11 value -0.164712
## iter 12 value -0.164712
## iter 12 value -0.164712
## iter 12 value -0.164712
## final value -0.164712
## converged
## initial value -0.036011
## iter 2 value -0.041058
## iter 3 value -0.084230
## iter 4 value -0.091951
## iter 5 value -0.095326
## iter 6 value -0.096226
## iter 7 value -0.097004
## iter 8 value -0.098002
## iter 9 value -0.098407
## iter 10 value -0.098612
## iter 11 value -0.098744
## iter 12 value -0.098875
## iter 13 value -0.099023
## iter 14 value -0.099202
## iter 15 value -0.099220
## iter 16 value -0.099225
## iter 17 value -0.099228
## iter 18 value -0.099230
## iter 19 value -0.099233
## iter 20 value -0.099235
## iter 21 value -0.099235
## iter 22 value -0.099235
## iter 23 value -0.099235
## iter 24 value -0.099235
## iter 25 value -0.099236
## iter 26 value -0.099236
## iter 27 value -0.099236
## iter 27 value -0.099236
## final value -0.099236
## converged
#code to obtain AICc values
fit.i$AICc
## [1] 2.475241
fit.ii$AICc
## [1] 2.503961
fit.iii$AICc # lowest AICc
## [1] 2.471596
# call fit to read coefficients
> fit.iii
$fit
Call:
arima(x = data, order = c(p, d, q), seasonal = list(order = c(P, D, Q),
period = S), include.mean = F)
Coefficients:
ar1 ar2 ma1 sar1 sma1
-0.2097 0.2146 0.2475 0.1763 -0.7657
s.e. 0.3438 0.1178 0.3418 0.2455 0.2983
sigma^2 estimated as 0.7491: log likelihood = -95.44, aic = 204.88
# check invertibility
# All roots (in red) should be outside of the unit circle; thus, both models
are causal and invertible.
#graphs used above in report
plot.roots(NULL,polyroot(c(1, -0.2097, 0.2146)), main="roots of ar part"
plot.roots(NULL,polyroot(c(1, 0.2475)), main="roots of ma part")
plot.roots(NULL,polyroot(c(1, 0.1763)), main="roots of sar part")
plot.roots(NULL,polyroot(c(1,-0.7657)), main="roots of sma part")
# check assumption that residuals are white noise
res = residuals(fit.iii$fit)
mean(res)
## [1] 0.0007956052
var(res)
## [1] 0.6314296
# code for residual graphs used above
pacf(res,main = "Partial Autocorrelation")
acf(res,main = "Autocorrelation")
# Histogram
hist(res,main = "Histogram")
# q-q plot
qqnorm(res)
qqline(res,col ="blue")
## Test for independence of residuals
Box.test(res, lag = 10, type = c("Box-Pierce"), fitdf = 2)
##
## Box-Pierce test
##
## data: res
## X-squared = 6.6067, df = 8, p-value = 0.5796
Box.test(res, lag = 10, type = c("Ljung-Box"), fitdf = 2)
##
## Box-Ljung test
##
## data: res
## X-squared = 7.3426, df = 8, p-value = 0.5002
# Test for normality of residuals
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.93646, p-value = 0.06787
# Use model to forecast, resulting graph used above in report
pred.tr <- sarima.for(train, 18 , 2, 0, 1, 1, 1, 1,
S = 12, no.constant = FALSE, plot.all = F)
points(85:96,test,col="blue")
legend("topleft", pch = 1,col=c("red","blue"),
legend=c("Forecast value", "True Value"))
# Use model to forecast on original data, resulting graph used above in
report
whale2 <- whale.og[-c(61:72),2]
pred.tr <- sarima.for(whale2 , 18 , 2, 0, 1, 1, 1, 1,
S = 12, no.constant = FALSE, plot.all = F)
## python code used in jupyter to extract/compile data
import csv
import requests
base_url =
"https://hotline.whalemuseum.org/api.json?species=orca&orca_type=southern%20r
esident"
csv_name = 'whale_data.csv'
page_num = 0
with open(csv_name, 'w') as csv_file:
while True:
response = requests.get(f"{base_url}&page={page_num}&limit=1000")
json_data = response.json()
if not json_data:
break
page_num += 1
if page_num == 1:
writer = csv.DictWriter(csv_file, fieldnames=list(json_data[0]))
writer.writeheader()
writer.writerows(json_data)
import pandas as pd
df = pd.read_csv(csv_name)
for time_col in ('sighted_at', 'created_at', 'updated_at'):
df[time_col] = pd.to_datetime(df[time_col])
print(df.head())
id species quantity description
\
0 5a84ba92686f74490f4e1700 orca NaN Imported by The Whale Museum.
1 5a84ba91686f74490f481700 orca NaN Imported by The Whale Museum.
2 5a84bbb7686f74490f821d00 orca NaN Imported by The Whale Museum.
3 5a84ba90686f74490f441700 orca NaN Imported by The Whale Museum.
4 5a84ba90686f74490f461700 orca NaN Imported by The Whale Museum.
url latitude longitude \
0 http://hotline.whalemuseum.org/sightings/5a84b... 47.78 -122.44
1 http://hotline.whalemuseum.org/sightings/5a84b... 47.62 -122.45
2 http://hotline.whalemuseum.org/sightings/5a84b... 49.12 -123.47
3 http://hotline.whalemuseum.org/sightings/5a84b... 47.50 -122.43
4 http://hotline.whalemuseum.org/sightings/5a84b... 47.61 -122.44
location sighted_at \
0 Kingston, WA, US 2016-11-14 20:28:00+00:00
1 Bainbridge Island, WA, US 2016-11-14 18:09:00+00:00
2 Galiano Island, BC, CA 2016-10-26 22:08:00+00:00
3 NaN 2016-10-24 22:40:00+00:00
4 Seattle, WA, US 2016-10-24 19:43:00+00:00
created_at updated_at orca_type \
0 2018-02-14 22:39:14+00:00 2018-02-14 22:39:14+00:00 southern resident
1 2018-02-14 22:39:13+00:00 2018-02-14 22:39:13+00:00 southern resident
2 2018-02-14 22:44:07+00:00 2018-02-14 22:44:07+00:00 southern resident
3 2018-02-14 22:39:12+00:00 2018-02-14 22:39:12+00:00 southern resident
4 2018-02-14 22:39:13+00:00 2018-02-14 22:39:13+00:00 southern resident
whale = pd.read_csv("whale_data.csv") # read file into pandas
whale.shape # check number of entries
Out[28]:
(16722, 13)
whale['sighted_at'] = pd.to_datetime(whale['sighted_at'])
new_whale = whale.set_index('sighted_at').groupby(pd.Grouper(freq
='M')).agg('count').reset_index()
new_whale # collection of sightings recorded for the whale museum
Out[78]:
0
2008-01-31
00:00:00+00:00
3
3
0
3
3
3
3
3
3
3
3
3
3
1
2008-02-29
00:00:00+00:00
0
0
0
0
0
0
0
0
0
0
0
0
0
2
2008-03-31
00:00:00+00:00
0
0
0
0
0
0
0
0
0
0
0
0
0
3
2008-04-30
00:00:00+00:00
0
0
0
0
0
0
0
0
0
0
0
0
0
4
2008-05-31
00:00:00+00:00
317
317
0
317
317
317
317
317
317
317
317
317
317
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
102
2016-07-31
00:00:00+00:00
532
532
0
532
532
532
532
412
532
532
532
494
532
103
2016-08-31
00:00:00+00:00
300
300
0
300
300
300
300
266
300
300
300
264
300
104
2016-09-30
00:00:00+00:00
432
432
0
432
432
432
432
374
432
432
432
328
432
105
2016-10-31
00:00:00+00:00
50
50
0
50
50
50
50
36
50
50
50
38
50
106
2016-11-30
00:00:00+00:00
4
4
0
4
4
4
4
4
4
4
4
4
4
107 rows × 14 columns
new_whale['month'] = new_whale.sighted_at.dt.strftime('%Y-%m')
newnew_whale = new_whale[['species', 'month']]
newnew_whale.to_csv('new_whale.csv')
newnew_whale.head()
Out[86]:
species
month
0
3
2008-01
1
0
2008-02
2
0
2008-03
3
0
2008-04
4
317
2008-05