Forecast of Southern Resident Killer

Whale Sightings in the Salish Sea

Catherine Rauch

Summary

The goal of this report was to compile sighting data of the Southern Resident Killer Whales in the Salish

Sea, and to accurately predict future months of sightings data from past data.

The analysis started with a sample of 16,722 observations, these observations were grouped by month

from year 2008 to 2015. The raw data had a decreasing variance and presented a seasonal trend.

Appropriate transformation and differencing was applied to achieve a stationary time series. The plots

of sample autocorrelation (ACF) and sample partial autocorrelation (PACF) were examined to determine

an appropriate model. Models were sorted by AICc (corrected Akaike information criterion), and

checked for non-stationarity and non-invertiblity. The model was then used to forecast the next 18

months of sightings data.

This dataset predicts the decrease in total whale sightings from the previous year, which is a part of a

larger downward trend of fewer sightings and a continuing seasonal trend as the whales continue to

migrate for winter and summer. Whale watching companies will have to prepare for a decrease in whale

sightings and potentially a decrease in revenue and tourism due to fewer whale sightings.

Introduction

The Southern Resident Killer Whales are a unique sub-species of killer whale. Their name, resident,

reflects their summer behavior, when they often spend weeks in the same area, but travel seasonally

between summer and winter.

The whale watching industry in Washington State is estimated to be worth $65 to $70 million annually.

With the Whale Museum estimating that, annually, more than 500,000 people go whale watching on

commercial /private vessels in the waters of Washington State and British Columbia (the Salish Sea).

NOAA Technical Memorandum conducted an analysis of the whale watching industry and found that

many businesses are based on tourism that would not survive without the Southern Resident Killer

Whales. It’s estimated that wildlife watching activities support more than 21,000 jobs in Washington.

However, these whales are critically endangered, so their population is small and often, they can be

elusive, hard to see, and in recent years their sightings have decreased all together.

The main goal is to identify a sightings pattern to be able to forecast expected arrival of the whales,

furthermore, although not in the scope of this time series analysis, this projection can allow for

prediction in estimated revenue for whale watching businesses.

In detailing techniques used on the raw time series data, a transformation was applied to stabilize the

non constant variance. The data was split into two data sets, one to be used in training the fitted model

and another to test the fitted model. The transformed data was differenced to remove the seasonality

trend. Through examination of the ACF and PACF parameters were identify to fit multiple SARIMA(p, d,

q) × (P, D, Q)s models. They were compared using AICc and model coefficients were used to assess

whether the model was invertible and stationary. Diagnostic checking was performed on the residuals to

assess if model needed adjusting. The model was fitted to the training data set along with the test data

set to compare and then used to forecast. This forecast predicted a overall decrease in total whale

sighting in the Salish Sea compared to the last year. It also predicted a continuation of a downward

trend of less total whale sightings year to year.

This data set is a compilation of monthly Southern Resident Killer Whale sighting counts from 2008 to

2015. This information came from The Whale Museums online sightings hotline, located on their

website, where they have public records available. This data was parsed and compiled using pandas in

Jupyterlab, and then analyzed and forecasted using R in Rstudio.

Findings

The initial overview of the data indicated that it had a variance that was non constant and had a

seasonal trend. This indicates the time series is non-stationary. The data was plotted below.

To assess the quality of the model, the raw data was split into a training set and test set. Training set

was the data that was analyzed to build the model and consisted of the monthly data from 2008-2014.

The training data was transformed using a log transformation to stabilize the variance. The resulting

data is graphed below

This graph still shows a yearly seasonal trend, so the data was differenced at lag 12 to eliminate this.

This newly differenced and transformed data appeared to have no apparent trend/seasonality.

The PACF and ACF below of the series were used to preliminarily estimate a SARIMA(p, d, q) × (P, D, Q)s

model.

One seasonal differencing was applied, indicating D = 1 at lag s = 12, as the data was reported monthly.

The ACF shows a peak at h = 1s, this shows Q can be estimated as 1. The PACF shows a strong peaks at h

= 2s, so estimation for P will be 2.

The series was differenced again to estimate p and q. Below are the graphs.

From the within-year ACF and PACF, q can be estimated as 1 and p as 2 or 1. Because the data was only

differenced to remove seasonality, d will be equal to 0.

3 models were then fitted.

Model 1: SARIMA(1, 0, 1) × (2, 1, 1)12

Model 2: SARIMA(1, 0, 1) × (2, 1, 2)12

Model 3: SARIMA(2, 0, 1) × (1, 1, 1)12

AICc was used to compare the model fit. As model 3 has the lowest value, the next step was to examine

model roots to check invertibility and stationarity. Roots were plotted below.

Model roots are in red and should be outside of the unit circle, inverse roots in blue and should be inside

the unit circle. No roots should be on the circle. Since all conditions are satisfied in these graphs, the

model is stationary and invertible.

Coefficients were estimated by the Rstudios’ sarima function in the package {astsa}.

Autoregressive (AR) coefficient 1 = -0.2097, AR coefficient 2 = 0.2146

Moving Average (MA) coefficient = 0.2475

Seasonal AR coefficient = 0.1763

Seasonal MA coefficient = -0.7657

SARIMA(p, d, q) × (P, D, Q)s equation:

























  























 



  



 







   



  









  









SARIMA(2, 0, 1) × (1, 1, 1)12:



    









  















  







  











Fitted equation:







 



 













 

















 













 







Diagnostic checking was conducted on the residuals of model 3. The portmanteau tests, Box–Pierce and

Ljung–Box in Rstudios’ {stats} package were computed to examine independence of residuals. The p-

values were reported as 0.5796 and 0.5002, respectively. It can be concluded that the residuals are

independent. Below are the residuals of model 3 graphed.

After examining the graphs and taking into consideration the p-values reported in the tests, it can be

concluded the residuals are approximately normal and approximately independent. The graphs also

indicate that the residuals are white noise, ie they have no trend, no seasonality, no change of variance.

From these conclusions it was decided no adjustment of the fitted model was needed.

From this model, using the sarima.for function in Rstudios {astsa} package, a forecast was fitted on the

training model. The test data was included. The shaded blue is the confidence interval calculated by the

function sarima.for.

The above graph is of the transformed data. Below the model was fitted on the original data.

From the prediction, it can be concluded that a decrease in whale sightings from the previous year will

occur. There appears to be a continuing trend of total amount of whale sightings decreasing in recent

years. Additionally, the strong seasonal trend will continue.

Conclusion

The goal of this analysis was to compile sighting data of the Southern Resident Killer Whales in the Salish

Sea, and to identify a sightings pattern in order to predict future months of sightings. This goal was

achieved.

The final fitted model:







 



 













 

















 













 







From the obtained prediction, it can be concluded that a decrease in whale sightings from the previous

year will occur. There appears to be a continuing trend of total amount of whale sightings decreasing in

recent years. These whales are critically endangered and in the past years have had a downward trend

in population. The sighting data appears to mirror this; fewer whales’ means fewer sightings. Whale

watching companies will have to prepare for an overall decrease in whale sightings and potentially a

decrease in revenue and tourism due to this.

References

The Whale Hotline. The Whale Museum, hotline.whalemuseum.org/

Southern Resident Killer Whale Chinook Salmon Inititive. srkwcsi.org/

Sightings of Southern Resident Killer Whales in the Salish Sea 1976−2014: the Importance of a Long-

Term Opportunistic Dataset.

Coastal Ocean Research Institute, Vancouver, BC, Canada

Jennifer K. Olson , Jason Wood , Richard Osborne , Lance Barrett-Lennard, Shawn Larson

www.eopugetsound.org/sites/default/files/features/resources/Olson%20et%20al.%202018%20SRKW%

20Sightings.pdf

The U.S. Whale Watching Industry of Greater Puget Sound A Description and Baseline Analysis NOAA

Technical Memorandum

https://www.nwfsc.noaa.gov/assets/25/7442_04172014_143023_WhaleWatchingTM126WebFinal.pdf

Appendix

# read data into Rstudio

whale <- read.csv("C:/Users/Catherine/Downloads/new_whale (1).csv")

whale <- whale[,2]

# variance changes over time, so apply transformation

whale <- log(whale)

# split model up into data for model to be trained on and then tested

train <- whale[1:84]

test <- whale[85:96]

test <- ts(test, frequency = 12)

whale.ts <- ts(train, frequency = 12)

plot.ts(whale.ts) # used above in report

# difference to remove seasonality

whale.diff <- diff(whale.ts, 12)

plot.ts(whale.diff)

# plot acfs and pacfs to estimate parameters

#(plots used above in report)

acf(whale.diff, lag.max = 40)

pacf(whale.diff, lag.max = 40)

acf(diff(whale.diff, 12), lag.max = 40)

pacf(diff(whale.diff, 12), lag.max = 40)

# fit models with predicted parameter values estimated from acf and pacf

# model 1

fit.i <- sarima((train), p = 1, d = 0, q = 1, P = 2 , D = 1, Q = 1, S = 12)

## initial value -0.077803

## iter 2 value -0.301345

## iter 3 value -0.320685

## iter 4 value -0.331440

## iter 5 value -0.336954

## iter 6 value -0.337901

## iter 7 value -0.338157

## iter 8 value -0.338249

## iter 9 value -0.338293

## iter 10 value -0.338361

## iter 11 value -0.338515

## iter 12 value -0.338566

## iter 13 value -0.338578

## iter 14 value -0.338579

## final value -0.338579

## converged

## initial value -0.042883

## iter 2 value -0.079369

## iter 3 value -0.094209

## iter 4 value -0.094667

## iter 5 value -0.094997

## iter 6 value -0.095120

## iter 7 value -0.095296

## iter 8 value -0.095854

## iter 9 value -0.096457

## iter 10 value -0.096825

## iter 11 value -0.097023

## iter 12 value -0.097040

## iter 13 value -0.097076

## iter 14 value -0.097109

## iter 15 value -0.097119

## iter 16 value -0.097121

## iter 17 value -0.097125

## iter 18 value -0.097127

## iter 19 value -0.097130

## iter 20 value -0.097134

## iter 21 value -0.097135

## iter 22 value -0.097135

## iter 23 value -0.097135

## iter 24 value -0.097135

## iter 25 value -0.097135

## iter 26 value -0.097135

## iter 27 value -0.097135

## final value -0.097135

## converged

# model 2

fit.ii <- sarima((train), p = 1, d = 0, q = 1, P = 2 , D = 1, Q = 2, S = 12)

## initial value -0.077803

## iter 2 value -0.294933

## iter 3 value -0.315719

## iter 4 value -0.329583

## iter 5 value -0.341670

## iter 6 value -0.343374

## iter 7 value -0.347261

## iter 8 value -0.347934

## iter 9 value -0.348118

## iter 10 value -0.348128

## iter 11 value -0.348143

## iter 12 value -0.348150

## iter 13 value -0.348158

## iter 14 value -0.348165

## iter 15 value -0.348168

## iter 16 value -0.348169

## iter 17 value -0.348169

## final value -0.348169

## converged

## initial value -0.008415

## iter 2 value -0.067954

## iter 3 value -0.093028

## iter 4 value -0.093418

## iter 5 value -0.093900

## iter 6 value -0.094110

## iter 7 value -0.094959

## iter 8 value -0.095439

## iter 9 value -0.095728

## iter 10 value -0.095935

## iter 11 value -0.096327

## iter 12 value -0.096756

## iter 13 value -0.096984

## iter 14 value -0.097065

## iter 15 value -0.097085

## iter 16 value -0.097088

## iter 17 value -0.097090

## iter 18 value -0.097095

## iter 19 value -0.097114

## iter 20 value -0.097151

## iter 21 value -0.097158

## iter 22 value -0.097160

## iter 23 value -0.097160

## iter 24 value -0.097163

## iter 25 value -0.097163

## iter 26 value -0.097163

## iter 27 value -0.097164

## iter 28 value -0.097164

## iter 29 value -0.097165

## iter 30 value -0.097165

## iter 31 value -0.097165

## final value -0.097165

## converged

# model 3

fit.iii <- sarima((train), p = 2, d = 0, q = 1, P = 1 , D = 1, Q = 1, S = 12)

## initial value -0.041606

## iter 2 value -0.047253

## iter 3 value -0.142442

## iter 4 value -0.153759

## iter 5 value -0.160822

## iter 6 value -0.163548

## iter 7 value -0.164328

## iter 8 value -0.164558

## iter 9 value -0.164609

## iter 10 value -0.164680

## iter 11 value -0.164712

## iter 12 value -0.164712

## final value -0.164712

## converged

## initial value -0.036011

## iter 2 value -0.041058

## iter 3 value -0.084230

## iter 4 value -0.091951

## iter 5 value -0.095326

## iter 6 value -0.096226

## iter 7 value -0.097004

## iter 8 value -0.098002

## iter 9 value -0.098407

## iter 10 value -0.098612

## iter 11 value -0.098744

## iter 12 value -0.098875

## iter 13 value -0.099023

## iter 14 value -0.099202

## iter 15 value -0.099220

## iter 16 value -0.099225

## iter 17 value -0.099228

## iter 18 value -0.099230

## iter 19 value -0.099233

## iter 20 value -0.099235

## iter 21 value -0.099235

## iter 22 value -0.099235

## iter 23 value -0.099235

## iter 24 value -0.099235

## iter 25 value -0.099236

## iter 26 value -0.099236

## iter 27 value -0.099236

## final value -0.099236

## converged

#code to obtain AICc values

fit.i$AICc

## [1] 2.475241

fit.ii$AICc

## [1] 2.503961

fit.iii$AICc # lowest AICc

## [1] 2.471596

# call fit to read coefficients

> fit.iii

$fit

Call:

arima(x = data, order = c(p, d, q), seasonal = list(order = c(P, D, Q),

period = S), include.mean = F)

Coefficients:

ar1 ar2 ma1 sar1 sma1

-0.2097 0.2146 0.2475 0.1763 -0.7657

s.e. 0.3438 0.1178 0.3418 0.2455 0.2983

sigma^2 estimated as 0.7491: log likelihood = -95.44, aic = 204.88

# check invertibility

# All roots (in red) should be outside of the unit circle; thus, both models

are causal and invertible.

#graphs used above in report

plot.roots(NULL,polyroot(c(1, -0.2097, 0.2146)), main="roots of ar part"

plot.roots(NULL,polyroot(c(1, 0.2475)), main="roots of ma part")

plot.roots(NULL,polyroot(c(1, 0.1763)), main="roots of sar part")

plot.roots(NULL,polyroot(c(1,-0.7657)), main="roots of sma part")

# check assumption that residuals are white noise

res = residuals(fit.iii$fit)

mean(res)

## [1] 0.0007956052

var(res)

## [1] 0.6314296

# code for residual graphs used above

pacf(res,main = "Partial Autocorrelation")

acf(res,main = "Autocorrelation")

# Histogram

hist(res,main = "Histogram")

# q-q plot

qqnorm(res)

qqline(res,col ="blue")

## Test for independence of residuals

Box.test(res, lag = 10, type = c("Box-Pierce"), fitdf = 2)

##

## Box-Pierce test

##

## data: res

## X-squared = 6.6067, df = 8, p-value = 0.5796

Box.test(res, lag = 10, type = c("Ljung-Box"), fitdf = 2)

##

## Box-Ljung test

##

## data: res

## X-squared = 7.3426, df = 8, p-value = 0.5002

# Test for normality of residuals

shapiro.test(res)

##

## Shapiro-Wilk normality test

##

## data: res

## W = 0.93646, p-value = 0.06787

# Use model to forecast, resulting graph used above in report

pred.tr <- sarima.for(train, 18 , 2, 0, 1, 1, 1, 1,

S = 12, no.constant = FALSE, plot.all = F)

points(85:96,test,col="blue")

legend("topleft", pch = 1,col=c("red","blue"),

legend=c("Forecast value", "True Value"))

# Use model to forecast on original data, resulting graph used above in

report

whale2 <- whale.og[-c(61:72),2]

pred.tr <- sarima.for(whale2 , 18 , 2, 0, 1, 1, 1, 1,

S = 12, no.constant = FALSE, plot.all = F)

## python code used in jupyter to extract/compile data

import csv

import requests

base_url =

"https://hotline.whalemuseum.org/api.json?species=orca&orca_type=southern%20r

esident"

csv_name = 'whale_data.csv'

page_num = 0

with open(csv_name, 'w') as csv_file:

while True:

response = requests.get(f"{base_url}&page={page_num}&limit=1000")

json_data = response.json()

if not json_data:

break

page_num += 1

if page_num == 1:

writer = csv.DictWriter(csv_file, fieldnames=list(json_data[0]))

writer.writeheader()

writer.writerows(json_data)

import pandas as pd

df = pd.read_csv(csv_name)

for time_col in ('sighted_at', 'created_at', 'updated_at'):

df[time_col] = pd.to_datetime(df[time_col])

print(df.head())

id species quantity description

\

0 5a84ba92686f74490f4e1700 orca NaN Imported by The Whale Museum.

1 5a84ba91686f74490f481700 orca NaN Imported by The Whale Museum.

2 5a84bbb7686f74490f821d00 orca NaN Imported by The Whale Museum.

3 5a84ba90686f74490f441700 orca NaN Imported by The Whale Museum.

4 5a84ba90686f74490f461700 orca NaN Imported by The Whale Museum.

url latitude longitude \

0 http://hotline.whalemuseum.org/sightings/5a84b... 47.78 -122.44

1 http://hotline.whalemuseum.org/sightings/5a84b... 47.62 -122.45

2 http://hotline.whalemuseum.org/sightings/5a84b... 49.12 -123.47

3 http://hotline.whalemuseum.org/sightings/5a84b... 47.50 -122.43

4 http://hotline.whalemuseum.org/sightings/5a84b... 47.61 -122.44

location sighted_at \

0 Kingston, WA, US 2016-11-14 20:28:00+00:00

1 Bainbridge Island, WA, US 2016-11-14 18:09:00+00:00

2 Galiano Island, BC, CA 2016-10-26 22:08:00+00:00

3 NaN 2016-10-24 22:40:00+00:00

4 Seattle, WA, US 2016-10-24 19:43:00+00:00

created_at updated_at orca_type \

0 2018-02-14 22:39:14+00:00 2018-02-14 22:39:14+00:00 southern resident

1 2018-02-14 22:39:13+00:00 2018-02-14 22:39:13+00:00 southern resident

2 2018-02-14 22:44:07+00:00 2018-02-14 22:44:07+00:00 southern resident

3 2018-02-14 22:39:12+00:00 2018-02-14 22:39:12+00:00 southern resident

4 2018-02-14 22:39:13+00:00 2018-02-14 22:39:13+00:00 southern resident

whale = pd.read_csv("whale_data.csv") # read file into pandas

whale.shape # check number of entries

Out[28]:

(16722, 13)

whale['sighted_at'] = pd.to_datetime(whale['sighted_at'])

new_whale = whale.set_index('sighted_at').groupby(pd.Grouper(freq

='M')).agg('count').reset_index()

new_whale # collection of sightings recorded for the whale museum

Out[78]:

0

2008-01-31

00:00:00+00:00

3

0

3

1

2008-02-29

00:00:00+00:00

0

2

2008-03-31

00:00:00+00:00

0

3

2008-04-30

00:00:00+00:00

0

4

2008-05-31

00:00:00+00:00

317

0

317

...

102

2016-07-31

00:00:00+00:00

532

0

532

412

532

494

532

103

2016-08-31

00:00:00+00:00

300

0

300

266

300

264

300

104

2016-09-30

00:00:00+00:00

432

0

432

374

432

328

432

105

2016-10-31

00:00:00+00:00

50

0

50

36

50

38

50

106

2016-11-30

00:00:00+00:00

4

0

4

107 rows × 14 columns

new_whale['month'] = new_whale.sighted_at.dt.strftime('%Y-%m')

newnew_whale = new_whale[['species', 'month']]

newnew_whale.to_csv('new_whale.csv')

newnew_whale.head()

Out[86]:

species

month

0

3

2008-01

1

0

2008-02

2

0

2008-03

3

0

2008-04

4

317

2008-05