Examining the Impact of COVID-19 on the Proportion of Theft-Related Crimes in Vancouver

Authors Group 4, Project Proposal (Acky Xu, Adam Mitha, Icy Xu, Michael DeMarco)

Abstract This project looks at the impact of the COVID-19 pandemic on theft-related crime in select Vancouver neighborhoods. Inferential techniques are applied to estimate the difference in the proportion of theft crime relative to all crime between 2020 and the mean proportion of the three previous years, 2017 to 2019; we test to see if there has been a statistically significant change in the proportion of theft related crime across individual Vancouver neighborhoods. We anticipated we would find a significant difference in theft related crime overall due to, presumably, the economic hardship induced by the COVID-19 pandemic in 2020; in the end, we instead found that roughly half of all examined neighborhoods saw no statistically significant change in theft related crime; there were neighborhoods that did see a statistically significant change in the proportion of theft-related crime, but this change did not seem to correlate with each neighborhoods' average income, as anticipated.

A VPD car

A Vancouver Police Department (VPD) car.

Introduction

Background

Since the beginning of the COVID-19 pandemic in March of 2020, many have discussed the issue of alleged increasing crime rates both globally and within Metro Vancouver. Indeed, an unprecedented economic contraction [Statistics Canada, 1] coupled with fewer resources, such as shelters or food banks, has driven an increase in many types of crime, including in fraud and counterfeiting crime [Interpol, 2], hate crimes [ABC News, 3]—with those of East Asian descent being impacted most significantly—and domestic violence [UN News, 4].

These spikes have led to tension between municipal governments and police departments with respect to budgeting in the wake of the economic depravity of the COVID-19 era. The City of Vancouver, specifically, enacted a freeze on the police budget in 2021 [CBC, 5]. However, there is little publicly available Vancouver-focused analysis on both theft-related crime specifically and trends in crime generally in the years leading up to that decision.

Existing literature relating to crime and the COVID-19 pandemic has primarily focused on many types of "global" crime (e.g., organized crime, terrorist attacks), and has largely concluded that on the whole, these types of crimes are increasing through the pandemic year; in our analysis, we'll look closely at "petty" crimes, specifically theft, and see how, if at all, the rate of theft crime has changed amidst the pandemic. We analyze all Metro Vancouver neighborhoods we have data for—twenty-four neighborhoods in total—for our analysis. In the appendix, we visualize the socioeconomic spectrum within Vancouver, grouped by neighborhood, which will be discussed as a part of our analysis. To do so, we used the City of Vancouver's 2016 census data on "local area profiles," and explore the "average total incomes among residents" [City of Vancouver, 9]. Here's a snapshot of what the data looks like:

Neighborhood Average Income (2016, CAD)
Arbutus Ridge \$62,675
Downtown \$63,251
Dunbar/Southlands \$78,117
Fairview \$61,627
Grandview/Woodland \$42,896
Hastings/Sunrise \$38,258

Figure 1: A sample of wealth in Vancouver neighborhoods.

(While the VPD has no direct data for neighborhood wealth, we can use the City of Vancouver's local profiles as they match the VPD's neighborhood classifications. Other studies of mean incomes among residents in Vancouver neighborhoods aren't guaranteed to share the same neighborhood boundaries as the City of Vancouver and VPD data sets.) It should be noted that income levels in Vancouver neighborhoods do not naturally split into a simple dichotomy of "upper class," "middle class," or "lower class"; there are also notable outliers such as Shaughnessy, with an average 2016 income of \$118,668. For more information, the reader is encouraged to look at the appendix at the end of this report.

Question

We are interested in exploring how much, if at all, the proportion of petty crime rates have changed in Metro Vancouver over the last year. We will compare crime rates from 2017 to 2020 in twenty-four neighborhoods characteristic of various strata of the full economic spectrum. Our analysis is narrowed to focusing on specifically theft, so we can best understand the intertwined relationships between the economic hardships brought forward over this last year and trends in crime that would seemingly be induced by someone incurring job loss, for example.

Therefore, our research question is as follows: "How has the proportion of theft-related crime changed, if at all, in Vancouver during 2020, the year of the COVID-19 pandemic, when compared to the average of the previous three years (2017 to 2019) across the various strata of Vancouver's economic spectrum, studied via twenty four Vancouver neighborhoods?"

Data set

We're considering our sample to be what is available in the Vancouver Police Department's database. Within the data set description, the VPD mentions "[t]he information provided [...] does not reflect the total number of calls or complaints made to the VPD" and "the crime classification [...] may change at any time based on the dynamic nature of police investigations." This alone means we cannot consider the data set to be entirely representative of all crime in Vancouver. Additionally, we have to consider that many crimes go unreported for various reasons, such as for a fear of not being believed, insecurity, and a fear of getting into trouble; therefore, we have further reason to presume this data is not completely representative of both all crimes and all thefts within Vancouver. While there is no specific study for unreported crimes statistics in Vancouver, Statistic's Canada's "Police-reported crime statistics in Canada, 2019" survey indicated that just "one-third (31\%) of crimes" are reported to the police [Statistics Canada, 13].

Our population is all actual crime that occurred in Vancouver, not just reported crime. Of course it is impossible to know how much crime actually occurred, but through our data set sample we'll be able to produce an estimate. Unfortunately, as we are working with the proportion of theft-related crime to overall crime, attitudes towards policing over time becomes a rather significant confounding variable, as that could dramatically impact an individual's willingness to report a crime. There is significant cause to consider this confounding as well, as 2020 saw both the killing of George Floyd by a Minneapolis police officer and little action taken by the government to reconcile for the missing and murdered Indigenous women (MMIW) crisis here in Canada, leading to furthered distrust between citizens and Canada's metropolitan police forces; this among other limitations will be discussed further in our analysis section.

We will be using the Vancouver Police Department's Open Crime Data [VPD, 6] to answer our research question. In particular, we'll need three columns from the data set:

Column Description Notes
Year "A four-digit field that indicates the year when the reported crime activity" We'll use data, specifically, from January 1st, 2017 to December 31st, 2020.
Type "The type of crime activities" We're considering theft crime to be all crime types with "theft" explicitly in the name.
Neighborhood "Neighborhoods within the City of Vancouver are based on the census tract (CT) concept within census metropolitan area (CMA)." We'll take three neighborhoods, representing the full economic spectrum.

Figure 2: A description of the VPD Open Crime data set.

Our random variable of interest is the difference in the proportion of theft-related crime with respect to all crime between the mean of 2017 to 2019 and 2020 (i.e., before and during the COVID-19 pandemic). Given our population is all Vancouver crime, our parameter of interest is the difference in the proportion of theft related crime from the last three years, 2017 to 2019, to 2020.

We will be estimating trends for all reported and unreported crime in Vancouver by making estimates with our data set of reported crime across twenty four different Vancouver neighborhoods, each with its own level of wealth.

Preliminary Results

We'll start by importing the dataset from it's original source [VPD, 6], loading it into a dataframe and inspecting the results.

Our data largely comes in a "tidy" format already. By tidy, we're verifying (from the DSCI 100 course textbook [Data Science: A First Introduction, 12], verbatim, which in turn is from Wickham and others [R for Data Science, 11]):

  • each row is a single observation,
  • each column is a single variable, and
  • each value is a single cell

However, there are four steps we could take to improve the data set for our use:

  1. clean up the column names,
  2. remove N/As,
  3. select only data relevant to our analysis, and
  4. convert crime types and neighborhoods to a factor, since we're working with proportions (i.e., categorical data).

We're primarily interested in the year, type and neighborhood of crimes, so we'll extract those columns and discard the rest. In addition, we would like to focus on the years 2017 to 2020.

Next, let's see how many different types of crimes there are present in our data set.

As was mentioned in the introduction, we're interested in theft related crime. Let's create a list of all theft related crime we see in this data set.

We can also check the neighborhoods present in this data set.

We're hoping to compare crime rates across years in neighborhoods that represent a distribution of average income levels. For our proposal, we're going to be using all neighborhoods available in the VPD's data set; a distribution of the levels of wealth these neighborhoods demonstrate is available in the appendix and further discussion on why the relationship between economy and crime is discussion in our analysis.

Now, to get a better sense of our data set, we'll produce three plots to explore, at a high level, what crime looks like in Vancouver both over time and by neighborhood. Our three plots will be:

  1. A line plot of total police-reported crime over time, grouped by neighborhood; this will demonstrate the change in total crime over time
  2. A stacked bar chart showing a breakdown of crime type, grouped by neighborhood; this will demonstrate the relative frequencies of each crime type, by neighborhood
  3. A waffle plot to show to crime types proportions, grouped by year; this will demonstrate the relative frequencies of each crime type, by year

Let's begin with the line plot.

It's clear that the crime is highest in Strathcona, followed by Marpole, and then Shaughnessy. Crime has also generally decreased in 2020, especially in Strathcona.

Alongside checking the number of crimes total in each neighborhood, it's also worth looking at how much of each type of crime is represented in our data set. Let's now look at the stacked bar chart.

This plot shows that theft from vehicle, in the Strathcona region, makes up a large proportion of all crimes committed. We can also see, thankfully, that vehicle collisions resulting in fatality make up such a small fraction of our data they're hard to distinguish.

Finally, we can look at a visualization of the relative frequencies of crime over the years through a waffle chart. N.B: Each individual square represents 30 police-reported crimes.

No trends are particularly obvious from the waffle plot, though we can see a clear reduction in crime overall in 2020, and it seems as though it is in large part due to a steep decline in "theft from vehicle" crimes.

Now, we can compute estimates of our parameter of interest across each of our different groups. In our case, this means computing the difference between the mean of the proportion of theft crimes committed across 2017 to 2019 and the proportion of theft crimes committed in 2020. Note the proportions are computed as the the total number of theft related crimes within a given neighborhood relative to the total of crimes that occurred within that neighborhood; hence the group_by(...) ahead of the computation of the proportion in the following code cell.

Note that a negative value indicates a decrease in crime during 2020. Our initial data suggests that the proportion of theft-related crime has actually mostly decreased since the pandemic began. While this is a good estimate for us to begin our analysis with, it's by no means sufficient to conclude upon; more as to why that is the case is discussed below.

Methods

Strengths

Our report uses data from the Vancouver Police Department to study trends in crime across select Vancouver neighborhoods. It is "trustworthy" in the sense that we will look to be careful at providing ranges of our final answer, rather than solely point estimates. However, that alone isn't enough to guarantee an effective analysis.

We're also using kind-of blocking (or simply "grouping") in our analysis that aids our interpretation of the results. Crime has been linked to wealth (and more specifically, socioeconomic status, or SES) [World Bank, 8], so we'll treat it as a blocking variable, and group our data by three neighborhoods that represent different levels of wealth; however, we do not ever randomly sample within those groups, as we're considering our data to already be a sample, so it's not exactly what "blocking" truly is. Regardless, this "grouping" step will help us draw more meaningful conclusions within our analysis.

Finally, we're being careful to not simply compare the trend of crime between 2019 and 2020. Since our focus is COVID-19, we're not especially interested in the 1-year trend, but rather, how 2020 has deviated from the "norm." It is plausible that 2019 was an outlier, so "merging" the 3-year span of 2017 to 2019 should provide us a better sense of the general, recent crime levels in Vancouver.

Limitations

Some could argue that many of the "petty" crimes we're interested in, like theft, largely go unreported. And while our analysis does indeed aim to provide an estimate from this sample of strictly reported crimes, arguing what makes a "fair" range is a hard question to answer. Countless factors affect whether or not a crime gets reported, and trust in police generally was found to be at a "record low" in August of 2020 [New York Times, 10], and according to Statistics Canada, "less than half of Canadians thought their local police were doing a good job of being approachable and easy to talk to" [Statistics Canada, 14]. Accounting for this variability, statistically, is arduous.

Furthermore, while our neighborhood selections do represent a fair amount of spread in the wealth of various Vancouver neighborhoods, they're by no means perfectly representative. Shaughnessy, for one, being at a rather high-end extreme of wealth, might be less useful in our analysis, as it doesn't really generalize to any kind of useful population outside of that specific neighborhood.

Analysis

Of course, the plots and estimates provided above are not sufficient for any stakeholder, such as the municipal government or the VPD themselves. Since we're working with a sample, we must report a range of plausible values, rather than a single point estimate. Thankfully, there are a few steps we can take to do this in different ways:

Firstly, we can produce a hypothesis test, where our null is that theft-related crime rates have not changed:

$$H_0: \Delta{p} = 0$$

where $\Delta{p} = p_1 - p_2$ and

$p_1$ is the proportion of theft-related crime in 2020

$p_2$ is the proportion of theft-related crimes, on average, between 2017 and 2019,

and thus $\Delta{p}$ is the difference in the proportion of theft-related crimes relative to all committed crimes (within a given neighborhood) between those two time periods.

Our alternative hypothesis is "that there has been some sort of change"; in this case, it's that there is a statistically significant difference in the amount of crime between these two time periods:

$$H_A: \Delta{p} \ne 0$$

We'll have to do a two-tailed, two-sample z-test and check if our sample difference in proportions falls within our significance level. We'll conduct our analysis with an alpha value of 5%. This is a relatively standard value to use and will allow us to gauge if our findings are significant, or if differences in proportions can instead be merely attributed to sampling variation instead.

Secondly, we can produce a confidence interval to report a range of plausible values alongside our difference in proportions statistic. We'll use two techniques, both bootstrapping (i.e., generating a bootstrapped sampling distribution) and asymptotics, to yield a range of values at a confidence level ("CL") of 95%.

(Note that two-tailed hypothesis tests and confidence intervals are practically equivalent at ${CL} = 1 - \alpha$, but doing both approaches is good for rigor and completeness, and allows us to use both bootstrapping and asymptotics.)

Along the way, we will be sure to visualize and interpret our results within the context of our problem. These techniques will also allow us to report something actually sufficient for a stakeholder: both a statistic and a range. This is at the core of inferential statistics.

Results

As described in the analysis section, we will conduct three different analyses.

  1. Hypothesis test with bootstrapping
  2. Confidence interval with bootstraping
  3. Hypothesis test with asymptotics
  4. Confidence interval with asymptotics

We begin with our bootstrapped hypothesis test.

Bootstrapping

We'll begin with a hypothesis test done via bootstrapping. To do so, we'll use the infer package. Conducting a hypothesis test using bootstrapping and infer requires the following steps:

  1. Specify your response variable
  2. Create your "null hypothesis"; in our case, we'll specify "independence," which is a short way of telling infer that the null hypothesis is that $\Delta{p} = 0$
  3. We'll generate our null model through taking a bootstrap sample of our data; we'll use 1,000 repetitions, since that allows the code cell to complete in a reasonable amount of time, and will yield a more accurate P-value
  4. We specify the value we want to calculate; again, in our case, it's a difference in proportions, shortened to "diff in props"

Before we're able to conduct the test however, we'll have to format our data to capture both whether a crime is theft-related, is_theft, and whether it occurred before 2020 or was during 2020, period. Let's begin by doing the necessary wrangling.

We now have our desired data frame. Before we conduct the hypothesis test, we'll also need to gather a list of all twenty four neighborhoods in the data set, as we'll have to iterate the hypothesis testing logic for each of these neighborhoods and append it to our data frame. (We did this earlier in our exploration, but the reminder should be helpful.)

Now we have everything we need to conduct our hypothesis test using bootstrapping. There are many valid approaches here; we could use some combination of group_by and summarize to work through each neighborhood. Instead, we opted to initially construct an empty data frame, and successively bind our results to that data frame. Then, we loop through each of the neighborhoods in the previous vector.

Within the loop, the steps we must take are as follows:

  1. Filter our wrangled crime data down to just containing reported crimes for our given neighborhood.
  2. Create two data frames:
    • pre_covid, a data frame containing all reported crime for a given neighborhood that occurred from 2017 to 2019
    • covid, a data frame containing all reported crime for a given neighborhood that occurred during 2020
  3. With each of these data frames, compute each proportion of theft-related crime relative to all crime; these are equal to "mean" of the is_theft column (treated as a numeric value instead of a boolean). This will effectively yield a proportion, as it will compute the sum of 1 + 1 + 0 + 1... divided by the size of the data frame.
  4. Compute our observed difference in proportions, obs_diff_in_props, which is our test statistic.
  5. Conduct the infer workflow, generating our "null model" using 1,000 bootstrapped sample repetitions
  6. Compute the probability of seeing our test statistic under the null model, the P-value
  7. Format the resulting data into a data frame, and bind it to our results data frame

Interpretations of the P-value are saved for a subsequent code cell, where we'll compare our P-values to our significance level, 5%.

Our final data frame will take on the following shape:

Column Description
neighbourhood The neighbourhood the crimes took place in.
past_prop The proportion of reported, theft-related crimes to all reported crimes from 2017 to 2019.
current_prop The proportion of reported, theft-related crimes to all reported crimes in 2020.
diff_in_props Our test statistic, equal to the difference of current_prop and past_prop.
p_value The likelihood of observing our test statistic under the null model, where there is no difference in the proportion of theft-related crimes..

Figure 19: Showing the form of our final data for the bootstrapping workflow.

To further illustrate this workflow, we'll do one neighborhood on its own and visualize the null model, just to get a better sense of what's happening within the loop. We'll use Marpole as an example.

We now need to interpret these P-values within the context of the problem. Let's first add a column that indicates whether or not we should reject each P-value at a 5% significance level.

We'll now do a brief interpretation of the results. Since we're doing our analysis across all of the twenty four available neighborhoods, we have twenty four separate hypotheses tests to interpret; to make this more digestible, we'll write a "general" interpretation and discuss our interpretation of the results in the discussion section.

In general, if reject5 is TRUE, indicating we should reject the null hypothesis at a 5% significance level, this means there was indeed a statistically significant difference in the proportion of theft-related crime between 2020 and the average of the proportion of the years 2017 to 2019. For example, in the the Kitsilano neighborhood, our P-value was just 0.02, or 2%. This indicates that under the null model, we had a 2% chance of observing a value as extreme or more extreme (in either the left or right tail, since the test was two-tailed) than our test statistic; since this is smaller than the threshold of P-values we are interested in, 5%, we do consider it significant. We conclude there is a statistically significant difference in the proportion of theft-related crime in 2020 compared to the mean of the past three years.

Likewise, in general, if reject5 is FALSE, indicating we should not reject the null hypothesis at a 5% significance level, this means there was not a statistically significant difference in proportions. For example, in the the Renfrew-Collingwood neighborhood, our P-value was 0.06, or 6%. This indicates that under the null model, we had a 6% chance of observing a value as extreme or more extreme (in either the left or right tail, since the test was two-tailed) than our test statistic; since this is larger than the threshold of P-values we are interested in, 5%, we don't consider it significant. We conclude there was no statistically significant in theft-related crime in 2020.

We can quickly compute the total numbers of rejected and non-rejected neighborhoods.

In the discussion, we'll consider all twenty four tests together, and see what, if any, trends we notice, especially considering the difference average income levels of each neighborhood.

Confidence Intervals via Bootstrapping

Similarly, we could also construct confidence intervals via bootstrapping; this, in a way, "inverts" our analysis. Instead of beginning from an assumption of the true value, we'll use our sample to compute a range of plausible values.

Just like we did for hypothesis testing, we'll first construct a basic confidence interval for one neighborhood; then, we'll do it for all twenty four. Additionally, we'll visualize which confidence intervals capture our null value of zero; this will suggest how many neighborhoods do indeed show a statistically significant difference in the proportion of theft-related crime. (Note that a two-tailed hypothesis test, as we did above, is effectively the same as a confidence interval; the interpretation lines up exactly, we just consider if the interval captures zero instead.)

Our process is much more straightforward than before as well, which is a nice bonus:

  1. Begin by filtering down the crime_data_processed data frame to just the neighborhood we're interested in
  2. Generate a bootstrap sample for 1,000 repetitions
  3. Construct the confidence interval by calling get_ci(...)
  4. Append our findings to the results data frame

We'll handle the logic of whether or not an interval captured zero later in the code.

This interval does captured the value of 0, as lower_ci is less than 0 and upper_ci is greater than 0. This means that Dunbar-Southlands would be considered to show no statistically significant change.

Let's try this process now on each neighborhood separately.

We now have examples of intervals which both do and do not captured the value of 0; again, when the interval does not capture the value of 0, we would in turn reject a "supposed" null hypothesis that there is no difference in the proportions at a significance level of $\alpha = 1 - \text{CL}$, where "CL" stands in for our confidence level.

Let's now visualize our results for all neighborhoods.

It appears that though most neighborhoods captured the value of zero, many do not; the Musqueam neighborhood is notable here as well, as it's extremely small sample size, relative to other neighborhoods, led to a very wide confidence interval. We also observe that the Central Business District has had a sharp change in the proportion of theft related crime, as it's one of the smallest intervals observed yet it is very far from containing zero.

Asymptotics

We'll now conduct hypothesis testing based on results from the Central Limit Theorem (CLT). To conduct a two-sample z-test of prortions, we'll rely on the following formula:

$$ Z = \frac{(\hat{p_1} - \hat{p_2} - 0)}{\sqrt(\hat{p_{\text{pooled}}} \cdot (1 - \hat{p_{\text{pooled}}} \cdot (\frac{1}{n_1} + \frac{1}{n_2}))} $$

where

$Z$ is the standardized z-score, to be used as our normalized test statistic

$\hat{p_1}$ is the proportion of theft-related crime relative to all crime in 2020

$\hat{p_2}$ is the proportion of theft-related crime relative to all crime on average from 2017 to 2019

$\hat{p_{\text{pooled}}}$ is the pooled proportion of and $\hat{p_1}$ and $\hat{p_2}$

${n_1}$ is the total number of crimes in 2020 (for a given neighborhood)

${n_2}$ is the number of crimes on average from 2017 to 2019 (for a given neighborhood)

It looks like we'll have to filter out the Museqeum neighborhood for this portion of our analysis.

Let's first define a helper function that will, from our processed crime data, compute the needed summary statistics for a given neighbourhood during a needed time period. The summary statistics we need to compute a difference in proportions are the proportion of theft-related crime to all crime, prop, the total number of theft related crimes, s (for "success"), and the total number of crimes, n. These values will be used in our calculation of our Z-score.

Note that in lieu of the following helper function, we could've alternatively used the group_by and summarize pattern to achieve a similar workflow.

We can now compute our "before pandemic" summary statistics.

We can again re-use our helper to do this kind of analysis for our 2020, or "during pandemic," data.

Finally, let's combine the data.

Note that for our application of the CLT to be valid, we'll also need to check a few key assumptions:

It looks like the Musqueam neighborhood will not be valid for the rest of our analysis.

From a quick glance at the data, there seems to be a difference with the proportion of thefts between 2020 and the previous years. In fact, quite a few seem to have a decrease in crime rate, as we saw in our initial exploration as well.

We can now compute our test statistic as before, and determine our P-value. The full workflow for conducting our hypothesis test will be as follows:

  1. Compute the pooled proportion for each neighborhood
  2. Compute the Z-score for each neighborhood
  3. Compute the P-value using pnorm(...), at R built-in
  4. Determine whether or not we should reject the null hypothesis, based on a significance level of 5%

Again, we can quickly summarize this data and see what proportion of neighborhood were rejected or not.

One thing to note is that a large majority of the neighbourhoods, either with the Null-Hypothesis rejected or not, seems to have a decrease of reported thefts in 2020 compared to the past year.

Confidence Intervals via Asymptotics

Finally, this brings us to the last stage of our analysis. Just as we did with bootstrapping, we'll also construct a 95% confidence interval using classical techniques.

This time, we'll need to compute the following:

$$\Delta{\hat{p}} = \hat{p_1} - \hat{p_2}$$

as we've seen before, the difference in proportions, and

$$SE_{\hat{p_1} - \hat{p_2}} = \sqrt{(\frac{\hat{p_1} \cdot (1 - \hat{p_1})}{n_1}) + \frac{(\hat{p_2} \cdot (1 - \hat{p_2})}{n_2}) } $$

where $SE_{\hat{p_1} - \hat{p_2}}$ is the standard error in the difference of proportions.

Note the definitions for $\hat{p_i}$ and $n_i$ carry forward from before.

We'll also need to check our assumptions to apply to CLT. Note these are the same assumptions we before!

We'll use our data set from before, crime_data_merged_clt, to again filter out the Musqueam neighborhood.

Now that those conditions are met, we can apply the CLT. Let's compute the confidence interval based on Central Limit Theorem.

Again, we can visualize our results.

Again, if the confidence interval captures zero, it means there's no significant difference between crime rates of 2020 and before 2020. If the confidence interval doesn't capture 0, it means there is indeed a significant difference between crime rates of 2020 and before 2020.

We find once again that based on 95% confidence interval with twenty four neighborhoods, twelve neighbourhoods capture zero and twelve do not.

Bootstrapping or Asymptotics?

The argument of which method is preferred is relative clear within our analysis. While asymptotics is a completely valid technique, the variability in the the sample size of each neighborhood varied greatly, which meant the technique as a whole was limited; we even had to filter out the Musqueam neighborhood. In a case such as this one, there is truly nothing you can do to take a larger sample size. Using the CLT means our sample size must be "large enough," and when working with data such as this where there are many different sample sizes being studied at once, that can be a severe limitation.

Bootstrapping is a clear winner here. We could apply the technique on all of the neighborhoods available in the data set without issue. This means that our preferable final results are as follows.

Discussion

In our analysis, we found that after conducting a hypothesis test for each Vancouver neighborhood to see whether or not there was a statistically significant change in the proportion of theft related crime to all crime between 2020 and the average of 2017 to 2019 at a 5% significance level, roughly half of neighborhoods did not show a significant change. In tandem, we conducted an analysis using asymptotics, but ultimately found that bootstrapping would lead to more accurate results and would allow us to truly use available neighborhoods in our original data set.

Notably, looking at the neighborhoods where we do observe a difference, that difference is largely a reduction in crime, as the difference in proportions was in fact negative. This would be a good candidate for further analysis, as we could use left-tailed tests, instead of two-tailed tests, to confirm whether this may indeed be a more general trend. This goes against what we anticipated to find; while COVID-19 has indeed led to an unprecedented economic decline for many countries and likewise many major cities across the world, it seems as though in Vancouver, this has not led to a distinct increase in theft-related crime across the board. Among the neighborhoods that we observed a statistically significant difference for, there is no clear economic grouping; there are so-called "middle class" neighborhoods like Fairview and West End, more "upper class" neighborhoods like Kitsilano, and "lower class" neighborhoods like Strathcona. At least in our analysis, there was no clear link between average income level in a neighborhood and whether or not a significant difference in the proportion of theft-related crime was observed.

There are many reason for why there might not be at difference in theft crime, and a decrease if there is a notable difference at all. "Petty" crimes such as theft often require some sort of human interaction; however, due to the pandemic, there has been an understandable reduction in amount of social interaction for anyone, including thieves. Additionally, locations where petty crimes may occur—such as restaurants, bars, or stores—have had to close their doors for sustained periods of time throughout 2020, and unfortunately into 2021, so there is clearly less opportunity for petty crimes to take place.

Additionally, it's worth returning to our initial point about "reported" crimes. Working with our data in this way, treating "reported" crimes as a sample of all "actual" crime, may simply be an inappropriate way to treat the data. Knowing just how many crimes go unreported is a daunting task to try to estimate, and while we were able to somewhat account for unreported crimes by applying inferential statistic techniques, the vast uncertainty and high sensitivity in these figures might render our results useless in the end. In 2020, it is likely that on whole, many thought longer before using the police in various situations; there has especially been a push, at least at the academia level, towards de-escalation (without involving the police) in small crimes such as the ones we look at in this report [NYT Magazine, 15]. Whether or not this has been observed in Vancouver is a question for future studies, but is again worth noting when interpreting our results.

Reflection

In our final report, we expected to find that the rate of theft-related crime has changed significantly from past years, supporting the hypothesis that the pandemic is correlated to changing crime rates, including at local levels, with petty crimes. In fact, for roughly half of the neighborhoods in Vancouver, there has not been a statistically significant difference in the proportion of theft-related crime. While we did observe some noticeable differences in select neighborhoods, there was no evidence of any kind of systematic change in the year 2020 relative to previous years. Additionally, there was no correlation between economic status and neighborhoods that did observe as change.

Still, we anticipate the impact of our findings will be the production of actionable insights for both city officials and the VPD. Getting budgets right is hard, but using data-driven methods makes the process significantly easier. Through hypothesis testing, we can put both the safety and security of Vancouverites as a top priority, as weighing police funding versus additional social security measures is of utmost importance.

Three examples of further questions that could extend this analysis are:

  1. Have other (perhaps, more small scale) pandemics affected the rate of theft-related crime? A study could be done of SARS in Toronto, from 2003-04.
  2. How has COVID-19 affected the rate of theft-related crime at different cities across Canada? Cities could include: Calgary, Edmonton, Winnipeg, Regina, etc.
  3. In what ways has the rate of violent crime changed since the pandemic began? We could look at the rate of more intense crimes, still on a city basis.

References

  1. Statistics Canada, The Social and Economic Impacts of COVID-19: A Six-Month Update. Published October 20, 2020. Source.

  2. Interpol, Operation Pangea – shining a light on pharmaceutical crime. Published November 21, 2019. Source.

  3. ABC News, FBI warns of potential surge in hate crimes against Asian Americans amid coronavirus. Published March 27, 2020. Source.

  4. UN News, UN chief calls for domestic violence ‘ceasefire’ amid ‘horrifying global surge’. Published April 6, 2020. Source.

  5. CBC, City of Vancouver freezes police department funding as part of 2021 budget. Published December 8, 2020. Source.

  6. Vancouver Police Department, Crime Data. Accessed March 5, 2021. Source.

  7. Piazza, Group Project clarification. Published March 4, 2021. Source.*

  8. Wright, Bradley R. Entner, Avshalom Caspi et al in Criminology, RECONSIDERING THE RELATIONSHIP BETWEEN SES AND DELINQUENCY: CAUSATION BUT NOT CORRELATION. Published March 7, 2006. Source.

  9. City of Vancouver, Census local area profiles 2016. Published April 10, 2018. Source.

  10. New York Times, Confidence in Police Is at Record Low, Gallup Survey Finds. Published August 12, 2020. Source.

  11. Wickham, Hadley and Garrett Grolemund, R for Data Science. Published December 2016. Source.

  12. Timbers, Tiffany-Anne, Trevor Campbell and Melissa Lee, Data Science: A First Introduction. Last updated January 12, 2021. Source.

  13. Statistics Canada, Police-reported crime statistics in Canada, 2019. Published October 29, 2020. Source.

  14. Statistics Canada, Public perceptions of the police in Canada’s provinces, 2019. Published November 25, 2020. Source.

  15. Bazelon, Emily, New York Times Magazine. Published June 13, 2020. Source.

* Note the Piazza reference is used to ensure we're handling a proportions analysis correctly, since it was a bit unclear in the instructions.

Appendix

In our analysis, we often reference the socioeconomic status of various neighborhoods in Vancouver. The following code cells display the full economic distribution we reference, from the 2016 City of Vancouver Census, available here.

First, we'll need to tidy the data. To do so, we'll clean up the column names using clean_names(), grab the survey question we're interested in from the data set, and use gather(...) to switch the neighborhoods from being column names to an actual column value, neighbourhood, themselves.

Now that we've tidied the data, we can visualize it in a histogram, to see the overall distribution, and a bar chart, to view the income by neighborhood.

The distribution is mostly centers around the $50,000 mark, though it has no clear peak. The histogram makes clear there is one notable outlier from the data.

The bar chart, as a follow-up, shows that it is indeed the Shaughnessy neighborhood that is a clear outlier in terms of average income of residents in 2016. The bar chart shows there are some natural groups that can be made, such as Strathcona to Victoria/Fraserview as "very low", Hastings/Sunrise to Marpole as "low", etc., that do not necessarily fit into a prescribed dichotomy like "lower class," "middle class," and "upper class."

Both the histogram and the bar plot will be reference throughout the analysis to aid the interpretation of the results.