This study chose ZG City as an example to estimate flow-based offender counts and their relationship with crime. ZG City is one of the most important metropolises in China. A relatively large number of crimes and arrestees offer the needed data for this study. Due to the confidentiality agreement, the city was named ZG City instead of its real name. Other studies followed the same practice, as crime data in China are not publicly available [
31,
32,
33,
34]. The unit of analysis is communities, or Shequ in Chinese. ZG City is divided into 2643 communities, the average area of the communities is 2.74 km
2.
2.1. Estimation of Offender Counts (Independent Variables)
People’s mobility data comes from DAAS platform of Unicom, one of the three largest mobile phone service providers. The platform records the users who stay at a base station for more than 30 min. There was a total of 468 million travel records generated in October 2020 in ZG City. After desensitization and summarization of the data, this research extracts residents’ trajectory data, as shown below.
As shown in
Table 1, each community has its unique 12-digit code. If a user had a trajectory from community A to B to C to D, it would be summarized as three trips: A to B, B to C, and C to D. A, B, and C are starting communities (O_Code), while B, C, and D are ending communities (D_Code). O_Ptype and D_Ptype represent a location classification of the starting community and ending community. A value of “0” means visiting community, “1” means home community, and “2” signifies the working community of the user. “Count” is the number of trips from the starting community to the ending community. In the first line of
Table 1, community “440113105202” is the users’ home community, and community “440106011029” is the user’s work community. The number of trips made by the user from home to work is 2. This study only analyzed the flows that originated from home (O_Ptype = 1).
The data above contain the mobility flows of both residents and offenders. It is impossible to separate offenders from regular people. Offenders’ home locations were obtained from an arrestee database. The flow pattern of the offenders is assumed to be similar to those of the residents living in the same community. Thus, combined with both residents’ mobility and offenders’ residences, the flow-based offender counts of a community could be estimated as follows (see
Figure 1):
Assuming that there are 4 communities labeled A, B, C, and D, and taking community A as an example, if 1000 trips originated from community A, with 600 trips destined to community B and 400 to community C, then the relative mobility ratio is 60% from community A to B and 40% from community A to C. Also, assuming that 20 offenders were known to live in community A, and that the offenders follow the above mobility ratios, the number of offenders would be 12 (20 × 60%) from community A to B and 8 (20 × 40%) from A to C. For each of the starting communities, replicate the above process to calculate the number of visiting offenders to each respective ending community. Finally, the visiting offenders to each community are summarized from all starting communities (
Figure 1).
To test the effectiveness of the flow-based offender count, we compared it to two conventional counts of offenders. One is the home-based count of arrested offenders in each community. Among 2643 communities, 1117 communities had arrestees. The other is called the spatial-lagged count, which considers that offenders may also come from the immediate neighboring communities. Among the 3921 arrested offenders in ZG, 1191 offenders, or roughly 30% of all offenders, travelled less than 1.66 km to commit crime. Since the average size of communities is 2.74 km2, the average distance between the neighboring communities is roughly the square root of 2.74, which happens to be 1.66 km. Based on these two distances, it is reasonable to assume that 30% of the offenders may commit crime in the neighboring communities. Therefore, the spatial-lagged count is an addition of the home-based count and 30% of the home-based counts from the neighboring communities.
To make the flow-based count comparable to the spatial-lagged count, the summation of flow-based counts contributed from the neighboring communities is multiplied by a coefficient of 22.95 such that it is the same as the summation of spatial-lagged counts. The value of 22.95 is empirically calculated, and it may vary from one study area to another. The number of flow-based offenders for each focal community is calculated as follows:
where
is the flow-based offender count in community
j.
is the number of offenders living in community
j.
is the number of offenders visiting community
j from community
i where they live. The multiplier of 22.95 makes the flow-based count comparable to the spatial-lagged count. The calculation process is shown in
Figure 1.
2.2. Dependent Variables and Covariates
The dependent variable in this study is the number of thefts in each community in 2018, which is provided by the Municipal Public Security Bureau of ZG City. For each community, the number of thefts can be counted using the intersection of community boundaries. The calculations were performed using the “sf” package in R.
Covariates include points of interest and the proportion of the migrant population. The point of interest data: Based on the routine activity theory, important activity places include the facilities that attract potential victims or potential offenders. Following prior studies [
10,
23,
35,
36,
37], we chose bus stops, subway stations, Internet bars, KTVs, and cinemas as the important activity nodes in this study.
Proportion of the migrant population: The household registration system in China, known as the Hukou system, divides the population in a city into locals with a local Hukou and the migrant population (also known as nonlocals) with a Hukou registered in other cities. In Chinese metropolises, local Hukou groups typically include those who were born in the city or have transformed their outside Hukou to the city through higher education, permanent employment, or housing property ownership. In contrast, migrant populations typically do not have permanent employment, and cannot therefore transform their Hukou to the city. Although the Hukou system is not inherently based on socio-economic status, migrants typically have lower educational attainment and less income compared to the local Hukou group. Most migrants live in urban–rural transitional areas, villages in the city, and factory dormitories [
27]. Previous studies have found that in Chinese cities, the proportion of the migrant population is an important factor for explaining crime [
24,
38]. Areas with a higher migrant population rate are commonly accompanied by more crimes [
24,
38]. Thus, following prior studies, the proportion of the migrant population was included in the analysis.
2.3. Regression Models
Three models were implemented using the home-based count, spatial-lagged count, and flow-based count. All models used the same dependent variable and control variables (or covariates).
Poisson regression models or negative binomial regression models are commonly used to model counts, such as the number of thefts in this study. Poisson regression models assume that the mean and variance of the dependent variable are equal [
39]. The probability distribution function of a Poisson regression model is as follows:
where
indicates the probability when the number of thefts
is
and the explanatory variables Xs of community i are known.
is the number of thefts in community
I and depends on a series of explanatory variables Xs. Furthermore, the conditional expectation function for
is commonly assumed as follows:
where α is the overdispersion parameter. If α equals to 0, the Poisson regression model should be used. When α is significantly greater than 0, the negative binomial regression model should be applied.
After logarithmic transformation of Equation (3), we obtain Equation (5) as follows:
However, crime counts are overdispersed, and Poisson regression models lead to underestimation of the standard error of coefficients [
40]. The negative binomial regression model solves the problem of modeling overdispersed data by adding a residual term to the Poisson regression model in its log-transformed conditional expectation function, as shown in Equation (5). The newly constructed function is shown as follows:
where the random variable
represents the unobservable part or the heterogeneity of individuals in the conditional expectation function. The residual is assumed to follow the gamma distribution, which reduces the miscalculation of the coefficient of explanatory variables and greatly improves the fitting degree of overdispersed data [
14,
41,
42].
After considering the relationship between explanatory variables and the dependent variable, the following three models were finally constructed:
In Model 1 to 3, in terms of dependent variables, λi is the number of thefts. In terms of independent variables, , , and are home-based offender count, spatial-lagged count, and flow-based offender count, respectively. includes bus stops, subway stations, Internet bars, KTVs, cinemas, and the proportion of the migrant population. In terms of coefficients, is the constant term, and , , , and are the coefficients for the independent variables accordingly. The coefficients were estimated with the maximum likelihood method.
Both unstandardized and standardized models can be useful. Unstandardized models use original independent variables to build models and directly display the quantitative relationship between independent variables and the dependent variable based on unstandardized coefficients. Different from unstandardized models, standardized models use independent variables that are standardized respectively to construct models. The standardized treatment makes the different independent variables have the same analysis scale in the same model. In each standardized model, it could compare the importance of the independent variables based on the standardized coefficients.
Since we need to verify which method of estimating the number of offenders can better explain crime, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were used in this study [
43,
44]. Models with lower AIC and BIC usually have a better goodness-of-fit.