1. Introduction
Urban agglomerations represent the pinnacle spatial organizational form of urban development in its mature phase, denoting regions within specific territorial boundaries that are generally comprised of multiple large cities [
1,
2]. These urban clusters emerge over areas characterized by tight spatial organization and closely-knit economic linkages, which are facilitated by an advanced infrastructure network of transport and communication, ultimately achieving a high degree of urban coalescence and integration [
3]. As urban agglomerations experience rapid development, the spatial influence of urban agglomerations extends far beyond these administrative borders and built-up areas [
4]. While the existing administrative boundaries and built-up areas align more closely with the actual form and developmental status of urban agglomerations, they are also limited in their ability to reflect the flow of factors and connections within the agglomerations [
5]. Consequently, they may fail to comprehensively capture the dynamics and complexities of urban agglomerations [
6]. Therefore, for the rational allocation of resources and factors within urban agglomerations, it is imperative to accurately identify their spatial ranges. Such identification can facilitate a more effective resolution of the developmental challenges within urban agglomerations, including optimizing resource allocation and promoting rational interaction and coordinated development among cities within the region [
7,
8]. To address these challenges, it is necessary to go beyond the conventional methods of identifying urban agglomerations and further analyze the inherent spatial influences in the contemporary era. Subsequently, a more precise method for identifying urban agglomerations can be proposed.
In past studies on identifying the spatial ranges of urban agglomerations, commonly used data types generally include statistical survey data and remote sensing data [
9]. NTL data, as an important branch of remote sensing data, can capture the distribution of nighttime light sources on the ground, providing intuitive indicators of urbanization levels and economic activities [
10]. NTL data not only vividly reflect the differences between urban agglomeration areas and surrounding regions but also excel in unveiling areas of rapid urbanization [
11]. Furthermore, NTL data can be used to monitor the expansion trends of urban spaces [
12]. This leads to the widespread application of NTL data in research related to urban interior spaces, including the delineation of urban boundaries, identification of urban centers, and analysis of factors influencing the spatial structure of urban spaces [
13,
14]. Currently, commonly used NTL data include NPP/VIIRS, DMSP/OLS, and Luojia-01. The DMSP/OLS NTL data have a spatial resolution of 1 km and cover the time period from 1992 to 2012. The NPP/VIIRS NTL data have a spatial resolution of 500 m and span from 2013 to 2024. In contrast, the Luojia-01 NTL data boast a higher spatial resolution of 130 m but are mainly concentrated in the years 2018–2019. Additionally, in the process of identifying the spatial range of an urban area, NTL data face limitations due to their inherent spillover effect and the inability to fully reflect the internal factor flows within an urban interior space, resulting in certain inaccuracies in the spatial range identification [
15]. Therefore, when using NTL data to identify the spatial range of an urban area, it is necessary to fuse other types of data and methods to mitigate the impacts of the nighttime light spillover effect and comprehensively reflect the internal factor flows, aiming to achieve more comprehensive and precise results in identifying the spatial range of an urban interior space.
In the contemporary information society, the development of big data technology provides new avenues for identifying and understanding the spatial range of an urban interior space [
16]. Particularly, big data from social media, originating from residents’ daily activities, offers a unique perspective for revealing the interactions and connections within urban interior spaces [
17]. Weibo sign-in data, as a form of social media data, can directly map the social and economic connections between cities through online interactions among crowds [
18]. For urban spaces, the flow of internal factors is a crucial prerequisite for identifying their spatial range, and analyzing the spatial interactions of different users within Weibo sign-in data can easily determine the range of the internal population element flow, thereby defining the spatial range of an urban space [
19,
20]. A considerable amount of research utilizes the characteristics of Weibo sign-in data to analyze content related to urban agglomerations, including the identification of urban agglomeration boundaries and the spatial connections between different cities within urban agglomerations [
21]. These studies reflect the promising prospects of applying Weibo sign-in data to this field. However, there are also some limitations to the use of social media data in urban spatial analysis [
22]. Firstly, social media users do not represent the entire population, which may lead to sampling bias. Secondly, social media data may be subject to data bias, such as the uneven distribution of user activity and the presence of false information [
23]. Therefore, like other big data, social media data represent a simulation effect in a geo-virtual space and need to be used in conjunction with a physical space.
In the context of identifying the spatial range of an urban area, both NTL data and social media data serve as significant sources of information. However, the distinct features of these data types vary in their application. NTL data’s advantage lies in its provision of intuitive, continuous geospatial information, which can clearly depict the physical form of cities and urban agglomerations [
24]. Nevertheless, this method also has its limitations, as it cannot offer insights into the socio-economic activities and population movements within urban agglomerations [
25]. On the other hand, the advantage of Weibo sign-in data is its reflection of the dynamism and immediacy of human interactions, offering a more authentic representation of the urban agglomeration’s influence [
26]. Moreover, social media data can provide in-depth insights into the lifestyles, consumption habits, and cultural characteristics of urban agglomeration residents, thereby offering more comprehensive information support for urban planning and regional development [
27]. However, the inherent characteristics of the data can lead to sampling biases and data skewness issues, making the use of Weibo sign-in data in urban agglomeration-related research somewhat inadequate [
28]. Consequently, an increasing number of studies are engaging in research from the perspective of fusing NTL data with urban big data, considering the combination of both data types to identify and understand the spatial ranges of urban agglomerations more comprehensively and accurately [
29].
In recent years, data fusion has been widely applied in research related to urban interior spaces. By fusing and analyzing information from diverse data sources, the precision of observations can be enhanced. Currently, the more commonly used data fusion techniques include algebraic methods, Intensity–Hue–Saturation (IHS) transformation, wavelet transformation, Principal Component Transformation (PCT), and K-T transformation, among others [
17,
30]. In terms of data fusion, the variability in the outcomes is largely dependent on the method of fusion employed, with different techniques yielding varying effects on the fused data. Specifically, wavelet transformation is noted for its ability to retain the information characteristics of the original data images as much as possible during the image fusion process, and it is increasingly applied in urban spatial studies [
31,
32]. Existing research, through the fusion of various data types, has substantiated this point, for instance, by integrating NTL data with Point-of-Interest (POI) data to extract urban built-up areas [
33], delineate urban boundaries [
34], etc. The findings from these studies indicate that data fusion generally provides stronger results in urban spatial applications than the use of single data sources [
35]. Recent studies have also explored the fusion of NTL data with Weibo data to analyze intra-urban spaces. For example, researchers have found that NTL data can reflect the economic development level of a city, while Weibo sign-in data capture residents’ emotional needs regarding urban spaces. By fusing NTL data with Weibo sign-in data, a comprehensive assessment of urban life satisfaction in China has been conducted [
36]. Therefore, building upon this foundation, we aim to effectively fuse social media data with NTL data through data fusion techniques, thereby more accurately identifying the spatial range of urban agglomerations.
This study is conducted from several perspectives. Firstly, we identify the spatial range of urban agglomerations based on NTL data. Secondly, this study fuses NTL data with social media data to identify the spatial range of urban agglomerations. Thirdly, this study compares and validates the results obtained from these two identification methods. The primary contribution of this study lies in proposing a novel method for identifying urban spatial boundaries by fusing NTL data with Weibo sign-in big data through neural networks. This approach offers a new perspective for researching intra-urban spaces. Additionally, identifying more accurate urban spatial boundaries aids in optimizing urban spatial structures and formulating development policies, thereby promoting high-quality and sustainable urban development.
3. Results
3.1. Urban Agglomeration Spatial Identification Based on NTL Data
Typically, the distribution of NTL data within the spatial range of urban agglomerations exhibits certain characteristics. Firstly, high-value areas of NTL data are primarily concentrated in the central areas of urban agglomerations, such as core urban areas, economically developed zones, commercial districts, and transportation hubs. Secondly, as the distance from the center of the urban agglomeration increases, the distribution of NTL data becomes relatively dispersed and diffused. Therefore, NTL data can be used to identify the spatial range of urban agglomerations based on the variations in their luminance characteristics. In automated urban spatial extraction, U-net performs supervised learning by training on annotated urban spatial images. The model learns features related to urban agglomeration spaces, enabling it to segment image pixels into regions with urban agglomeration attributes or non-urban agglomeration attributes. Thus, the U-net-based automated method efficiently and accurately extracts urban agglomeration spaces, aiding researchers in analyzing and understanding the distribution and characteristics of these spaces.
We utilize the NTL data in conjunction with a U-net neural network to delineate the spatial range of the PRD urban agglomeration. The process begins with the establishment of training samples, where the sample labels are derived from the officially announced spatial boundaries of the urban agglomeration in previous years, ensuring a certain level of accuracy in the sample labels. Following this, the training samples are annotated and divided into test, training, and validation sets, culminating in the identification of the urban spatial range of the PRD for the year 2022, as shown in
Figure 4. By calculating the average NTL brightness values in different regions, it is found that the core area’s average brightness is 75.4, while the peripheral area’s is 42.8, with a brightness gradient decreasing by 2.3 units per kilometer, indicating a significant brightness attenuation. Moran’s I index is 0.72, indicating significant spatial autocorrelation in the NTL data, and the Getis–Ord Gi* statistic identifies Guangzhou and Shenzhen as high-brightness hotspot areas. The identified spatial extent of the urban agglomeration covers an area of 8491.26 square kilometers. The average connectivity index between cities within the urban agglomeration is 0.78, and the average shortest path length is 45.6 km, demonstrating an efficient transportation network. The boundary clarity index is 0.45, and the diffusion coefficient in the central region of Huizhou is 1.2, indicating a clear outward diffusion trend in that area.
3.2. Urban Agglomeration Spatial Identification Based on Weibo Sign-In Data
While NTL data reflect the disparities in economic development levels within urban agglomerations, focusing on macro-level spatial identification, Weibo sign-in data offer insights into urban agglomeration spaces from the perspectives of individual behavior and social interaction. Weibo sign-in data can reveal the patterns of population movement within urban agglomerations, aiding in the identification of highly active commercial districts, tourist attractions, transportation hubs, and their distribution within the urban agglomeration. Moreover, Weibo sign-in data can also uncover the interaction relationships between cities within the urban agglomeration. Drawing upon existing research on the fusion of NTL data with big data on urban agglomerations, we employ wavelet transform to fuse NTL data with social media data. The fused data exhibit a smaller spatial coverage compared to NTL data alone and display significant variations across different regions.
In the study of identifying the spatial extent of the PRD urban agglomeration using U-net neural networks combined with NTL data and Weibo data, training samples are labeled and divided into test, training, and validation sets. The final identification results for the urban agglomeration in 2022 are shown in
Figure 5. The spatial extent identified through data fusion covers an area of 7993.08 square kilometers, 498.18 square kilometers less than the 8491.26 square kilometers identified using only NTL data, a reduction of 5.9%. The average brightness of the core area is 78.6, while that of the peripheral area is 40.3, with a brightness gradient decreasing by 2.6 units per kilometer. The Moran’s I index after data fusion is 0.75, higher than the 0.72 obtained with only NTL data, indicating stronger spatial autocorrelation. The Getis–Ord Gi* statistics show that high-brightness hotspots identified through data fusion are concentrated in Guangzhou, Shenzhen, and Dongguan, while low-brightness cold spots are located in Zhaoqing and Jiangmen. The spatial extent accuracy identified through data fusion is 99.1%, slightly lower than the 99.9% obtained using only NTL data, but the Kappa coefficient is 0.87, slightly higher than the 0.85 of the NTL data alone, indicating a higher consistency. The average connectivity index between cities within the urban agglomeration identified through data fusion is 0.82, higher than the 0.78 from NTL data alone, and the average shortest path length is 43.2 km, shorter than the 45.6 km identified using only NTL data, reflecting a higher connectivity efficiency. The boundary clarity index is 0.42, lower than the 0.45 from NTL data alone, indicating clearer boundaries. The diffusion coefficient for Huizhou and other areas is 1.3, higher than the 1.2 from NTL data alone, indicating more fine-grained brightness points and human activity features. Overall, the spatial extent identified through data fusion is more concentrated in the core area, reflecting a higher population density and more advanced urban infrastructure and services, making the characteristics of city group synergy and integration within the urban agglomeration more evident. These quantitative and spatial analyses demonstrate the advantages of the data fusion identification method in urban agglomeration spatial-extent identification, making the research results more persuasive and scientifically robust.
From the perspective of the spatial extent identified using the two data sources, there are significant differences between the urban agglomeration spaces identified through data fusion and those identified using only NTL data. Firstly, NTL data, due to their unique attribute of nighttime light brightness, determine the influence range of regions solely based on brightness values. In highly developed regions with closely spaced cities like the PRD, this method results in NTL data exhibiting a relatively concentrated spatial pattern, overlooking some finer internal differences within the urban agglomeration. Secondly, Weibo sign-in data reflect population data in different areas within the urban agglomeration but lack deeper socio-economic information, such as the purpose of user activities, satisfaction, or interactions with others, limiting a comprehensive understanding of the socio-economic dynamics of the urban agglomeration. In the spatial extent of the urban agglomeration identified through the fusion of Weibo sign-in data, Weibo data highlight the peripheral areas of cities. In contrast, the NTL–Weibo data fusion retains the characteristics of Weibo data while incorporating NTL data features, resulting in a diminished urban spatial extent near the main built-up areas of major cities within the urban agglomeration and a strengthened urban development cluster between cities. This leads to a more fragmented urban agglomeration space identified through data fusion, reflecting the actual spatial situation of the urban agglomeration.
These differences indicate that the urban agglomeration space identified through data fusion is more detailed and fragmented, reflecting spatial heterogeneity. The data fusion method provides a richer and more detailed spatial extent of urban agglomerations, aiding researchers and decision makers in better understanding the distribution of human activities and the micro-patterns of urbanization within these areas, thereby effectively improving the accuracy of spatial identification. These differences demonstrate the heterogeneity in development levels among different cities within an urban agglomeration, reflecting the complex socio-economic dynamics and patterns of human activities. This enables the data fusion method to more comprehensively depict the spatial structure of urban agglomerations.
3.3. Accuracy Verification and Comparative Analysis
To verify and conduct a comparative analysis of the urban agglomeration spatial results identified before and after data fusion and to examine the differences between the results of this study and those of previous research, this study selects 5000 random pixel verification points. Through the aid of high-resolution imagery data from Google Earth, it is confirmed on-site that 1551 of these random verification points are located within the urban agglomeration spatial range, while 3449 are outside. The confusion matrix determined using the random pixel verification points is shown in
Table 1. The accuracy is the percentage of all random pixel verification points that are successfully verified, while the Kappa coefficient is a measure of the consistency of the verification results. A Kappa coefficient closer to 1 indicates better verification results.
As indicated in
Table 1, the accuracy of urban agglomeration space identification using NTL data is 85.38%, with a Kappa coefficient of 0.6468. After data fusion, the accuracy of urban agglomeration space identification improves to 92.38%, with a Kappa coefficient of 0.8234. The validation results demonstrate that the accuracy of identifying the spatial range of urban agglomerations using a fusion of NTL data and social media data is enhanced by 7% compared to using NTL data alone. Furthermore, the Kappa coefficient increased by 0.1766, indicating that data fusion yields more accurate results in identifying the spatial range of urban agglomerations.
Comparing the spatial results of urban agglomerations identified using different datasets (as shown in
Figure 6), there is a noticeable difference between the spatial ranges identified using NTL data and NTL_WB data. Consequently, we select four points with significant differences for the highlighted comparison and analysis. A detailed comparison of the identification results reveals that the urban agglomeration space identified using NTL data is larger due to its spill-over effect, and areas such as transportation routes, airports, and ports are included within the urban agglomeration space due to higher light-intensity values. However, after fusing Weibo sign-in data, the identified spatial range is smaller in these areas due to less population interaction, and the use of Weibo sign-in data is less frequent in suburban and urban edge areas, leading to a more fragmented identification of urban agglomeration spaces. Overall, the comparison of results identified using NTL data and NTL_WB data shows that data fusion identifies a smaller spatial range of urban agglomerations, with this trend being more evident in areas with fewer urban clusters. This indicates that in the case of Weibo data, they tend to provide the identification of urban agglomeration spaces from the perspectives of individual behavior and social interaction rather than having clear boundaries like economic and land data. Thus, the results identified through data fusion are more capable of distinguishing the spatial ranges of urban agglomerations.
4. Discussion
This study, premised on the spatial differences within urban agglomerations, employs a fusion of Weibo data and NTL data, utilizing the U-net neural network to identify the spatial range and distribution characteristics of the PRD urban agglomeration. Furthermore, a comparative analysis is conducted between the urban agglomeration ranges identified after data fusion and those identified using solely NTL data.
The current identification of urban agglomeration spaces primarily utilizes data on land, population, and economy [
46,
47]. Such data, in earlier research, indeed facilitated the delineation of urban agglomeration boundaries and their specific spatial ranges, aiding in understanding the scale, population density, and level of economic activities of different urban agglomerations [
48]. Analyzing the results of the urban agglomeration spaces identified in these studies, it is an undeniable fact that the spatial ranges of urban agglomerations are expanding. However, these datasets are typically collected by governmental agencies at fixed time points, implying that the data are not in real time. Given that urban agglomerations are dynamically changing, traditional data collection methods struggle to monitor these changes in real time. This increasingly complicates the task of clearly defining the spatial ranges of urban agglomerations with traditional research data and methods [
49]. Thus, identifying urban agglomeration spaces through appropriate methods and approaches is evidently crucial for coordinated development within urban agglomerations and for achieving their sustainable development [
50]. Current research on urban spatial analysis using NTL data is becoming increasingly abundant, including NPP, DMSP, and Luojia-01. DMSP data, with their longer time series, are suitable for historical studies and trend analysis, although they have lower spatial resolution. In contrast, Luojia-01 data offer a higher spatial resolution, providing more detailed urban brightness information. This study, employing a fusion of social media data and NTL data for identifying urban agglomeration spaces, demonstrates that the accuracy of urban agglomeration space identification reached 92.38%, an improvement of 7% over the accuracy achieved with NTL data alone. The Kappa coefficient also increased by 0.1766, significantly enhancing the precision in identifying the spatial ranges of urban agglomerations. By fusing other data sources, the U-net neural network model can be trained separately using DMSP and Luojia-01 data to analyze and validate the correctness of this study’s results. A comprehensive comparison of the identification results from different data sources shows that the characteristics identified using DMSP and Luojia-01 data, such as urban agglomeration spatial area, spatial range distribution, brightness distribution, and spatial autocorrelation, reveal differences in the detail capture and spatial heterogeneity. Compared with current research results, the data fusion method can more comprehensively depict the spatial structures of urban agglomerations, providing richer and more detailed spatial extents. This helps researchers and decision makers better understand the distribution of human activities and the micro-patterns of urbanization within urban agglomerations, thereby effectively improving the accuracy of spatial identification.
From our analysis of features following the fusion of NTL and WB data, several key observations emerge. Firstly, social media data provide a more comprehensive array of information, including population distribution, activity hotspots, and community interactions, which can enrich the understanding of the spatial differences and characteristics of urban agglomerations [
51]. Secondly, while NTL data reflect urban construction and economic activity levels, social media data unveil details about population distribution and activities. The fusion of these datasets allows for the integration of social media and NTL data, thereby yielding more accurate and comprehensive results in the identification of urban agglomeration spaces [
52]. By leveraging these diverse types of data, we can more effectively identify the spatial range of urban agglomerations, circumventing some of the difficulties and errors inherent in traditional methods [
53]. In summary, the method of fusing social media data enables the acquisition of more comprehensive and precise results in identifying urban agglomeration spaces. By combining the unique features and strengths of different datasets, we can better capture the dynamics and detailed spatial information on urban agglomerations. This approach yields the accurate identification of urban agglomeration spaces, providing important references and guidance for urban planning, decision making, and development.
Although the study of identifying the spatial range of urban agglomerations is not a brand new topic, many studies have analyzed the identification and delineation of the spatial range of urban agglomerations in different urban agglomerations in China [
54]. However, this study proposes a new way of fusing NTL data with social media data to accurately identify the spatial range of urban agglomerations based on the work of predecessors by conducting a detailed analysis of the PRD urban agglomeration. This introduces a fresh perspective and solution to the study of urban agglomeration spaces, offering a straightforward and widely applicable method that holds significant practical value and prospects for application.