Semantic Segmentation of Urban Airborne LiDAR Point Clouds Based on Fusion Attention Mechanism and Multi-Scale Features

Wang, **gxue; Li, Huan; Xu, Zhenghui; **e, **ao

doi:10.3390/rs15215248

Open AccessArticle

Semantic Segmentation of Urban Airborne LiDAR Point Clouds Based on Fusion Attention Mechanism and Multi-Scale Features

¹

School of Geomatics, Liaoning Technical University, Fuxin 123000, China

²

Collaborative Innovation Institute of Geospatial Information Service, Liaoning Technical University, Fuxin 123000, China

³

Key Laboratory for Environment Computation & Sustainability of Liaoning Province, Institute of Applied Ecology, Chinese Academy of Sciences, Shenyang 110016, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(21), 5248; https://doi.org/10.3390/rs15215248

Submission received: 22 September 2023 / Revised: 30 October 2023 / Accepted: 3 November 2023 / Published: 5 November 2023

(This article belongs to the Topic Geocomputation and Artificial Intelligence for Map**)

Download

Browse Figures

Versions Notes

Abstract

:

Semantic segmentation of point clouds provided by airborne LiDAR survey in urban scenes is a great challenge. This is due to the fact that point clouds at boundaries of different types of objects are easy to be mixed and have geometric spatial similarity. In addition, the 3D descriptions of the same type of objects have different scales. To address above problems, a fusion attention convolutional network (SMAnet) was proposed in this study. The fusion attention module includes a self-attention module (SAM) and multi-head attention module (MAM). The SAM can capture feature information according to correlation of adjacent point cloud and it can distinguish the mixed point clouds with similar geometric features effectively. The MAM strengthens connections among point clouds according to different subspace features, which is beneficial for distinguishing point clouds at different scales. In feature extraction, lightweight multi-scale feature extraction layers are used to effectively utilize local information of different neighbor fields. Additionally, in order to solve the feature externalization problem and expand the network receptive field, the SoftMax-stochastic pooling (SSP) algorithm is proposed to extract global features. The ISPRS 3D Semantic Labeling Contest dataset was chosen in this study for point cloud segmentation experimentation. Results showed that the overall accuracy and average F1-score of SMAnet reach 85.7% and 75.1%, respectively. It is therefore superior to common algorithms at present. The proposed model also achieved good results on the GML(B) dataset, which proves that the model has good generalization ability.

Keywords:

urban scenes; airborne LiDAR point clouds; fusion attention mechanism; multi-scale features; semantic segmentation

Graphical Abstract

1. Introduction

Airborne LiDAR point clouds, scanned by light detection and ranging equipment mounted on aerial platforms, are a collection of points with original geometric properties. With the rapid development of computer vision and remote sensing technology, the application of airborne LiDAR point cloud data to urban scenes is paid more and more attention, especially in the fields of navigational positioning, automatic driving, smart city, and 3D vision [1], etc. Point clouds in urban scenes are important information carriers, which are consisted of complex surface features. In order to accurately understand 3D urban scenes from the point level, the concept of point cloud semantic segmentation was proposed. Semantic segmentation, as an important technique for LiDAR point cloud data processing, is aimed at subdividing point clouds into several specific point sets with independent attributes, recognizing the target types of point sets, and making semantic marking [2]. Semantic segmentation of airborne LiDAR point clouds in urban scene can quickly extract typical feature information and understand complex urban scenes, so as to effectively reflect the spatial layout, development scale and greening level of the city, which has a crucial role in the fields of urban development planning, smart city and geo-database [3]. Nevertheless, semantic segmentation of point clouds is a great challenge since airborne LiDAR point clouds have characteristics of high redundancy, incompleteness and complexity [4,5].

To extract surface features from 3D point clouds, traditional methods usually construct the corresponding segmentation model according to geometric attributes and data statistical features chosen manually, such as support vector machine (SVM) [6], random forest (RF) [7], conditional random field (CRF) [8], Markov random field (MRF) [9], etc. However, selection of statistical features mainly relies on priori knowledge of operators, which has great randomness, limited ability in feature extraction of point clouds, and poor generalization. With the improvement of calculation power of computers and continuous emerging of 3D scene dataset, deep learning is taking a dominant role in the field of point cloud semantic segmentation field.

Deep learning [10] firstly was used for semantic segmentation of point clouds through rasterization of point clouds. Su et al. [11] proposed Multi-View Convolutional Neural Network (MVCNN), which got the segmentation results through convolution and aggregation of 2D images of point clouds under different perspectives. According to existing snapshots, Boulch et al. [12] produced pairs of snapshots which contained RGB views and depth maps of geometric features, then provided labels for corresponding pixels of each pair of snapshots, and then mapped the marked pixels onto the original data. Wu et al. [13] extracted features from projected 2D images by using CNN, output the pixel-by-pixel labeling chart, refined it with the conditional random field (CRF) model, and finally got the instance-level labels through the traditional clustering algorithm. Besides, voxelization of irregular 3D point clouds is a common method that researchers are used to process the original point clouds. Maturana et al. [14] proposed VoxNet network based on voxelization of point clouds, which classified point clouds by using the supervised 3D convolutional neutral network (CNN). Tchapmi et al. [15] generated the bold voxel labels through the 3D fully convolutional neural network based on voxelization of point clouds and then enhanced the prediction results by combining the trilinear interpolation and fully-connected CRF learning fine granularity. Wang et al. [16] implemented multi-scale voxelization of point clouds and extracts features, made adaptive learning of local geometric features, and realized global optimization of prediction class probabilities by using CRF with full considerations to spatial consistency of point clouds. The above semantic segmentation methods based on multi-views or voxels solve the structural problems and have some practicability. However, semantic segmentation methods based on multi-views are inevitable to lose 3D space information in the rasterization process of point clouds. The semantic segmentation methods based on voxels increase the spatial complexity and incur great expenses for storage and operation.

Therefore, some effective frameworks for direct processing of point cloud data are proposed. Qi et al. [17] designed PointNet, which made pointwise coding through multilayer perception (Mlp) and got global features through aggregation function. Nevertheless, it ignores the concept of local space and lacks extraction and utilization of local features. Qi et al. [18] proposed the improved version of PointNet, denoted as PointNet++. It proposes the density adaptive cut-in layer, learns features of point sets at different scales according to multi-layer sampling and grou**, and captures local detail information. However, PointNet++ still processes each point independently, without considerations to connections among neighbor points. In PointNet++, K nearest neighbor searching results have a problem of single direction. Jiang et al. [19] designed a scale perception descriptor for ordered coding of information from different directions and effective capture of local information of point clouds. Based on KNN construction of local neighbor graph, Wang et al. [20] used EdgeConv module to capture local geometric features of point clouds and learn features by making full use of point neighborhood information. Based on the local neighborhood processing of PointNet++, Zhao et al. [21] increased the adaptive feature adjustment module to transform and aggregate upper and bottom information, then integrated information of different channels through Mlp and max pooling, and strengthened the description ability of features to local neighborhood. ** (SG) and CNN Block. The SG layer firstly makes uniform sampling of input point clouds and the sampling points are used as centroids. Later, the input point clouds are divided into point cloud sets of different scales according to number of points searched within different radii. The numbers of sampling centroids at three layers are N/4, N/16, and N/64, respectively. The numbers of searched points at different scales are denoted as S₁ and S₂. Finally, multilayer perception (Mlp) is used in CNN Block to extract features of point set in the local neighborhood. The output channel parameters of Mlp and output features of each block are shown in Figure 1a. Different from the complicated structure of PointNet++ feature extraction layer, the proposed SMAnet model applies three-layer feature extraction and takes calculation efficiency and segmentation accuracy of the model into account.

(3)

To address insufficient interaction information of point clouds in PointNet++, the fusion attention layer was designed after the feature extraction layer. High-dimensional feature information was strengthened by integration SAM and MAM. The basic principle is shown in Figure 1b,c. The color intensity of segments between two points represents the strength of relations and associations of multiple aspects are expressed by combination of different colors. Some points

{P_{1}^{s}, P_{2}^{s}, P_{3}^{s}, P_{4}^{s} {, P}_{5}^{s}, P_{6}^{s}}

are given, where

P_{1}^{s}

is the middle point. The SAM module adds the connection between each point and the central point through global features. In other words, a thrust was applied on the point cloud feature space to push surrounding points of feature deviation

P_{1}^{s}

to

P_{1}^{s}

and establish the relationship between surrounding points and

P_{1}^{s}

. Based on the diversity principle of point cloud, the MAM module explores the deep association among point cloud features in feature spaces according to correlations among different subspace features. Essentially, it applies several different forces onto

P_{1}^{s}

to establish multiple aspects of relations with surrounding points and associations of point clouds are simulated from different perspectives. The fusion attention layer establishes associations among points from two aspects, thus improving of semantic segmentation accuracy of point clouds.

(4)

For giving high-dimensional features of attention, max pooling will lose many important features and it cannot extract global information effectively. Hence, a new aggregation function of SSP was designed as the pooling layer and point cloud features with complicated information were aggregated selectively according to probability after smoothing of SoftMax function to extract global features and filter redundant information.

(5)

The feature spreading upsampling layer and feature extraction layer both contain three layers, respectively. Features of all input points were retrieved through skip connections between the learned features and the features from the corresponding feature extraction layer. Finally, pointwise classification was carried out according to features, thus getting the semantic segmentation results.

Class	pow	l_veg	i_surf	car	f_hedge	roof	fac	shrub	tree	Total
TraniningSet	546	180,850	193,723	4614	12,070	152,045	27,250	47,605	135,173	753,876
TestSet	600	98,690	101,986	3708	7422	109,048	11,224	24,818	54,226	411,722

Class	pow	l_veg	i_surf	car	f_hedge	roof	fac	shrub	tree
pow	90.8	0.0	0.0	0.5	0.0	6.7	0.5	0.2	1.3
l_veg	0.0	83.2	6.3	0.4	0.8	1.0	0.3	5.7	2.2
i_surf	0.0	7.2	91.1	0.2	0.5	0.4	0.1	0.4	0.0
car	0.0	2.2	1.8	86.1	1.0	2.3	1.6	4.8	0.1
f_hedge	0.0	5.7	1.1	1.4	51.1	1.8	1.3	21.5	16.1
roof	0.2	1.9	1.7	0.0	0.1	93.1	0.8	0.6	1.6
fac	0.2	5.7	0.7	0.9	1.1	10.9	59.6	10.8	10.2
shrub	0.0	12.4	0.6	1.7	4.3	4.9	1.6	53.8	20.8
tree	0.1	2.7	0.1	0.3	0.7	2.8	0.6	6.0	86.8
Precision	62.5	84.5	91.7	72.9	54.1	95.1	75.0	48.5	81.3
Recall	90.8	83.2	91.1	86.1	51.1	93.1	59.6	53.8	86.8
F1-score	74.1	83.9	91.4	78.9	52.5	94.1	66.4	51.0	83.9

Method	pow	l_veg	i_surf	car	f_hedge	roof	fac	shrub	tree	A.F1	OA	Time/s
IIS_7	54.4	65.2	85.0	57.9	28.9	90.9	-	39.5	75.6	55.3	76.2	-
UM	46.1	79.0	89.1	47.7	5.2	92.0	52.7	40.9	77.9	58.9	80.8	-
HM_1	69.8	73.8	91.5	58.2	29.9	91.6	54.7	47.8	80.2	66.4	80.5	120
WhuY3	37.1	81.4	90.1	63.4	23.9	93.4	47.5	39.9	78.0	61.6	82.3	70
LUH	59.6	77.5	91.1	73.1	34.0	94.2	56.3	46.6	83.1	68.4	81.6	-
BIJ_W	13.8	78.5	90.5	56.4	36.3	92.2	53.2	43.3	78.4	60.2	81.5	-
RIT_1	37.5	77.9	91.5	73.4	18.0	94.0	49.3	45.9	82.5	63.3	81.6	3.7
NANJ2	62.0	88.8	91.2	66.7	40.7	93.6	42.6	55.9	82.6	69.3	85.2	-
SMAnet	74.1	83.9	91.4	78.9	52.5	94.1	66.4	51.0	83.9	75.1	85.7	47

Method	pow	l_veg	i_surf	car	f_hedge	roof	fac	shrub	tree	A.F1	OA	Time/s
PointNet	59.9	77.5	88.7	61.1	22.9	85.0	26.9	36.6	71.4	58.9	76.2	13
DGCNN	68.1	78.2	89.1	61.6	27.9	90.5	40.8	41.4	75.5	63.6	79.7	15
PointNet++	74.3	78.9	89.7	73.9	30.6	92.1	56.8	43.3	79.7	68.8	81.5	59
RanLANet	76.8	79.3	87.3	68.1	46.7	91.0	57.9	50.6	83.7	71.2	82.2	55
A_PointNet++	77.6	82.7	91.7	79.2	38.9	92.2	61.3	43.2	79.1	71.8	83.5	-
PCT	75.2	80.7	90.9	70.2	41.6	91.9	61.8	48.9	82.3	71.5	83.8	42
GADHNet	75.4	82.0	91.6	77.8	44.2	94.4	61.5	49.6	82.6	73.2	84.5	56
PointSIFT	67.6	82.7	91.2	74.7	49.1	91.9	62.6	51.1	83.1	72.6	84.6	145
SMAnet	74.1	83.9	91.4	78.9	52.5	94.1	66.4	51.0	83.9	75.1	85.7	47

Method	pow	l_veg	i_surf	car	f_hedge	roof	fac	shrub	tree	A.F1	OA	Time/s
SMAnet (BASE)	72.8	78.0	89.7	73.1	29.2	92.7	57.1	42.0	79.2	68.2	81.3	30
SMAnet (S)	77.7	78.4	89.9	78.1	33.5	93.7	60.5	41.4	80.5	70.4	82.4	34
SMAnet (M)	72.6	81.0	91.2	68.6	42.5	92.3	63.3	50.8	82.2	71.6	83.9	39
SMAnet (S+M)	68.9	81.5	91.1	73.9	48.2	92.1	63.7	55.4	82.8	73.0	84.3	35
SMAnet (SM)	69.6	82.6	91.1	74.6	51.3	93.0	64.9	54.5	82.9	73.8	84.9	43
SMAnet (SM+SSP)	74.1	83.9	91.4	78.9	52.5	94.1	66.4	51.0	83.9	75.1	85.7	47

Article Menu

Semantic Segmentation of Urban Airborne LiDAR Point Clouds Based on Fusion Attention Mechanism and Multi-Scale Features

Abstract

1. Introduction

2.2. Data Preprocessing

2.3. Feature Extraction

2.4. Fusion Attention Mechanism

2.4.1. Self Attention Module (SAM)

2.4.2. Multi-Head Attention Module (MAM)

2.5. SSP Aggregation Algorithm

2.6. Upsampling and Semantic Segmentation

3. Results

3.1. Brief Introduction to Experimental Data

3.2. Network Parameters

3.3. Experimental Results and Analysis

3.3.1. Accuracy Evaluation Metrics

3.3.2. Overall Performance

4. Discussion

4.1. Comparative Experiments

4.2. Ablation Experiments

4.3. Experiments with the Number of Heads of MAM

4.4. Experiments on Grid Sampling Parameters

4.5. Experiments with the GML(B) Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

heads	A.F1	OA	Time/s
SMAnet (2)	70.6	83.1	36
SMAnet (4)	73.9	84.8	40
SMAnet (8)	75.1	85.7	47
SMAnet (16)	73.4	84.5	59

Block Size	S_Points	A. F1	OA	P_Time/s
40 m × 40 m	4096	71.1	82.2	4
30 m × 30 m	2048	71.7	82.7	5
15 m × 15 m	1024	75.1	85.7	8
10 m × 10 m	512	72.3	84.4	17

Method	Ground	Buildings	h_ve	l_ve	A.F1	OA	Time/s
PointNet	99.5	76.0	88.2	42.9	76.6	96.3	39
PointNet++	99.4	80.8	90.8	47.4	79.6	96.7	145
PointSIFT	99.6	88.8	94.9	56.7	85.0	97.8	227
SMAnet	99.5	92.4	96.8	64.4	88.3	98.3	69