This section introduces a solution for indoor localization of mobile robots based on depth cameras, mainly including ORB feature extraction and homogenization processing, feature matching and removal of mismatching feature points, and pose estimation algorithms.
2.2. ORB Feature Extraction
Feature points are some unique features in an image that can represent local features such as corners, blocks, and edges. Among them, corner and edge blocks have a higher degree of differentiation between different images than pixel blocks and perform more stably in small-scale image perspective transformations, but they do not meet the needs of image feature points in visual SLAM. In order to obtain more stable local features of images, scholars in the field of computer vision have conducted a lot of work and designed many image features that remain stable despite significant changes in perspective. The classic ones include Smallest Univalue Segment Assisting Nucleus (SUSAN), Harris, Features From Accelerated Segment Test (FAST), Shi Tomasi, Scale Invariant Feature Transform (SIFT) [
21], Speeded-Up Robust Feature (SURF) [
22], and Oriented FAST and BRIEF (ORB) [
23]. Among them, the Harris and SIFT algorithms, as well as variations of these two methods, are widely used in traditional visual SLAM algorithms. The Harris algorithm adopts differential operations that are not sensitive to changes in image brightness, as well as grayscale second-order matrices. The principle of the SUSAN algorithm is similar to that of the Harris algorithm, and it performs well in corner and edge detection. During the operation of the visual SLAM algorithm, there are often changes in the distance of the camera’s perspective, and SUSAN and Harris have weaker support for scale invariance. SIFT, also known as a scale-invariant feature algorithm, uses Gaussian differential functions and feature descriptors to ensure that images perform well in scenarios with changes in luminosity, rotation, and scale. However, the drawback of SIFT, namely its high computational complexity, is also evident, and SIFT is generally used in high-performance computing platforms and applications that do not consider real-time performance. SURF is an improved version of SIFT, but it still has difficulty meeting the real-time requirements in visual SLAM application scenarios without using GPU acceleration.
In this regard, the ORB feature points used in the visual SLAM system constructed in this paper are a compromise between accuracy and real-time performance. While appropriately reducing the accuracy of image feature extraction, it has resulted in a significant increase in computational speed to meet the real-time requirements of the visual SLAM algorithm. The ORB feature points consist of an improved FAST corner point and a Binary Robust Independent Element Features (BRIEF) descriptor. The principle of FAST is shown in
Figure 2.
Assuming that there is a pixel
in the image with a brightness of
, based on engineering experience, we set a brightness threshold
and select 16 pixels on the circumference with a radius of 3 pixels with a as the center. If there are consecutive
pixels with grayscale values greater than
or less than
, pixel
is considered a corner. The overall FAST corner extraction algorithm process is shown in
Figure 3. The ORB algorithm uses an image pyramid to downsample the image and then uses the gray centroid method to solve the problem of image rotation. The centroid refers to taking a gray value as the center of weight. Firstly, the moment of an image is defined as follows:
The centroid of this image can be found through the following equation:
We connect the centroid and geometric center of the image and then use a vector to represent it. Therefore, the direction of a feature point can be defined as:
After extracting ORB feature points from the image, the step is feature matching. In cases where the number of feature points is not large, brute force matching can be directly used. However, this method requires a lot of time when the number of feature points is large. Therefore, a fast approximate nearest neighbor algorithm can be used to improve the matching speed.
2.3. Feature Point Homogenization
The dense distribution of image feature points in local areas can affect the accuracy of visual odometry, as using only local area data to represent the global area often results in errors. We adopt an improved ORB feature uniform distribution algorithm based on Quadtree for this purpose. This paper mainly improves the feature point extraction and feature description stages. Firstly, the threshold is calculated based on the grayscale values of different pixel neighborhoods, rather than manually setting the threshold to improve the algorithm’s ability to extract feature points in homogeneous image regions. Secondly, an improved Quadtree structure is adopted for feature point management and optimization to improve the distribution uniformity of detected feature points. To prevent excessive node splitting, threshold values are set in each layer of the pyramid image. In the feature description stage, image information included in the grayscale difference is used to obtain more information support and enhance the discriminative power of the feature description. The overall process of the algorithm is shown in
Figure 4. The first step is to construct an eight-layer image pyramid. Assuming that the required number of ORB feature points is
m, the scale factor is
s, and the number of feature points in the first layer is a,
Based on the resolution of the image, each layer of the pyramid image is divided into grids, in order to make the FAST corner distribution more uniform. The next step is to calculate the feature point extraction threshold in each grid. Due to the fixed threshold of the ORB algorithm, it does not consider the specific situation of the local neighborhood of pixels. Therefore, the feature points detected by the ORB algorithm are concentrated in areas with significant changes in the image, and there are a large number of redundant feature points. For uniform areas of the image, due to the similarity of the grayscale features, the FAST algorithm can only extract a small number of feature points.
In this regard, this paper adopts a dynamic local threshold method. We set a separate threshold
t within each grid, and the calculation formula is as follows:
Here, is the scaling factor, usually set to 1.2 based on engineering experience, m is the number of pixels in the grid, is the grayscale value of pixels in the grid, and is the average grayscale value of all pixels in the grid.
After using dynamic thresholding to extract FAST corners, a relatively uniform distribution of feature points can be obtained, but there is a large number of redundant feature points. Therefore, for this paper, a Quadtree structure is used to further screen feature points. Quadtree, also known as Q-tree, has a maximum of four subnodes per node. Using a Quadtree structure, a two-dimensional space can be divided into four quadrants, and the information in a quadrant can be stored in a subnode. The Quadtree structure is shown in
Figure 5. If the number of feature points in a subregion is greater than 1, the splitting will continue in each subregion, while regions without feature points will delete a subnode until the number of nodes equals 1 or reaches the required number of feature points for the system.
The original ORB feature descriptor only uses the size of grayscale values, without using the difference information between grayscale values, resulting in the loss of some image information. In this regard, this paper adopts a feature description method that integrates grayscale difference information, while using grayscale value size and grayscale difference information to describe feature points. Suppose there is a feature point
Q; according to the rules of BRIEF algorithm, we select the
T group of pixel block pairs to compare the gray value, obtain a binary code string
, and then record the gray difference of each pixel block pair to generate another binary code string. Assuming
is the dataset for recording grayscale differences,
Here,
f is the average grayscale value of the pixel block. Then, we calculate the average value
of these
T differences as the threshold for binary encoding these grayscale differences, and the calculation formula is (7). We compare these
T grayscale differences with
, as shown in Formula (8), and then obtain a binary string
, as shown in Formula (9). Finally, we combine
and
to form a new feature descriptor
, as shown in Formula (10).
2.4. Feature Matching
In traditional ORB feature-matching methods, feature vectors are matched by comparing binary strings without considering the correlation between the components of the feature vectors. Therefore, when there are similar regions in the image, there will be more feature mismatches, such as two document images with multiple identical characters. The Hamming distance represents the difference in numerical features between two vectors, which can effectively determine whether two vectors are similar but does not have the ability to describe the directional features of two vectors. We adopt an ORB feature-matching algorithm that combines Hamming distance and cosine similarity. Firstly, we use Hamming distance for coarse matching, then we use cosine similarity for further filtering, and finally we use the Random Sample Consistency (RANSAC) algorithm to further eliminate mismatches. The overall framework of the algorithm is shown in
Figure 6.
The algorithm first performs a preliminary screening of ORB feature points waiting for matching based on Hamming distance and then calculates cosine similarity based on this. Assuming there are two ORB feature description vectors
a and
b, the cosine similarity between them can be obtained:
Here, the larger the cosine similarity value, the closer the directions of the two vectors are.
The next step is to calculate the comparison threshold for cosine similarity. The specific method is to subtract the cosine similarity of all feature point pairs and then find the value with the highest occurrence frequency from it as the cosine similarity comparison threshold for this feature matching. Considering the existence of errors, a floating value is usually set to 0.3 based on engineering experience. Next, the ORB feature points can be re-filtered based on the cosine similarity comparison threshold, and those feature point pairs that are not within the threshold range can be removed.
Finally, we use the RANSAC algorithm to remove the mismatched feature pairs. Although the previous steps can ensure that most of the matches are correct, there may still be some cases of mismatches. Because these two methods are based on the local characteristics of the image, it is necessary to remove the mismatched feature pairs from the perspective of the entire image. The RANSAC algorithm can eliminate outliers in a dataset by calculating the best model of the data which are also known as outer points, and it can retain normal values that conform to this model, which are also known as inner points.
The algorithm first randomly selects four pairs of feature points from the dataset and then calculates the homography matrix
H, which is the initialization of the optimal model. We use the homography matrix
H to test all data. If the error between the coordinate point and the actual result obtained through homography matrix transformation is small enough, then the point is considered an interior point. If there are not enough internal points, we select new random points and recalculate the new homography matrix model until the internal points are sufficient. The minimum number of iterations
k of the RANSAC algorithm should meet the following:
Here, the minimum amount of data required to calculate the homography matrix H is m; is the confidence level, representing the probability of at least one inner point in m ORB feature point pairs; and represents the probability of the selected ORB feature point pair being an outer point.
2.5. Camera Pose Estimation
After the ORB feature extraction and matching steps described in the previous section, a correctly matched and evenly distributed pair of feature points can be obtained, and then the camera motion can be estimated based on this. The Iterative Closest Point (ICP) algorithm can calculate the camera’s pose changes based on two RGB-D images. Assuming that after the feature-matching step, there is a set of paired 3D points,
According to the camera rigid body motion theory, the camera motion can be represented by the rotation matrix
R and the translation vector
t:
The ICP algorithm constructs the solution process of camera motion into a least-square form, assuming that the error between the matched ORB feature points is
This error term is constructed into a least-square problem, and the camera position and pose at this time can be obtained by minimizing the sum of error squares:
Before solving Formula (16), we define the centroids of these two matched ORB feature points and then simplify the formula through the centroids:
where the left term on the right side of the equation is only related to the rotation matrix
R. If the rotation matrix
R is known, we order the right term on the right side of the equation to be equal to 0 to obtain the translation vector
t. So, ICP can be solved in three steps, first calculating the centroid coordinates of these two matched ORB feature points:
Then, the second step is to calculate the rotation matrix
R according to Formula (20), and the last step is to obtain the translation vector
t according to Formula (21):