4.1. Supervised Stereo Models
Due to the recent advancements in the realm of Convolutional Neural Networks (CNNs), stereo-depth estimation has been demonstrated as a supervised learning problem. Such deep learning approaches have proven to transcend traditional techniques in terms of performance. However, one of the major demerits of CNNs, which is the extraction of context features and other information from ill-posed areas, has been innate in the deep learning techniques for a long time. Thus, the generation of fine-quality disparity maps for the ill-posed regions is the prevalent problem that has been targeted and solved in “Cascade Residual Learning: A Two-stage Convolutional Neural Network for Stereo Matching” [
38]. As its contribution, the end-to-end trained CRL (Cascade Residual Learning) scheme joins the pipeline from matching cost evaluation to disparity refinement using non-linear layers to demonstrate that residual learning provides better refinement as compared to direct disparity learning at a stage. The CRL network is a CNN architecture of the cascaded type with two stages—in the commencing stage, DispFulNet, a suggested DispNet [
33], is equipped with additional up-convolution modules to produce highly detailed disparity images. The second stage, DispResNet, corrects this disparity by joining with the first stage to produce residual signals across numerous scales. The conglomeration of the two-stage outputs yields the final disparity.
Figure 4 depicts the architecture of the CRL network. The first stage, DispFulNet, has the left and the corresponding right stereo images (
and
, respectively) as inputs to generate the initial disparity regarding the left image (
). The right image is then warped following the disparity to obtain the left image of the synthesized form
).
Thus, the second stage, DispResNet, has the input of
,
,
, and
and the error
. The error is depicted as follows:
With the first stage providing the initial disparity, the next stage gives the respective residual signal. The new disparity is given by
. The residual signals are produced across multiple scales, with the scale ranging from
to a value of
(
denotes the full-resolution scale). The final disparity, after down sampling the initial disparity, at scale
is given by,
where
and
denote the initial disparity and the residual signal corresponding to the initial disparity, respectively, at scale
.
The disparity images at full resolution, from the output of the DispFulNet, with other intermediate ones, are supervised using the ground truth by evaluating the L1 loss. For the supervised learning paradigm of DispResNet, again an L1 loss is evaluated between the estimated disparity and the ground-truth disparity at every scale. To implement the CRL model, the Caffe framework [
39] is made use of. The model is essentially trained in FlyingThings3D and in KITTI 2015. The parameters of [
33] are used while training the first or the second stage on the FlyingThings3D dataset. However, some model architectures depend upon patch-based Siamese networks, which have similar demerits regarding finding correspondences in certain regions. To address such identical issues in “Pyramid Stereo Matching Network” [
40], the problem targeted was to extract and use context information to find correspondence in regions that are ill-posed. The end-to-end PSMNet (Pyramid Stereo Matching Network) contributes to the approach differently and without any post-processing. The pyramid stereo matching network contains a couple of modules: spatial pyramid pooling and a 3D CNN. The former gathers global context information, for incorporating into image features, of various scales and positions to form a cost volume. The stacked hourglass 3D CNN learns to regularize the cost volume with stacked multiple hourglass networks along with supervision at intermediate stages.
Due to the use of disparity regression for continuous disparity map estimation, the smooth L1 loss function is used to train the suggested network. This loss is used in bounding box regression to detect objects due to their robustness and low sensitivity to outliers. This model is trained on the Scene Flow dataset with dense ground-truth disparity maps. The architecture is implemented with PyTorch and the Adam optimizer is used to train the network end-to-end. Four NVIDIA Titan-Xp GPUs were used for the training phase. The depth maps produced by the various methods were satisfactory in the sense that they could serve their purpose of estimating the distance of the objects from the reference point, generally from the camera, with appreciable accuracy. However, to augment the horizon of application, edge detection is a concept that has been demonstrated in the depth estimation model of “StereoNet: Guided Hierarchical Refinement for Real-Time Edge-Aware Depth Prediction” [
41]. It is an end-to-end deep network model that produces fine-quality, quantization-free, and edge-manifested disparity maps. This network obtains sub-pixel matching accuracy which is relatively higher, compared to the traditional approaches. This allows for attaining precision of the traditional stereopsis with a low-resolution cost volume. It also shows the over-parameterization of the previous work and how it appreciably reduced the run-time of the system. It even contributes to the establishment of a hierarchical depth-refinement layer which performs fine up-sampling for edge preservation. The Siamese network is utilized for feature extraction from both the left and right images.
A hierarchical layer for refining depth is implemented, which performs high-quality up-sampling for preserving the edges. For the training of the network, the fully supervised mode is followed with the ground truth-labeled stereo data. The hierarchical loss function is minimized, and it represents a smoothed L1 loss. The network is initially trained on the Scene Flow dataset. Tensorflow is used to implement and train the network, and the experiments are optimized using RMSProp (Root Mean Square Propagation) [
42]. It runs at 60 fps on a high-end NVIDIA Titan X GPU. Another edge-preserving end-to-end deep network, EdgeStereo from “EdgeStereo: A Context Integrated Residual Pyramid Network for Stereo Matching” [
43], is capable of multi-tasking and is comprised of the basic disparity network and a sub-network to preserve the edges. The edge cues and edge-aware loss guide the disparity learning. The context pyramid and the residual pyramid are established, which help in dealing with the ill-posed areas and in replacing the cascade refinement structures, as a contribution of the model. This model enables predicting, from a stereo image pair, both a disparity map and an edge map.
To include context information in the multi-scale mode in the disparity branch, a context pyramid is established initially, after which a residual pyramid of the compact type is developed for cascaded refinement. To preserve the nuanced details of the edges, this model embeds boundary features and regularization of the edge-aware smoothness loss function. Edge detection helps enhance stereopsis. The training part is split into three phases to allow for multi-task learning for the model. In the initial phase, the sub-network for edge detection is trained on a dataset and a cross-entropy loss of the class-balanced type, demonstrated in [
44], guides it. In the second phase, deep supervision is used for the supervision of regressed disparities across various scales on a dataset for stereo vision.
The total loss is the sum of the losses at the concerned scales. Along with the smoothness loss for the disparity, the regression loss for supervised learning is also included. In the final phase, every layer of the model is optimized on a single dataset used in the previous phase. Again, deep supervision is used across the scales. The only constraint used in this phase is the avoidance of using the edge-aware smoothness loss function since the edge contours used in the previous phase have greater stability than those used in the final phase. The model is trained on the Scene Flow dataset with dense ground-truth depth maps. To pretrain the sub-network for edge-detection, the BSDS500 [
45] dataset is considered. Going along with [
44,
46], the data for training in BSDS500 is combined with the PASCAL VOC Context dataset [
47]. Based on Caffe [
39], the implementation of this model is carried out, and the Adam optimizer [
48] is used to optimize it.
Such ideas of generating a final depth map for the application of autonomous driving are appreciable. However, some areas which require fast generation of these maps, irrespective of the precision, could incur a loss. Such areas include drones which might have to fly at high speeds at times.
For such domains, researchers have thought about the idea of the fast production of depth maps with low accuracy, which, in stages, will have the resolution of the maps enhanced gradually. This concept is generally used for obstacle-evasion and the final high-resolution maps can be used for trajectory generation for flying objects. With such a concept in mind, a network, AnyNet, has been proposed in “Anytime Stereo Image Depth Estimation on Mobile Devices” [
49] to predict disparity in the anytime setting. This method trades the time of computation and precision whenever needed during inference. Depth estimation takes place in stages and the model produces the current best output whenever commanded to do so. Much lesser parameters are needed than many recent approaches to produce disparity maps with appreciable precision. This effectuates the usage of such models in embedded devices with constrained resources.
Finally, a spatial propagation model (SPNet) further enhances the quality of the disparity map. The full network is trained end-to-end with a joint loss over each scale, and the network is termed an Anytime Stereo Network (AnyNet). This network is implemented in PyTorch and trained end-to-end with the help of the Adam optimizer [
48]. Following training on the Scene Flow dataset, the model is pre-trained on the same dataset for KITTI. One GTX 1080Ti GPU was used during the training phase of the model. Another domain in stereo depth estimation dealt with stereopsis on fine-quality images, which is not so easy to find. This concept is elucidated in “Hierarchical Deep Stereo Matching on High-resolution Images” [
50]. To address the issue of stereo matching on high-resolution images in real-time due to limited processing speed and memory constraints, an end-to-end framework has been developed which looks for correspondences progressively following the route of a hierarchy going from coarse to fine mode. Since datasets dealing with high-resolution images are difficult to obtain, a couple of fine-resolution datasets are collected, and a dataset has been introduced containing stereo pairs of such fine resolution for training and evaluation. This design permits the generation of disparity as and when needed. It implies that any time setting is possible considering the intermediate results of disparity for close-range objects, with less latency of about 30 ms. Also, to enhance the robustness of the model to various factors, a set of augmentation strategies are developed.
In the training phase itself, the network is trained to predict at various scales, which allows for any time setting at any pyramid level and regularizes the overall network. At each level, for the progressive disparity resolution, the losses are scaled. Pytorch is used for the implementation of the HSM network, and the training is performed using the Adam optimizer with four Titan X Pascal GPUs. HR-VS, a synthetic dataset, is developed and used to train stereo models of fine resolution. Middlebury, ETH3D [
30,
33,
51], KITTI 2015, and HR-VS are enhanced, during the training phase, to be of equal size to Scene Flow, yielding about 170k samples of training. For the stereo-matching techniques dealing with the CNN form of deep learning, cost volumes help obtain significant precision in correspondence matching. MCV-MFC (Multi Level Cost Volume and Multi-Scale Feature Constancy) in “Stereo Matching Using Multi-level Cost Volume and Multi-scale Feature Constancy” [
52] is such a CNN which can be trained end-to-end and focuses on the usage of cost volumes to their fullest extent for precise stereo depth estimation. It consists of three sub-modules: feature extraction, which is in shared mode, commencing estimation of disparity, and finally, refinement of disparity. Both the accuracy and the computational efficiency are enhanced by fusing the disparity estimation and refinement tasks into a single network. Multi-level cost volume evaluation is introduced, for which the feature discriminability is improved by considering information from various factors. The model robustness is also improved by introducing a finetuning scheme comprising a couple of stages.
Feature constancy of the multi-scale type is used for measuring the commencing disparity correctness in the feature realm for augmenting the disparity refinement efficiency. The tight coupling of the sub-modules makes it easy to train them due to their compactness. The model is robust enough to have the generalizability of considerable performance across various datasets. For this, a bi-stage strategy to finetune is demonstrated for the transfer of the model to the desired datasets. Here, L1 loss is used as the loss function to evaluate the average of the absolute difference between the disparities which are predicted and the ground truth. This average of the absolute difference itself is termed as the end-point-error (EPE). The stereo images and the respective ground-truth disparity maps are used for training the network on the Scene Flow dataset. However, training is also carried out on datasets other than Scene Flow for this model. CAFFE [
39] was used for the implementation of the model and the ADAM solver [
48] was used for its optimization.
Regarding multi-view stereo methods, the learning-based ones proved promising. Notwithstanding, they fail to consider the difference in visibility among the various views. This leads to an unsystematic creation of platforms for multi-view similarity prediction and limits their ability to work on datasets with robust variations in viewpoints. PVSNet in “PVSNet: Pixelwise Visibility-Aware Multi-View Stereo Network” [
53] has been suggested for performing 3D reconstruction of the dense and robust type. It is a pixelwise visibility network to comprehend information related to visibility for the various neighboring images prior to evaluating the multi-view similarity. Two-dimensional visibility maps are regressed from two-view cost volumes. Due to the representation of undesirable image characteristics by the visibility maps, the regression permits the accepted views to have more weight in the eventual representation of cost volume. To add to it, a strategy to train the network to adapt to noises has been demonstrated which includes views which are disturbing during the training phase to increase the generalizability of the network to various unrelated views.
The L1 loss function is used for the training of the network. Two modes of training loss functions are included: the training loss for low-resolution prediction and one for high-resolution prediction. As suggested by [
54,
55], the DTU dataset [
56] is classified into sets for training, validation, and finally, evaluation. The network is trained on the training set and the Poisson surface reconstruction [
57] is used to produce depth maps for ground truth at some resolutions. The network implementation is carried out using PyTorch [
58] and the RMSprop optimizer is used for network training on a couple of NVIDIA GTX 1080Ti GPUs. Unlike a few neural network stereo depth estimation techniques, as mentioned above, dealing with a full cost volume and depending upon 3D convolutions, the HITNet model in “HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching” [
59] depends on a quick multi-resolution step for initialization, a 2D geometric propagation which is differentiable, and some mechanisms for war**. Hence, this model does not develop any explicit cost volume. As its contribution, it consists of a swift multi-resolution initialization step that uses learned features to deal with fine resolution matches: a 2D disparity propagation stage which efficiently exploits the concept of slanted support windows with learned descriptors. Thus, it delivers with relatively less computation. End-to-end training of the network is carried out with ground-truth disparities with the help of the losses, namely an initialization loss, a propagation loss, a surface slant loss, and a loss to supervise the confidence.
The architecture of the model is represented in
Figure 5. The module to extract features is based on a small U-Net [
60]. The features contain details of the images in the multi-scale mode. After the feature extraction, initialization of the disparity maps is started at various resolutions as fronto parallel tiles. For that, a matching component evaluates several hypotheses and decides upon that with the lowest distance between the left and the right views of the feature maps. This output is then sent to the stage of propagation which refines the predicted disparity in a hierarchically iterative fashion, using the concept of slanted support windows.
The target of the initialization stage is to bring out an initial estimate of the disparity
and a learnable tile-wise feature vector
regarding the resolutions. The output of this stage is the disparity hypotheses having the form
. Basically, the feature maps are generated with tile-wise features
. The matching cost
at location
and resolution
with disparity
as
Subsequently, the initial disparity is evaluated; thus,
for each
, where
is the maximum value of considered disparity. Another approach, RAFT-Stereo in “RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching” [
61], similarly avoids the high computational memory consumption of 3D convolutions. It can be directly implemented for megapixel images with no resizing or processing needed for the image in patches. The main contribution of this work is that it has a different take on stereopsis, fusing stereo and optical flow techniques. It exhibits relatively better cross-data generalization, which leads to its high precision. It solely utilizes 2D convolutions and a light-weight cost volume. It is an architecture, conforming to the rules of deep learning, for a rectified mode of stereopsis and is inspired by RAFT [
62], the optical flow network. Multi-level modes of convolutional GRUs are implemented which help in efficient communication of information signals across the concerned image. A merit of the network is that it up-samples the stereopsis at the final stage, making the network highly memory efficient, which, in turn, enables full-resolution stereo prediction on megapixel images.
The network is pretrained on the Scene Flow dataset [
33]. The implementation of the network is carried out in Pytorch [
63] and the training happens with the help of a couple of RTX 6000 GPUs. During this phase, the AdamW [
64] optimizer is made use of. The experiments, excluding the ablation ones, are trained with the help of data augmentation. To enhance the zero-shot generalization performance, the additional versions of this network are trained with the help of additional data of the synthetic variety from datasets such as Tartan Air, Falling Things, and Sintel-Stereo. Considering the demerits of CNNs, yet another disadvantage of the appreciable computational time consumption, which slows down the inference generation, has been addressed in “CRAR: Accelerating Stereo Matching with Cascaded Residual Regression and Adaptive Refinement” [
65]. Although some of the previous works solved this issue by down sampling the input images to reduce the spatial size, this augments the rate of errors. The method of CRAR accelerates the algorithms for stereo matching by improving the structure of the network. It decreases the number of features which have a direct proportionality with the computational cost. This cost aggregation, due to its extra dimension while using the 4D feature manipulation technique, is the most time-consuming factor of some of the stereo-matching methods. Based on the concept of compressing a network, decomposition and sparsification are carried out to squeeze the network for cost optimization since it is computationally expensive. The various refinement methods used earlier have been combined to establish a consolidated algorithm to execute parallelism for running required devices to accelerate the inference even further. The smooth L1 loss function is used for the training of the network. Experiments are carried out using PyTorch, and the optimization of the network is performed with the help of the Adam SGD optimization procedure. According to the works of [
40,
66,
67], the network is pretrained on the Scene Flow dataset.
Some research works focused on develo** the cost volume with a well-performing depth estimation model. Their idea was that a compact but information-manifesting cost volume is necessary for precise and efficient stereo matching purposes. In “Attention Concatenation Volume for Accurate and Efficient Stereo Matching” [
68], a method to construct cost volume has been demonstrated which uses clues of correlation to obtain attention weights to augment the information relevant to matching. The concept of patch matching of the adaptive type and of multiple levels has been used to demonstrate the high reliability of the weights. These all happen in the concatenation volume and help highlight the uniqueness of the cost at various disparities also for the regions being ill-posed. The suggested cost volume is termed as Attention Concatenation Volume (ACV), which can be implemented in many stereo methods. The core contribution of this paper focuses on the usage of the informative and efficient cost volume representation using the similarity details stored in the correlation segment; this is to regularize the concatenation volume for the demand of a lightweight aggregation network to attain significant efficiency and precision.
The smoothness loss is used as the final loss for the network. The model is implemented using PyTorch and the training is conducted on NVIDIA RTX 3090 GPUs. The Adam optimizer is used for every experiment. Another project worked on improving the robustness of an already existing method of stereopsis. Robustness is unavoidable since eventually the depth estimation methods are applied to real-time cases. Unlike the common lengthy procedures, the following paper asserts that gathering numerous datasets for training is an easy way of enhancing the generalization ability of the algorithms. The paper “An Improved RaftStereo Trained with A Mixed Dataset for the Robust Vision Challenge 2022” [
69] provides an improved version of the RaftStereo [
61] and the model is trained with an eclectic dataset containing seven datasets for the purpose of the Robust Vision Challenge. It is termed as the iRaftStereo_RVC. The datasets used are Sceneflow, CreStereo, Tartan Air, Falling Things, Sintel-Stereo, HR-VS, and InStereo2K. The amalgamated form of the above datasets is used to pre-train RaftStereo before its fine-tuning is carried out.
The experiments are carried out with the open-source code of RaftStereo being implemented using Pytorch. For training, a couple of RTX 2080Ti GPUs are used. However, with the cross-domain generalization capability of such an algorithm enhanced, its stereo matching performance has a good probability of being degraded. To deal with the trade-off between the performances based on stereo-matching and cross-domain generalization, “PCW-Net: Pyramid Combination and War** Cost Volume for Stereo Matching” [
70] suggests a network, PCW-Net (Pyramid Combination and War** Network), based on cost volume related to combination and war** to achieve significant results from both types of performances on varied benchmarks.
The paper contributed to an effective framework which attained significant generalizability from synthetic to real-time datasets and, after fine tuning, performed appreciably on target datasets. A cost volume fusion module of the multi-scale type is developed to act according to receptive fields of the multi-scale category and bring out structural cues which are domain independent. This leads to better stereopsis of varied image resolutions. A war** disparity refinement segment which is efficient and volume dependent is introduced, which eventually helps to find the appropriate residue in an uncontrolled environment.
The architecture of the model is depicted in
Figure 6. It consists of three components: the component to extract features in multi-scale mode, the cost-aggregation component based on the combination volume of the multi-scale mode, and the component to refine the predicted disparity based on the war** volume. The features, after being extracted, are used to establish a pyramid volume. Combination volumes are then developed on the top pyramid levels, and then a fusion module for the cost volume is developed to fuse them for the commencement of the disparity estimation. Eventually, at the final level, the war** volume is established to refine the estimated disparity.
In the feature extraction component or module of the multi-scale mode, with a pair of images as input, three convolutional layers are used for obtaining the unary feature map at the initial level. Subsequently, three residual blocks are implemented to obtain the feature maps at the next three levels. Using the features extracted, pyramid cost volumes are established at various stages. In the cost-aggregation component, the combination volume is established at four levels. For every level,
, the combination volume,
, includes the volume for concatenation,
, and the volume for group-wise correlation,
. Considering the feature extracted at each level is
, the combination volume is evaluated as
where
represents the operation of concatenation at the axis of features, and
and
denote the features extracted for the left and right images, respectively.
denotes the features which are grouped and are symmetrically divided from the extracted one,
, considering the number of groups as
.
stands for the disparity levels.
denotes the channels of
, and
denotes the inner product. During the establishment of the combination volume, an extra convolution layer is included without the function of activation and batch normalization (normalization layer
) to have
and
with identical distribution of the data. The multi-scale combination volume is fused together to estimate the disparity map at the initial stage.
In the cost-volume fusion, the combination volumes, encoder blocks, followed by blocks for fusion and decoding are represented by
, respectively, where
represents the levels. The ultimate fused cost volume is
. After that, a few 3D hourglass networks are stacked to finally generate the commencing disparity map
. The suggested blocks for fusion have a couple of main inputs—the encoder blocks, which deal with the cost-volume of high resolutions, and the combination volume, which helps evaluate the similarity between the left and the corresponding right features. The process is formulated as
where
denotes the layer of convolution. The encoder and decoder processes are then developed. In this model, regarding the component for refining the predicted disparity, an input of the multi-modal variety is used to help the network learn the residue and consists of the war** volume, which is 3D in nature, the commencing disparity map, the left features, and the reconstructed error. In the volume for war**, the left feature and the warped right feature are implemented to establish this volume at the final level. The predicted initial disparity,
, is referred to while war** the right features. The residual disparity is considered small and, hence, a tiny search range,
, for the residue is implemented. The war** volume is evaluated as
where
and
are unsampled from the initial feature level to the actual size of the image. Subsequently, the concept of a reconstructed error helps to identify flawed regions of commencing disparity prediction and is evaluated as
With this, the refinement network better recognizes the pixels to be optimized further. The left features and the initial disparity map are amongst the inputs to the refinement network. The initial disparity gives a base to the network to carry out more optimization operations, and the feature of the left image possesses context for learning the residues. For weight balancing of the input, the commencing disparity is regularized by a layer of convolution.
Drawing inspiration from the previous works of [
40,
71], the smoothness loss function is used for training the network in an end-to-end fashion. The suggested framework is implemented using Pytorch and trained using the Adam optimizer. The training of the experiments was carried out on a couple of NVIDIA V100 GPUs. Another model which can simultaneously achieve appreciable performance in real-time with a high level of precision and high generalizability is the one suggested in “CGI-Stereo: Accurate and Real-Time Stereo Matching via Context and Geometry Interaction” [
72]. The crux of the model is a fusion block which helps fuse information related to context along with geometry for precise and efficient cost aggregation. It also gives feedback to feature comprehension to help extract highly effective contextual features. The suggested block can be implemented in various methods of stereo matching. A different design of cost volume is introduced to consider information related to both matching and content. A compact and informative cost volume has also been suggested, which leverages a correlation volume for filtering a feature volume. Based on the fusion block and this compact cost volume, the model is developed.
The network is trained in a supervised manner and in an end-to-end fashion using the smoothness loss. PyTorch is used for the implementation of the method and the experiments are performed with the help of NVIDIA RTX 3090 GPUs. The Adam optimizer is used for the experiments.
4.2. Unsupervised Stereo Models
Sometimes obtaining the ground-truth depth data can be troublesome; due to this, a shift to unsupervised depth estimation techniques has gained attention over the past few years. Due to the lack of ground-truth data, the geometric and photometric consistency constraints are generally used as the main supervisory signal in the unsupervised techniques. The recent deep MVS (Multi-View Stereo) methods have presented significant improvement in their performance. However, they rely on ground-truth depth maps for supervision, acquiring which is not an easy task. Hence, in “Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency” [
73], novel view images are used solely as signals of supervision for the designed framework to enable the comprehension of unsupervised stereopsis of the multi-view category. The photometrically consistent reprojections, provided by geometry, are used to minimize the equivalent reprojection error, which is implemented for the training of the CNN of the model. Basically, a photometric consistency loss of the multi-view mode and robust nature is used to learn unsupervised depth prediction, which then permits overcoming occlusions and changes due to lighting across various views of training.
The ADAM [
48] optimizer is used during the training of the network. To implement the pipeline of the learning model, Tensorflow [
74] is used. As stated by [
55], the appreciable efficiency of the model is maintained using a small image resolution and coarse depth steps, due to the elevated requirements of the GPU memory, during training. However, during the evaluation, the settings can be increased to higher modes. Similarly, supervision with the help of ground-truth data can impede the generalization of the learned models in unexpected scenarios. In “MVS
2: Deep Unsupervised Multi-view Stereo with Multi-View Symmetry” [
75], an end-to-end deep MVS network of unsupervised learning is suggested that can be learned with no use of ground-truth depth as a signal for supervision. The photometric consistency across the various views, in the form of image war** errors of the multi-view mode, suffices in training the network for converging to the desirable state which enhances the performance. It contributes to the introduction of the cross-view consistency in depth prediction and the proposal of a loss function for consistency measurement. This, in turn, is implemented to train the deep network. This network also learns the occlusion maps of the multi-view mode. This enhances the network robustness to deal with real-time occlusions.
The implementation of the MVS
2 network is conducted in Tensorflow with an NVIDIA v100 GPU. The DTU training dataset is used to train the model. “Progressive Fusion for Unsupervised Binocular Depth Estimation using Cycled Networks” [
76] is yet another trainable end-to-end unsupervised deep network with regard to estimating depth based on adversarial learning. It contributed to the introduction of a novel Progressive Fusion Network (PFN) which combines the information from both images of a stereo pair and is developed on a multi-scale refinement technique. The process is customized according to the couple disparity maps predicted by the PFN. To have the form of a cycle, the sub-network is stacked twice. The arrangement gives strong constraints and, hence, supervisory signals for each view of the image. This allows for network optimization. This form manifests as a data-augmentation type. This is because while training and after learning, the disparity maps are predicted by the network from the training images, which happens in the forward pass of the cycle, and from the images which are synthesized, which happens in the backward pass of the cycle. To add to it, in the forward pass of the cycle, deformed or blurred images are prevented from being predicted by the sub-network or there would be ramifications for them in the backward pass of the cycle. There is joint learning in the whole cycle. The initial network provides the end disparity map.
For training, the reconstruction loss, the consistency loss, and a least-square GAN loss [
77] are used. For the optimization of the network, the Adam optimizer is made use of. The training happens on a couple of Titan Xp GPUs. For the implementation of the suggested model, TensorFlow is utilized. Similarly, in “M
3VSNet: Unsupervised Multi-metric Multi-view Stereo Network” [
78], a multi-metric MVS network, M
3VSNet, of the unsupervised type is introduced which works even in real or non-ideal environments and infers depth maps for the reconstruction of dense point clouds. An innovative multi-metric loss in the form of the pixel-wise and feature-wise loss function includes the varied correspondence-matching perspectives happening beyond pixel value. Concerning the loss function, the geometric as well as the photometric matching consistencies are assured to be both high and robust relative to MVSNet’s sole photometric constraints. Other than that, to enhance the continuity and the precision of the generated depth maps, the consistency of normal depth is included in the format of the 3D point cloud.
Pytorch is used for the implementation of the network. For training, the training set of DTU is solely utilized, which has no ground-truth depth maps. Four NVIDIA RTX 2080Ti and the ADAM optimizer are used for the training of the model. Despite the development of the unsupervised methods, the depth maps generated are of low resolutions due to the memory consumption of the processes. In “Unsupervised multi-view stereo network based on multi-stage depth estimation” [
79], an innovative unsupervised network of the multi-view mode has been demonstrated which can enhance the resolution of the depth map and help in the generation of a detailed dense 3D model. To enhance the resolution of depth maps by reducing the cost volume memory consumption, multiple stages of the progressive coarse-to-fine type are considered. A multi-view correlation based on groups is established to minimize the irrelevant insights in the channel-wise dimension, and a correlation prior to the multi-view category is also demonstrated.
The pixel-wise photometric consistency for certain views is implemented to reduce the repercussion of occlusion and reflection. For augmenting the robust nature of the model, the structure loss, including structural similarity, has been implemented. To maintain the smoothness of the estimated map, the depth gradient smooth loss has been considered too. The overall loss function is the combination of these stage losses with the concerned weights. Several techniques based on pseudo labels are brought up to emphasize the constraint of the loss function, owing to the weakness of the supervision signal of the MVS method with the unsupervised mode in intricate areas. Numerous training processes are involved in such techniques and extra supervision signals like optical flow or the established depth map of the mesh model being reconstructed are presented. To produce this optical flow for supervision as an extra one, PWC-Net [
80] is applied in the commencing level of U-MVSNet [
81].
The method is implemented using the PyTorch framework on four TITAN-RTX GPUs. The Adam optimizer is used to optimize the network and the regularization is performed using SGDR [
82] to prevent the convergence of parameters to the local lowermost level. Another deep learning-based unsupervised network, “H-Net: Unsupervised Attention-based Stereo Depth Estimation Leveraging Epipolar Geometry” [
83], is suggested to erase the dependency on supervised learning schemes. The framework is of the end-to-end type, which makes meritorious utilization of epipolar geometry for the stereo-matching process. An encoder-decoder architecture of the Siamese type in the self-supervised mode is introduced which augments the ease of communication between the left and right images mainly by combining their complementary information. The mechanism of the mutual epipolar attention concept is developed to include the epipolar constraint in feature matching. The correspondences of features, lying on a single epipolar line, are signified by the mechanism as it comprehends the information which is mutual between the stereo pair given as input. Semantic information is also included in the suggested mechanism of the attention concept to further augment the correspondences in the stereo matching process. Furthermore, the concept of attention is subdued, and the outlier errors are removed in the areas invisible to both cameras by deploying the optimal transport algorithm.
The consistency between the input images and their corresponding reconstructed ones gives the signal for supervision. A function for the photometric error has been used which contains the L1-norm along with the index for structural similarity. To enhance the predictions of the boundaries, an edge-based smoothness component has been implemented. Eventually, the photometric and the pixel-wise smoothness loss terms were balanced using a term for smoothness and the overall loss was evaluated. The PyTorch library was used to train the network, using the Adam optimizer, on a single NVIDIA 2080Ti GPU.
4.3. Self-Supervised Stereo Models
The lack of labeled ground-truth disparity also obliged the evolution of the self-supervised method of learning, where the model is basically responsible for giving labels to the data, creating pseudo-labels from the data themselves. These models can be argued to be relatively more prevalent because, generally, they are easier to evaluate than the unsupervised models. In “Light-weight network for real-time adaptive stereo depth estimation” [
84], considering the self-supervised method of learning for online adaptive stereopsis for both low GPU memory space and low computation cost, a lightweight adaptive network (LWANet) is presented for real-time stereopsis. A 3D convolution of the pseudo type appreciably minimizes the computational cost, with a slight compromise in accuracy, for which it is implemented in the cost volume aggregation portion. Other than that, to enhance the estimation of the eventual disparity, a U-Net architecture of high efficiency is fused with the CSPN [
85] refinement module. It also contributes to industrial applications, providing them with a generalized depth estimation strategy. This method involves the amalgamation of the structural similarity index (SSIM) and the distance loss, which is termed as the photometric image reconstruction cost. Then, the gradient distance loss between the predicted disparity and the image of the left side is exploited to enhance the consistency of the edge. Pytorch is used for the network implementation, and for its training, NVIDIA 1080TI GPU is deployed. To update the parameters while training, the Adam optimizer is made use of.
Supervised multi-view stereo depth prediction techniques have achieved promising results with the training using ground-truth depth data. That being said, gathering multi-view depth data in large quantities is not easy. For this reason, a self-supervised strategy, “Self-supervised Learning of Depth inference for Multi-view Stereo” [
86], is implemented for multi-view stereopsis which exploits, from the data as input, pseudo labels. Initially, depending on the reconstruction loss of an image as supervision, the model learns to generate initial pseudo labels under a learning framework of the unsupervised type. To leverage the information regarding depth inferred from high-resolution images and neighboring views, the initial labels are refined with the help of a punctiliously designed pipeline. Such high-quality pseudo labels are used as supervisory signals for the purpose of training the network and for its performance improvement by self-training, iteratively.
The base network in CVP-MVSNet (Cost Volume Pyramid-based depth inference from Multi-View Stereo) [
87] is adopted because of its compactness and flexibility in dealing with fine-resolution images. The view synthesis loss of [
75] and the perpetual loss of [
88] are implemented to establish a strong relationship between the synthesized and the corresponding reference images. Finally, a weighted fusion of four loss functions, the image gradient loss, the structure similarity loss, the perpetual loss, and the depth smoothness loss, are made use of. For the same reasons, another self-supervised depth prediction network, SMAR-Net in “Self-Supervised Multiscale Adversarial Regression Network for Stereo Disparity Estimation (SMAR-Net)” [
89], has been proposed to avoid using the ground-truth depth data for training. The network consists of a pair of stages. Disparity regression comprises the initial stage, where a network for regression uses stereo image pairs, which are stacked, to predict the values of disparity. There also exists a discriminator that assists in training the regressor. In the final stage, depending on the assumption of the left–right consistency, a synthetic left image is produced. To deal with real-time data, an image-war** operation is exploited.
The image stacking segment is proposed before extracting features and contains the spatial appearance of the stereo images and implies their matching correspondences with varied disparity values. Furthermore, a multiscale pooling layer is used to emphasize the consistency of object sizes and receptive fields. The model also predicts stereo disparity with features of the multi-scale variety in both the regression and discrimination segments. This combination helps predict disparity in ill-posed regions.
For the training purpose of the network, a hybrid loss function is minimized, and it consists of a content loss and an adversarial loss. The joint implementation of extraction of features, belonging to the multi-scale type, in both types of losses helps further enhance the generalizability of the network in the regions which are ill-posed. This network is implemented using PyTorch. Uniform distribution is used to initialize network parameters and ADAM is used to optimize them. The training of the network is carried out on a single Nvidia Quadro P5000 GPU. In the realm of surgeries assisted by computers, some of the intricate steps involve dense depth prediction and 3D reconstruction of any surgical scene. However, ground-truth data are not easily available for laparoscopic imaging and, in general, for supervised learning implementation of a stereo depth estimation model. Hence, a self-supervised method, SADepth (Self-supervised Adversarial Depth Estimation) in “Self-Supervised Generative Adversarial Network for Depth Estimation in Laparoscopic Images” [
90], has been suggested to predict dense depth based on Generative Adversarial Networks. To consider geometry constraints during the phase of training, the network contains an encoder–decoder generator along with a discriminator. The adopted generative U-Net architecture benefits from the complementary information of the input stereo images. The photometric reprojection loss causes the local minima, which are solved with the help of the disparity smoothness loss, and the network of the multi-scale mode is formed. Adversarial learning helps enhance the quality of the generation of the framework.
Regarding the training, the structural similarity between the concerned actual images and the ones that are reconstructed is the supervisory signal for the training of the generator. In the network, the generator loss is formed using the appearance matching loss and the smoothness loss of disparity. The discriminator loss and the multi-scale losses are also implemented. Finally, the joint optimization loss is a conglomeration of the generator loss and the adversarial loss. PyTorch is used for the implementation of the model, and its training is carried out by the Adam optimizer and performed using a single NVIDIA 2080 Ti GPU. Similarly, another architecture, in “Self-Supervised Learning for Stereo Matching with Self-Improving Ability” [
91], based on a CNN, has been designed which learns to estimate disparity maps of the dense type directly from the stereo inputs. The image war** error is used as the loss function to guide the process of learning. This network is well-generalizable to various unseen scenarios and to various settings of the camera.
With rectified left and right stereo images,
and
, being given, the task of the network is to learn a function
to predict pixel-wise disparity maps of the dense type, represented as
and
to be disparity maps for the left and right images, respectively. The function can be learned with the need for no disparity maps of the dense type, for which the stereo matching is formulated as a problem of war** of image. To be specific, if the left image
and the disparity map corresponding to the right image
are obtained, the right image can be attained by the process of war** to be conducted on the left image regarding the dense disparity map, shown as follows:
where
is the right image of the warped type. The inconsistency between the warped images and the observed images act as the supervisory signal in the learning of the function spoken above. For this mode, it is performed using a deep CNN in an end-to-end self-supervised manner.
Figure 7 depicts the network architecture which has five modules: a module to extract features, module to generate cross feature volume, module for 3D feature matching, module to perform the soft-argmin operation, and the module for the war** of images. The module to extract features contains a set of residually connected 2D convolutions to bring out local features. The learned features are gathered into an assembly of a couple of cross-feature volumes. Subsequently, the feature matching module is used for map** the 2D features to a higher dimension for distinction. Then, the soft-argmin is used to project the 3D volume to a 2D volume. In the final module, war** of images is performed to calculate the photometric error, which serves as the supervisory signal for the network training.
Basically, the learned features are used to construct a feature volume to evaluate the cost of stereo matching. The volume is constructed by exhausting the levels of disparity in a range already stated. If the feature maps brought out from the left and right images by the feature extraction component are represented as
and
, the left-to-right feature volume at a position of the pixel represented by
with disparity
is given as follows:
Similarly, the right-to-left feature volume is given as follows:
The matching cost is learnt at each disparity with the photometric-based unary loss term and the local regularization. A module with a top-down approach is presented for better feature extraction. The output of this module is basically a 3D volume with features that are regularized. The step to match features are implanted in the 3D volume to 2D conversion stage since for war**, a disparity map of 2D is required. During the step, the dimension of disparity in the feature volumes is reduced by choosing the disparity with the least value of the distance between the corresponding left and right features. The soft argmin operation is performed over the disparity dimension for the 3D volume projection to its 2D version. The operation is given as follows:
where
represents the estimated cost (at disparity
),
is the already stated disparity range, and
represents the softmax operation.
Under the self-supervised mode of learning for stereo matching, the image reconstruction error is used to decide upon the quality of the estimated disparity map. The loss function to learn to estimate the disparity map consists of a photometric-based unary term, a regularization term for the field of disparity, a constraint of consistency between the stereo pair of images and the corresponding maps of disparity, and finally, the maximization of depth heuristic (MDH) term. The network is implemented using TensorFlow. End-to-end optimization of the models is carried out using RMSProp. Another self-supervised deep learning-based model for stereopsis, PVStereo (Pyramid Voting Stereo Network), has been proposed in “PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo Matching” [
92] to avoid the usage of ground-truth depth maps and to overcome a limitation of deep CNN,s which is the lack of generalizability of the models while adapting to unseen scenarios. It is a robust technique and consists of a Pyramid Voting Module (PVM) and an innovative architecture, based on a deep CNN, termed as OptStereo. OptStereo initially builds cost volumes of the multi-scale type, and then for updating the disparity predictions in an iterative manner at high resolutions, it takes up a recurrent unit. On the other hand, disparity images, semi-dense but reliable in nature, are generated by the PVM and are used for the training supervision of OptStereo. A good trade-off is achieved by OptStereo between accuracy and efficiency for stereopsis. To add to it, a large-scale synthetic HKUST-Drive stereo dataset has been published, which has been gathered under various weather and illumination conditions.
The architecture of the framework is depicted in
Figure 8. The PVM generates a disparity image of the semi-dense type which can be represented as
, under its multi-scale voting structure. This disparity image is used for the supervision of the training of the deep CNNs used to learn the prediction of dense disparity. Given a left and right stereo image pair,
and
, the PVM produces a left and right pyramid of the image pairs, respectively. The left and right pyramid groups generate corresponding left and right semi-dense images of disparity, represented as
and
. Every group consists of
stereo pairs of images at various scales. Every stereo pair of images can produce a left and right image of disparity, which can be mentioned as
, where
, using a TSM (Traditional Stereo Matching) algorithm. A representation
can be formulated, based on the concept, as follows:
where
represents individual pixels of an image,
represents the normalized cost of inverse stereo matching. A map of voting,
, is attained thus,
where
and
are the thresholds;
when
, or
. Eventually,
and
are worked upon by an operator to check for consistency of disparity, which gives the final semi-dense image of disparity
.
OptStereo, taking in the stereo pair of images as input, generates an image of dense disparity represented as
. It contains three stages: one to extract features, the next to compute cost volume, and the final one for refinement based on iterations. A couple of residual networks are used to bring out visual features,
and
, from
and
, respectively. Next, the visual feature consistency, between
and
, is evaluated for as many matching pairs as possible. A cost volume is evaluated by performing the dot product operation between the acceptable matching feature pairs, as demonstrated below:
Cost volumes of the multi-scale type are established as
, where
. A disparity prediction of the dense mode can map a point
in
to its analogous point
in
. A neighbouring area around
is formulated as follows:
where
is the distance of lookup and is a constant. Then, values are brought out of the multi-scale cost volume and are linked into a local version of the cost volume. It gives beneficial information about visually consistent features, based on the assistance given by the prediction of dense disparity, for refinement ahead in the process. In the final refinement stage, a chain of dense disparity predictions is updated iteratively. Later, a module for up-sampling is applied which contains a layer of up-sampling followed by a couple of layers of convolution to provide the predictions of dense disparity at full resolution. During the phase of training, the parameters of the framework are optimized by the minimization of a loss function consisting of three terms, the guiding loss for PVM, the reconstruction loss, and finally, the smoothing loss. In this phase, the Adam optimizer is used and a couple of NVIDIA GeForce RTX 2080 Ti graphics cards.
4.4. Experimental Comparison—Stereo Depth Estimation Models
Some supervised methods have been compared in
Table 2, considering all pixels, and
Table 3, considering only the non-occluded pixels, for the Middlebury 2014 dataset. In
Table 2, the models are arranged according to the years of their acceptance. The evaluation metrics used for this dataset are bad-4.0, bad-2.0, and bad-1.0 (percentage of pixels which are bad and whose error is more than 4, 2, and 1 pixel, respectively), avgerr (average of the absolute error in the pixels), rms (disparity error of the root mean square type in the pixels) which considers the subpixel accuracy, and A99, A95, and A90 (99%, 95%, and 90% quantile of error, respectively, in pixels) and it neglects the high deviations while estimating accuracy. As can be observed, almost all the error-types have shown a considerable reduction in their values except for some abrupt increase, like in the case of CRAR [
65], which traded its accuracy for its relatively less computational time. Meanwhile, RAFT-Stereo [
61], with its much acceptable error-rates, has a higher computational time than other stereo methods. However, overall observation of the advancements made in stereopsis shows their efficiency in optimizing the parameters to develop a fine model for stereo matching. In
Table 3, the same evaluation metrics are used but for the non-occluded pixels. The observation states the trend is quite like that observed in
Table 2.
Table 4 depicts the year-wise comparison between the supervised and some of the self-supervised models which have been explored in the realm of stereo depth estimation for the KITTI 2015 dataset. Both cases are considered—all pixels and non-occluded pixels (pixels that are visible in both the left and right images of a stereo pair). The evaluation metrics used for this dataset include D1-bg, D1-fg, and D1-all, which evaluate the outliers percentage regarding background pixels, foreground pixels, and all pixels, respectively. These are amongst the main evaluation metrics and are elucidated below:
D1-bg (background error) measures the percentage of bad pixels in the background area of an image. Background area refers to the area that excludes the dynamic objects (cars, pedestrians) in the scene. It is measured as the proportion of the background pixels, relative to the total number of background pixels, whose disparity error is greater than a specified threshold.
D1-fg (foreground error) works with foreground objects (cars, pedestrians, vehicles), which are important for comprehending the dynamics and navigation-related tasks. Similarly, it measures the percentage of bad pixels in the foreground area of an image considering the same specified threshold.
D1-all (overall error) combines both background and foreground evaluations. This assesses the overall performance of the model across the entirety of the image. It measures the percentage of bad pixels in both the foreground and the background areas of an image considering the same specified threshold.
The observation from the table states that the supervised models have shown improvement over the last few years in terms of all the error metrics of both cases. EdgeStereo [
43], for instance, has slightly larger values for some of the metrics than its predecessor, PSMNet [
40]; however, it considers an edge subnetwork and provides an edge map in addition to a disparity map. Furthermore, the computational time of EdgeStereo [
43] is lesser than PSMNet [
40]. Stereo models have seen some self-supervised methods in the past and are increasing in prevalence. Three of them have been mentioned in the table. They are still to be on par with the supervised methods, as can be observed. However, with respect to each other, they have shown steady improvements in all the metrics and closing the performance gap to the supervised models. As can be seen from
Figure 9, the stereo depth estimation models have relatively reduced the D1-all metric values, compared to their self-supervised counterparts and show a steadier decline in evaluation metric values with their newer models. The self-supervised models have higher D1-all values; they also show a minimizing tendency but with a significant rise in the middle.