2.1. Pansharpening Methods
Pansharpening methods can be roughly divided into four categories [
10]: component substitution (CS) methods, multiresolution analysis (MRA) methods, variational optimization (VO) methods, and machine learning (ML)-based methods. The first two classes have a fundamental role as conventional methods. As the technology has evolved, significant improvements have been made in recent years, with deep learning (DL)- and variational optimization-based methods gradually flourishing.
CS-based methods, also called spectral methods, assume that the spatial details of the image can be split and replaced. The LRMS image is projected to a suitable transformed domain, such as intensity–hue–saturation (IHS) space. Then, the separated spatial components, either partial or total, are replaced with that of PAN images. Due to its simplicity and high speed, IHS [
11] is well known for the fusion of PAN and LRMS images with three channels. If four or more channels are concerned in the data, GIHS [
12] can be generalized to pansharpening. Moreover, common methods in this context include the principal component analysis (PCA) transform [
13], Brovey transform [
14], and smoothing filter-based intensity modulation (SFIM) [
15]. Much effort has been also devoted to improvement of the injection rules, which focus on exploiting the relationship between the pixel values of PAN and MS images, such as the partial replacement adaptive CS (PRACS) [
16] and adaptive GS (GSA) [
17] implementations. The hypothesis that the linear combination of MS image bands approximates PAN images has been widely accepted, which neglects the inherent spatial and spectral properties. Hence, the inappropriate definition of weights can lead to serious distortion.
MRA-based methods, referred to as spatial methods, generally decompose PAN and MS images into multiple scales. The extracted spatial information is then injected into the MS image at different scales. Typical decomposition algorithms include wavelet transforms [
18,
19], Laplacian pyramids [
20], and curve transform [
21]. Considering the specificity of the acquisition sensor, refs. [
22,
23] introduced the information of the acquisition sensor into the decomposition scheme. Moreover, the performance was improved through the introduction of a nonlinear method and optimization of the injection coefficient. The advantage of MRA-based methods is that less spectral distortion is produced. However, these methods are sensitive to spatial information. The injection of high-frequency information may result in aliasing effects and the blurring of contours and textures. Synthesizing these two classical fusion methods, hybrid technology using CS and MRA approaches has emerged, including CS followed by MRA (CS+MRA) and MRA followed by CS (MRA+CS). The most important hybrid technology is CS+MRA, which carries out decomposition in the transformation domain and then projects back into MRA classes.
Pansharpening is regarded as an optimization problem in VO-based methods. Specifically, the relationship between the HRMS images and the observed images is established according to the sensor model, which is estimated from PAN and LRMS images. As reconstruction from low- to high-resolution images is ill-conditioned, which can lead to noise amplification, several types of regularization approaches have been introduced to mitigate this ill conditioning. The estimation problem lies in the establishment of a cost function, including a fidelity term that describes the relationship between the HRMS image and the observed image, and a regularization term that incorporates certain prior beliefs about the HRMS image into the optimization process. Ballester [
24] first exploited P+XS with the three assumptions, which were all groundbreaking. Both sparse regularization [
25] and Bayesian [
26] methods fall into the VO family. Fasbender et al. [
27] hypothesized a joint Gaussian model for the unknown MS image and PAN method. The earliest work on sparse representation was proposed by Li and Yang [
28], whose idea was to represent unknown HRMS images as sparse linear combinations of dictionary elements. The sparse representation theory was introduced in SR-D [
29], which involved the development of a signal reconstruction procedure using a reduced number of measurements. However, most VO methods rely on one or more regularization parameters that need to be selected by the user. Moreover, the energy function and prior knowledge require complex calculation and time consumption, especially when considering images at large scales.
Deep learning is a new milestone in the field of pansharpening research. The promising capability of deep learning models to capture complex nonlinear relationships and extract features based on multi-layer neural networks has resulted in their widespread use in various fields of computer vision [
30,
31], such as image classification, image super-resolution, and image colorization. The modified sparse denoising autoencoder (MSDA) algorithm [
32] was the first attempt to conduct pansharpening leveraging a convolutional neural network. Subsequently, methods based on deep learning for pansharpening have continued to emerge.
Most of the existing DL-based methods follow the supervised learning paradigm, which satisfies the synthesis properties of Wald’s protocol [
33]. First, once the fused image is downgraded to its original resolution, it should be as identical as possible to the original image. Second, the fused images should be as identical as possible to the image observed by the corresponding sensor at the highest resolution. Third, the multispectral set of fused images should be as identical as possible to the multispectral set of images that the corresponding sensor would observe with the highest resolution. The simulated data sets are acquired from degraded original high-resolution images. Relying on the reference images, the network is trained to update its parameters through minimizing the loss between the fused results and pseudo-ground truth MS images. Afterwards, the full-resolution data are used to test the pre-trained network. In 2016, a pansharpening method (PNN) received widespread attention [
34], which consisted of a three-layer convolutional neural network (CNN). Scarpa et al. [
35] introduced residual connections based on the PNN structure and adopted the training mode of target adaptive fine-tuning to enhance its generalization ability on several data sets. The PanNet [
36] combines domain-specific knowledge with neural networks to train network parameters in the high-frequency domain. However, the methods mentioned above simply apply a single branch for feature extraction, ignoring the spatial and spectral features of the source image. Liu et al. [
37] investigated a network with two branches to carry out fusion in the feature domain, which first encodes input images into high-level feature representations and then reconstructs high-resolution images. A unified two-stage spatial and spectral network has been proposed, called UTSN [
38], which contains a spatial enhancement network, which was trained and shared on hybrid data sets, and a spectral adjustment network, which is used to capture the spectral characteristics of a specific satellite. However, it should be noted that supervised learning models generate simulated results with limited real-world applicability; furthermore, the process of training fails to make full use of the original high-resolution information, potentially resulting in scale mismatches.
As for unsupervised learning frameworks, which are built based on the concept of consistency in Wald’s protocol, the problem of the unavailability of reference images can be tackled by designing appropriate loss functions and backbones. Luo et al. [
39] designed an unsupervised network that can be modeled by PAN-guided feature fusion. The PAN images serve as the guidance for spatial information construction in order to recover details at high spatial resolution. Due to the complex spectral characteristics of MS images, an unsupervised pansharpening method with a self-attention mechanism [
40] was proposed. The stacked self-attention network contains an attention representation layer that naturally identifies the spectral characteristics of mixed pixels with sub-pixel accuracy. Re-blurring blocks and graying blocks are applied in LDP-Net [
41], allowing it to learn degradation processes at different resolutions. The speed of inference can be improved through the use of a target-adaptive inference scheme. Therefore, target-adaptive processing has been introduced into many methods, such as Lambda-PNN [
42] and Fast Z-PNN [
43]. Faced with the challenges associated with limited training data, a zero-shot semi-supervised method for pansharpening (ZS-Pan) [
44] was exploited, which served as a plug-and-play module. The SURE loss function, based on Stein’s unbiased risk estimate [
45], was introduced into an unsupervised network to avoid overfitting. MetaPan [
46] solved the problem of setting key hyperparameters manually. The meta-learning stage optimizes for an internal representation of network parameters that is adaptive to specific image pairs.
In addition, there are a large number of unsupervised networks based on generative adversarial networks (GANs) [
8]. As a pioneering work, Ma [
47] proposed an unsupervised pansharpening method, which is termed PanGAN. MDSSC-GAN SAM [
48] focuses on high-frequency information and utilizes dual discriminators: a geometric discriminator, which optimizes image texture and geometry, and a chromaticity discriminator, which preserves the spectral resolution. Motivated by the cycle-consistent adversarial network (CycleGAN [
49]), Li et al. [
50] proposed a self-supervised framework in which the fused images are successively passed through two generators for improved performance. Zhou et al. [
51] also proposed a cycle-consistent generative adversarial network (UCGAN) to bridge the gap between reduced and full resolution. ZeRGAN [
52] is a zero-reference generative adversarial network whose structure consists of a set of multi-scale generators and discriminators. The training process involves only a pair of images, and accurate fused results are generated. Ozcelik et al. [
53] adopted a new perspective that regarded pansharpening as the task of colorization. The self-supervised mode overcomes the shortcomings of spatial detail loss and ambiguity in CNN-based models.