1. Introduction
A hyperspectral image(HSI) creates a three-dimensional data cube, where each pixel is represented by an almost continuous spectral curve. This representation captures the spatial arrangements and spectral characteristics so that the objects can be differentiated clearly. High-resolution (HR) remote sensing images, crucial in applications like military rescue [
1] and environmental monitoring [
2], provide detailed observations of ground objects. These images facilitate various tasks, including image classification [
3], object detection [
4], and tracking [
5]. However, the limitations posed by sensors and transmission bandwidth necessitate a trade-off between spatial and spectral resolutions. The spatial resolution is often reduced to preserve accurate spectral features, causing a significant challenge in enhancing HSI spatial resolution. In recent years, multiple strategies have been proposed to address the issue of low spatial resolution in HSIs, broadly divided into two categories. The first approach employs fusion methods [
6], which integrate image data from diverse sources to extract and amalgamate valuable information within a unified framework, thus producing HR HSIs enriched with spatial details and spectral data. The second strategy involves single HSI super-resolution (SR), aimed at directly producing HSIs with enhanced spatial resolution by learning the map** relations between low-resolution (LR) HSIs and their HR counterparts. For fusion-based methods, the traditional approach typically relies on preset algorithms to integrate images of different resolutions. These techniques are effective in specific cases but usually require precise image alignment and complex preprocessing steps, which may be difficult to meet in dynamic or complex environments, limiting their application scope. In contrast, fusion methods based on deep learning automatically extract and integrate features by learning from a large amount of data. This approach not only improves the accuracy of fusion but also handles larger datasets better, better adapting to changing environmental conditions. Nevertheless, fusion-based methods can offer richer information but face major challenges: accurate alignment of different source images is critical but hard in dynamic environments, risking data loss. Additionally, the heterogeneity of sources complicates fusion, requiring extensive preprocessing [
7]. Conversely, single HSI SR is more straightforward, eliminating the need for auxiliary data or complex preprocessing and dominating in current research with its simplicity implemented by deep learning techniques.
Traditional single HSI SR methods develop a map** function from LR to HR HSIs, often relying on handcrafted prior knowledge (e.g., low-rank approximations [
8] and sparse coding [
9]) to address the inherent uncertainty in HR-HSI reconstruction. In these methods, prior knowledge acts as regularization to simulate image degradation in a forward mathematical model that captures the spectral properties and spatial structure of the input. However, the optimization of the model is often ill-conditioned, and it is difficult to solve the optimal HR-HSI results. Moreover, although various priors [
10,
11], such as spectral mixing models, total variation, sparse representation, low rank, and self-similarity, have been explored in signal processing and computer vision, demonstrating superiority over unconstrained optimization techniques, the diversity of HSI scenarios and the intricate nature of spectral and spatial structures pose challenges in the efficient designs of priors.
Benefiting from the end-to-end learning of a map** function implemented by deep learning techniques, the spatial–spectral features of HSIs could be captured adeptly without handcrafted priors. With developments in computing hardware and the increase in available datasets, deep learning has set new benchmarks in HSI SR. Among the leading architectures in this domain are convolutional neural networks (CNNs) and Transformers.
The CNN, known for its deep structures and convolutional operations, excels at extracting depth-wise features from images and understanding the map** relations between LR-HSI and HR-HSI, effectively representing spatial–spectral relationships. The emergence of SRCNN [
12] has inspired many CNN-based methods, incorporating advanced techniques like residual learning [
13], attention mechanisms [
14,
15], and multiscale processing [
16,
17,
18] to boost performance. Some researchers have also explored 3D convolutions to address spectral-wise representations [
19,
20] and minimize spectral distortions. Nonetheless, CNNs, primarily focusing on local feature extraction, may perform suboptimally in extracting long-range information in HSIs, resulting in poor representational capacity and artifacts in HSI SR outcomes.
Recently, Transformers have been applied to single HSI SR, leveraging self-attention mechanism that grasp long-range dependencies and integrate information globally, enhancing the quality of HR HSI reconstruction. Despite their scalability and flexibility, the applicability of Transformers in HSI-SR is hampered by the limited size of HSI datasets compared to the vast collections of RGB images. Moreover, the computational complexity of Transformers, which scales quadratically with the sequence length , imposes significant computational demands.
To address the aforementioned challenges, including the inadequacy of CNNs in capturing long-range dependencies, the high computational cost of Transformers in processing large-scale HSI data, and the existing room for improvement in the extraction and fusion of spatial and spectral information in HSI SR tasks, we introduce Spatial–Spectral Aggregation Transformer (SSAformer), a hybrid model that combines the strengths of Transformer and CNN architectures for efficient feature extraction and fusion in spatial–spectral channels, achieving superior restoration results. Specifically, SSAformer incorporates spatial and spectral attention modules. For spatial features, it introduces a window attention mechanism for enhanced extraction and employs a cross-fusion attention mechanism to strengthen long-range dependencies. This approach not only maintains linear computational complexity but also broadens the receptive field, effectively reducing spatial artifacts. In the spectral domain, SSAformer applies channel attention operations via deformable convolutions (DCs), adaptively processing information from each channel to overcome the redundancy inherence in HSIs. Consequently, SSAformer tackles channel redundancy and significantly improves global attention to spectral features. Comprehensive experiments on three widely used benchmark datasets show that SSAformer surpasses existing state-of-the-art (SOTA) methods. Our contributions can be summarized as follows:
We propose the novel Spatial–Spectral Aggregation Transformer for HSI SR, designed to capture and integrate long-range dependencies across spatial and spectral dimensions. It features spatial and spectral attention modules that effectively extract and integrate spatial and spectral information in HSI SR tasks, significantly enhancing SR performance while maintaining linear computational complexity.
To achieve long-range spatial dependencies, we construct spatial attention modules, utilizing cross-range spatial self-attention mechanisms within cross-fusion windows to aggregate local and global features, effectively enhancing the model’s perception of spatial details while ensuring the integrity and continuity of spatial information.
To address the redundancy problem in high-dimensional spectral data of HSIs and effectively capture long-range spectral dependencies, we construct spectral attention modules, combining DCs to perform spatial attention operations, reducing channel redundancy while enhancing the model’s global attention to spectral characteristics.