1. Introduction
Multispectral images are useful in a wide range of applications: facial recognition [
1], remote sensing [
2], medical imaging [
3], and precision agriculture [
4], among others. Multispectral image acquisition systems offer great diversity, particularly with scanning mode acquisition systems that acquire the multispectral image in multiple frames. They are divided into three categories: tunable filter cameras, tunable illumination cameras, and multi-camera systems. Tunable filters, such as LCTF (Liquid Crystal Tunable Filter) [
5] and AOTF (Acousto-Optical Tunable Filter) [
6], use electronic techniques to capture each multispectral band. Although these systems produce fully defined multispectral images, their acquisition time is beyond the scope of a real-time acquisition system.
On the other hand, instantaneous acquisition systems, or snapshots, capture the MS image in a single shot. They include single-sensor or multi-sensor multispectral systems, which are divided into several classes: multispectral filter array (MSFA), interferometers, tunable sensors, and filtered lens arrays [
7].
The acquisition system based on a single-sensor one-shot camera coupled with an MSFA provides a compact, low-cost, real-time solution for multispectral image acquisition. The camera can capture all the necessary spectral bands in a single snapshot [
8]. To achieve this, an MSFA is positioned in front of the sensor to capture mosaic images where each pixel location contains information from a single spectral band. An interpolation method is applied to the mosaic image to obtain the fully defined multispectral image [
9].
The MSFA plays a crucial role in multispectral imaging by filtering the light entering the sensor. In a defined MSFA, the number of bands increases by reducing the number of pixels assigned to the band. A greater number of spectral bands in the MSFA allows for a more precise spectral analysis of the observed scene, but this results in a decrease in the spatial resolution of the image. Indeed, with more spectral bands, the distance between spectrally similar pixels increases [
10]. The main weakness of single-sensor one-shot cameras is their ability to efficiently reconstruct a complete multispectral image from a mosaic image, especially when the mosaic contains non-homogenous areas, abrupt transitions, and textured regions [
11].
Previous works [
4,
12] have detailed our single-shot multispectral camera’s design process, specifically designed to operate in the visible. This camera has a 4 × 4 MSFA moxel with eight spectral bands selected by a genetic algorithm. Each spectral band receives two pixels per moxel, where the mosaic arrangement is a moxel assembly over a monochrome sensor [
13]. After a snapshot, the camera provides a mosaic that is decomposed into eight sparse images, each containing pixels with the same spectral properties and null pixels. Thus, the sparse images have a very high number of null pixels. For our camera, in a 16-pixel moxel window, 14 pixels are null. This deficit can cause problems during image demosaicing, affecting image quality and visual fidelity and a loss of spatial resolution.
To address these issues, we propose a method for reducing the number of null pixels in sparse images. Our approach aims to reduce the number of null pixels by combining sparse images from multiple acquisitions. To achieve this, we combine camera displacements along both the vertical and horizontal axes. At each displacement, the camera captures an image of the observed scene, generating a mosaic of the scene with a spatial redistribution of pixels with similar spectral properties. Next, the set of sparse images from each post-displacement acquisition is summed with those obtained without displacement to obtain new composite sparse images. The new sparse images are finally demosaiced.
In this study, we present the following contributions:
Setting up a dataset for our experiments, consisting in transforming images from a database of 31 into 8 bands to simulate our 8-band MSFA moxel. These images will then be mosaicked with our MSFA filter to simulate a snapshot from our camera;
Development of a new composition method with a multi-shot approach to reduce the number of null pixels in sparse images while maintaining the same number of spectral bands;
Performed visual and analytical comparisons using validation metrics to evaluate our experiments, demonstrating the improvement in spatial resolution of the final image obtained after demosaicing.
The remainder of this article is organized as follows:
Section 2 presents the state of the art in improving the spatial resolution of MSFA images.
Section 3 details the materials and methods used in our approach.
Section 4 presents the experiments carried out and the results obtained.
Section 5 discusses the results. Finally,
Section 6 presents our conclusion.
2. Related Works on Improving the Spatial Resolution of MSFA Images
Much research has demonstrated the interest in improving the spatial resolution of multispectral images from an MSFA sensor.
Monno et al. [
11,
14] proposed a multispectral demosaicing method using a guided filter. This method is used in multispectral imaging to improve color reproduction and computer vision applications. The proposed method uses a guided filter to interpolate spectral components in a multispectral color filter array. The technique addresses the challenge of undersampling in multispectral imaging and shows promising results for practical applications. Its effectiveness is based on the establishment of an MSFA pattern with a dominant green band.
Wang et al. [
15] proposed a method to improve the quality of images reconstructed from multispectral filter networks while minimizing the computational cost. It addresses the challenge of estimating missing data in images acquired by these networks using adaptive frequency domain filtering (AFDF). This technique combines the design of a frequency domain filter to eliminate artifacts with spatial averaging filtering to preserve spatial structure. By incorporating adaptive weighting, AFDF improves the quality of reconstructed multispectral images while maintaining high computational efficiency.
Rahti and Goyal [
16] proposed a weighted directional interpolation method for estimating missing pixel values. They exploit both spectral and spatial correlations present in the image to intelligently select interpolation schemes based on the properties of binary tree-based MSFA models. By computing directional estimates and using edge amplitude information, the method progressively estimates missing pixel values and updates pixel arrangements according to the band’s point of arrival (PoA) in the binary tree structure.
Zhang et al. [
17] proposed a method that integrates a deep convolutional neural network with a channel attention mechanism to improve the demosaicing process. In this method, a mean square error (MSE) loss function is used to improve the accuracy of estimated pixel values in image processing. In addition, a contour loss is introduced to improve the sharpness and richness of textured images using high-frequency subband analysis in the wavelet domain. The method uses the TT-59 database [
18] for training and evaluation. Multispectral images are processed to synthesize radiance data to demonstrate the effectiveness of the demosaicing technique.
Mihoubi et al. [
19] proposed a demosaicing method called PPID based on the generation of a pseudo-panchromatic image (PPI). To ensure robustness to different lighting conditions, an adjustment of the value scale in the raw image is proposed before estimating the PPI, with the aim of mitigating biases caused by differences in spectral illumination distribution between channels. The remaining steps include calculating the spectral differences [
20] between the original raw image and the PPI, using local directional weights for interpolation [
21], and, finally, combining the PPI with the differences to estimate each channel of the final image.
Jeong et al. [
22] proposed a method to improve image quality by estimating a pseudo-panchromatic image using an iterative linear regression model. It then performs directional demosaicing, a technique that combines the pseudo-panchromatic image with spectral differences to produce a final interpolated image. The process includes steps such as directional interpolation using the BTES method [
23] and calculation of weights to improve the accuracy of the final multispectral image.
Rathi and Goyal [
9] proposed a method that uses the concept of the pseudo-panchromatic image and spectral correlation between spectral bands to efficiently generate a complete multispectral image. It involves estimating a pseudo-panchromatic image from a mosaic image using convolution filters based on the probability of the appearance of each spectral band [
24] and binary masks. This pseudo-panchromatic image is then used to interpolate each spectral band to produce a multispectral image. The process iteratively improves the quality of the multispectral image by updating the pseudo-panchromatic image and estimating the spectral bands multiple times.
Liu et al. [
25] proposed a new deep learning framework for multispectral demosaicing using pseudo-panchromatic images. The framework consists of two networks, the Deep PPI Generation Network (DPG-Net) and the Deep Demosaic Network (DDM-Net), which are used to generate and refine the PPI to improve image quality and recover high-frequency information in the demosaicing process. DPG-Net specifically focuses on improving the sharpness of the preliminary PPI to improve image resolution by learning the differences between the actual PPI and Mihoubi’s blurred version [
19], which ultimately leads to the production of the final refined PPI. DDM-Net uses bilinear interpolation to estimate missing pixel values in fragmented bands, followed by a neural network architecture that extracts color and texture features to improve image quality. By combining convolutional layers and loss functions, DDM-Net aims to minimize reconstruction errors and produce high-quality demosaiced images.
Zhao et al. [
26] proposed a neural network model with two branches of adaptive features (DDMF) and edge infusion (PPIG). The proposed architecture combines weighted bilinear interpolation [
21] to generate initial demonstration images with adaptive adjustments of pixel values in reconstructed multispectral images. It uses a DDMF module to generate convolution kernel weights that adapt to spatial and spectral changes, thus improving the accuracy of the demosaicing process. In addition, the PPIG edge infusion sub-branch integrates edge information to improve demosaicing accuracy in terms of spatial precision and spectral fidelity.
Most of the methods proposed to improve the spatial resolution of a multispectral image are based on complex steps during the demosaicing process. Our paper proposes a new approach based on a multi-shot method that happens before the demosaicing process.
3. Materials and Methods
3.1. The MSFA Moxel
The MSFA moxel is a grid of optical filters placed in front of the sensor of a multispectral camera to filter the incoming light into different spectral bands. Each pixel in the captured image is associated with a specific filter in the MSFA moxel, allowing light intensity to be measured in different parts of the electromagnetic spectrum. The MSFA allows the simultaneous acquisition of multispectral information during image acquisition by distributing the pixels on the image sensor according to their spectral sensitivity. The choice of MSFA size and the number of bands is essential for the acquisition and reconstruction of multispectral images. The MSFAs commonly used in the literature generally have the following two main characteristics:
Redundancy [
27]: a band can have a probability of appearance greater than one,
, where n represents the linear size of the MSFA moxel;
Non-redundancy [
21]: each band has a probability of appearance of
.
In the case of bands with redundancy, the following two types of behavior can be observed:
These characteristics of the MSFA moxel directly affect the quality and resolution of the multispectral images obtained after the acquisition and reconstruction process. The selection of the appropriate MSFA moxel depends on the specific application requirements, such as the desired spectral resolution, sensitivity to different wavelengths, and camera hardware constraints.
Our camera uses a 4 × 4 filter with equal probability of band appearance to acquire mosaic images, where each band is sampled by two pixels. This moxel was chosen to balance the spatial distribution of pixels in sparse images [
28]. This design is based on the color shade approach [
12], which optimizes the spectral response of the filters and improves the quality of images acquired during a shot.
Figure 1a illustrates the spectral band arrangement of our MSFA moxel. This moxel is used throughout our study to construct mosaic images and in demosaicing multispectral images.
Figure 1b shows the filters’ spectral response in our MSFA model. Spectral response refers to how well the sensor detects and measures light in different spectral bands. This spectral response is given in the visible spectral interval [400 nm, 790 nm].
3.2. Dataset
In our simulation, we project 31 image bands from the TokyoTech database (TT-31) [
11] into 8 bands corresponding to our MSFA. This projection is performed on the response of the MSFA filters of our camera. The use of this projection is important because it allows us to work with accurate data that reflect the conditions we encounter in the real world when making acquisitions with our camera. This allows us to reduce the dimensionality of the images while preserving the most relevant spectral information. Here are the steps in the projection process:
Determination of the desired number of bands for the resulting multispectral image, in our study, eight bands.
Definition of Gaussian filter full width at half maximum (FWHM) in nanometers; in our study, this width is 30 nm.
Calculating the standard deviation of the Gaussian filter corresponding to the defined FWHM is necessary because the shape of the Gaussian is determined by its standard deviation.
Calculation of the central wavelength of each Gaussian filter. we use a distance of 3 times the standard deviation of the start wavelength. Then, we move at a calculated interval between filters and end at a distance of 3 times the standard deviation of the end wavelength. Subsequently, we round the values to the nearest integer and sample at the desired spectral interval.
Creation of Gaussian filters using a Gaussian function. Each filter is calculated based on the similarity between the spectral wavelength and the central wavelength of the filter. The greater the similarity, the higher the filter weight. Filters are normalized to ensure that their sum equals 1.
The recovery of original image data from 31 bands is followed by filtering using the created Gaussian filters.
Multiplication of Gaussian filters to the weighted data to perform the 8-band multispectral transformation, selecting the appropriate spectral bands.
The 430, 464, 498, 529, 571, 605, 645, and 680 nm bands used for projection result from optimization work with the genetic algorithm.
In this approach, it is assumed that there is no change in the inclination of the illuminance.
3.3. Mosaicking Process to Obtain Sparse Images
A mosaic image captured by our camera produces 8 sparse images after grou** pixels with similar spectral properties. Since we will be working with fully defined images, we use our MSFA moxel to generate mosaics from them.
Figure 2 illustrates the mosaicking process with our MSFA moxel and the grou** of pixels with similar spectral properties into sparse images.
Figure 3 shows the spatial distribution of pixels in the sparse images of spectral band B1. The gray areas represent the available pixels, while the white areas represent the null pixels.
Our approach is to reduce the number of null pixels in these sparse images. We expect that reducing the number of null pixels will reduce reconstruction errors during the demosaicing process.
3.4. Conceptualization of the Method
Let us define , the original spectral bands that have the fully defined information (pixels) of the eight bands obtained after projection.
Let us define as the mosaic obtained after the first snapshot without sensor displacement. Synthetically, it is obtained using our MSFA moxel on bands.
Let us define as the mosaic obtained with camera displacement of kj pixels, where kj ∈ {1, …, 3}, along the Dj axes, which can be either horizontal (H) or vertical (V). Synthetically, these mosaics are obtained by shifting the bands of kj pixels along the Dj axes. This produces bands , which are mosaicked with our MSFA moxel.
Figure 4 illustrates the different mosaics obtained with a one-pixel camera displacement on the vertical axis (
k1 = 1 and
D1 =
V) and a one-pixel camera displacement on the horizontal axis (
k2 = 1 and
D2 =
H).
These mosaic matrices have the following shapes for
kj displacement:
The process of grou** pixels with similar spectral properties involves separating a mosaic image into different spectral bands using a binary mask formulated as follows:
For each mosaic, we obtain a set of sparse images,
, by applying the following formula on them:
For any camera displacement, we obtain the mosaics where i represents the index of a band of the MSFA moxel and kj represents the displacement scalars along the horizontal (H) and vertical (V) axes.
Figure 5 shows the density of pixels that are spectrally similar in band B1 of the mosaics
and
. The gray areas represent the available pixels in the sparse image
of the band
B1; the yellow areas represent those available in the sparse image
of the band
due to the camera’s displacement on the vertical axis of
k1 pixels; and the blue areas represent the pixels available in the sparse image
of the band
due to the camera’s displacement on the vertical axis of
k2 pixels.
The positions of the non-null pixels vary in each sparse image, and these pixels have the same spectral properties. Therefore, the sparse images can be combined (composition method), i.e., added together, to increase the number of non-null pixels and reduce the number of null pixels. The pixels are redistributed according to the camera displacement combinations.
3.5. Sparse Image Composition
The sparse image composition method is performed in 3 steps, as shown in
Figure 6. The first step is to take an initial snapshot of a scene. This snapshot provides a mosaic image
, which is decomposed into sparse images
using Formula (2). Then we set the number N of compositions we want to make by specifying the displacement scalars
kj, and the axes
Dj. Finally, we obtain composite sparse images
, which contain more available pixels. The symbol “?” in the composite sparse images
, indicates the areas where new pixels can appear depending on the displacement combination. This composition method reduces the distance between two non-null pixels and is limited to three compositions. Beyond three compositions, implementing such a method can be very time consuming.
3.5.1. Case of the Composition of Two Sparse Images
For two bands, we obtain six possible compositions for the different values of the displacement scalar on the two axes H and V. The following algorithm shows how the composition of two bands is achieved:
The camera takes a first snapshot from which we obtain a mosaic ;
The camera moves k pixel(s) on the D axis and takes a second snapshot, from which a second mosaic is obtained;
The separation into sparse image is performed on the mosaics and with Formula (2), resulting in sparse images and ;
The addition of the two sparse images is performed, such that .
Figure 7 shows a composition of bands from the
and
mosaics. The eight sparse images have globally the same pixel distributions, which vary according to the parameters
kj and
Dj. Thus, for a given camera displacement, the pixel distribution in the composite sparse image
is the same, which justifies that we comment only on spectral band B1 of each composition.
Figure 8 shows the six possible compositions of band B1 with different values of the displacement scalar
kj on the
V and
H axes. The new composition allows for more pixels and better redistribution to minimize the non-null pixel distance in the composite sparse images
.
3.5.2. Cases of Sparse Images Greater Than Two
For more than two bands, we obtain more than 30 possible compositions for the different values of the displacement scalars kj on the axes Dj. The following algorithm shows how the composition of N bands is achieved where :
The camera takes a first snapshot from which a mosaic is obtained.
The separation into sparse images is performed on the mosaics using Formula (2), resulting in the sparse images .
The initialization step sets the values of to and j to 1.
As long as j ≤ N:
The camera moves along the Dj axis by kj pixels from its position (0, 0) and takes a snapshot, and a new mosaic is obtained.
The new mosaic is decomposed using Formula (2), resulting in sparse images .
The above sparse image is added to the previous sparse image , and the value of j is incremented.
In the end, we get composite sparse images .
Figure 9 illustrates the spatial distribution of pixels of certain three- and four-band compositions. The blue area represents the
H-axis displacement and the yellow area represents the
V-axis displacement.
The composition method redistributes pixels to provide more information and reduce the number of pixels to interpolate. It is important to note that with our MSFA moxel it is not possible to achieve a three-band composition with a displacement of two pixels on both the horizontal and vertical axes (k1 = k2 = 2 on the H and V axes). This would cause a problem with overlap** pixels at certain positions of the composite sparse image.
3.6. Bilinear Interpolation
To generate a fully defined image, we use the bilinear interpolation on the sparse images to deduce the null pixels according to the following Algorithm 1:
Algorithm 1: Bilinear interpolation. |
Input: sparse_image, method |
Output: InterpIMG |
BEGIN |
Width = sparse_image.width |
Height = sparse_image.height |
XI = value grid going from 1 to height + 1 |
YI = value grid going from 1 to width + 1 |
Ind = coordinates of data to interpolate |
Z = values of non-null indices |
InterpIMG = grid_interpolation(Ind,Z,(XI, YI), fill_value = 2.2 × 10−16) |
END |
We set the fill value to × 10−16 to avoid the zero-value. This would avoid having nan values in our interpolated matrix. The grid_interpolation function is given in the following Algorithm 2.
Algorithm 2: grid_interpolation. |
Input: points, values, grid, method, fill_value |
// points: The coordinates of the data to interpolate |
// values: The corresponding values at the data points |
// grid: The grid on which to interpolate the data |
// fill_value: the value to use for points outside the input grid |
Output: InterpIMG |
BEGIN |
For each point (x, y) in grid: |
If (x, y) is outside of the input points: |
Assign fill_value to InterpIMG(x, y) |
Else: |
Find the k (2 ≤ k ≤ 4) nearest data points within a rectangular grid, with 2 along each axis |
Calculate the weights for interpolation based on distance |
Interpolate the value at (x, y) using the input values in points and interpolation weights |
Assign the new value to InterpIMG(x, y) |
End If |
End For |
End If |
|
END |
3.7. The General Architecture of Our Method
The architecture in
Figure 10 shows the general flow of our method. We start by projecting the images of TT-31 into 8 bands. Then, we create a mosaic from these bands, which represents the first snapshot of the sensor. The mosaic is decomposed into 8 sparse images that go through a composition method that depends on the displacement of the sensor horizontally or vertically. Each displacement provides a mosaic that is decomposed into 8 sparse images, which are added together with the previous 8 sparse images to form the new composite sparse images. The composition process is repeated until the stop condition is reached. The final composite sparse images are demosaiced using a bilinear method to obtain the fully reconstructed images.
5. Discussion
The study’s results show a direct correlation between the number of compositions and the spatial resolution of the reconstructed image, especially when reconstructing abrupt transitions, non-homogeneous areas, and textured regions. The more compositions performed, the better the reduction of the distance between the non-null pixels of the sparse images, leading to a better spatial resolution after demosaicing. For each level of composition, there are differences in the qualitative and quantitative results depending on the values of the displacement scalars.
Several observations can be made about compositions involving two bands where only one camera displacement is required. For certain images in
Figure 11 and
Figure 12, there is a preference for horizontal shifts, while for others in
Figure 13, there is a preference for vertical shifts. Depending on the type of image, there is a clear improvement in abrupt transitions, non-homogeneous areas, and textured regions. Two-pixel displacements significantly improve local structures such as edges, textures, and patterns compared to the image obtained without band compositing. This improvement is manifested in higher SSIM values (
Table 3), lower spectral similarity angle according to SAM (
Table 2), and lower reconstruction error according to RMSE (
Table 4). However, reconstruction with less noise is observed with 1-pixel or 3-pixel displacements, as shown by PSNR (
Table 1). The study highlights a significant correlation between the spatial distribution of the pixels in the sparse images and the quantitative and qualitative results after reconstruction. Indeed, the displacement of 2 pixels better reduces the distance between two non-null pixels of the sparse images, leading to less overlap** in abrupt transitions and improved visual restitution, as shown by the displacement (ad, ae) in
Figure 11,
Figure 12 and
Figure 13. In conclusion, vertical shifts, especially those of 2 pixels, offer a good compromise between improving local structures and reducing noise in the reconstructed images. The study highlights the importance of considering the spatial distribution of pixels when planning camera shifts for optimal reconstruction.
In compositions with three bands and two camera displacements, there are 14 possible combinations of displacements on the horizontal and vertical axes. According to PSNR, 1-pixel or 3-pixel displacements on both axes result in a less noisy reconstruction. SSIM shows that the structural reconstruction is almost equivalent in most cases. Moving along the same axis results in higher spectral similarity and fewer reconstruction errors, as indicated by SAM and RMSE. Visual results show increased sharpness for displacements on the same axis, but decreased sharpness for displacements of 1 pixel on both axes and 3 pixels on both axes. In conclusion, displacements on the same axis provide an optimal compromise between the structural and spectral quality of the reconstruction. At the same time, other configurations offer specific advantages and disadvantages in terms of noise reduction and visual sharpness.
The visual results obtained are very close to the reference image for four-band compositions with three camera displacements, with 10 possible combinations. This suggests a satisfactory ability to reconstruct images with a high level of visual fidelity, although the metrics show less good results than in the case of the three-band composition. However, the implementation of this type of shift is not directly feasible in a real-time acquisition system due to the increased complexity of the camera shift. Therefore, the use of this type of composition is not necessary in real-time acquisition systems. Nevertheless, the displacements of this type of composition on the same axis show excellent visual results. This observation suggests that a limited camera displacement for this type of composition may be sufficient to significantly improve the quality of reconstructed images without requiring the excessive complexity of a bi-axial composition. In conclusion, four-band compositions can produce satisfactory visual results, but their practical implementation in a real-time acquisition system is limited due to their displacement complexity. However, simpler strategies, such as moving along the same axis, can provide significant improvements while reducing the difficulty of operational feasibility.
In practice, the implementation of our method is possible, in particular, by using a tri-CCD system to capture and restore a motion scene of objects [
33]. This acquisition system has a beam splitter to split the light into two other axes. The prism redirects light to three sensors that capture a mosaic of the same scene with different observations, providing three mosaics of the same scene with different spatial information distributions. For static objects, a micron-precision camera translation system would be required to capture and restore the fully defined image.
Figure 14 illustrates the operation of a tri-CCD system where each sensor is equipped with an MSFA.
The first MSFA filter is mounted on top of sensor 1 to obtain a mosaic with no information shift. The second MSFA filter is mounted on top of sensor 2 to obtain a mosaic with information shifted by 1 pixel on the horizontal axis. Finally, a third MSFA filter is mounted on top of sensor 3 to obtain a mosaic with information shifted by 1 pixel on the vertical axis.