1. Introduction
The paper is devoted to the issues of acoustic steganography, and more precisely hidden synchronization of acoustic steganographic channels. Analyzing scientific publications in this field, it can be concluded that this is a valid topic, and the algorithms of acoustic steganography are constantly being improved. However, in many cases, the authors of the published solutions in their research ignore the significant problem of signal synchronization, in which data is embedded, often assuming perfect synchronization. In the case of practical implementations of steganographic systems, this approach is too much of a simplification, because achieving synchronization is a necessary condition for the effective extraction of payload [
1,
2].
Data transmission in a steganographic system is inextricably linked with the issue of synchronization. In the absence of synchronization mechanisms, the moment of starting the steganographic data extraction procedure is difficult to determine unequivocally, which implies the random nature of the received data, which is synonymous with low efficiency of hidden transmission.
The bit error rate (BER) was adopted as a measure of the efficiency of steganographic data transmission [
3]. The use of hidden synchronization methods should therefore result in obtaining low BER values, which, in combination with detection and correction codes, will enable error-free transmission of the payload.
The use of hidden synchronization methods may or may not be associated with a deterioration in the quality of the cover work. Therefore, it is reasonable to search for such methods of synchronization that will not cause a significant deterioration of the quality of the signal carrying the payload. We often call cover work (original signal) with a payload Stego Object or Stego Work [
1,
2].
The paper presents four unique mechanisms that allow to achieve synchronization on the receiving side. Three of the developed methods of synchronization operate directly on the acoustic signal, while the fourth method works in the higher layer, analyzing the structure of the decoded steganographic data stream. All of new synchronization methods have been tested against the steganography paradigms: transparency, robustness, and data rate.
The remainder of the paper is organized as follows.
Section 2 present a short description of the state of the art about speech steganography. Description of one of the methods of speech steganography is contained in
Section 3. Technique development, implementation and study results are shown in
Section 4 and
Section 5, respectively. Finally, we summarize the paper.
2. Related Work
There are many published papers dealing with acoustic steganography. For the purposes of this paper, many solutions have been analyzed. The most popular and characteristic methods will be presented, which will sufficiently indicate how complex the problem is to hide information in an acoustic signal.
Depending on the place of embedding and extraction of payload in the speech signal in the telecommunications chain, the acoustic steganography algorithms can be divided into three groups [
4].
The first variant consists of a certain modification of the operation of the selected speech signal codec. Such a mechanism was used, for example, in [
5], where the G.729 codec code book was modified, which allowed for a hidden data rate of 2 kbit/s. In [
6], it is proposed to hide information by changing the values of the linear prediction coefficients. The authors presented the results of experiments involving coding of the original signal with various codecs (G.721, GSM, G.728, G.729). In [
7] the iSAC codec (internet Speech Audio Codec) was analyzed, hiding 12 bits per frame, which corresponds to 400 bit/s.
The second variant presents a situation in which hiding information is performed by modifying the data stream obtained at the output of the speech signal codec. The way of embedding information here is usually done by modifying the appropriate parameter or individual bits in the data stream. These types of solutions are relatively easy to implement and provide a high speed of payload transmission. Here we find a whole range of Least Significant Bits (LSB) methods from the simplest implementations [
8,
9,
10,
11] to the more complex [
12,
13,
14].
The last, third variant involves embedding the information in the speech signal, just downstream of the analogue to digital converter, operating only on samples of the signal. The embedding of payload takes place with the use of various digital signal processing methods.
One way is to hide information by coding or manipulating the phase of the original signal [
15,
16,
17,
18,
19]. Information hiding algorithms based on phase modification are characterized by high resistance to signal degrading factors and, depending on the carrier signal and the size of the data block being analyzed, by the hidden transmission rate from single bits to even kilobits per second [
19,
20]. A slightly different approach was proposed in [
21]. Namely, instead of embedding the information in the phase of the original signal, the authors proposed to embed an OFDM signal in the original signal and encode the payload by changing the phase angle of these additional harmonics. This method is characterized by a data rate of about 40 bit/s and is resistant to degradation factors occurring in real VHF radio links or during speech signal transmission in GSM cellular lines.
In the works [
17,
22] a procedure of signal synchronization was presented, consisting in shifting, with a certain step, the receiving window in relation to the received signal and an attempt to extract the embedded bits. If periodically repeated maxima indicative of bit detection were obtained, the synchronization was considered to be achieved. In [
20,
21,
23] the synchronization mechanism is described, the principle of which is based on the phase analysis of selected harmonics of the signal, in which the steganographic data is embedded.
Apart from the methods that modify the signal phase, there are methods that allow you to hide information by changing the amplitude spectrum of the signal [
23,
24,
25,
26,
27,
28]. These methods are characterized by a hidden data rate ranging from several bits per second to even several hundred bits per second. They are resistant to lossy compression, filtering, and changing the sampling frequency or analogue to digital conversion.
Apart from Fourier transform, often used in acoustic steganography, a number of publications are devoted to other transformations. In [
29,
30,
31] the data hiding mechanism was presented, consisting in the quantization of the wavelet transform coefficients. In [
29], a data rate of almost 300 kbit/s was achieved.
The method of using the cosine transform to hide data is described in [
32]. The data rate of 150 bit/s was achieved, as well as resistance to lossy compression and analogue to digital conversion.
An innovative approach to the topic of hiding information in acoustic signals has been proposed in [
33,
34]. The authors propose to transform the acoustic signal into an image using the wavelet transform (A2IWT, Audio to Image Wavelet Transform). Then, embedding the information in the signal is done using one of the known steganography methods for digital images. The features of steganographic algorithms based on signal to image transformation depend on the properties of the image steganography algorithms used in a given case.
The paper [
35] proposes a mechanism of steganographic data embedding in acoustic signals using the Hermit transform. The presented method is characterized by high perceptual transparency, but the authors do not specify the data rate achieved. The method is resistant to signal noise and filtering.
An algorithm based on the statistical properties of the signal was presented in [
36]. In [
37], the probability density function for the speech signal was proposed. Assuming that the speech signal at the receiver input is the sum of the steganographic signal and the noise signal, in [
36] the dependencies on the random variable of this signal were determined. The process of embedding additional information in a single signal frame is done by appropriately scaling the amplitude. The data rate depends on the nature of the original signal and ranges from 172 bit/s for music signals to 40 bit/s for speech signals.
The papers [
38,
39,
40] present the results confirming the use of the echo signal to create covert channels. These methods confirm effective data extraction in the presence of many signal distorting factors, such as adding noise, changing the sampling frequency, filtering, lossy compression, and transmission over a VoIP link.
In [
41,
42,
43], methods using the spread spectrum were presented. Algorithms in the field are relatively easy to implement and show good resistance to a wide variety of signal transformations. They meet the requirements for perceptual transparency and provide a hidden data rate of several dozen bits per second.
There is, therefore, a relatively small number of articles and technical descriptions of audio systems used to create hidden communication channels. There is a clear gap in this area of knowledge. This is due in part to the realization that the power of the secret channel access key lies not in the number of key combinations but its stealth, i.e., the method of embedding and extracting the steganographic sequence. This is because the space of possible secret channel access key combinations is limited by strongly correlating with the values of the cover signal. The article [
44] presents a mathematical description and explanation of this vulnerability.
The works on steganographic algorithms presented in this section do not exhaust this extensive issue. It is enough to bear in mind that in electronic publication databases, after entering the keyword audio steganography, only for the years 2018–2019 we get 92 results in the IEEE database, 121 results in the Web of Science database, and as many as 299 results in the Scopus database. On the other hand, the Google Scholar search engine finds 4270 items.
3. Embedding and Extraction Algorithm
In order to study synchronization methods, it is necessary to have a mechanism for embedding and extracting steganographic data. Among the many methods presented in
Section 2, one of the algorithms was selected for further analysis. The algorithm is presented in [
26,
28], and described in detail in [
45]. This method uses a narrowband speech signal as a carrier of steganographic data. Additionally, the authors showed that the method is resistant to a number of factors that degrade the steganographic signal during its transmission over the VoIP link.
For the purposes of this paper, there have been little changes introduced to the algorithm. The signal frame size was determined to be 192 samples (24 ms). Moreover, when determining the masking curve, a procedure using the psychoacoustic model of the MPEG-1 standard was used [
46]. Hiding a single bit of information in a steganographic encoder consists of bipolar modification of the amplitude spectrum of the original signal in two adjacent signal frames. The steganographic data transmission rate was 20.83 bit/s.
The last modification of the original algorithm consisted in adding a feedback loop and a local decoder in the transmitting part, whose task is to constantly check whether it is possible to correctly extract the information bit embedded in the steganographic signal. In the steganography literature, such solutions are referred to as informed sender algorithms or “dirty paper codes” [
47]. In case of error detection, a coefficient
Ci is determined at the output of the local decoder for the curve
SMRi(k) (Equation (1)). In the feedback loop, we determine such value
Ci, for which the instantaneous signal value at the output of the local decoder, which is also the average
R value of the previous instantaneous values, exceeds the specified threshold
Kmin.
It should be additionally emphasized here that the greater the value of the signal at the output of the local decoder, the greater the energy of the watermark signal. Therefore, it will be more resistant to possible disturbances. At the same time, the higher energy value of the watermark signal makes it “audible” to the user of the system.
4. Technique Development and Implementation
The methods of signal synchronization in conjunction with the procedure of data embedding and extraction described in the
Section 3 should allow for hidden data transmission in the selected telecommunications channel. The first method Monotonic Phase Correction and the second Direct Spread Spectrum of synchronization, consist of the construction of the synchronizing signal and adding it to the steganographic signal. The third method Pattern Insertion Detection consists of inserting a synchronizing marker into the speech signal preceding the steganographic transmission. The fourth and last method Minimal Error Synchronization, on the other hand, consists of the appropriate preparation of steganographic information.
4.1. Monotonic Phase Correction
The synchronizing signal synthesis system is shown in
Figure 1. The input signal here is a Stego Object (signal in which steganographic information is embedded).
The
SMR(k) masking curve was determined based on the psychoacoustic model of the MPEG-1 standard [
46] according to the procedure described in [
48]. For each frame and each harmonic component, the value of the correction factor was determined in accordance with the relationship.
where:
i—signal frame number,
k—harmonic number,
—sound pressure level for the i-th original signal frame,
—the minimum masking threshold for the i-th original signal frame,
—additional optional correction factor.
In an OFDM (Orthogonal Frequency Division Multiplexing) block, a signal is formed which is the sum of 14 harmonic components. The OFDM signal is contained in the band from 375 Hz to 500 Hz and from 3041.7 Hz to 3166.7 Hz.
Figure 2 shows a single OFDM signal frame and the corresponding amplitude spectrum. The OFDM signal frame duration is 48 ms. The OFDM signal phase is set as follows:
Figure 3 shows the synchronization system. The input signal
is fed first to the input of the phase angle scanner system. The task of the phase angle scanner is to determine the value of the phase angle jitter [
49,
50]. This jitter may arise as a result of different accuracy of the clocks that clock the sampling circuits in the steganographic signal transmitter and receiver.
The next stage of the synchronizing system operation is the detection of pilot spectral lines. This procedure consists of checking whether a given pilot spectral line, after correcting its phase angle by the value of the determined jitter correction, has a phase angle value of zero. If the number of pilot spectral lines thus detected is greater than or equal to 4, then the input is assumed to be a steganographic signal and the algorithm moves to the timing step.
The time synchronization mechanism is based on the analysis of the cumulative phase of the signal. The cumulative phase is determined based on the recursive equation:
where:
i—number of the analyzed signal frame,
k—harmonic number of the synchronization spectral line, the constant component has the index k = 0,
—value of the phase angle of the k-th harmonic in the i-th frame.
Figure 4 and
Figure 5 show the cumulative phase waveform for an exemplary steganographic signal, in which a synchronizing signal was additionally embedded. The continuous line marks the course of the cumulative phase of the signal on the transmitting side (in the synthesis circuit), and the dashed lines mark the courses of the cumulative phases recorded on the receiving side (in the synchronizer circuit).
Figure 4 shows the cumulative waveforms of the signal in the absence of synchronization, and in
Figure 5, when the synchronization is achieved. It is worth adding that for the presented characteristics, the average ratio of the steganographic signal energy to the energy of the synchronizing signal expressed in dB and determined in terms of segments (for 5 ms fragments) was 21.62 dB.
The time synchronization procedure consists of an iterative search for such a signal detuning (shift) for which the distance between the expected value of the cumulative signal phase and the measured value is the smallest. Due to the periodicity of the OFDM signal, said minimum is searched for in the set of distances determined for offsets ranging from 0 to 383 samples. There are many different methods of determining the distance between data sets [
51]. The work is limited to determining the synchronization using the Euclidean distance, Mahalanobis distance [
52,
53], and Fréchet distance [
54,
55].
Figure 6,
Figure 7 and
Figure 8 show the total distance between the expected cumulative phase of the signal and the phase measured using the above-mentioned metrics. Additionally, the figures show the minimum value of the determined distance. In the analyzed case, the minimum was achieved in each case for shifting the signal by 184 samples.
4.2. Direct Spread Spectrum
The synchronizing signal generation circuit is shown in
Figure 9. The input signal here is a Stego Object.
The SMR(k) masking curve was determined as described in the Monotonic Phase Correction method.
In an OFDM block, a signal is formed which is the sum of 6 harmonic components. The OFDM signal is contained in the band from 416.7 Hz to 500 Hz and from 3083.3 Hz to 3166.7 Hz. The OFDM signal frame duration is 24 ms. All harmonics of the OFDM signal act as pilot spectral lines. In the implementation, the value of the phase angle was assumed to be equal to 0.
The second component of the synchronization signal, next to the OFDM signal, is the DSS signal (Direct Spread Spectrum). The block scheme of the DSS signal generation system is shown in
Figure 10.
The first stage of DSS signal synthesis is the generation of a pseudo-random sequence with appropriate properties [
56]. Gold sequences and primary polynomials were used:
with initial condition
The size of the Gold string used to generate the DSS signal has been limited to 6096 symbols. The duration of a single symbol has been set to Tc = 1 ms. The duration of the entire sequence is therefore T = 6.096 s.
In the next stage, the generated pseudo-random sequence is fed to the input of the filter block. First of all, it is an interpolation filter with the characteristic of the root raised cosine (RRC, Root Raised Cosine) and then the low-pass filter such as FIR (Finite Impulse Response). These filters are designed to properly shape the pseudorandom sequence pulses and narrow the signal band.
Figure 11 shows a fragment of the signal at the output of the low-pass filter. Additionally, the corresponding fragment of the pseudorandom sequence is marked (top picture, red dotted line).
The final step in generating the DSS signal is to transfer the signal from the low-pass filter output to a higher range of audio frequencies. This is due to the fact that frequencies below 300 Hz can be strongly suppressed during signal transmission in telecommunications links. The frequency of the carrier wave used is fc = 2000 Hz.
Spread spectrum systems are characterized by two important parameters processing gain
G and the interference margin
M [
57]. The processing gain is a parameter that determines the degree of dispersion of the information signal spectrum:
where:
Tb—duration of the data bit—synchronization bit,
Tc—the duration of the spreading sequence chip.
Determining that the duration of the sync bit is equal to the duration of the spreading sequence Tb = T = 6.096 s, the processing gain of the considered system is G = 37.85 dB. The obtained value of the processing profit meets the condition related to perceptual transparency.
The interference margin
M is a measure of the receiver’s immunity to interference. It determines the maximum ratio of the noise power to the signal power at the receiver input, at which we obtain the minimum bit energy level to the noise power
Eb/N0 ensuring an acceptable error probability [
56].
After generating the DSS and OFDM signals, these signals are fed to the input of the amplitude correction circuit and then summed with the steganographic signal. Both the OFDM signal and the DSS signal are corrected based on the
SMR(k) masking curve. At the same time, an additional condition is introduced for the DSS signal related to the correction of the signal energy. The correction factor
Ci (Equation (1)) is chosen such that for each signal frame the following condition is satisfied:
where:
—i-th frame of steganographic signal, noise signal for DSS signal,
—i-th frame of DSS signal.
In the receiving part, the synchronization procedure is based on a system similar to that shown in
Figure 3. The difference is in the different principle of the time synchronization block.
The time synchronization procedure consists of determining the value of the cross-correlation function between the received signal (corrected by the determined correction of the phase angle) and the reference signal generated on the receiving side. The received signal is “shifted” relative to the reference signal. The signals are considered synchronized when the value of the cross-correlation function reaches a maximum for
τ = 0 s.
Figure 12 shows an example of the course of the cross-correlation function value determined in the time synchronization block.
4.3. Pattern Insertion Detection
The synchronization method consists of inserting a synchronizing marker into the speech signal preceding the steganographic transmission. This method was developed for the needs of a steganographic system, which was designed to embed and extract data in a VoIP stream.
The principle of operation is based on the analysis of the signal that will be the information carrier. The start of steganographic transmission is determined by the detection of the speech signal. The presence of speech in the signal is detected on the basis of the analysis of the values of two parameters [
58]:
ZCR (Zero Crossing Rate)
where:
—i-th frame of signal,
N—number of samples in the signal frame.
The duration of a single frame was set to 24 ms (192 samples). The implementation assumes that the presence of a speech signal is determined when the power value exceeds −50 dB and the number of zero crossings coefficient is less than 0.5. If three consecutive signal frames meet the above conditions, then these frames are corrected according to the attenuation pattern, the characteristics of which are shown in
Figure 13. The characteristics of the attenuation pattern have been empirically established based on the preliminary research of the method. Document [
59] states that the permissible IP (Internet Protocol) packet loss during the conversation should not exceed 3%. This value additionally depends on the speech signal codec used during communication. In addition, it assumes the use of the Packet Loss Concealment mechanism (PLC). The adopted attenuation pattern shape reduces the power of the speech signal in the 11 ms window, which is a value similar to the typical frame lengths in speech codecs used in VoIP. In three consecutive signal frames (72 ms) the mentioned reduction of the signal power occurs twice, see
Figure 14.
The next stage, after performing the signal correction procedure (inserting a synchronizing marker into the signal), consists of embedding a portion of steganographic data in the speech signal, according to the algorithm described in the
Section 3. The data portion size was set to 16 bits. The algorithm then restarts from scratch detecting the speech signal again.
In the receiving part, the synchronization procedure consists of continuously checking whether the currently analyzed signal fragment includes a synchronizing marker in its structure. This process is based on the analysis of the signal energy value and the number of zero crossings according to the Formula (8) and (9).
4.4. Minimal Error Synchronization
This method was developed for the needs of a steganographic system, which was designed to embed and extract data in a VoIP stream. The method was inspired by the cell delineation mechanism used in ATM (Asynchronous Transfer Mode) networks [
60,
61]. The purpose of the MES method is to recognize the steganographic transmission solely on the basis of the decoded bitstream, without the use of additional tags or unique sequences. The method of embedding and extraction of steganographic data remains unchanged as described in
Section 3. It was assumed that the steganographic data extraction procedure would not know whether steganographic information was being transmitted at a given moment and that the extraction would always return a certain bitstream. Moreover, it was assumed that the steganographic data would be formed into a frame constructed in such a way that it would be possible to unambiguously recognize it in the bitstream after extraction, and that it would be resistant to 5% RTP packet loss.
The problem of recognizing a data structure in a bitstream is often solved by using a unique preamble or flag. However, in conditions of significant losses, and thus also distortions, such a mechanism cannot be used because it would generate incorrect frame recognition too often. Moreover, it is desirable that the data organization used should provide redundancy to repair bits corrupted due to RTP packet loss.
The transmission errors caused by the loss of RTP packets can be detected and corrected using detection and correction codes (Error Correction Code). There are many different variations of the code that can detect and correct errors. For the purposes of the paper, it was decided to use BCH codes (Bose–Chaudhuri–Hocquenghem). The choice of the BCH code was conditioned, on the one hand, by the requirement of the ability to improve the assumed percentage of lost RTP packets, and, on the other hand, by ensuring the lowest possible information overhead. In addition, the ease of implementation of the target steganographic system was of great importance here because the BCH encoding and decoding procedures are included in the Linux kernel.
BCH codes have strictly defined parameter values (n, k, t)
where:
n specifies the length (in the number of bits) of the code vector, ,
m—integer, m ≥ 3,
k—specifies the length (in the number of bits) of the information vector,
t—is the corrective ability of the code.
To determine the appropriate variant of the BCH code, which will enable the protection of steganographic transmission in the VoIP channel with RTP packet loss at the level of 5%, simulation tests were carried out. Two VoIP channel models were designed in the Matlab/Simulink environment:
The input signal was each time a speech signal with a duration of about 2 min, containing more than 2000 bits of payload. Packet losses were adjusted in the range from 0 to 5% with step 1. The payload was extracted on the receiving side. In the next step, the maximum number of errors recorded in a given observation window was determined. The observation window was shifted in the receiving vector every bit.
Table 1 shows the maximum number of errors found in the receive vector with a size of
d bits
Due to the specific values of the BCH codes parameters, the codes listed in
Table 2 were selected for further analysis. In addition, this table shows the steganographic data rate
R after taking into account the code rate and the minimum duration of the signal
T to allow
n bits of the code vector to be embedded in the signal.
The next stage of work on the method was to estimate the probability of the first type of errors. To this end, 10
7 random bit sequences of length equal to n were generated for each variant of the BCH code, and then it was checked whether the BCH algorithm would qualify such a sequence as a BCH code vector. The results are presented in
Table 3. For codes with the length of the code vector
n = 63, the probability of the first type errors was considered too high. Two variants of the code with a length of
n = 127 were selected for further analysis:
n = 127, k = 50, t = 13;
n = 127, k = 15, t = 27.
For the purposes of transmission, the code vector was interleaved.
In the receiving part, the synchronization procedure consists of continuously checking whether the BCH decoder can recognize the data frame in the extracted bitstream. If the BCH decoder determines that there are no errors or detects and corrects the errors, then it is assumed that synchronization is achieved. Otherwise, if the BCH decoder results in a negative syndrome, the speech signal is shifted by a certain number of samples and the steganographic data extraction and BCH decoding procedures are repeated.
It should be emphasized that the main disadvantage of the presented method is the high computational complexity related to the continuous operation of the steganographic decoder and the BCH decoder. On the other hand, it should also be noted that this is a method that does not interfere with the steganographic signal in any way. Therefore, there will be no deterioration in signal quality.
6. Conclusions
The paper describes four new mechanisms that allow synchronization in acoustic steganography systems. All of these methods have been tested against transparency, robustness, and data rate.
The presented research results regarding the objective and subjective assessment of the quality of signals in relation to the developed methods of synchronization confirm the initial assumption that the use of hidden synchronization of acoustic signals will not significantly deteriorate the quality of the signal being the information carrier.
The presented research results on steganographic transmission in real telecommunications channels allow us to conclude that the use of hidden synchronization of acoustic signals increases the efficiency of steganographic data transmission in a telecommunications channel with signal degrading factors.
Machine learning algorithms can help increase the effectiveness of acoustic synchronization mechanisms. These algorithms build a mathematical model from sample data, called the training set. Machine learning that may prove helpful in the synchronization recovery process include the following methods: Decision Tree Learning for acquiring knowledge based on examples with numerous variants, Bayesian Learning as a probabilistic inference and Instance-based Learning method for modelling the synchronization procedure based on previous sample solutions. There are known methods of synchronization recovery for Forward Error Correction enabled channel [
74] and the solution of the problem of network time synchronization [
75] with the use of machine learning. Further work on the synchronization in acoustic steganographic channels should also cover the implementation of machine learning algorithms.