In this section, we introduce our devised method to accomplish two objectives: (i) generating images associated with Android malware applications and (ii) distinguishing these synthetic images from images obtained from real-world Android malware ones.
3.3. Dynamic Analysis Generation
To generate the second dataset, we exploited dynamic analysis; in particular, we executed each application with the aim of extracting the system call traces, and from the obtained trace, we built the related image.
Figure 5 depicts this process.
To generate images by exploiting dynamic analysis, we recorded and stored system call traces generated by Android running applications in a textual format. To achieve this, we considered the Android Package (APK) file, which represents the installation file of an Android application (referred to as the “Mobile Application” in
Figure 5). Subsequently, we generated a series of 25 distinct operating system events at regular 10-second intervals (referred to as “Event Injection” in
Figure 5). These events are then dispatched to the emulator to stimulate the behavior of the malicious payload within the application (referred to as “Mobile Application” in
Figure 5). As a result, we obtained the corresponding sequence of system calls (referred to as “System Call Extraction” in
Figure 5).
These 25 operating system events were selected based on prior research studies, including those by the authors in [
32,
33], which demonstrate that malicious actors employ this set of events to activate payloads within the Android environment. Specifically, we considered the operating system event used to trigger Android malware as exploited by the authors in [
34].
The retrieval of system calls from the Android application under analysis was carried out using a script developed by the authors (referred to as the “Shell Script” in
Figure 5). This script performs a sequence of actions as outlined below:
Initialization of the target Android device emulator.
Installation of the .apk file of the application under analysis on the Android emulator.
Waiting until the device reaches a stable state, typically when it is in an “epoll_wait” state and the application under analysis is awaiting user input or a system event.
Commencement of the retrieval of system call traces.
Sending one of the 25 selected operating system events to the application.
Dispatching the chosen operating system event to the application under analysis.
Capturing system calls generated by the application until a stable state is reached.
Selection of a new operating system event (i.e., the next one in the sequence) and repeating the above steps to capture system call traces for this new event.
Iterating through the previous step until all 25 operating system events have been used to stimulate the Android application.
Halting the capture of system calls and saving the acquired system call trace.
Terminating the process of the Android application under analysis.
Stop** the Android emulator.
Reverting the emulator’s disk to a clean snapshot, restoring it to its state before the analyzed Android application was installed.
Moreover, to simulate user interaction with the Android operating system, we utilized the “monkey” tool from the Android Debug Bridge (ADB) version 1.0.32. This tool generates pseudo-random user events, including clicks, touches, and gestures. To collect the system call traces, we employed “strace”, a tool available on Linux operating systems. Specifically, we used the command “strace-s PID” to hook into the running Android application process and intercept only the system calls generated by that specific application. Once we had obtained a log of system calls, we extracted each individual call one by one, adhering to the order provided by the log, and used this information to construct an image (referred to as “Image Generation” in
Figure 5). Each system call corresponds to a specific RGB pixel, allowing us to create the images pixel by pixel.
In
Figure 6 and
Figure 7, just as an example, we show the images obtained from both the static and dynamic analysis, obtained starting from the same Android application, i.e., GPS Fields Area Measure app
https://play.google.com/store/apps/details?id=lt.noframe.fieldsareameasure&hl=en_US, (accessed on 17 June 2024), an app freely available on Google Play, the official Android market
https://play.google.com/ (accessed on 17 June 2024), identified by the following package name: lt.noframe.fieldsareameasure. This app is typically exploited to measure an area, with the related details about the distance and perimeter.
In both
Figure 6 and
Figure 7, we consider a pixel of a different color to represent, related to the static analysis, a specific byte, while, related to the dynamic one, a specific system call.
3.5. GAN 1
DCGAN introduced a GAN architecture that employs CNNs to define both the discriminator and generator.
DCGAN provides several architectural guidelines aimed at enhancing training stability [
27]:
Substituting pooling layers with strided convolutions in the discriminator and fractionally strided convolutions in the generator;
Incorporating batch normalization (i.e., batchnorm) in the generator and also the discriminator;
Eliminating fully connected hidden layers in deeper architectures;
Using ReLU activation for all generator layers with the exception of the output, which employs tanh;
Employing LeakyReLU activation in all discriminator layers.
Strided convolutions, characterized by a stride of 2, are convolutional layers employed for downsampling within the discriminator. Conversely, fractionally strided convolutions, also known as Conv2DTranspose layers, employ a stride of 2 for upsampling in the generator.
In the domain of DCGANs, batch normalization (batchnorm) is employed in both the generator and the discriminator to improve the stability of GAN training. batchnorm operates by normalizing the input layer, ensuring it maintains a mean of zero and a variance of one. Typically, batchnorm is integrated after the hidden layer and before the activation layer.
In both the generator and discriminator of DCGANs, four frequently employed activation functions include sigmoid, tanh, ReLU, and LeakyReLU.
The sigmoid function compresses numbers to either 0 (indicating fake) or 1 (indicating real). Since the DCGAN discriminator performs binary classification, we employed the sigmoid activation function in its final layer.
Tanh (Hyperbolic Tangent) is an S-shaped function similar to sigmoid, but it is scaled and centered at 0, map** the input values to the range of [−1, 1]. We applied tanh in the final layer of the generator. Consequently, our training images must be preprocessed to fall within the range of [−1, 1] to match the input requirements of the generator.
The Rectified Linear Activation (ReLU) function produces a zero output for negative input values and preserves the input value for non-negative inputs. In the generator, ReLU activation is utilized for all layers, except the output layer, where tanh activation is employed.
LeakyReLU behaves similarly to ReLU, but introduces a slight slope (determined by a constant alpha) for negative input values. We set the slope (alpha) to 0.2, as shown in [
27]. Within the discriminator, LeakyReLU activation is used in all layers, except for the final layer.
The generator and discriminator model training occurs concurrently.
The first step involves data preparation for training. In training a DCGAN, there is no necessity to split the dataset into training, validation, and test sets because we are not using the generator model for classification tasks. A set of images obtained from real-world Android malware was acquired using the procedure illustrated in
Figure 8.
The generator expects input images in the format (60,000, 28, 28), representing 60,000 training grayscale images with dimensions of 28 × 28. The loaded data retain a shape of (60,000, 28, 28) as they are in grayscale format.
To ensure compatibility with the tanh activation function used in the generator’s final layer, the input images are normalized to fall within the range of [−1, 1].
The main objective of the generator is to generate lifelike images that can trick the discriminator into perceiving them as genuine.
The generator receives random noise as the input and produces an image that closely resembles the training images. To ensure compatibility with the grayscale images of dimensions 28 × 28 being generated, the model architecture must ensure that the output of the generator is shaped as 28 × 28 × 1.
To achieve this, the generator undergoes the following steps:
It transforms the 1D random noise (latent vector) into a 3D shape using the Reshape layer.
The generator consistently upsamples the noise using the Keras Conv2DTranspose layer (also known as fractionally strided convolution in the paper) to attain the desired output image size, which, in our case, is a grayscale image with dimensions of 28 × 28 × 1.
The generator incorporates several crucial layers as its fundamental building blocks:
Fully connected layers, also known as dense layers, are primarily utilized for resha** and flattening the noise vector.
Conv2DTranspose is utilized to upscale the image in the generation process.
BatchNormalization: utilized to enhance training stability, positioned after the convolutional layer and before the activation function.
In the generator, ReLU activation is employed in all layers except the output layer, where tanh activation is utilized.
To construct the generator model, we introduced a dense layer to facilitate the resha** of the input into a 3D format. It is essential to specify the input shape within this initial layer of the model architecture.
Subsequently, the BatchNormalization and ReLU layers were integrated into the generator model. Next, the preceding layer was reshaped from 1D to 3D, followed by two upsampling operations using Conv2DTranspose layers with a stride of 2. This sequential process facilitated the transition from a 7 × 7 size to 14 × 14 and, ultimately, to 28 × 28, achieving the desired image dimensions.
Following each Conv2DTranspose layer, a BatchNormalization layer was added, succeeded by a ReLU layer.
Finally, a Conv2D layer with a tanh activation function was employed in the generator model. The generator model encompasses a total of 2,343,681 parameters, with 2,318,209 being trainable and the remaining 25,472 being non-trainable parameters.
Moving on, let us delve into the design of the discriminator model.
The discriminator functions as a binary classifier tasked with determining whether an image is real or fake. Its main objective is to precisely classify the given images.
However, there are a few distinctions between a discriminator and a typical classifier:
We utilized the LeakyReLU activation function in the discriminator.
The discriminator deals with two categories of input images: real images sourced from the training dataset labeled as 1 and fake images generated by the generator labeled as 0.
It is noteworthy that the discriminator network is typically smaller or simpler compared to the generator. This is because the discriminator has a relatively simpler task than the generator. In fact, if the discriminator becomes too powerful, it may impede the progress and improvement of the generator.
In formulating the discriminator model, we will once more define a function. The discriminator takes as the input either real images from the training dataset or fake images generated by the generator. These images have dimensions of 28 × 28 × 1, and we pass the arguments (width, height, and depth) according to the function.
In constructing the discriminator model, we incorporated the Conv2D, BatchNormalization, and LeakyReLU layers twice for downsampling. Following this, we introduced the Flatten layer and applied dropout. Finally, in the last layer, we employed the sigmoid activation function to yield a single value for binary classification.
The discriminator model encompasses 213,633 parameters, comprising 213,249 trainable parameters and 384 non-trainable parameters.
Within the framework of the considered DCGAN, we adopted the modified minimax loss, involving the utilization of the binary cross-entropy (BCE) loss function, as illustrated in [
27].
It is necessary to calculate two distinct losses: one for the discriminator and another for the generator.
Regarding the discriminator loss, since the discriminator receives two sets of images (real and fake), we computed the loss for each group independently and then merged them to derive the overall discriminator loss.
Concerning the generator loss, our approach diverges from training G to minimize log(1 − D(G(z))), aiming to improve the probability of the discriminator D correctly classifying fake images as fake. Instead, we concentrated on training the generator G to maximize , representing the probability that D incorrectly classifies the fake images as real. This encapsulates the modified minimax loss strategy we employed. The objective is to enhance the probability of the discriminator D accurately classifying fake images as fake by training the generator G to maximize this probability.