Comparison of Deepfake Detection Techniques through Deep Learning

Taeb, Maryam; Chi, Hongmei

doi:10.3390/jcp2010007

Open AccessArticle

Comparison of Deepfake Detection Techniques through Deep Learning

by

Maryam Taeb

¹

and

Hongmei Chi

^2,*

¹

Electrical and Computer Engineering, FAMU-FSU College of Engineering, Tallahassee, FL 32310, USA

²

Department of Computer Sciences, Florida A&M University, BBTA RM 309, 1333 Wahnish Way, Tallahassee, FL 32307, USA

^*

Author to whom correspondence should be addressed.

J. Cybersecur. Priv. 2022, 2(1), 89-106; https://doi.org/10.3390/jcp2010007

Submission received: 10 January 2022 / Revised: 18 February 2022 / Accepted: 21 February 2022 / Published: 4 March 2022

(This article belongs to the Collection Machine Learning and Data Analytics for Cyber Security)

Download

Browse Figures

Versions Notes

Abstract

:

Deepfakes are realistic-looking fake media generated by deep-learning algorithms that iterate through large datasets until they have learned how to solve the given problem (i.e., swap faces or objects in video and digital content). The massive generation of such content and modification technologies is rapidly affecting the quality of public discourse and the safeguarding of human rights. Deepfakes are being widely used as a malicious source of misinformation in court that seek to sway a court’s decision. Because digital evidence is critical to the outcome of many legal cases, detecting deepfake media is extremely important and in high demand in digital forensics. As such, it is important to identify and build a classifier that can accurately distinguish between authentic and disguised media, especially in facial-recognition systems as it can be used in identity protection too. In this work, we compare the most common, state-of-the-art face-detection classifiers such as Custom CNN, VGG19, and DenseNet-121 using an augmented real and fake face-detection dataset. Data augmentation is used to boost performance and reduce computational resources. Our preliminary results indicate that VGG19 has the best performance and highest accuracy of 95% when compared with other analyzed models.

Keywords:

deepfake detection; digital forensics; media forensics; deep learning; VGG19; face-image manipulation

1. Introduction

In the last few years, cybercrime, which accounts for a 67% increase in the incidents of security breaches, has been one of the most challenging problems that national security systems have had to deal with worldwide [1]. Deepfakes (i.e., realistic-looking fake media that has been generated by deep-learning algorithms) are being widely used to swap faces or objects in video and digital content. This artificial intelligence-synthesized content can have a significant impact on the determination of legitimacy due to its wide variety of applications and formats that deepfakes present online (i.e., audio, image and video).

Considering the quickness, ease of use, and impacts of social media, persuasive deepfakes can rapidly influence millions of people, destroy the lives of its victims and have a negative impact on society in general [1]. The generation of deepfake media can have a wide range of intentions and motivations, from revenge porn to political fake news. Rana Ayyub, an investigative journalist in India, became a target of this practice when a deepfake sex video showing her face on another woman’s body was circulated on the Internet in April 2018 [2]. Deepfakes have also been published to falsify satellite images with non-existent landscape features for malicious purposes [3].

There are numerous captivating applications of deepfakery in video compositing and transfiguration in portraits, especially in identity protection as it can replace faces in photographs with ones from a collection of stock images. Cyber-attackers, using various strategies other than deepfakery, are always aiming to penetrate identification or authentication systems to gain illegitimate access. Therefore, identifying deepfake media using forensic methods remains an immense challenge since cyber-attackers always leverage newly published detection methods to immediately incorporate them in the next generation of deepfake generation methods. With the massive usage of the Internet and social media, and billions of images available on the Internet, there has been an immense loss of trust from social media users. Deepfakes are a significant threat to our society and to digital evidence in courts. Therefore, it is highly important to obtain state-of-the-art techniques to identify deepfake media under criminal investigation.

As demonstrated in Table 1 (inspired by the figure presented in [1]), tampering of evidence, scams and frauds (i.e., fake news), digital kidnap** associated with ransomware blackmailing, revenge porn and political sabotage are among the vast majority of types of deepfake activities with the highest level of intention to mislead [1].

The first deepfake content published on the Internet was a celebrity pornographic video that was created by a Reddit user (named deepfake) in 2017. The Generative Adversarial Network (GAN) was first introduced in 2014 and used for image-enhancement purposes only [4]. However, since the first published deepfake media, it has been unavoidable for deepfake and GAN technology to be used for malicious uses. Therefore, in 2017, GANs were used to generate new facial images for malicious uses for the first time [5]. Following that, there has been a constant development of other deepfake-based applications such as FakeApp and FaceSwap. In 2019, Deepnude was developed and provided undressed videos of the input data [6]. The widespread strategies used to manipulate multimedia files can be broadly categorized into the following major categories: copy–move, splicing, deepfake, and resampling [7]. Copy–move, splicing and resampling involve repositioning the contents of a photo, overlap** different regions of multiple photos into a new one, and manipulating the scale and position of components of a photo. The final goal is to manipulate the user by conveying the deception of having a larger number of components in the photograph than those that were initially present. Deepfake media, however, leveraging powerful machine-learning (ML) techniques, have significantly improved the manipulation of the contents. Deepfake can be considered to be a type of splicing, where a person’s face, sound, or actions in media is swiped by a fake target [8]. A wide set of cybercrime activities are usually associated with this type of manipulation technique, and while spreading them is easy, correcting the records and avoiding deepfakes are harder [9]. Consequently, it is becoming harder for machine-learning techniques to identify convolutional traces of deepfake generation algorithms, as there needs to be frequency-specific anomaly analysis. The most basic algorithms that were being used to train models for the task of deepfake detection such as Support Vector Machine (SVM), Convolution Neural Network (CNN), and Recurrent Neural Network (RNN) are now being coupled with multi-attentional [10] or ensemble [11] methods to increase the performance and address weakness of other methods. As proposed by [12], by implementing an ensemble of standard and attention-based data-augmented detection networks, the generalization issue of the previous approaches can be avoided. As such, it is of high importance to identify the most suitable algorithms for the backbone layers in multi-attentional and ensembled architectures. As generation of deepfake media only started in 2017, academic writing on the problem is meager [13]. Most of the developed and published methods/techniques are focused on deepfake videos. The main difference between deepfake video- and image-detection methods is that video-detection methods can leverage spatial features [14], spatio-temporal anomalies [15] and supervised domain [16] to draw a conclusion on the whole video by aggregating the inferred output both in time and across multiple faces. However, deepfake image-detection techniques have access to one face image only and mostly leverage pixel- [17] and noise-level analysis [18] to identify the traces of the manipulation method.

Therefore, identifying the most reliable methods for face-image forgery detection that relies on convolutional neural networks (CNN) as the backbone for a binary classification task could provide valuable insight for the future direction in the development of deepfake-detection techniques. The overall approach taken in this work is illustrated in Figure 1.

DenseNet has shown significant promise in the field of facial recognition. DenseNet as an extension of Residual CNN (ResNet) architecture has addressed the low-supervision problem of all its counterparts by initiating a between-layer connection using dense blocks. The dense blocks in the DenseNet architecture improve the learning process by leveraging a transition layer (essentially convolution, average pooling, and batch normalization between each dense block) that concatenates feature maps. As such, gradients from the initial input and loss function are shared by all the layers. The described implementation reduces the number of required parameters and feature maps, and consequently provides a less computationally expensive model. Therefore, we have decided to test DenseNet’s capabilities and compare it with other neural network architectures.

VGG-19, as an algorithm that has been widely used to extract the features of the detected face frames [19], was chosen to be compared with the DenseNet architecture. VGG-19’s architecture eases the face-annotation process by forming a large training dataset with the use of online knowledge sources that are then used to implement deep CNNs to perform the task of face recognition. The formed model is then evaluated on face recognition benchmarks to analyze model efficiency regarding the generation of facial features. During this process, VGG-19 is trained on classifiers with sigmoid activation function in the output layer which produces a vector representation of facial features (face embedding) to fine-tune the model. The fine-tuning process differentiates class similarities using Euclidean distance that is achieved using a triplet loss function that aims at comparing Euclidean spaces of similar and different faces using learning score vectors. The CNN architecture implemented in VGG-19 implements fully connected classifiers that include kernels and ReLU activation followed by maxpooling layers.

Finally, we have implemented a Custom CNN architecture to evaluate the performance of previously described algorithms and analyze the effectiveness of dropout, padding, augmentation and grayscale analysis on model performance.

This study aims to provide an in-depth analysis on the described algorithms, structures and mechanisms that could be leveraged in the implementation of an ensembled multi-attentional network to identify deepfake media. The result of this work contributes to the nascent literature on deepfakery by providing a comparative study on effective algorithms for deepfake detection on facial images within the possible use of digital forensics in criminal investigations.

The rest of this paper is organized as follows. Section 2 provides a literature review of the algorithms and datasets that are widely used for deepfake detection. Section 3 provides details on the analysis methods and configurations of the compared algorithms as well as with the details on the tested dataset. Section 4 provides the results of the comparative analysis. Finally, Section 5 concludes with implications, limitations, and suggestions for future research.

2. Literature Review

Anti-deepfake technology can be divided into three categories: (1) detection of the deepfake; (2) authentication of the published content; and (3) prevention of the spread of contents that can be used for deepfake production. Technology towards detection and authentication of deepfakery is growing fast; however, the capacity to generate deepfakes is proceeding much faster than the ability to detect them. Twitter has reported attempts to publish misinformation and fake media by 8 million accounts per week [20]. There has been a wide variety of deepfake media, and the detection techniques that have been used to identify them is shown in Figure 2. This has created a massive challenge for researchers to provide a solution that can promptly analyze all the posted material on the Internet and social media platforms to identify deepfakes. Previous research has mostly aimed at improving previously developed technologies to train a new detection system.

2.1. Deepfake Detection Datasets

Deepfake detection systems typically leverage binary classifiers to cluster information into real and fake classes. This method requires a great quantity of good-quality authentic and tampered data to train classification models. The first known datasets that had a great impact on the growth and improvement of deepfake detection technologies were UADFV [21] and DFTIMIT [22]. FaceForensics++ dataset includes 977 downloaded videos from YouTube, provides 1000 sequences of original unobstructed faces, as well as their manipulated versions. The manipulated versions were generated by four methods: Deepfakes, Face2Face, FaceSwap and NeuralTextures [23]. The DeepFakeDetection dataset (DFD) released by Google in collaboration with Jigsaw contains over 363 original sequences from 28 paid actors in 16 different scenes as well as over 3000 manipulated videos using deepfakes [23]. The Deepfake Detection Challenge (DFDC) dataset [24] published by Facebook is another publicly available large dataset that includes over 100,000 total clips from 3426 actors, produced with deepfake, GAN-based and unsupervised models. Celeb-DF (v2) [25] dataset published by [25] is an extension to Celeb-DF (v1) that contains real and fake videos that are generated via deepfake algorithm by providing images with the same quality as the synthesized videos circulating online. This dataset provides 5639 videos with subjects of different ages, ethnic groups and genders, and their corresponding deepfake videos. The DeeperForensics-1.0 dataset is a large-scale benchmark for face forgery detection that represents the largest face forgery detection dataset by far. This benchmark includes 60,000 videos forming a total of 17.6 million frames generated by an end-to-end face-swap** framework. Furthermore, extensive real-world perturbations are applied to obtain a more challenging benchmark of larger scale and higher diversity [26].

For our research and analysis, we took the “Real and Fake Face-Detection” dataset from Yonsei University [27] that contains expert-generated high-quality PhotoShopped face images. The dataset includes 960 fake and 1081 real images that are composites of different faces, separated by eyes, nose, mouth, or whole face. The second dataset that has been used in this work is the “140K Real and Fake Faces” that consists of 70K real faces from the Flickr dataset collected by Nvidia, as well as 70K fake faces sampled from the 1 million fake faces (generated by StyleGAN) that were published by Bojan [28]. These two datasets were used to include both GAN-generated images along with expert/human-generated images to provide many good-quality data. All the above-mentioned datasets can be used for image and video classification, segmentation, generation and augmentation of new data. Table 2 represents a cumulative comparison of the mentioned datasets; please note that the rows with a “*” sign include images only (not videos). Deepfake datasets have been categorized into two generations based on several factors and elements. Considering release time and synthesis algorithms involved in the generation of the data, UADFV and DF-TIMIT are categorized as the first generation. Considering the quality and quantity of the generated data, DFD, DeeperForensics, DFDC, and the Celeb-DF datasets are categorized as the second generation [25].

2.2. Deepfake Detection Algorithms

Deepfake detection techniques aim to conceal revealing traces of deepfakes by extracting semantic and contextual understanding of the content. Research in the field of media forensics provides a wide range of imperfections as indicators of fake media: face wobble, shimmer and distortion; waviness in a person’s movements; inconsistencies with speech and mouth movements; abnormal movements of fixed objects such as a microphone stand; inconsistencies in lighting, reflections and shadows; blurred edges; angles and blurring of facial features; lack of breathing; unnatural eye direction; missing facial features such as a known mole on a cheek; softness and weight of clothing and hair; overly smooth skin; missing hair and teeth details; misalignment in face symmetry; inconsistencies in pixel levels; and strange behavior of an individual doing something implausible are all the indicators and features used by deepfake detection algorithms [13]. The use of deep-learning techniques and algorithms such as CNN and GAN has made deepfake detection more challenging for forensics models because deepfakes can preserve pose, facial expression and lighting of the photographs [29]. Frequency domain, JPEG Ghost and Error Level Analysis (ELA) are among the first methods that were used to identify manipulation traces on images. However, they are not successful in identifying manipulated images that are generated with deep-learning and GAN algorithms. Neural networks are one of the most widely used methods for deepfake detection. There are some proposals on the usage of X-rays [18], and spectrograms [30] to identify traces of blending and noise in deepfake media. However, such methods cannot detect random noise and suffer from a performance drop when encountering low-resolution images. Deepfakes are implemented mainly using a CNN that generates deepfake images and an encoder–decoder network structure (ED), or GAN [4] that synthesizes fake videos. Deepfake detection techniques focused on anomalies in the face region only can be categorized into holistic and feature-based matching techniques [31]. The holistic techniques, which are mostly used to identify deepfake face images and include Principal Component Analysis (PCA), Support Vector Machines (SVM), and CNN, mainly analyze the face as a whole. These techniques aim at reducing data dimensionality by forming a smaller set of linear combinations of the image pixels that are then fed to a binary classifier to identify authentic and fake images. Feature-based or attention-based matching techniques, however, are used for both deepfake video and image identification, and split the whole face into different regions of focus such as eye, nose, lips, skin, head position, color mismatches, etc. [32]. Holistic techniques are successful in detecting localized deepfake characteristics (i.e., anomalies in the face and jaw region) and can be leveraged to identify specific feature characteristics (eyes, nose, mouth) that could be significant in detection [12]. Convolutional Neural Network (CNN)-based image classification and recognition models have been proven to be trainable to classify manipulated images from authentic ones [33]. Luca et al. [34] aimed to extract and detect fingerprints that represent convolution traces left in the process of generating GAN images using the Expectation-Maximization algorithm. Wang et al. [35] demonstrated that with careful pre- and post-processing and data augmentation, a standard classifier trained on ProGAN, an unconditional CNN generator can be generalized surprisingly well to unseen architectures, datasets, and training methods. CNN have also been trained to detect manipulation techniques such as lack of eye-blinking [36], missing details in eyes from an image [37], and facial wrap** artifacts. Furthermore, CNNs have been shown to be able to capture distinctive traces of generation methods that have worked on further wrap** the faces with high-resolution sources [17].

VGG19 and VGG16 has significantly improved large-scale fake image recognition by increasing the layer depth (23/26 layers) of CNN-based models [38]. Chang et al. [39] presented an improved VGG network, namely NA-VGG, based on image augmentation and noise-level analysis to detect a deepfake face image. The experimental results using the Celeb-DF dataset shows that NA-VGG improved accuracy over other state-of-the-art fake image detectors. Kim et al. [40] demonstrated that VGG-16 has a better performance than the ShallowNet architecture to classify genuine facial images from disguised face images.

Furthermore, DenseNet architecture has also been demonstrated to be computationally more efficient with its feed-forward design network, which connects each layer to every other layer [41]. In DenseNet architecture, feature maps of all former layers are used as the input for each layer. DenseNet requires significantly fewer parameters and computation to achieve state-of-the-art performance [33]. Hsu, Chih-Chung, Yi-**

After augmentation, the face images are classified as either fake or real using three different models: Custom CNN, VGG, and DenseNET. We defined two classes for our binary classification task: 0 to denote the real (e.g., normal, validation, and disguised face images) and 1 to denote fake (e.g., impersonator face images) groups, respectively.

The “Real and Fake Face-Detection” dataset was used to train the three models at a learning rate of 0.001 and for 10 epochs. The test accuracy was then calculated using the test set. We applied data augmentation to flip all original images horizontally and vertically, hence a three-fold increase of the dataset size (original image + horizontally flipped image + vertically flipped image).

The Custom CNN architecture included six convolution layers (Conv2D) each paired with batch normalization, max pooling and dropout layers. Rectified Linear Unit (ReLU) and sigmoid activation functions were applied for the input and output layers respectively. Dropout was applied to each layer to minimize over-fitting and padding was also applied to the kernel to allow for a more accurate analysis of images. The Custom CNN architectures have been trained and validated on the original and augmented datasets with a 1/255 scaling factor. Data augmentation was performed to observe effects of data aggregation on model performance and promote the generalizability of the findings. Details on augmentation process includes horizontal flip along with a 0.2 zoom range, shear range of 0.2 along with rescaling factor to avoid image quality to factor in model behavior during classification since not all the images had the same pixel-level quality.

Following a similar approach to [56], the VGG-19 model that was used is a 16-layer CNN architecture paired with three fully connected layers, five maxpooling layers and one SoftMax layer that is modeled from architectures in [56]. VGG-19 has been pretrained on a wide variety of object categories, which leads to its ability to learn rich feature representations. VGG-19 has demonstrated that it can provide a high accuracy level when classifying partial faces. This architecture demonstrated that its highest accuracy is accessible when its size is increased [57]; therefore, we have applied a high-end configuration to it by adding a dense layer after the last layer block that provides the facial features and added a dense layer as the output layer with sigmoid activation function to fine-tune the model for the task of deepfake detection.

The DenseNET architecture used in this work is Keras’s DenseNet-264 architecture with an additional dense layer as the last output layer. This architecture starts with a 7 × 7 stride 2 convolutional layer followed by a 3 × 3 stride-2 MaxPooling layer. It also includes four dense blocks paired with batch normalization and ReLU activation function for the input layers and sigmoid activation function for the output layer. Furthermore, there are transition layers between each denseblock that include a 2 by 2 average pooling layer along with a 1 by 1 convolutional layer. The last dense block is followed by a classification layer that leverages the feature maps of all layers of the network for the task of classification which we have coupled with a denseblock with the sigmoid activation function as the output layer. This model was trained on 100,000 images and validated on 20,000 images. This model has been trained and validated on the original, grayscale and augmented datasets with a 1/255 scaling factor too. We aimed to add to the diversity of the training data by performing augmentation to the DenseNet architecture by applying a horizontal flip, a 20 range rotation along with the same rescaling procedure that was applied in the Custom CNN architecture. Because pixel-level resolution of grayscale and color images are different, we have also measured the importance of color on model behavior towards classifying data into the fake and real categories by training the DenseNet architecture on grayscale only data too. The VGG architecture, however, was only trained and tested on the original dataset. All the analyzed models in this work are used as they were designed with an additional custom dense layer with sigmoid activation function. The rationale behind adding this layer to all models was to add a useful rectifier activation function layer for the task of binary classification to produce a probability output in the range of 0 to 1 that can easily and automatically be converted to crisp class values.

Type of Media	Examples	Intention to Mislead	Level of Truth
Hoax	Tampering evidence Scam and Fraud Harming Credibility	High	Low
Entertainment	Altering movies Editing Special effects Art Demonstration	Low	Low
Propaganda	Misdirection Political Warfare Corruption	High	High
Trusted	Authentic Content	Low	High

	Real		Fake
Dataset	Video	Frame	Video	Frame	Generation Method	Release Date	Generation Group
UADFV	49	17.3K	49	17.3K	FakeAPP	11/2018	1st
DF-TIMIT	320	34K	320	34K	Faceswap-GAN	12/2018	1st
*Real & Fake Face Detection	1081	405.2K	960	399.8K	Expert-generated high-quality photoshopped	01/2019	2st
FaceForensics++	1000	509.9k	1000	509.9K	DeepFakes, Face2Face, FaceSwap, NeuralTextures	01/2019	2nd
DeepFakeDetection	363	315.4K	3068	2242.7K	Similar to FaceForensics++	09/2019	2nd
DFDC	1131	488.4K	4113	1783.3K	Deepfake, GAN-based, and non-learned methods	10/2019	2nd
Celeb-DF	590	225.4K	5639	2116.8K	Improved DeepFake synthesis algorithm	11/2019	2nd
*140K Real & Fake Faces	70K	15.8M	70K	15.8M	StyleGAN	12/2019	2nd
DeeperForensics	50,000	12.6M	10,000	2.3M	Newly proposed end-to-end face swap** framework	06/2020	2nd

Dataset	Drawbacks	Improvements
UADF DFTMIT	Lack of quantity and Diversity	Suitable baseline
DFD	Limited size and methods	Extension to FaceForensics dataset
DFDC	Distinguishable visual artifacts	Large number of clips of varying quality
Celeb-DF	Low realness score Biased: impractical for face Forgery detection	Fixed color mismatch Accurate face masks
Deeper Forensics-1.0	Challenging as a test database	High realness score

	Model Performance			PCA-SVM Performance
Architecture	Accuracy	Precision	Recall	Accuracy	Precision	Recall
VGG-19	95	93	97	99	99	99
DenseNet OD	94	92	96	98	98	98
DenseNet AD	73	66	95	86	86	86
DenseNet GS	94	91	99	50	50	47
Custom CNN OD	89	91	87	97	97	97
Custom CNN AD	84	87	79	91	90	91

Architecture	F-1	ROC-AUC	AP
VGG-19	95	96	93
DenseNet OD	92	99	99
DenseNet AD	92	97	97
DenseNet GS	97	99	99
Custom CNN OD	91	98	98
Custom CNN AD	85	95	95

Article Menu

Comparison of Deepfake Detection Techniques through Deep Learning

Abstract

1. Introduction

2. Literature Review

2.1. Deepfake Detection Datasets

2.2. Deepfake Detection Algorithms

3.2. Evaluation

4. Preliminary Results

4.1. Dataset

4.2. Algorithms

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI