Deepfakes Generation and Detection: A Short Survey

Akhtar, Zahid

doi:10.3390/jimaging9010018

Open AccessReview

Deepfakes Generation and Detection: A Short Survey

by

Zahid Akhtar

Department of Network and Computer Security, State University of New York (SUNY) Polytechnic Institute, Utica, NY 13502, USA

J. Imaging 2023, 9(1), 18; https://doi.org/10.3390/jimaging9010018

Submission received: 20 November 2022 / Revised: 5 January 2023 / Accepted: 9 January 2023 / Published: 13 January 2023

(This article belongs to the Special Issue Deepfakes, Fake News and Multimedia Manipulation from Generation to Detection)

Download

Browse Figures

Versions Notes

Abstract

:

Advancements in deep learning techniques and the availability of free, large databases have made it possible, even for non-technical people, to either manipulate or generate realistic facial samples for both benign and malicious purposes. DeepFakes refer to face multimedia content, which has been digitally altered or synthetically created using deep neural networks. The paper first outlines the readily available face editing apps and the vulnerability (or performance degradation) of face recognition systems under various face manipulations. Next, this survey presents an overview of the techniques and works that have been carried out in recent years for deepfake and face manipulations. Especially, four kinds of deepfake or face manipulations are reviewed, i.e., identity swap, face reenactment, attribute manipulation, and entire face synthesis. For each category, deepfake or face manipulation generation methods as well as those manipulation detection methods are detailed. Despite significant progress based on traditional and advanced computer vision, artificial intelligence, and physics, there is still a huge arms race surging up between attackers/offenders/adversaries (i.e., DeepFake generation methods) and defenders (i.e., DeepFake detection methods). Thus, open challenges and potential research directions are also discussed. This paper is expected to aid the readers in comprehending deepfake generation and detection mechanisms, together with open issues and future directions.

Keywords:

DeepFakes; digital face manipulations; digital forensics; fake news; multimedia manipulations; generative AI; deepfake generation; deepfake detection; deep learning; face recognition; misinformation; disinformation face morphing attack; biometrics; fake news; information authenticity

1. Introduction

It is estimated that 1.8 billion images and videos per day are uploaded to online services, including social and professional networking sites [1]. However, approximately 40% to 50% of these images and videos appear to be manipulated [2] for benign reasons (e.g., images retouched for magazine covers) or adversarial purposes (e.g., propaganda or misinformation campaigns). In particular, human face image/video manipulation is a serious issue menacing the integrity of information on the Internet and face recognition systems since faces play a central role in human interactions and biometrics-based person identification. Therefore, plausible manipulations in face samples can critically subvert trust in digital communications and security applications (e.g., law enforcement).

DeepFakes refer to multimedia content that has been digitally altered or synthetically created using deep learning models [3]. Deepfakes are the results of face swap**, enactment/animation of facial expressions, and/or digitally generated audio or non-existing human faces. In contrast, face manipulation involves modifying facial attributes such as age, gender, ethnicity, morphing, attractiveness, skin color or texture, hair color, style or length, eyeglass, makeup, mustache, emotion, beard, pose, gaze, mouth open or closed, eye color, injury and effects of drug use [4,5], and adding imperceptible perturbations (i.e., adversarial examples), as shown in Figure 1. The readily-available face editing apps (e.g., FaceApp [6], ZAO [7], Face Swap Live [8], Deepfake web [9], AgingBooth [10], PotraitPro Studio [11], Reface [12], Audacity [13], Soundforge [14], Adobe Photoshop [15]), and Deep Neural network (DNN) source codes [16,17] have enabled even non-experts and non-technical people to create sophisticated deepfakes and altered face samples, which are difficult to be detected by human examiners and current image/video analysis forensics tools.

Deepfakes are expected to advance present disinformation and misinformation sources to the next level, which could be exploited by trolls, bots, conspiracy theorists, hyperpartisan media, and foreign governments; thus, deepfakes could be fake news 2.0. Deepfakes can be used for productive applications such as realistic dubbing of foreign video films [18] or historical figure reanimation for education [19]. Deepfakes can also be used for destructive applications such as the use of fake pornographic videos to damage a person’s reputation or to blackmail them [20], manipulating elections [21], creating warmongering situations [22], generating political or religious unrest via fake speeches [23], causing chaos in financial markets [24], or identity theft [25]. It is easy to notice that the number of malevolent exploitations of deepfakes chiefly dominates the benevolent ones. In fact, not only have recent advances made creating a deepfake with just a still image [26], but also deepfakes are successfully being misused by cybercriminals in the real world. For instance, an audio deepfake was utilized to scam a CEO out of $243,000 [27]. The issue of deepfakes and face manipulations is getting compounded as they can negatively affect the automated face recognition system (AFRS). For instance, studies have shown that AFRS’s error rates can reach up to 95% under deepfakes [28], 50–99% under morphing [29], 17.08% under makeup manipulation [30], 17.05–99.77% under partial face tampering [31], 40–74% under digital beautification [32], 93.82% under adversarial examples [33], and 67% under GANs generated synthetic samples [34]. Similarly, automated speaker verification’s accuracy drops to 40% from 98% under adversarial examples [35].

There exist many deepfake and face manipulation detection methods. However, a systematic analysis shows that the majority of them have low generalization capability, i.e., their performances drop drastically when they encounter a novel deepfake/manipulation type that was not used during the training stage, as also demonstrated in [36,37,38,39,40]. Also, prior studies considered deepfake detection a reactive defense mechanism and not as a battle between the attackers (i.e., deepfake generation methods) and the defenders (i.e., deepfake detection methods) [41,42,43]. Therefore, there is a crucial gap between academic deepfake solutions and real-world scenarios or requirements. For instance, the foregoing works are usually lagging in the robustness of the systems against adversarial attacks [44], decision explainability [45], and real-time mobile deepfake detection [46].

The study of deepfake generation and detection, in recent years, is gathering much more momentum in the computer vision and machine learning community. There exist some review papers on this topic (e.g., [5,24,47,48]), but they are focused mainly on deepfake or synthetic samples using generative adversarial networks. Moreover, most survey articles (e.g., [4,49,50]) were mainly written from an academic point of view and not from a practical development point of view. Also, they did not cover the advent of very recent face manipulation methods and new deepfake generation and detection techniques. Thus, this paper provides a concise but comprehensive overview from both theoretical and practical points of view to furnish the reader with an intellectual grasp as well as to facilitate the progression of novel and more resilient techniques. For example, publicly available apps, codes, or software information can be easily accessed or downloaded for further development and use. All in all, this paper presents an overview of current deepfake and face manipulation techniques by covering four kinds of deepfake or face manipulation. The four main types of manipulation are identity swap, face reenactment, attribute manipulation, and entire face synthesis, where every category manipulation generation and such manipulation detection methods are summarized. Furthermore, open challenges and potential future directions (e.g., robust deepfake detection systems against adversarial attacks using multistream and filtering schemes) that need to be addressed in this evolving field of deepfakes are highlighted. The main objectives of this article are to complement earlier survey papers with recent advancements, to impart to the reader a deeper understanding of the deepfake creation and detection domain, and to use this article as ground truth to develop novel algorithms for deepfake and face manipulation generation and detection systems.

The rest of the article is organized as follows. Section 2 presents deepfake and face manipulation generation as well as detection techniques. In Section 3, the open issues and potential future directions of deepfake generation and detection are discussed. The conclusions are described in Section 4.

2. Deepfake Generation and Detection

We can broadly define deepfake as “believable audio-, visual- or multimedia generated by deep neural networks”. Deepfake/face manipulation can be categorized into four main groups: identity swap, face reenactment, attribute manipulation, and entire face synthesis [47], as shown in Figure 2. Several works have been conducted on different types of deepfake/face manipulation generation and detection. However, in the following subsections, we have included representative studies based on their novelty, foundational idea, and/or performance. Also, studies have been incorporated to represent the most up-to-date research works depicting the state-of-the-art in deepfake generation and detection.

2.1. Identity Swap

Here, an overview of existing identity swap or face swap (i.e., replacing a person’s face with another person’s face) generation and detection methods is presented.

2.1.1. Identity Swap Generation

This consists of replacing the face of a person in the target image/video with the face of another person in the source image/video [51]. For example, Korshunova et al. [52] developed a face-swap** method using Convolutional Neural Networks (CNNs). While Nirkin et al. [53] proposed a technique using a standard fully convolutional network in unconstrained settings. Mahajan et al. [54] presented a face swap procedure for privacy protection. Wang et al. [55] presented a real-time face-swap** method. Natsume et al. [56] proposed a region-separative generative adversarial network (RSGAN) for face swap** and editing. Other interesting face swam** methods can be seen in [28,57,58,59,60,61].

2.1.2. Identity Swap Detection

Ample studies have been conducted on identity swap deepfake detection. For instance, Koopman et al. [62] analyzed photo response non-uniformity (PRNU) for detection. Also, war** artifacts [63], eye blinking [64], optical flow with CNNs [65], heart rate [66], image quality [28], local image textures [37], long short-term memory (LSTM) and recurrent neural network (RNN) [67], multi-LSTM and blockchain [68], clustering [69], context [70], compression artifacts [71], metric learning [72], CNN ensemble [73], Identity-aware [74], transformers [75], audio-visual dissonance [76], and multi-attentional [77] features were used. Very few works have been focused on deepfake detection method’s explainability (e.g., [78]) and generalization capability (e.g., work of Bekci et al. in [38] and Aneja et al. [79] work using zero-shot learning). Recently, S. Liu et al. [80] proposed a block shuffling learning method to detect deepfakes, where the image is divided into blocks, and using random shuffling where intra-block and inter-block-based features are extracted.

2.2. Face Reenactment

Here, an overview of prior face reenactment (i.e., changing the facial expression of the individual) generation and detection techniques is provided.

2.2.1. Face Reenactment Generation

This consists of replacing the facial expression of a person in the target image/video with the facial expression of another person in the source image/video [47]. It is also known as expression swap or puppet master. For instance, Thies et al. [82] developed real-time face reenactment RGB video streams. Whereas encoder-decoder, RNN, unified landmark converter with geometry-aware generator, GANs, and task-agnostic GANs-based schemes were designed by Kim et al. [83], Nirkin et al. [84], Zhang et al. [85], Doukas et al. [86], and Cao et al. [87], respectively.

2.2.2. Face Reenactment Detection

Face reenactment detection methods were designed by Cozzolino et al. [88] using CNNs; Matern et al. [89] using visual features with logistic regression and MLP; Rossler et al. [90] using mesoscopic, steganalysis, and CNN features; Sabir et al. [91] using RNN; Amerini et al. [65] using Optical Flow + CNNs; Kumar et al. [92] using multistream CNNs; and Wang et al. [93] using 3DCNN. In contrast, Zhao et al. [94] designed a spatiotemporal network, which can utilize complementary global and local information. In particular, the framework uses a spatial module for the global information, and the local information module extracts features from patches selected by attention layers.

2.3. Attribute Manipulation

Here, an overview of existing attribute manipulation or face retouching, or face editing (i.e., altering certain face attributes such as skin tone, age, and gender) generation and detection techniques is presented.

2.3.1. Attribute Manipulation Generation

This consists of modifying some facial attributes, e.g., color of hair/skin, gender, age, adding glasses [95,96,97]. It is also known as face editing or face retouching. ** various multistream and filtering schemes could be effective.

3.5. Mobile Deepfake Detector

The neural networks-based deepfake detection methods, which are capable of attaining remarkable accuracy, are mostly unsuited for mobile platforms/applications owing to the huge number of parameters and computational cost. Compressed, yet effective, deep learning-based detection systems, which could be used on mobile and wearable devices, will greatly help counteract deepfakes and fake news.

3.6. Lack of Large-Scale ML-Generated Databases

Most studies on AI-synthesized face sample detection compiled their own database with various GANs. Thereby, different published studies have different performances on GANs samples, because the quality of GANs-generated samples varies and are mostly unknown. Several public GANs-generated fake face sample databases should be produced to help the advancement of this demanding research field.

3.7. Reproducible Research

In machine learning and the deepfake research community, the reproducible results trend should be urged by furnishing the public with large datasets with larger human scores/reasons, experimental setups, and open-source tools/codes. It will surely aid in outlining the true progress in the field and avoid overestimation of the performances by the developed methods.

4. Conclusions

AI-synthesized or digitally manipulated face samples, commonly known as DeepFakes, are a significant challenge threatening the dependability of face recognition systems and the integrity of information on the Internet. This paper provides a survey on recent advances in deepfake and facial manipulation generation and detection. Despite noticeable progress, there are several issues remaining to be resolved to attain highly effective and generalized generation and defense techniques. Thus, this article discussed some of the open challenges and research opportunities. The field of deepfakes still has to go a long way for dependable deepfake and face manipulation detection frameworks, which will need interdisciplinary research efforts in various domains, such as machine learning, computer vision, human vision, psychophysiology, etc. All in all, this survey may be utilized as a ground truth for develo** novel AI-based algorithms for deepfake generation and detection. Also, it is hoped that this survey paper will motivate budding scientists, practitioners, researchers, and engineers to consider deepfakes as their domain of study.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Available online: https://theconversation.com/3-2-billion-images-and-720-000-hours-of-video-are-shared-online-daily-can-you-sort-real-from-fake-148630 (accessed on 4 January 2023).
Available online: https://www.nbcnews.com/business/consumer/so-it-s-fine-if-you-edit-your-selfies-not-n766186 (accessed on 4 January 2023).
Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C. The deepfake detection challenge dataset. ar**, and Face Perception. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition, **,+and+Face+Perception&conference=Proceedings+of+the+13th+IEEE+International+Conference+on+Automatic+Face+&+Gesture+Recognition&author=Nirkin,+Y.&author=Masi,+I.&author=Tuan,+A.T.&author=Hassner,+T.&author=Medioni,+G.&publication_year=2018&pages=98%E2%80%93105" class='google-scholar' target='_blank' rel='noopener noreferrer'>Google Scholar]
Mahajan, S.; Chen, L.; Tsai, T. SwapItUp: A Face Swap Application for Privacy Protection. In Proceedings of the IEEE 31st International Conference on Advanced Information Networking and Applications (AINA), Taipei, Taiwan, 27–29 March 2017; pp. 46–50. [Google Scholar]
Wang, H.; Dongliang, X.; Wei, L. Robust and Real-Time Face Swap** Based on Face Segmentation and CANDIDE-3. In Proceedings of the PRICAI 2018: Trends in Artificial Intelligence, Nan**g, China, 28–31 August 2018; pp. 335–342. [Google Scholar]
Natsume, R.; Yatagawa, T.; Morishima, S. RSGAN: Face Swap** and Editing Using Face and Hair Representation in Latent Spaces. ar**+and+Editing+Using+Face+and+Hair+Representation+in+Latent+Spaces&author=Natsume,+R.&author=Yatagawa,+T.&author=Morishima,+S.&publication_year=2018&journal=ar**. ar**&author=Li,+L.&author=Bao,+J.&author=Yang,+H.&author=Chen,+D.&author=Wen,+F.&publication_year=2019&journal=ar** for Forgery Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 19–13 June 2020; pp. 5073–5082. [Google Scholar]
Chen, R.; Chen, X.; Ni, B.; Ge, Y. SimSwap: An Efficient Framework For High Fidelity Face Swap**. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2003–2011. [Google Scholar]
Koopman, M.; Rodriguez, A.; Geradts, Z. Detection of deepfake video manipulation. In Proceedings of the 20th Irish Machine Vision and Image Processing Conference (IMVIP), Coleraine, UK, 29–31 August 2018; pp. 133–136. [Google Scholar]
Li, Y.; Lyu, S. Exposing DeepFake Videos by Detecting Face War** Artifacts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1–7. [Google Scholar]
Li, Y.; Chang, M.; Lyu, S. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. ar** and reenactment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7184–7193. [Google Scholar]
Zhang, J.; Zeng, X.; Wang, M.; Pan, Y.; Liu, L.; Liu, Y.; Ding, Y.; Fan, C. Freenet: Multi-identity face reenactment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5326–5335. [Google Scholar]
Doukas, M.; Koujan, M.; Sharmanska, V.; Roussos, A.; Zafeiriou, S. Head2Head++: Deep Facial Attributes Re-Targeting. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 31–43. [Google Scholar] [CrossRef]
Cao, M.; Huang, H.; Wang, H.; Wang, X.; Shen, L.; Wang, S.; Bao, L.; Li, L.; Luo, J. Task-agnostic Temporally Consistent Facial Video Editing. ar**. ar**&author=Xu,+Z.&author=Hong,+Z.&author=Ding,+C.&author=Zhu,+Z.&author=Han,+J.&author=Liu,+J.&author=Ding,+E.&publication_year=2022&journal=ar** in the Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10789–10798. [Google Scholar]
Chung, J.S.; Nagrani, A.; Zisserman, A. Voxceleb2: Deepspeaker recognition. In Proceedings of the IEEE Conf. Conference of the International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 1–6. [Google Scholar]
Afchar, D.; Nozick, V.; Yamagishi, J.; Echizen, I. Mesonet: A compact facial video forgery detection network. In Proceedings of the IEEE International Workshop on Information Forensics and Security (WIFS), Montpellier, France, 7–10 December 2021; pp. 1–7. [Google Scholar]
Miao, C.; Chu, Q.; Li, W.; Gong, T.; Zhuang, W.; Yu, N. Towards Generalizable and Robust Face Manipulation Detection via Bag-of-local-feature. ar** and Reenactment. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020; pp. 1–17. [Google Scholar]
Shen, J.; Zafeiriou, S.; Chrysos, G.G.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. The first facial landmark tracking in-the-wild challenge: Benchmark and results. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 50–58. [Google Scholar]
Tripathy, S.; Kannala, J.; Rahtu, E. FACEGAN: Facial Attribute Controllable rEenactment GAN. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1329–1338. [Google Scholar]
Bounareli, S.; Argyriou, V.; Tzimiropoulos, G. Finding Directions in GAN’s Latent Space for Neural Face Reenactment. ar**v 2022, ar**v:2202.00046. [Google Scholar]
Nagrani, A.; Chung, J.S.; Zisserman, A. Voxceleb: A large-scale speaker identification dataset. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 1–6. [Google Scholar]
Agarwal, M.; Mukhopadhyay, R.; Namboodiri, V.; Jawahar, C. Audio-visual face reenactment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 5178–5187. [Google Scholar]
Nguyen, H.; Fang, F.; Yamagishi, J.; Echizen, I. Multi-task Learning For Detecting and Segmenting Manipulated Facial Images and Videos. ar**v 2019, ar**v:1906.06876. [Google Scholar]
Dang, H.; Liu, F.; Stehouwer, J.; Liu, X.; Jain, A. On the Detection of Digital Face Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1–10. [Google Scholar]
Kim, M.; Tariq, S.; Woo, S. FReTAL: Generalizing Deepfake Detection using Knowledge Distillation and Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1001–1012. [Google Scholar]
Yu, P.; Fei, J.; **a, Z.; Zhou, Z.; Weng, J. Improving Generalization by Commonality Learning in Face Forgery Detection. IEEE Trans. Inf. Secur. 2022, 17, 547–558. [Google Scholar] [CrossRef]
Wu, H.; Wang, P.; Wang, X.; **ang, J.; Gong, R. GGViT:Multistream Vision Transformer Network in Face2Face Facial Reenactment Detection. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 2335–2341. [Google Scholar]
Lample, G.; Zeghidour, N.; Usunier, N.; Bordes, A.; Denoyer, L.; Ranzato, M. Fader networks: Manipulating images by sliding attributes. ar**v 2017, ar**v:1706.00409. [Google Scholar]
Liu, M.; Ding, Y.; **a, M.; Liu, X.; Ding, E.; Zuo, W.; Wen, S. STGAN: A unified selective transfer network for arbitrary image attribute editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3673–3682. [Google Scholar]
Kim, H.; Choi, Y.; Kim, J.; Yoo, S.; Uh, Y. Exploiting Spatial Dimensions of Latent in GAN for Real-Time Image Editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 852–861. [Google Scholar]
Choi, Y.; Uh, Y.; Yoo, J.; Ha, J. StarGAN v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8188–8197. [Google Scholar]
Huang, W.; Tu, S.; Xu, L. IA-FaceS: A Bidirectional Method for Semantic Face Editing. Neural Netw. 2023, 158, 272–292. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Wang, X.; Zhang, Y.; Li, X.; Zhang, Q.; Liu, Y.; Wang, J. Fenerf: Face editing in neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7672–7682. [Google Scholar]
Wang, S.; Wang, O.; Owens, A.; Zhang, R.; Efros, A. Detecting photoshopped faces by scripting photoshop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10072–10081. [Google Scholar]
Du CX, T.; Trung, H.T.; Tam, P.M.; Hung NQ, V.; Jo, J. Efficient-Frequency: A hybrid visual forensic framework for facial forgery detection. In Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 707–712. [Google Scholar]
Deepfake in the Wild Dataset. Available online: https://github.com/deepfakeinthewild/deepfake-in-the-wild (accessed on 4 January 2023).
Rathgeb, C.; Nichols, R.; Ibsen, M.; Drozdowski, P.; Busch, C. Busch. Crowd-powered Face Manipulation Detection: Fusing Human Examiner Decisions. ar**v 2022, ar**v:2201.13084. [Google Scholar]
Phillips, P.; Wechsler, H.; Huang, J.; Rauss, P.J. The FERET database and evaluation procedure for face-recognition algorithms. Image Vis. Comput. 1998, 16, 295–306. [Google Scholar] [CrossRef]
Guo, Z.; Yang, G.; Zhang, D.; **a, M. Rethinking gradient operator for exposing AI-enabled face forgeries. Expert Syst. Appl. 2023, 215. [Google Scholar] [CrossRef]
Li, Y.; Chen, X.; Wu, F.; Zha, Z.J. Linestofacephoto: Face photo generation from lines with conditional self-attention generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2323–2331. [Google Scholar]
**a, W.; Yang, Y.; Xue, J.H.; Wu, B. TediGAN: Text-Guided Diverse Face Image Generation and Manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2256–2265. [Google Scholar]
Song, H.; Woo, S.; Lee, J.; Yang, S.; Cho, H.; Lee, Y.; Choi, D.; Kim, K. Talking Face Generation with Multilingual TTS. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 21425–21430. [Google Scholar]
Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R.J.; Jia, Y.; Chen, Z.; Wu, Y. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. Interspeech 2019. [Google Scholar] [CrossRef] [Green Version]
Shi, Y.; Bu, H.; Xu, X.; Zhang, S.; Li, M. AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. Interspeech 2021. [Google Scholar] [CrossRef]
Li, Z.; Min, M.; Li, K.; Xu, C. StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18197–18207. [Google Scholar]
Wang, S.; Wang, O.; Zhang, R.; Owens, A.; Efros, A. CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8695–8704. [Google Scholar]
Pu, J.; Mangaokar, N.; Wang, B.; Reddy, C.; Viswanath, B. Noisescope: Detecting deepfake images in a blind setting. In Proceedings of the Annual Computer Security Applications Conference, Austin, TX, USA, 6–10 December 2020; pp. 913–927. [Google Scholar]
Yousaf, B.; Usama, M.; Sultani, W.; Mahmood, A.; Qadir, J. Fake visual content detection using two-stream convolutional neural networks. Neural Comput. Appl. 2022, 34, 7991–8004. [Google Scholar] [CrossRef]
Nowroozi, E.; Conti, M.; Mekdad, Y. Detecting high-quality GAN-generated face images using neural networks. ar**v 2022, ar**v:2203.01716. [Google Scholar]
Ferreira, A.; Nowroozi, E.; Barni, M. VIPPrint: Validating Synthetic Image Detection and Source Linking Methods on a Large Scale Dataset of Printed Documents. J. Imaging 2021, 7, 50. [Google Scholar] [CrossRef]
Boyd, A.; Tinsley, P.; Bowyer, K.; Czajka, A. CYBORG: Blending Human Saliency Into the Loss Improves Deep Learning-Based Synthetic Face Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, Hawaii, 3–7 January 2023; pp. 6108–6117. [Google Scholar]
Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training generative adversarial networks with limited data. Adv. Neural Inf. Process. Syst. 2020, 33, 12104–12114. [Google Scholar]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
Banerjee, S.; Bernhard, J.S.; Scheirer, W.J.; Bowyer, K.W.; Flynn, P.J. SREFI: Synthesis of realistic example face images. In Proceedings of the IEEE International Joint Conference on Biometrics, Denver, CO, USA, 1–4 October 2017; pp. 37–45. [Google Scholar] [CrossRef] [Green Version]
Mishra, S.; Shukla, A.K.; Muhuri, P.K. Explainable Fuzzy AI Challenge 2022: Winner’s Approach to a Computationally Efficient and Explainable Solution. Axioms 2022, 11, 489. [Google Scholar] [CrossRef]
Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Das, A.; Rad, P. Opportunities and challenges in explainable artificial intelligence (xai): A survey. ar**v 2020, ar**v:2006.11371. [Google Scholar]

Figure 1. Examples of different face manipulations: original samples (first row) and manipulated samples (second row).

Figure 2. Real and fake examples of each deepfake/face manipulation group. The fake sample in “Entire face synthesis” group is obtained from the method in [81].

Table 1. Representative works on deepfake and face manipulation generation and detection techniques. SWR = successful swap rate; MS-SSIM = multi-scale structural similarity; Acc = accuracy; LL = Logloss; AUC = area under the curve; CL = contextual loss; RMSE = root mean square error; AU = Facial action unit; CSIM = Cosine Similarity between IMage embeddings; EER = Equal error rate; FID = Frechet inception distance; AP = Average Precision; KID = kernel inception distance; PSNR = Peak Signal-to-Noise Ratio.

Study	Approach	Dataset	Performance	Source Code	Year
Deepfake Generation
Wang et al. [55]	Real-time face swap** using CANDIDE-3	COFW [132], 300W [133], LFW [134]	SWR = 87.9%.	×	2018
Natsume et al. [56]	Face swap** and editing using RSGAN	CelebA [135]	MS-SSIM = 0.087	×	2018
Chen et al. [61]	High fidelity encoder-decoder	VGGFace2 [136]	Qualitative Analysis	https://github.com/neuralchen/SimSwap (accessed on 4 January 2023)	2021
Xu et al. [137]	Lightweight Identity-aware Dynamic Network	VGGFace2 [136] FaceForensics++ [90]	FID = 6.79%	https://github.com/Seanseattle/MobileFaceSwap (accessed on 4 January 2023)	2022
Shu et al. [138]	Portrait, identity, and pose encoders with generator and feature pyramid network	VoxCeleb2 [139]	PSNR = 33.26	https://github.com/jmliu88/heser (accessed on 4 January 2023)	2022
Deepfake Detection
Afcha et al. [140]	CNNs	FaceForensics++ [90]	Acc = 98.40%	https://github.com/DariusAf/MesoNet (accessed on 4 January 2023)	2018
Zhao et al. [77]	Multi-attentional	FaceForensics++ [90] DFDC [3]	Acc = 97.60% LL = 0.1679	https://github.com/yoctta/multiple-attention (accessed on 4 January 2023)	2021
Miao et al. [141]	Transformers via bag-of-feature for generalization	FaceForensics++ [90], Celeb-DF [142], DeeperForensics-1.0 [143]	Acc = 87.86% AUC = 82.52% Acc = 97.01%	×	2021
Prajapati et al. [144]	Perceptual Image Assessment + GANs	DFDC [3]	AUC = 95% Acc = 91%	https://github.com/pratikpv/mri_gan_deepfake (accessed on 4 January 2023)	2022
Wang et al. [75]	Multi-modal Multi-scale Transformer (M2TR)	FaceForensics++ [90]	Acc = 97.93%	https://github.com/wangjk666/M2TR-Multi-modal-Multi-scale-Transformers-for-Deepfake-Detection (accessed on 4 January 2023)	2022
Reenactment Generation
Zhang et al. [145]	Decoder + war**	CelebA-HQ [146] FFHQ [147] RAF-DB [148]	AU = 75.1% AU = 70.9% AU = 71.1%	https://github.com/bj80heyue/One_Shot_Face_Reenactment (accessed on 4 January 2023)	2019
Ngo et al. [149]	Encoder-decoder	300VW [150]	CL= 1.46	×	2020
Tripathy et al. [151]	Facial attribute controllable GANs	FaceForensics++ [90]	CSIM = 0.747	×	2021
Bounareli et al. [152]	3D shape model	VoxCeleb [153]	FID = 0.66	×	2022
Agarwal et al. [154]	Audio-Visual Face Reenactment GAN	VoxCeleb [153]	FID = 9.05	https://github.com/mdv3101/AVFR-Gan/ (accessed on 4 January 2023)	2023
Reenactment Detection
Nguyen et al. [155]	Autoencoder	FaceForensics++ [90]	EER = 7.07%	https://github.com/nii-yamagishilab/ClassNSeg (accessed on 4 January 2023)	2019
Dang et al. [156]	CNNs + Attention mechanism	FaceForensics++ [90]	AUC = 99.4% EER = 3.4%	https://github.com/Jstehouwer/FFD_CVPR2020 (accessed on 4 January 2023)	2020
Kim et al. [157]	Knowledge Distillation	FaceForensics++ [90]	Acc = 86.97%	×	2021
Yu et al. [158]	U-Net Structure	FaceForensics++ [90]	Acc = 97.26%	×	2022
Wu et al. [159]	Multistream Vision Transformer Network	FaceForensics++ [90]	Acc = 94.46%	×	2022
Attribute Manipulation Generation
Lample et al. [160]	Encoder-decoder	CelebA [135]	RMSE = 0.0009	https://github.com/facebookresearch/FaderNetworks (accessed on 4 January 2023)	2018
Liu et al. [161]	Selective transfer GANs	CelebA [135]	Acc = 70.80%	https://github.com/csmliu/STGAN (accessed on 4 January 2023)	2019
Kim et al. [162]	Real-time style map GANs	CelebA-HQ [146] AFHQ [163]	FID = 4.03 FID = 6.71	https://github.com/naver-ai/StyleMapGAN (accessed on 4 January 2023)	2021
Huang et al. [164]	Multi-head encoder and decoder	CelebA-HQ [146] StyleMapGAN [162]	MSE = 0.023 FID = 7.550	×	2022
Sun et al. [165]	3D-aware generator with two decoupled latent codes	FFHQ [147]	FID = 28.2	https://github.com/MrTornado24/FENeRF (accessed on 4 January 2023)	2022
Attribute Manipulation Detection
Wang et al. [166]	CNNs	Own dataset	Acc = 90.0%	https://github.com/peterwang512/FALdetector (accessed on 4 January 2023)	2019
Du et al. [167]	DFT + CNNs	Deepfake-in-the-wild [168] Celeb-DF [142] DFDC [3]	Acc = 78.00% Acc = 96.00% Acc = 81.00%	×	2020
Akhtar et al. [36]	DNNs	Own dataset	Acc = 99.31	×	2021
Rathgeb et al. [169]	Human majority voting	FERET [170]	CCR = 62.8%	×	2022
Guo et al. [171]	Gradient operator convolutional network with tensor pre-processing and manipulation trace attention module	FaceForensics++ [90]	Acc = 94.86%	https://github.com/EricGzq/GocNet-pytorch (accessed on 4 January 2023)	2023
Entire face synthesis generation
Li et al. [172]	Conditional self-attention GANs	CelebA-HQ [146]	KID = 0.62	https://github.com/LiYuhangUSTC/Lines2Face (accessed on 4 January 2023)	2019
Karras et al. [81]	StyleGAN	FFHQ [147]	FID = 3.31	https://github.com/NVlabs/stylegan2 (accessed on 4 January 2023)	2020
**a et al. [173]	Textual descriptions GANs	CelebA-HQ [146]	FID = 106.37	https://github.com/IIGROUP/TediGAN (accessed on 4 January 2023)	2021
Song et al. [174]	Text-to-speech system	LibriTTS dataset [175] AISHELL-3 [176]	FPS = 30.3	×	2022
Li et al. [177]	StyleT2I: High-Fidelity Text-to-Image Synthesis	CelebA-HQ [146]	FID = 18.02	https://github.com/zhihengli-UR/StyleT2I (accessed on 4 January 2023)	2022
Entire face synthesis detection
Wang et al. [178]	CNNs	StyleGAN2 [81] ProGAN [146]	AP = 99.10% AP = 100%	https://github.com/peterwang512/CNNDetection (accessed on 4 January 2023)	2020
Pu et al. [179]	Incremental clustering	PGGAN [146]	F1 Score = 99.09%	https://github.com/jmpu/NoiseScope (accessed on 4 January 2023)	2020
Yousaf et al. [180]	Two-Stream CNNs	StarGAN [101]	Acc = 96.32%	×	2021
Nowroozi et al. [181]	Cross-band and spatial co-occurrence matrix + CNNs	StyleGAN2 [81] VIPPrint [182]	Acc = 93.80% Acc = 92.56%	×	2022
Boyd et al. [183]	Human-annotated saliency maps into a deep learning loss function	StyleGAN2 [81], ProGAN [146], StyleGAN [147], StyleGAN2-ADA [184], StyleGAN3 [185], StarGANv2 [163], SREFI [186]	AUC = 0.633	https://github.com/BoydAidan/CYBORG-Loss (accessed on 4 January 2023)	2023

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Akhtar, Z. Deepfakes Generation and Detection: A Short Survey. J. Imaging 2023, 9, 18. https://doi.org/10.3390/jimaging9010018

AMA Style

Akhtar Z. Deepfakes Generation and Detection: A Short Survey. Journal of Imaging. 2023; 9(1):18. https://doi.org/10.3390/jimaging9010018

Chicago/Turabian Style

Akhtar, Zahid. 2023. "Deepfakes Generation and Detection: A Short Survey" Journal of Imaging 9, no. 1: 18. https://doi.org/10.3390/jimaging9010018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deepfakes Generation and Detection: A Short Survey

Abstract

1. Introduction

2. Deepfake Generation and Detection

2.1. Identity Swap

2.1.1. Identity Swap Generation

2.1.2. Identity Swap Detection

2.2. Face Reenactment

2.2.1. Face Reenactment Generation

2.2.2. Face Reenactment Detection

2.3. Attribute Manipulation

2.3.1. Attribute Manipulation Generation

3.5. Mobile Deepfake Detector

3.6. Lack of Large-Scale ML-Generated Databases

3.7. Reproducible Research

4. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI