Efficient Online Engagement Analytics Algorithm Toolkit That Can Run on Edge

Thiha, Saw; Rajasekera, Jay

doi:10.3390/a16020086

Open AccessArticle

Efficient Online Engagement Analytics Algorithm Toolkit That Can Run on Edge

by

Saw Thiha

and

Jay Rajasekera

^*

Digital Business and Innovations, Tokyo International University, Saitama 350-1197, Japan

^*

Author to whom correspondence should be addressed.

Algorithms 2023, 16(2), 86; https://doi.org/10.3390/a16020086

Submission received: 10 December 2022 / Revised: 27 January 2023 / Accepted: 31 January 2023 / Published: 6 February 2023

(This article belongs to the Special Issue Advances in Cloud and Edge Computing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The rapid expansion of video conferencing and remote works due to the COVID-19 pandemic has resulted in a massive volume of video data to be analyzed in order to understand the audience engagement. However, analyzing this data efficiently, particularly in real-time, poses a scalability challenge as online events can involve hundreds of people and last for hours. Existing solutions, especially open-sourced contributions, usually require dedicated and expensive hardware, and are designed as centralized cloud systems. Additionally, they may also require users to stream their video to remote servers, which raises privacy concerns. This paper introduces scalable and efficient computer vision algorithms for analyzing face orientation and eye blink in real-time on edge devices, including Android, iOS, and Raspberry Pi. An example solution is presented for proctoring online meetings, workplaces, and exams. It analyzes audiences on their own devices, thus addressing scalability and privacy issues, and runs at up to 30 fps on a Raspberry Pi. The proposed face orientation detection algorithm is extremely simple, efficient, and able to estimate the head pose in two degrees of freedom, horizontal and vertical. The proposed Eye Aspect Ratio (EAR) with simple adaptive threshold demonstrated a significant improvement in terms of false positives and overall accuracy compared to the existing constant threshold method. Additionally, the algorithms are implemented and open sourced as a toolkit with modular, cross-platform MediaPipe Calculators and Graphs so that users can easily create custom solutions for a variety of purposes and devices.

Keywords:

android; blink; cross-platform; edge computing; eye aspect ratio; face orientation; head pose; iOS; MediaPipe; open source; proctoring; Raspberry Pi; real-time; toolkit; video conference

1. Introduction

Scholars have been predicting the rise of video conferences and virtual events for many decades. In 2020, COVID-19 accelerated video conferencing, online education, and online conducts to a whole new level and many solutions such as Zoom, Microsoft Teams and other video meeting software have emerged. Zoom, for instance, increased from 10 million daily users in December, 2019 to 300 million in April, 2020 [1], especially due to COVID-19-driven travel constraints in countries around the world. Video conferencing today has become the norm and it helps bring the remote partners much closer than personal meetings.

1.1. The Future of Workplaces and Meetings

Additionally, the COVID-19 pandemic has resulted in a rapid shift towards remote work for many organizations and employees. According to a report by Organization for Economic Co-operation and Development (OECD), industries that are highly digitalized, such as information and communication services, professional, scientific and technical services, and financial services, saw the greatest increase in teleworking, with over 50% of employees working remotely on average [2]. This trend has been driven by the need to reduce the spread of the virus by limiting in-person contact, as well as by the increasing availability of technology that enables remote work.

Although the trend was largely driven by the pandemic, it is likely to shape the way we work in the future. One reason for this is that online meetings and remote work offer a number of benefits that are likely to continue to be valued even after the pandemic subsides [3]. They can also enable more flexible and agile ways of working, which can increase productivity and enable organizations to respond more quickly to changing needs and conditions [4]. In addition, online meetings, classes and remote work can reduce the need for travel, which can save time and money and reduce the environmental impact of work [5,6].

Another reason is that the shift towards online meetings and remote work has required organizations to invest in new technologies and infrastructure as well as cybersecurity measures that are likely to continue to be used even after the pandemic subsides and will continue to pay dividends in terms of increased productivity and efficiency [7].

1.2. The Problems

A qualitative study has shown that remote work can lead to feelings of isolation and disconnection from colleagues and the broader organization [8]. This can be particularly challenging for those who are not used to working remotely or who lack strong social connections outside of work [9]. Additionally, a study [10] found that students also have difficulties with the shift from hands-on laboratory experiences to online classes, which resulted in loss of communication among peers, which is important for understanding, engagement and persistence.

Given the nature of video conferences, the audience, especially the meeting hosts, suffers from the stress which is coined as “Zoom Fatigue”. According to the studies [11], it can be accounted for the fact that the audience have to turn on their cameras and have close engagement with others for long hours and as for the meeting hosts, online engagement creates social isolation in [12]. Additionally, having the audience unnecessarily turn on their video feed all the time can have an adverse effect on the engagement itself as well as on privacy concerns.

Bailenson [11] made theoretical arguments on four causes for Zoom fatigue: close eye contact, cognitive load due to harder non-verbal cues, self-consciousness from looking at one’s self image and reduced mobility.

Additionally, he and other researchers [13] conducted an empirical study on the scale (called Zoom Exhaustion and Fatigue Scale, ZEF) where they outlined general facts regarding Zoom Fatigue. Also, the same researchers [14] conducted an empirical test on five non-verbal causes, similar to Bailenson’s [11] arguments. From [11,13,14], one can argue that video-first online conference settings significantly contributes to Zoom fatigue and the fatigue leads to negative attitudes towards online meeting, making matters worse and disrupting the engagement as a whole.

A study in Thailand [15], for instance, revealed that teachers have difficulties with gauging student engagement. One potential solution to the stress associated with video sharing is to instead share engagement analytics such as face orientation and eye blink, which can address the causes of Zoom Fatigue by reducing the need for video sharing. For meeting hosts too, this can provide assurance of audience attentiveness, even without visual access to the attendees’ faces.

1.3. Understanding Audience

Having said that, video plays a major role in evaluating the progress of the online meeting as it reveals how an audience engages with the content, giving the meeting hosts insight and confidence. For the employers, examiners, teachers, and online meeting hosts, the need to understand the behavior and attentiveness of the online audience naturally arises.

First, understanding the behavior and attentiveness of the audience can help to improve the effectiveness of the meeting or exam. One study [16] observed that individuals who pay attention to the information are more likely to retain it and perform better on tasks compared to those who are likely to be distracted (also [17,18]). By using video analytics to track attentiveness, employers, examiners, and teachers can identify any issues that may be impacting the ability of the audience to engage with the material, and take steps to address them.

Second, understanding the behavior and attentiveness of the audience can help to ensure the integrity of the meeting or exam. For example, in an online exam setting, video analytics can be used to detect and prevent cheating by identifying behaviors that are inconsistent with an individual working alone [19].

1.4. Concerns

As online meetings and exams become increasingly reliant on video analytics to understand the behavior and attentiveness of the audience, it is important to consider the concerns that the audience might have about these analytics being applied to their video feed.

One concern that the audience might have is about privacy [20]. With online meetings and remote work, it is often necessary to share personal information and data, such as one’s location, webcam feed, and computer screen. This can raise concerns about data security and the risk of data breaches and the policies to protect privacy. Many people are understandably concerned about their data being collected, shared, or used in ways that they did not consent to. A qualitative study conducted in Japan found that students who participated in online classes were comfortable with showing their faces on camera, but they were hesitant to be recorded due to concerns that the recordings might be shared publicly [21]. To address this concern, it is important for online meeting hosts and employers to provide clear information about how the analytics will be used, and to give the audience control over their data. This might include options for opt-in or opt-out, as well as clear information about what data will be collected and how it will be used.

There is also a risk that the analytics could be biased and used for nefarious purposes, such as surveillance or control. To address this concern, it is important for online meeting hosts and employers to be transparent about how the analytics will be used and to ensure that they are not being used for purposes that are not in the best interests of the audience.

1.5. Existing Proctoring Solutions

The authors of [22] conducted an extensive review of 29 premium, commercial exam proctoring solutions, evaluating the features and platforms of each solution. These solutions often utilize multi-modal systems, which may involve multiple cameras or specialized hardware, and are usually centralized web platforms. Most of these solutions offer advanced performance and are not open source. They typically take control of the examinee’s computer system while executing analytics. However, these solutions may raise concerns about privacy and ethics due to their complexity and lack of transparency [20]. A systematic review on similar software is provided in [6,23].

Likewise, many transparent solutions (with research published) utilize a cloud-based multi-modal approach, incorporating data from multiple sources. One example of such a system [24] uses a wearable camera and microphone, as well as a webcam, to evaluate various factors such as gaze, speech, text, and phone screen presence. Other research [25] has focused on analyzing behavior such as facial expressions, eye and mouth movements to detect cheating. In conclusion, proctoring automation requires a variety of factors and imposes certain constraints on examinees in order to operate accurately.

Additionally, the authors researched open-source tools for facial analytics in proctoring. The open-source projects, in general, can be categorized into general and proctoring-specific. General tools such as OpenCV and Tensorflow can be used as a foundation to build proctoring solutions but may lack specificity, requiring extra effort for proctoring purposes. Proctoring-specific tools, on the other hand, come as a complete package or solution with dashboard, etc. They usually are highly specialized and do not offer much customization. They may not be portable as they are implemented in Python [26], and for web-based deployment [27,28]. Some proctoring features normally require dedicated hardware such as Graphical Processing Unit (GPU) to perform optimally in real-time, adding further constraints on scalability and portability. Being open-source project, they may not provide all of the features and infrastructure that premium proctoring services offer.

1.6. A Toolkit for Proctoring

The aim of this paper is to propose an open-source initiative to make proctoring-related algorithms accessible, portable, scalable and customizable, as a toolkit. Instead of creating a complete proctoring solution, the authors aim to promote customizablility, transparency, and alleviate concerns about proctoring tools, as discussed in [20]. Additionally, the proposed algorithms are designed to run directly on consumer devices, which can be called edge, in real-time, eliminating the need to send video streams or recorded videos to a remote server. This helps protect the privacy of end users. For organizations and facilitators, it can reduce the need for dedicated hardware and infrastructure, as the algorithms can extract features at the edge. This addresses scalability issues.

As of the writing, the proposed toolkit is able to detect face orientation and eye blink. Those features are important for a variety of purposes, including proctoring, as they can provide valuable insights into the behavior and attentiveness of individuals. For example, detecting face orientation can help to determine whether someone is looking directly at the screen or camera, which can be useful for detecting cheating during online exams or for ensuring that employees are paying attention during online meetings. Eye blink detection can also be useful for assessing attentiveness, as prolonged periods of inactivity or low blink rate can be indicators of fatigue or lack of engagement.

Similar to the proposed toolkit, the OpenFace Toolkit [29] is an open-source project that offers libraries and tools for facial behavior analysis. It is written in C++ and Python and is designed to be cross-platform, utilizing the CMake build system to run on Windows, Linux, and macOS. However, certain features based on dlib and TBB may not be compatible with mobile devices such as iOS and Android, and some prior knowledge of CMake is required to build for these platforms. In contrast, the authors’ toolkit uses the out-of-the-box build configuration provided by the MediaPipe Framework to easily support a wider range of devices.

1.7. MediaPipe

MediaPipe is an opensource framework developed by Google for building cross-platform multimodal machine learning pipelines [30]. It allows developers to build and deploy solutions for a variety of applications, including augmented reality, video analytics, and gesture recognition.

One of the key features of MediaPipe is its modular design, which allows developers to easily incorporate a wide range of machine learning models and algorithms into their pipelines. MediaPipe is built on the concept of “Calculators”, which are modular components that can be easily combined to create custom pipelines for a variety of applications. This makes it easy to customize and extend existing solutions, or to build new ones from scratch. The proposed algorithms have been implemented as separate Calculator components, allowing them to be easily integrated into the MediaPipe ecosystem.

MediaPipe also includes support for a variety of input and output modalities, including video, audio, and sensor data. This makes it well-suited for applications that require the integration of multiple data streams, such as those found in augmented reality, video analytics, multi-modal proctoring or the online engagement analytics.

In addition to its flexibility and modularity, MediaPipe is also highly efficient and scalable. It is designed to run on a variety of platforms, including mobile devices and edge devices, and is optimized for performance and real-time processing. This allows developers to bring machine learning to the edge, meaning that models can be run directly on device hardware rather than being required to be run on a remote server or cloud infrastructure. This makes it possible to build and deploy models that are highly efficient and responsive, making it an attractive choice for a wide range of custom applications. The proposed algorithms by the authors, for instance, use the landmarks detected by FaceMesh solution [31] of MediaPipe framework.

1.8. Contributions

The main contribution of the paper is the initiative for an open-source proctoring toolkit [32], which supports face orientation, eye blink detection and other trivial analytics. The authors proposed an extremely simple yet efficient face orientation detection approach. Additionally, the authors improved the F1 score and accuracy of the existing Eye Aspect Ratio (EAR) thresholding approach on eye blink detection with a simple non-linear regression-based adaptive threshold. Last but not least, they statistically proved that face orientation indeed affects the perceived eye closeness; given the adaptive threshold based on face orientation, a single eyelid distance can outperform EAR thresholding method.

2. Related Works

The proposed toolkit is comprised of landmark standardization, face orientation, and eye blink detection; the following sub-sections explain the justifications.

2.1. Landmark Standardization

The most straight-forward standardization is the scaling normalization where the coordinates are normalized into a value range of 0 and 1, based on the max and min values (Equation (1)). However, this method can correlate to the face size compared to z-score standardization.

x_{n o r m} = \frac{x - x_{m i n}}{x_{m a x}}

(1)

Z-score standardization is a statistical method that transforms a variable to have a mean of 0 and a standard deviation of 1. This can be useful for comparing values of the variable across different samples or populations, or for comparing the variable to a normal distribution.

In the context of facial analysis, z-score standardization can be applied to facial landmarks, which are points on the face that correspond to specific features such as the eyes, nose, and mouth. By standardizing the coordinates of these landmarks, it is possible to compare the relative positions of the landmarks across different faces or to detect subtle changes.

Research in [33] evaluated the effect of standardizing imaging features on the accuracy of a radiomics analysis for predicting the histology of non-small cell lung cancer (NSCLC). A total of 476 imaging features were extracted from each database and standardized using min-max normalization, z-score normalization, and whitening from principal component analysis. The results showed that all of the standardization strategies improved the accuracy of the histology prediction, with the highest accuracy achieved using z-score normalization.

Another case of using z-score standardization as explained in [34] occurs in measuring telomere lengths across different studies or runs, because raw values of rTL (Relative Telomere Length) can be influenced by technical and contextual factors, and are not directly comparable. Z-score standardization helped to mitigate this problem by expressing the data in standard deviations, which are directly comparable between studies.

Likewise, un-normalized landmarks can be influenced by many factors such as input image dimension, size of the face, and detected location. This can affect the robustness of algorithms that are designed to work with various devices and people. In MediaPipe Facemesh [31], the landmarks are normalized using the width and height of the image, which again may not be ideal for comparative purposes.

In the context of machine learning, comparing K-means clustering results on infectious diseases datasets showed that the z-score standardization method is more effective and efficient than the min-max and decimal scaling standardization methods [35]. A study in [36], on the other hand, pointed out that conventional z-score standardization may not be optimal for certain clustering analyses, along with the performances of modified z-score standardizations.

Overall, [33,34,35,36], strongly suggest that z-score is the optimal, robust standardization for meta analysis of facial landmarks.

2.2. Head Pose Estimation

Proctoring involves identifying whether an audience is looking straight at the computer screen or not, in order to assess their behavior, attentiveness, and integrity during online conduct. This is often achieved by analyzing the head pose, or the Euler angles (yaw, pitch, and roll) of the head in 3D space. Many studies have focused on estimating head pose, especially from visual data, for a variety of applications, including driver fatigue detection [37,38,39]. Head pose has also been studied for proctoring purposes in previous research [40,41,42].

Deep learning-based approaches for estimating head pose can be found in previous research [38,40,43]. The authors of [44], for instance, used a CNN [45] to detect the head pose. However, they took a different approach; they z-score standardize the yaw and pitch to accommodate different settings based on an assumption that they follow normal distribution, as in the case of face landmarks in this paper. Other studies have used non-deep learning approaches, such as localized gradient orientation (LGO) histograms with support vector regressors (SVRs), to detect not only head pose but also position in 3D space [39]. In another proctoring solution [42], the head pose was estimated using an extended version of the Constrained Local Model.

The survey in [46] documents several methods for estimating head pose, including geometric methods that use facial landmarks. For example, ref. [41] described a method for calculating head pose from cylindrical and ellipsoidal face models (proposed in [47]).

An example of a method for estimating head pose using facial landmarks can be found in the work of [48]. They used Euler angles to estimate head pose by solving the Point-n-Perspective (PnP) problem with methods such as POSIT (Pose from Orthography and Scaling with ITerations). This solution requires a reference 3D face model and camera parameters, and is most optimal when the real camera parameters and the input face closely resemble the references [45,49]. Another study [37] used Efficient PnP (EPnP) for head pose estimation in driver fatigue detection.

This paper, however, presents an algorithm that estimates face orientation using two degrees of freedom (horizontal and vertical) for proctoring purposes. A similar approach for detecting face orientation can be found in a previous study [50], which used five separate AdaBoost classifiers to classify five discrete face orientations (left, right, up, down, straight) directly from raw pixels. In contrast, the authors of this paper used only a single nose landmark to evaluate continuous horizontal and vertical orientations. The algorithm is reliable as long as the face landmark detection is accurate, which is ensured by the underlying MediaPipe FaceMesh solution [31], which has been demonstrated to be accurate and is widely used in production.

2.3. Eye Blink Detection

Eye blink detection is important in various fields of research, including facial anti-spoofing, physical and mental health assessment, and human-computer interaction. By accurately detecting and analyzing eye blinks, researchers can gain insights into a person’s cognitive and emotional states, detect potential security threats, and design more intuitive and user-friendly interfaces.

A vision-based human–computer interface that uses eye blinks as control commands on a consumer computer is described in [51]. The detection of Deepfakes using eye blink patterns, as demonstrated in [52,53], highlights the utility of eye blink detection for proctoring purposes. Research on drowsiness detection [54] and eye fatigue detection [55] show that it is also useful for detecting physical and mental health problems and analyzing attentiveness and behavior.

In visual systems, traditional image processing techniques are used to detect eye blink. The proposed interface in [51] only uses image processing techniques, including Haar-like features for face detection and template matching for eye tracking and eye blink detection, to detect and interpret eye blinks. The system requires template images of the user’s eyes for initialization unlike deep learning models. The author of [56] utilized deep Convolutional Neural Network (CNN) to detect eye blinks. Apart from a deep learning approach, the authors of [57] used Active Shape Model (ASM) to extract the facial features for eye blink detection. The usage of facial landmarks for eye blink detection can be found in [58,59,60].

Among sensor-based approaches, a study [61] developed a system called BlinkListener for detecting eye blinks using acoustic signals (chirps signals from a speaker) in a contact-free manner. The authors of [62,63] discuss the impact of eye blink artifacts on EEG data and the development of algorithms for detecting and correcting these artifacts. Eye blinking EOG signals were incorporated to improve existing EEG-based biometric authentication [64]. The authors of [65,66,67], used EOG sensors to track eye movement and detect eye blinks. These methods sometimes require dedicated hardware in order to detect eye blink and require EOG goggles, which may not suitable for the general public.

In this paper, the authors chose the facial landmarks approach for detecting eye blink as it saves computation time since face orientation detection already relies on them. Regarding the eye blink detection using facial landmarks, the authors of [58] proposed a well-known metric, eye aspect ratio (EAR, Equation (2)) computed from six salient landmarks around the eye (as shown in Figure 1) to detect whether a person’s eyes are open or closed.

E A R_{l e f t} = \frac{| | L^{160} - L^{144} {| |}_{2}^{2} + | | L^{158} - L^{153} {| |}_{2}^{2}}{2 | | L^{33} - L^{133} {| |}_{2}^{2}}

(2)

E A R_{r i g h t} = \frac{| | L^{385} - L^{380} {| |}_{2}^{2} + | | L^{387} - L^{373} {| |}_{2}^{2}}{2 | | L^{362} - L^{263} {| |}_{2}^{2}}

(3)

An alternative approach, as demonstrated in [57], is that eye blinks can be detected by simply calculating the average of three vertical eyelid distances. The authors of [59] improved the robustness of a single eyelid distance in eye blink detection by smoothing the distance signal with a Savitzky-Golay (SG) filter and taking the duration of the eye blink into consideration. Additionally, according to the experimental results presented in this paper, a single vertical eyelid distance can perform comparably and better in terms of F1 score to the Eye Aspect Ratio (EAR) method (see the results in Section 4.2) in detecting eye blinks if the adaptive threshold of eye blink is calculated from face orientation. As suggested in [68], the head orientation indeed correlates to perceived eye openness and the correlation is statistically proven and explained by a non-linear regression (see Section 3.3).

Nonetheless, EAR is a commonly used method when it comes to detecting eye blinks using facial landmarks. A study [60], for instance, used eye aspect ratio to detect eye blinks with high accuracy in real-time on Raspberry Pi. The authors of [69] explored a mental fatigue assessment model for drivers using the eye aspect ratio and mouth aspect ratio.

3. Materials and Methods

The proposed algorithms are based on Facemesh solution [31] from Mediapipe framework and supports head orientation, eye blink and additionally, facial activity and movement detection. The Figure 2 depicts the algorithm (for the example application and source code, refer to Appendix C). First, facial landmarks from Facemesh are standardized before performing face orientation and eye blink detection. The facial activity is detected by calculating the difference between the standardized landmarks of the current image frame and the previous one. Likewise, face movement is calculated as the difference of the unstandardized nose landmark.

3.1. Landmark Standardization

A one-way ANOVA test (see Appendix A and the results in [32]) demonstrates that the distribution of facial landmarks is normal across different people given the same context, looking straight with neutral faces. This normality and the clinical results of [34] strongly suggest that z-score standardization can be more optimal than its counterpart, scaling normalization (Equation (1)). Additionally, standardized nose landmarks can differentiate face orientation in practice.

In conclusion, the authors utilized z-score normalization for robust head orientation and eye blink detection. Mediapipe’s facemesh solution offers 468 3D landmarks,

L : = {x, y, z}

with

{x, y, z}

where z is the distance from the camera to the face. In this case, the standardization becomes

L_{s t d} : = {x_{s t d}, y_{s t d}, z_{s t d}}

(4)

where

L_{s t d}

is standardized landmark and is composed of

{x_{s t d}, y_{s t d}, z_{s t d}}

being standardized principal axes. And

x_{s t d}

, for instance, is defined as

x_{s t d} = \frac{X - μ_{x}}{σ_{x}}

(5)

where X is the x axis component of a landmark,

μ_{x}

the mean of all the x components and

σ_{x}

the standard deviation.

3.2. Face Orientation

Since the authors are not modeling in a 3D context or analyzing behaviour, yaw and pitch can be sufficient for proctoring purposes. This is also pointed out by [44]. Therefore, the author proposes a method to estimate the face orientation with just two degrees of freedom (horizontal and vertical) directly from the nose; the significance of the nose in face orientation is also mentioned in [50,68].

The algorithm takes advantage of the simple pattern that the nose always points in the direction the head is facing. For example, if the head is facing left, the nose will be on the left side of the face. To detect this, the nose landmark must be compared to other facial landmarks, and standardization enables this comparison. The standardized nose landmark coordinate can be used directly as an indicator of face orientation without the need for further calculations and the value of zero means the nose at the center of the face, looking straight (see Figure 3 for reference). In that case, x of the standardized nose landmark represents the horizontal face orientation and y, vertical face orientation. Equations (6) and (7) presents a simple ad hoc threshold on standardized nose landmark to detect the head orientation.

A_{h o r i z o n t a l} = \{\begin{matrix} l e f t, & if x_{s t d} \leq - 0.3 \\ r i g h t, & if x_{s t d} \geq 0.3 \\ s t r a i g h t, & otherwise \end{matrix}

(6)

A_{v e r t i c a l} = \{\begin{matrix} u p, & if y_{s t d} \leq 0.05 \\ d o w n, & if y_{s t d} \geq 0.6 \\ s t r a i g h t, & otherwise \end{matrix}

(7)

3.3. Blink Detection—Adaptive Threshold

The authors compare three methods for detecting eye blinks, EAR with constant threshold, eyelid distance with adaptive threshold and EAR with adaptive threshold. Both adaptive thresholds incorporate face orientation, which is the standardized landmark of the nose as mentioned in the previous section.

3.3.1. EAR with Constant Threshold

For constant EAR threshold, the author used 0.2 as in [58]. Given that

E A R

is calculated using Equation (2) with unstandardized landmarks, an eye is considered blinking if

E A R < 0.2 .

(8)

3.3.2. Eyelid Distance with Adaptive Threshold

The vertical face orientation plays a statistically significant role in eye blink detection from eyelid distance as it alone accounted for 87.5% of the variation in the perceived eye openness threshold (see regression analysis in Figure 4). The author presents an adaptive threshold, Equation (9) that is derived from the regression in Figure 4.

T_{y} = - 0.0228 x + 0.0162 y + 0.0792 e^{y^{2}}

(9)

where

T_{y}

denotes threshold function and y, the vertical face orientation. Then, the authors calculate the eye lid distances as in Equation (10) and (11) (for reference of the left eye landmarks, check Figure 5.

E L D_{l e f t} = | | L_{s t d}^{159} - L_{s t d}^{145} {| |}_{2}^{2}

(10)

E L D_{r i g h t} = | | L_{s t d}^{386} - L_{s t d}^{374} {| |}_{2}^{2}

(11)

Given the corresponding

E L D

and

T_{y}

, an eye is considered blinking if

E L D < T_{y} .

(12)

3.3.3. EAR with Adaptive Threshold

Unlike eyelid distance, both the horizontal and vertical face orientation play a statistically significant role in eye blink detection from EAR as 85.4% of the variation in the perceived eye openness threshold is explained by them (see regression analysis in Figure 6).

In this case, the adaptive threshold function is defined as

T_{x, y}^{E A R} = - 0.0135 y + 0.1202 y^{2} - 0.0487 x + 0.0821 e^{x^{2}}

(13)

where

T_{y}

denotes threshold function. x and y denote the horizontal and vertical face orientation.

Given that

E A R

is calculated using Equation (2) with standardized landmarks, an eye is considered blinking if

E A R < T_{x, y}^{E A R} .

(14)

3.3.4. Datasets

The authors run the algorithms against three different datasets, RT-BENE, Eyeblink8 and Talking Face, to test the performance and the robustness.

The RT-BENE (Real-Time Blink Extraction in Natural Environments) dataset [70] is a collection of videos captured in natural environments, aimed at providing a challenging and realistic testbed for blink detection algorithms. It includes individuals performing various activities, a wide range of lighting conditions and head poses, captured at 30 frames per second with a resolution of

720 \times 576

pixels, annotated with blink events, and providing information about head pose and lighting conditions. The dataset has a total of approximately 20,000 blink events and is widely used as benchmark dataset in blink detection research.

The Eyeblink8 dataset is considered to be more difficult to work with due to the presence of facial expressions, head movements, and instances of individuals looking down at a keyboard. It consists of 408 instances of both eyes blinking (2571 when counting each eye individually as detected by MediaPipe Facemesh) captured across 70,992 video frames, and the data was annotated by [71]. The videos have a resolution of

640 \times 480

and were captured at 30 frames per second, with an average length ranging from 5000 to 11,000 frames.

The Talking Face dataset features a single video recording of a subject speaking and displaying various facial expressions in front of the camera. The video was captured at 25 frames per second and has a resolution of

720 \times 576

. The dataset includes 61 instances of both eyes blinking (valid 312 individual blinks) that have been annotated by [71,72].

4. Experimental Results

In this study, the performance of the face orientation estimation method is evaluated using the CMU dataset [73], which includes faces in four orientations (left, right, up and straight). For the eye blink detection, the adaptive threshold models were evaluated using three different datasets, RT-BENE [70], Eyeblink8 [71] and Talking Face [71,72]. The results of the experiments are reported using several commonly used metrics for classification models, such as F1 score, Precision, Recall, Accuracy and AUC-ROC (Area Under the Receiver Operating Characteristic Curve). Additionally, the runtime performance of the algorithms on Raspberry Pi 4 Model B is presented.

4.1. Face Orientation

The confusion matrix in Figure 7 shows the evaluation of face orientations on CMU Face dataset using the threshold function described in Equations (6) and (7) of Section 3.2. This extremely simple algorithm using a nose landmark achieved 94.78% on both Accuracy and F1-score on the CMU dataset. This is comparable to the algorithm in [50], which requires more complex feature extraction and classification.

4.2. Blink Detection

The authors evaluated eye blink detection algorithms, CEAR (EAR with Constant Threshold), AELD (Eyelid Distance with Adaptive Threshold) and AEAR (EAR with Adaptive Threshold), using three datasets: RT-BENE, Eyeblink8 and TalkingFace. Regression-based adaptive thresholds are fitted on the training part of RT-BENE dataset. The evaluations were conducted on the test part of the RT-BENE, and all videos were from Eyeblink8 and TalkingFace. The average performance of each algorithm or model can be found in Table 1.

Based on the F1 score, the best-performing algorithm was AELD, with an average score of 53.14%. CEAR achieved an average score of 30.49%, while AEAR achieved an average score of 51.65%.

In terms of accuracy, AELD also achieved the highest average score of 97.59%, which was not significantly different from AEAR, which achieved an average score of 97.53%. CEAR achieved the lowest average score of 89.59%. However, CEAR achieved the highest average AUC-ROC score with 81.59% and the highest average Recall up to 73.27%.

Overall, the proposed adaptive threshold with face orientation saw a noticeable and consistent improvement in F1 score and accuracy. It is worth noting that single eyelid distance can perform on par with EAR, given the use of adaptive threshold and landmark standardization. All three models perform below average in terms of F1 score on Eyeblink8 dataset.

4.2.1. EAR with Constant Threshold

Using the constant threshold of 0.2 as in [58], on the RT-BENE dataset, CEAR has an accuracy of 89.23%, an F1 of 24.64%, and an AUC-ROC of 80.84%. On the Eyeblink8 dataset, the model has an accuracy of 84.80%, an F1 of 10.58%, and an AUC-ROC of 66.65%. On the TalkingFace dataset, the model has an accuracy of 94.74%, an F1 of 54.26%, and an AUC-ROC of 97.28%.

The model detected the most eye blinks out of all three datasets and has the highest recall and AUC-ROC scores. However, it performed poorly and was below average in terms of precision and F1 score. This indicates that the model has an imbalance between False Positives and False Negatives. The constant threshold of 0.2 has the highest trade-off between true positive rate and false positive rate, greedily detecting eye blinks. The detailed results on each dataset can be found in Table 2.

4.2.2. Eyelid Distance with Adaptive Threshold

Eyelid distance with adaptive threshold (Equation (9)) achieved better results in general according to the results in Table 3. This simple eyelid distance threshold model performs the best in terms of accuracy and particulary, F1 score. Thus, it offers better balance between False Negatives and False Positives. Additionally, it performed well in terms of AUC-ROC score with an average of 80.44%, offering a good trade-off.

4.2.3. EAR with Adaptive Threshold

EAR with adaptive threshold (Equation (13)) also shows improvement in accuracy and F1 score, compared to CEAR. It achieved relatively similar accuracy and F1 score compared to the best, AELD. The experimental results can be found in Table 4.

4.3. Runtime Performance

The authors evaluated the performance of the example solution shown in Figure 2 using a sample video from [74] at two different resolutions,

1280 \times 720

(HD) and

1920 \times 1080

(FHD). The experiments were conducted on a Raspberry Pi 4 Model B with 1 GB RAM. The solution uses Mediapipe’s “FlowLimiterCalculator” component to achieve real-time low latency performance in practice, but the author disabled it during experimentation. The results, which can be found in Figure 8, show that the solution runs at an average of 30 frames per second on HD video and approximately 13 frames per second on FHD video.

5. Conclusions

To conclude, the authors have

Proposed an initiative open-source proctoring toolkit [32] for online engagement analytics with face orientation and eye blink detection algorithms that can be used on consumer devices or edge devices
Demonstrated the effectiveness of z-score standardization for facial landmarks
Statistically proven the impact of face orientation on eye blink
Improved F1 score and Accuracy for Eye Aspect Ratio (EAR) using adaptive threshold
Proposed a simple and efficient face orientation detection.

Given the z-score standardization and adaptive threshold based on face orientation, eye blink detection with one vertical eyelid distance can achieve a better F1 score and accuracy than Eye Aspect Ratio (EAR) thresholding methods. The authors detect the individual eye blink based solely on facial landmarks. As demonstrated in [59], other factors such as timing and smoothing of eyelid distance signal can drastically improve the F1 score and the accuracy.

In the paper, the MediaPipe Facemesh [31] is utilized, which generates 468 facial landmarks. However, only the landmarks related to the eyes and nose were used in this research. Future work should include examination of z-score standardization’s impact when applied to fewer facial landmarks as in dlib [75].

It is important to note that in real-world proctoring and online engagement analytics, authentication and security measures are also crucial to ensure the integrity. Future work should include the integration of authentication mechanisms in the proposed toolkit to make it more secure and practical for real-world scenarios.

Author Contributions

Conceptualization, S.T. and J.R.; methodology, S.T. and J.R.; software, S.T.; validation, S.T. and J.R.; formal analysis, S.T. and J.R.; investigation, S.T. and J.R.; resources, S.T. and J.R.; data curation, S.T.; writing—original draft preparation, S.T. and J.R.; writing—review and editing, S.T. and J.R.; visualization, S.T.; supervision, J.R.; project administration, S.T. and J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

CMU dataset [73] can be found on UCI Machine Learning Repository [76] and may use the material free of charge for any educational purpose, provided attribution is given in any lectures or publications that make use of the material. RT-BENE dataset [70] is licensed under “Creative Common License” and cannot be used commercially. However, it can be used in publications given the proper citation and can be found on Open Science Platform, Zenedo [70]. The authors provided implementation of the proposed algorithms on GitHub and Zenedo [32] along with configuration script to download the datasets and the notebooks to reproduce the experimental results.

Conflicts of Interest

The authors declare no conflict of interest.

Sample Availability

Sample web application for the proposed algorithm can be found on [77]. Although the algorithms were tested on MacOS and iOS by the time of writing, the authors have not supported any convenient sample application for these platforms. With the source code in [32], the algorithms can be integrated into Mediapipe and any custom solution.

Abbreviations

The following abbreviations are used in this manuscript:

CMU	Carnegie Mellon University (referred to a face dataset)
EAR	Eye Aspect Ratio
RT-BENE	A Dataset and Baselines for Real-Time Blink Estimation in Natural Environments
ANOVA	Analysis of Variance
OLS	Ordinary Least Square
EAR	Eye Aspect Ratio
CEAR	EAR with Constant Threshold
AELD	Eyelid Distance with Adaptive Threshold
AEAR	EAR with Adaptive Threshold
PnP	Perspective-n-Point
POSIT	Pose from Orthography and Scaling with ITerations

Appendix A. Statistics on Standardized Landmarks

An omnibus one-way ANOVA test was performed on 20 faces from the CMU dataset to compare the distributions of standardized landmarks and determine if there is a statistically significant difference in the mean and variance among the faces with the same facial expression and orientation. The null hypothesis that the mean of each face was the same was not rejected with a p-value of

1.0

, indicating that there was no statistically significant difference among the faces. This further reinforces the conclusion that z-score landmark standardization can be optimal for robust facial feature classification and meta analysis.

Appendix B. Statistics on Face Orientation

Z-tests with a p-value of

0.0

showed that when a face turns left, its nose’s standardized x coordinate tends towards negative values and when it turns right, it tends towards positive values. Thus, horizontal face orientation can be concluded as in Equations (A1) and (A2).

\begin{matrix} x_{n o s e}^{-} \Rightarrow & l e f t \end{matrix}

(A1)

\begin{matrix} x_{n o s e}^{+} \Rightarrow & r i g h t . \end{matrix}

(A2)

In the case of vertical orientation, the results of the z-test showed that when a face is facing upwards, its nose’s standardized y coordinate tends towards a negative value (Equation (A3)). However, as the CMU dataset did not include faces facing downwards, the authors assumed that the nose’s y coordinate would be positive in such an orientation.

\begin{matrix} y_{n o s e}^{-} \Rightarrow & u p . \end{matrix}

(A3)

Appendix C. Example Application and Source Code

The native implementation of the proposed toolkit and its application in C++ can be found on GitHub [32]. Detailed instructions on how to compile the application using Bazel can be found in the repository documentation. The authors also implemented a convenient Python package, which you can install from pip, package installer and use as follows.

Listing A1. Example application in Python.

References

Iqbal, M. Zoom Revenue and Usage Statistics. 2022. Available online: https://www.businessofapps.com/data/Zoom-statistics/ (accessed on 3 January 2023).
Organisation for Economic Co-operation and Development. Teleworking in the COVID-19 Pandemic: Trends and Prospects; OECD Publishing: Paris, France, 2021. [Google Scholar] [CrossRef]
KPMG. The Future of Work: A Playbook for the People, Technology and Legal Considerations for a Successful Hybrid Workforce. 2021. Available online: https://assets.kpmg/content/dam/kpmg/ie/pdf/2021/10/ie-kpmg-hybrid-working-playbook.pdf (accessed on 3 January 2023).
Hilberath, C.; Kilmann, J.; Lovich, D.; Tzanetti, T.; Bailey, A.; Beck, S.; Kaufman, E.; Khandelwal, B.; Schuler, F.; Woolsey, K. Hybrid work is the new remote work. Boston Consulting Group. 22 September 2020. Available online: https://www.bcg.com/publications/2020/managing-remote-work-and-optimizing-hybrid-working-models (accessed on 3 January 2023).
Barrero, J.M.; Bloom, N.; Davis, S.J. 60 Million Fewer Commuting Hours Per Day: How AMERICANS Use Time Saved by Working from Home; University of Chicago, Becker Friedman Institute for Economics: Chicago, IL, USA, 2020. [Google Scholar] [CrossRef]
Hussein, M.J.; Yusuf, J.; Deb, A.S.; Fong, L.; Naidu, S. An evaluation of online proctoring tools. Open Prax. 2020, 12, 509–525. [Google Scholar] [CrossRef]
Barrero, J.M.; Bloom, N.; Davis, S.J. Why Working from Home Will Stick; Technical Report No. 28731; National Bureau of Economic Research: Cambridge, MA, USA, 2021. [Google Scholar] [CrossRef]
Fana, M.; Milasi, S.; Napierala, J.; Fernández-Macías, E.; Vázquez, I.G. Telework, Work Organisation and Job Quality During the COVID-19 Crisis: A Qualitative Study; Technical Report No 2020/11; JRC Working Papers Series on Labour, Education and Technology; European Commission, Joint Research Centre (JRC): Seville, Spain, 2020; Available online: http://hdl.handle.net/10419/231343 (accessed on 3 January 2023).
Russo, D.; Hanel, P.H.; Altnickel, S.; van Berkel, N. Predictors of well-being and productivity among software professionals during the COVID-19 pandemic—A longitudinal study. Empir. Softw. Eng. 2021, 26, 1–63. [Google Scholar] [CrossRef]
Jeffery, K.A.; Bauer, C.F. Students’ Responses to Emergency Remote Online Teaching Reveal Critical Factors for All Teaching. J. Chem. Educ. 2020, 97, 2472–2485. [Google Scholar] [CrossRef]
Bailenson, J.N. Nonverbal Overload: A Theoretical Argument for the Causes of Zoom Fatigue. Technol. Mind Behav. 2021, 2. Available online: https://tmb.apaopen.org/pub/nonverbal-overload (accessed on 3 January 2023). [CrossRef]
Elbogen, E.B.; Lanier, M.; Griffin, S.C.; Blakey, S.M.; Gluff, J.A.; Wagner, H.R.; Tsai, J. A National Study of Zoom Fatigue and Mental Health During the COVID-19 Pandemic: Implications for Future Remote Work. Cyberpsychol. Behav. Soc. Netw. 2022, 25, 409–415. [Google Scholar] [CrossRef]
Fauville, G.; Luo, M.; Queiroz, A.C.M.; Bailenson, J.N.; Hancock, J. Zoom Exhaustion & Fatigue Scale 2021. Available online: https://ssrn.com/abstract=3786329 (accessed on 3 January 2023).
Fauville, G.; Luo, M.; Queiroz, A.C.; Bailenson, J.; Hancock, J. Nonverbal Mechanisms Predict Zoom Fatigue and Explain Why Women Experience Higher Levels than Men. SSRN. 14 April 2021. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3820035 (accessed on 9 December 2022).
Todd, R.W. Teachers’ perceptions of the shift from the classroom to online teaching. Int. J. Tesol Stud. 2020, 2, 4–16. [Google Scholar] [CrossRef]
Perez-Mata, M.N.; Read, J.D.; Diges, M. Effects of divided attention and word concreteness on correct recall and false memory reports. Memory 2002, 10, 161–177. [Google Scholar] [CrossRef] [PubMed]
Lin, C.H.; Wu, W.H.; Lee, T.N. Using an online learning platform to show students’ achievements and attention in the video lecture and online practice learning environments. Educ. Technol. Soc. 2022, 25, 155–165. Available online: https://www.jstor.org/stable/48647037 (accessed on 3 January 2023).
Lodge, J.M.; Harrison, W.J. Focus: Attention science: The role of attention in learning in the digital age. Yale J. Biol. Med. 2019, 92, 21. Available online: https://pubmed.ncbi.nlm.nih.gov/30923470 (accessed on 3 January 2023).
Tiong, L.C.O.; Lee, H.J. E-cheating Prevention Measures: Detection of Cheating at Online Examinations Using Deep Learning Approach–A Case Study. ar** social presence in online classes: A Japanese higher education context. J. Foreign Lang. Educ. Res. 2021, 2, 174–183. [Google Scholar] [CrossRef]
Arnò, S.; Galassi, A.; Tommasi, M.; Saggino, A.; Vittorini, P. State-of-the-art of commercial proctoring systems and their use in academic online exams. Int. J. Distance Educ. Technol. 2021, 19, 55–76. [Google Scholar] [CrossRef]
Nigam, A.; Pasricha, R.; Singh, T.; Churi, P. A systematic review on AI-based proctoring systems: Past, present and future. Educ. Inf. Technol. 2021, 26, 6421–6445. [Google Scholar] [CrossRef] [PubMed]
Atoum, Y.; Chen, L.; Liu, A.X.; Hsu, S.D.; Liu, X. Automated online exam proctoring. IEEE Trans. Multimed. 2017, 19, 1609–1624. [Google Scholar] [CrossRef]
Jia, J.; He, Y. The design, implementation and pilot application of an intelligent online proctoring system for online exams. Interact. Technol. Smart Educ. 2021, 19, 112–120. [Google Scholar] [CrossRef]
Agarwal, V. Proctoring-AI. 2020. Available online: https://github.com/vardanagarwal/Proctoring-AI.git (accessed on 3 January 2023).
Namaye, V.; Kanade, A.; Nankani, T. Aankh. 2022. Available online: https://github.com/tusharnankani/Aankh.git (accessed on 3 January 2023).
Fernandes, A.; Fernandes, A.; D’silva, C.; D’cunha, S. GodsEye: Smart Virtual Exam System. 2022. Available online: https://github.com/AgnellusX1/GodsEye.git (accessed on 3 January 2023).
Baltrušaitis, T.; Robinson, P.; Morency, L.P. OpenFace: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar] [CrossRef]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. ar**v 2019, ar**v:1906.08172. [Google Scholar] [CrossRef]
Kartynnik, Y.; Ablavatski, A.; Grishchenko, I.; Grundmann, M. Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs. ar**v 2019, ar**v:1907.06724. [Google Scholar] [CrossRef]
Saw, T. MediaPipe Proctoring Toolkit; Zenodo: Genève, Switzerland, 2023. [Google Scholar] [CrossRef]
Haga, A.; Takahashi, W.; Aoki, S.; Nawa, K.; Yamashita, H.; Abe, O.; Nakagawa, K. Standardization of imaging features for radiomics analysis. J. Med. Investig. 2019, 66, 35–37. [Google Scholar] [CrossRef]
Verhulst, S. Improving comparability between qPCR-based telomere studies. Mol. Ecol. Resour. 2020, 20, 11–13. [Google Scholar] [CrossRef]
Mohamad, I.B.; Usman, D. Standardization and its effects on K-means clustering algorithm. Res. J. Appl. Sci. Eng. Technol. 2013, 6, 3299–3303. [Google Scholar] [CrossRef]
Milligan, G.W.; Cooper, M.C. A study of standardization of variables in cluster analysis. J. Classif. 1988, 5, 181–204. [Google Scholar] [CrossRef]
Ye, M.; Zhang, W.; Cao, P.; Liu, K. Driver fatigue detection based on residual channel attention network and head pose estimation. Appl. Sci. 2021, 11, 9195. [Google Scholar] [CrossRef]
Venturelli, M.; Borghi, G.; Vezzani, R.; Cucchiara, R. Deep head pose estimation from depth data for in-car automotive applications. In Understanding Human Activities Through 3D Sensors—Second International Workshop (UHA3DS 2016), Held in Conjunction with the 23rd International Conference on Pattern Recognition (ICPR 2016), Cancun, Mexico, 4 December 2016; Springer: Cham, Switzerland, 2016; pp. 74–85. [Google Scholar] [CrossRef]
Murphy-Chutorian, E.; Trivedi, M.M. Head pose estimation and augmented reality tracking: An integrated system and evaluation for monitoring driver awareness. IEEE Trans. Intell. Transp. Syst. 2010, 11, 300–311. [Google Scholar] [CrossRef]
Indi, C.S.; Pritham, K.; Acharya, V.; Prakasha, K. Detection of Malpractice in E-exams by Head Pose and Gaze Estimation. Int. J. Emerg. Technol. Learn. 2021, 16, 47–60. [Google Scholar] [CrossRef]
Prathish, S.; Narayanan, A.S.; Bijlani, K. An intelligent system for online exam monitoring. In Proceedings of the 2016 International Conference on Information Science (ICIS), Kochi, India, 12–13 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 138–143. [Google Scholar] [CrossRef]
Chuang, C.Y.; Craig, S.D.; Femiani, J. Detecting probable cheating during online assessments based on time delay and head pose. High. Educ. Res. Dev. 2017, 36, 1123–1137. [Google Scholar] [CrossRef]
Yang, T.Y.; Chen, Y.T.; Lin, Y.Y.; Chuang, Y.Y. FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1087–1096. [Google Scholar] [CrossRef]
Li, H.; Xu, M.; Wang, Y.; Wei, H.; Qu, H. A Visual Analytics Approach to Facilitate the Proctoring of Online Exams. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21), Yokohama, Japan, 8–13 May 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Ruiz, N.; Chong, E.; Rehg, J.M. Fine-grained head pose estimation without keypoints. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 2074–2083. [Google Scholar] [CrossRef]
Murphy-Chutorian, E.; Trivedi, M.M. Head Pose Estimation in Computer Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 607–626. [Google Scholar] [CrossRef]
Narayanan, A.; Kaimal, R.M.; Bijlani, K. Yaw estimation using cylindrical and ellipsoidal face models. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2308–2320. [Google Scholar] [CrossRef]
Ju, K.; Shin, B.S.; Klette, R. Novel Backprojection Method for Monocular Head Pose Estimation. Int. J. Fuzzy Log. Intell. Syst. 2013, 13, 50–58. [Google Scholar] [CrossRef]
Shao, M.; Sun, Z.; Ozay, M.; Okatani, T. Improving head pose estimation with a combined loss and bounding box margin adjustment. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar] [CrossRef]
Baluja, S.; Sahami, M.; Rowley, H.A. Efficient face orientation discrimination. In Proceedings of the 2004 International Conference on Image Processing (ICIP’04), Singapore, 24–27 October 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 1, pp. 589–592. [Google Scholar] [CrossRef]
Królak, A.; Strumiłło, P. Eye-blink detection system for human–Computer interaction. Univers. Access Inf. Soc. 2012, 11, 409–419. [Google Scholar] [CrossRef] [Green Version]
Jung, T.; Kim, S.; Kim, K. DeepVision: Deepfakes Detection Using Human Eye Blinking Pattern. IEEE Access 2020, 8, 83144–83154. [Google Scholar] [CrossRef]
Li, Y.; Chang, M.C.; Lyu, S. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In Proceedings of the 2018 IEEE International workshop on information forensics and security (WIFS), Hong Kong, China, 11–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar] [CrossRef]
Danisman, T.; Bilasco, I.M.; Djeraba, C.; Ihaddadene, N. Drowsy driver detection system using eye blink patterns. In Proceedings of the 2010 International Conference on Machine and Web Intelligence, Algiers, Algeria, 3–5 October 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 230–233. [Google Scholar] [CrossRef]
Divjak, M.; Bischof, H. Eye Blink Based Fatigue Detection for Prevention of Computer Vision Syndrome. In Proceedings of the MVA, Yokohama, Japan, 20–22 May 2009; pp. 350–353. [Google Scholar]
Kim, K.W.; Hong, H.G.; Nam, G.P.; Park, K.R. A study of deep CNN-based classification of open and closed eyes using a visible light camera sensor. Sensors 2017, 17, 1534. [Google Scholar] [CrossRef]
Sukno, F.M.; Pavani, S.K.; Butakoff, C.; Frangi, A.F. Automatic assessment of eye blinking patterns through statistical shape models. In Proceedings of the 7th International Conference on Computer Vision Systems (ICVS 2009), Liège, Belgium, 13–15 October 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 33–42. [Google Scholar] [CrossRef]
Soukupová, T.; Cech, J. Real-Time Eye Blink Detection using Facial Landmarks. In Proceedings of the 21st Computer Vision Winter Workshop, Rimske Toplice, Slovenia, 3–5 February 2016. [Google Scholar]
Al-gawwam, S.; Benaissa, M. Robust Eye Blink Detection Based on Eye Landmarks and Savitzky–Golay Filtering. Information 2018, 9, 93. [Google Scholar] [CrossRef]
Ibrahim, B.R.; Khalifa, F.M.; Zeebaree, S.R.M.; Othman, N.A.; Alkhayyat, A.; Zebari, R.R.; Sadeeq, M.A.M. Embedded System for Eye Blink Detection Using Machine Learning Technique. In Proceedings of the 2021 1st Babylon International Conference on Information Technology and Science (BICITS), Babil, Iraq, 28–29 April 2021; pp. 58–62. [Google Scholar] [CrossRef]
Liu, J.; Li, D.; Wang, L.; **ong, J. BlinkListener: “Listen” to Your Eye Blink Using Your Smartphone. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2021, 5, 1–27. [Google Scholar] [CrossRef]
Wang, J.; Cao, J.; Hu, D.; Jiang, T.; Gao, F. Eye blink artifact detection with novel optimized multi-dimensional electroencephalogram features. IEEE Trans. Neural Syst. Rehabil. Eng. 2021, 29, 1494–1503. [Google Scholar] [CrossRef]
Hoffmann, S.; Falkenstein, M. The correction of eye blink artefacts in the EEG: A comparison of two prominent methods. PLoS ONE 2008, 3, e3004. [Google Scholar] [CrossRef] [PubMed]
Abo-Zahhad, M.; Ahmed, S.M.; Abbas, S.N. A new multi-level approach to EEG based human authentication using eye blinking. Pattern Recognit. Lett. 2016, 82, 216–225. [Google Scholar] [CrossRef]
Bulling, A.; Ward, J.A.; Gellersen, H.; Tröster, G. Eye movement analysis for activity recognition using electrooculography. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 741–753. [Google Scholar] [CrossRef]
Ishimaru, S.; Kunze, K.; Uema, Y.; Kise, K.; Inami, M.; Tanaka, K. Smarter eyewear: Using commercial EOG glasses for activity recognition. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct Publication, Seattle, WA, USA, 13–17 September 2014; pp. 239–242. [Google Scholar] [CrossRef]
Kosmyna, N.; Morris, C.; Nguyen, T.; Zepf, S.; Hernandez, J.; Maes, P. AttentivU: Designing EEG and EOG compatible glasses for physiological sensing and feedback in the car. In Proceedings of the 11th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, Utrecht, The Netherlands, 21–25 September 2019; pp. 355–368. [Google Scholar] [CrossRef]
Wilson, H.R.; Wilkinson, F.; Lin, L.M.; Castillo, M. Perception of head orientation. Vis. Res. 2000, 40, 459–472. [Google Scholar] [CrossRef]
Cheng, Q.; Wang, W.; Jiang, X.; Hou, S.; Qin, Y. Assessment of Driver Mental Fatigue Using Facial Landmarks. IEEE Access 2019, 7, 150423–150434. [Google Scholar] [CrossRef]
Cortacero, K.; Fischer, T.; Demiris, Y. RT-BENE: A Dataset and Baselines for Real-Time Blink Estimation in Natural Environments (Source). In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019; Available online: https://zenodo.org/record/3685316#.Y2nO9C0RpQI (accessed on 3 January 2023).
Fogelton, A.; Benesova, W. Eye blink detection based on motion vectors analysis. Comput. Vis. Image Underst. 2016, 148, 23–33. [Google Scholar] [CrossRef]
Drutarovsky, T.; Fogelton, A. Eye Blink Detection Using Variance of Motion Vectors. In Proceedings of the Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, 6–7, 12 September 2014; Agapito, L., Bronstein, M.M., Rother, C., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 436–448. [Google Scholar]
Mitchell, T. CMU Face Images Data Set. Donated to UCI Machine Learning Repository. 1999. Available online: https://archive.ics.uci.edu/ml/datasets/cmu+face+images (accessed on 3 January 2023).
ROMAN, K. Selfies and Video Dataset (4000 People). 2022. Available online: https://www.kaggle.com/datasets/tapakah68/selfies-and-video-dataset-4-000-people (accessed on 3 January 2023).
King, D.E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 3 January 2023).
Saw, T. Sample Application for MediaPipe Proctoring Solution. Available online: https://hakkaix.com/aiml/proctoring/sample (accessed on 3 January 2023).

Figure 1. MediaPipe Facemesh Left Eye Landmarks for calculating Eye Aspect Ratio

E A R

.

Figure 1. MediaPipe Facemesh Left Eye Landmarks for calculating Eye Aspect Ratio

E A R

.

Figure 2. Overview of the algorithm.

Figure 3. Face orientation and the nose landmark.

Figure 4. OLS regression on eyelid distance w.r.t. face orientation (Equation (9)).

Figure 5. MediaPipe Facemesh Left Eye Landmarks for calculating Eyelid Distance.

Figure 6. OLS Regression on EAR w.r.t. face orientation (Equation (13)).

Figure 7. CMU Face Orientation Confusion Matrix.

Figure 8. Runtime performance.

Table 1. Average Performance Metrics of Eye Blink Detection Approaches.

	Accuracy	F1	Precision	Recall	AUC-ROC
CEAR	89.59	30.49	19.34	73.27	81.59
AELD	97.59	53.14	46.43	62.53	80.44
AEAR	97.53	51.65	45.18	60.94	79.63

Table 2. Results for EAR with constant threshold.

	Accuracy	F1	Precision	Recall	AUC-ROC
RT-BENE	89.23	24.64	14.86	72.02	80.84
Eyeblink8	84.80	10.58	5.94	47.8	66.65
TalkingFace	94.74	54.26	37.23	100	97.28

Table 3. Results for eyelid distance with adaptive threshold.

	Accuracy	F1	Precision	Recall	AUC-ROC
RT-BENE	97.63	56.78	51.26	63.63	81.06
Eyeblink8	97	26.34	24.5	28.47	63.39
TalkingFace	98.15	76.31	63.53	95.51	96.87

Table 4. Results for EAR with adaptive threshold.

	Accuracy	F1	Precision	Recall	AUC-ROC
RT-BENE	97.11	51.85	43.75	63.63	80.79
Eyeblink8	97.22	26.03	26.13	25.94	62.26
TalkingFace	98.27	77.08	65.68	93.26	95.85

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Thiha, S.; Rajasekera, J. Efficient Online Engagement Analytics Algorithm Toolkit That Can Run on Edge. Algorithms 2023, 16, 86. https://doi.org/10.3390/a16020086

AMA Style

Thiha S, Rajasekera J. Efficient Online Engagement Analytics Algorithm Toolkit That Can Run on Edge. Algorithms. 2023; 16(2):86. https://doi.org/10.3390/a16020086

Chicago/Turabian Style

Thiha, Saw, and Jay Rajasekera. 2023. "Efficient Online Engagement Analytics Algorithm Toolkit That Can Run on Edge" Algorithms 16, no. 2: 86. https://doi.org/10.3390/a16020086

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Online Engagement Analytics Algorithm Toolkit That Can Run on Edge

Abstract

1. Introduction

1.1. The Future of Workplaces and Meetings

1.2. The Problems

1.3. Understanding Audience

1.4. Concerns

1.5. Existing Proctoring Solutions

1.6. A Toolkit for Proctoring

1.7. MediaPipe

1.8. Contributions

2. Related Works

2.1. Landmark Standardization

2.2. Head Pose Estimation

2.3. Eye Blink Detection

3. Materials and Methods

3.1. Landmark Standardization

3.2. Face Orientation

3.3. Blink Detection—Adaptive Threshold

3.3.1. EAR with Constant Threshold

3.3.2. Eyelid Distance with Adaptive Threshold

3.3.3. EAR with Adaptive Threshold

3.3.4. Datasets

4. Experimental Results

4.1. Face Orientation

4.2. Blink Detection

4.2.1. EAR with Constant Threshold

4.2.2. Eyelid Distance with Adaptive Threshold

4.2.3. EAR with Adaptive Threshold

4.3. Runtime Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Sample Availability

Abbreviations

Appendix A. Statistics on Standardized Landmarks

Appendix B. Statistics on Face Orientation

Appendix C. Example Application and Source Code

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI