Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era

Majeed, Abdul; Hwang, Seong Oun

doi:10.3390/asi7040054

Open AccessTechnical Note

Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era

by

Abdul Majeed

^*

and

Seong Oun Hwang

^*

Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Syst. Innov. 2024, 7(4), 54; https://doi.org/10.3390/asi7040054

Submission received: 16 April 2024 / Revised: 15 June 2024 / Accepted: 20 June 2024 / Published: 24 June 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Data-centric artificial intelligence (DC-AI) is a modern paradigm that gives more priority to data quality enhancement, rather than only optimizing the complex codes of AI models. The DC-AI paradigm is expected to substantially advance the status of AI research and developments, which has been solely based on model-centric AI (MC-AI) over the past 30 years. Until present, there exists very little knowledge about DC-AI, and its significance in terms of solving real-world problems remains unexplored in the recent literature. In this technical note, we present the core aspects of DC-AI and MC-AI and discuss their interplay when used to solve some real-world problems. We discuss the potential scenarios/situations that require the integration of DC-AI with MC-AI to solve challenging problems in AI. We performed a case study on a real-world dataset to corroborate the potential of DC-AI in realistic scenarios and to prove its significance over MC-AI when either data are limited or their quality is poor. Afterward, we comprehensively discuss the challenges that currently hinder the realization of DC-AI, and we list promising avenues for future research and development concerning DC-AI. Lastly, we discuss the next-generation computing for DC-AI that can foster DC-AI-related developments and can help transition DC-AI from theory to practice. Our detailed analysis can guide AI practitioners toward exploring the undisclosed potential of DC-AI in the current AI-driven era.

Keywords:

data-centric artificial intelligence; AI models; training data; data quality enhancement; model-centric AI; AI model codes; artificial intelligence; AI-driven era; poor quality data

1. Introduction

Artificial intelligence (AI) is a mainstream technology with a wide range of promising applications in different sectors such as healthcare, smart cities, chatbots, etc. The representative AI applications in the medical area are disease classification using convolutional neural network (CNN) by leveraging image data [1], fetal brain MRI segmentation to identify brain abnormalities [2], accurate and effective segmentation of medical images for clinical assessment of different diseases [3], personalized healthcare and medical content generation for personalized medication and surgery planning [4], medical question–answer systems [5], and situational awareness for people who are visually impaired or blind in indoor environments [6], to name just a few. The representative AI applications in industry are product quality and design optimization [7], fault detection and failure mode prediction [8], industrial predictive modeling by extracting salient and far features with CNN [9], predictive maintenance [10], detecting defects in products [11], video surveillance [12], and robust continuous-flow manufacturing processes by imputing missing time series data [13], to name just a few. Due to compelling interests from both academia and industry, AI has vastly improved its architectural aspects, including the code of AI models, network architectures, and internal optimizations. Recent research is paying ample attention to further advancing AI technology via hyperparameter optimization, model size reduction (quantization), training AI models with as few data as possible, and pruning redundant weights [14]. Generally, AI contains a model (or algorithms) that determines a solution to a problem by finding key features (or observing patterns) from the given data. AI is highly dependent on both code and data (i.e.,

A I = C + D

, where C denotes the models/algorithms and D is the data) to solve any real-world problem. In some cases, AI models can perform poorly due to incomplete or low-quality data, loose/strict learning rates, or drastic changes in the conditions of the deployment environments. At present, there are two practical ways to increase AI models’ performance: data-centric AI (DC-AI) and model-centric AI (MC-AI).

Andrew Ng introduced the idea of DC-AI on 24 March 2021 [15]. According to Ng, the main idea of DC-AI is to systematically improve data quality rather than the AI model’s code alone while develo** AI-based systems. Since the introduction of DC-AI, many breakthrough developments have been made and many are underway. For instance, Hedge [16] applied the DC-AI approach to time series data and achieved 100% accuracy in anomaly detection. Motamedi et al. [17] applied the DC-AI concept to train a deep neural network with fewer, but good-quality data. The authors offered a principled approach to improve accuracy while significantly reducing the model size. Some data-centric solutions have been developed for preparing high-quality data to make the training process fair and to reduce the computing overheads of the complex neural network model. There are many real-world scenarios like autonomous cars where data can be noisy, messy, and heterogeneous, owing to different sensor uses. Fusing data that stem from multiple sources often requires reliability, consistency, synergism, and alignment. This needs very careful attention to the cleaning/pre-processing, harmonization, formatting, transformation, and fusion of data. By leveraging a data-centric approach, we can pinpoint and resolve data-quality-related issues, i.e., missing/wrong data, and guarantee that the final data are dependable, robust, accurate, and suitable.

Despite these developments, there are various challenges and misconceptions about DC-AI, which can diminish further developments in it. It is worth noting that some studies have discussed the importance of the DC-AI paradigm for relevant communities [18], but many important questions/concerns have remained unexplored. For instance, many AI practitioners think that DC-AI is simply a form of pre-processing; however, this is not the case, because DC-AI encompasses more sophisticated techniques that are not part of pre-processing. Similarly, the AI research community has a lack of clear understanding of when to properly amalgamate DC-AI with MC-AI to solve some real-world problems. On the other hand, some of the DC-AI techniques are difficult to implement as measuring the quality of entire datasets is a complex task, especially when the data volume is high and the number of modalities is high. Recently, some libraries like Influenciæ have been introduced, but still, it is very challenging to identify bad parts of data [19]. Similarly, Python libraries like deepchecks have also been recently developed to evaluate ML models and datasets [20]. Lastly, there is an inadequate understanding of the current status of this paradigm and what lies ahead. Motivated by these research gaps, we present a detailed analysis of this fledgling paradigm for the AI and computer science community.

Although MC-AI gives comparable results for many real problems, i.e., natural language processing (NLP) [21], it is usually applicable to very limited scenarios, where a large volume of data are available. In most real-life scenarios, large datasets are either unavailable or may not be easily curated due to limited budgets or incompetence in handling big datasets. Therefore, how to address limited data issues while achieving comparable accuracy with AI models is a complex task. To that end, DC-AI can be a feasible alternate that encompasses a large number of data-specific tools/techniques and can improve the AI model’s results [22]. DC-AI is introduced to combat the most dominant MC-AI trend/mindset, enhancing AI technology advancement and development in the coming years [23]. Furthermore, DC-AI can likely enhance AI adoption in industrial and commercial settings [24].

This paper will have scholarly impacts on the boundaries of data quality enhancement, data optimization, machine/deep learning, and AI use in real situations. It is likely that this paper and its extensions will lead to further technical advancement in the AI field, and will democratize AI developments that are currently in the hands of high-resourced organizations [25]. Consequently, many existing limitations of AI models can be rectified and AI benefits can likely be extended to marginalized groups of our society. Next, we clarify the major goals of the study in terms of the conceptual, theoretical, design, and application perspectives. Conceptually, this paper explores an emerging paradigm that remains relatively less explored than MC-AI and holds significant potential to solve many real-world problems. Also, this topic can be a potential avenue for AI and database researchers who intend to explore new frontiers in AI systems. Theoretically, this paper can open up a new research dimension that can create opportunities for formal methods to prove the efficacy of DC-AI techniques, as well as their limitations. From a design perspective, it provides a reference/generic architecture and proofs of concept, which can be fine-tuned or customized depending on the problem/domain, leading to technical advancement in the AI field. From the application point of view, this paper provides insight into extending AI applications to domains where data are limited or the quality of the data is below par. Overall, this paper aims to systematically introduce an emerging paradigm, named DC-AI, along with the associated concepts/examples to increase AI developments and possibly reduce the harms of AI. Our major contributions are summarized as follows:

Systematic introduction of DC-AI: We systematically present what DC-AI entails and how it can be systematically applied in practice. We also discuss the role of DC-AI in advancing AI technology, which has not been comprehensively discussed in the literature [26].
Generic architecture of DC-AI: We devise the general architecture of the DC-AI paradigm and highlight its key requirements that need to be ensured in each phase, as well as in the entire lifecycle of AI-based projects, whereas existing research only discusses some of the DC-AI techniques [27].
Insight into when to amalgamate MC-AI with DC-AI: We pinpoint and describe five scenarios/situations when amalgamating DC-AI with MC-AI is necessary to solve longstanding technical, social, and industrial problems of conventional AI approaches, which have remained unexplored in the current literature [28].
Case study along with the empirical results: We report a case study (or specific example) to highlight the application of DC-AI in some real-world scenarios and report empirical results by comparing DC-AI with MC-AI, whereas most of the existing papers theoretically discuss the co-design of these approaches [26,28].
Holistic overview of DC-AI challenges and future prospects: We pinpoint the key challenges that are currently hindering DC-AI adoption worldwide and recommend promising avenues for future research that can assist in transforming AI from academic labs to the market.
Next-generation computing for the DC-AI paradigm: We discuss the next-generation computing for the DC-AI paradigm along with the relevant technologies that can contribute to transitioning DC-AI from theory to practice, which has remained unexplored in the recent literature [29].
A call for action to harness the potential of DC-AI: Through this paper, we aim to foster technical advancements in AI by utilizing DC-AI, and we hope to open up avenues for future development/research in this line of work. This is the first work that provides a broader understanding of DC-AI while kee** MC-AI in the loop.

The rest of this technical note is organized as follows. Section 2 discusses the background and related work concerning the subject matter discussed in this paper. Section 3 concisely discusses the workflows of DC-AI and common practices of MC-AI. The generic architecture of DC-AI is discussed in Section 4. Section 5 discusses the need to amalgamate DC-AI with MC-AI to solve longstanding real-world problems. Section 6 reports a case study (or specific example) to demonstrate the effectiveness of DC-AI over MC-AI along with the empirical results. Section 7 comprehensively discusses the challenges and prospects of DC-AI. Section 8 discusses the next-generation computing paradigm of DC-AI. We conclude this paper in Section 9.

2. Background and Related Work

The idea of DC-AI was coined in March 2021 by Andrew Ng [28], and after that, it has been rigorously investigated from diverse perspectives. The key innovation of the DC-AI idea was to improve data so that they yield consistent performance across ML models. Conclusively, the idea is to prepare model-agnostic data to solve any real-world problems. The quality of the data can be enhanced by applying sophisticated data engineering techniques that are more robust and extensive than the pre-processing used in conventional ML [27]. It is worth noting that data wrangling and quality enhancement have been used for a long time in the ML community, but DC-AI further enhanced their adoption status and made them continuously practiced in the entire lifecycle of an ML project, not just once [30]. Recently, DC-AI has been widely investigated from the perspective of the accuracy enhancement of AI models with fewer data, optimizing AI model’s structure and making them lightweight, augmenting the generalizability and unification of AI models, generating synthetic data of diverse modalities to balance data, tool development to either spot problems in data or to figure out vulnerabilities of diverse types, resolving biases or unfairness issues of AI models by improving data, and making AI models more explainable.

DC-AI is in the early stage of development, and therefore, most of the previous studies are theoretical/perspectives. Wang et al. [31] discussed the challenges of data collection and quality enhancement. The authors developed a decision tree-like model to show the connection between different DC-AI techniques in one workflow. Aldoseri et al. [32] examined diverse challenges of data when used in AI models and provided solutions on how companies can address them. Clemente et al. [33] developed an open-source practical tool, named ydata-profiling, for exploratory data analysis. The tools enable data engineers to create data profiles in a fast, simple, and efficient manner. Luley et al. [34] discussed ways to solve obstacles faced by small and medium-sized enterprises (SMEs) by adopting MC-AI, and how to address those obstacles by using DC-AI. Holstein [35] developed a system, named VIS4ML, to enable better data understanding and collaborations between data curators. Song et al. [36] developed a practical method, named BalanceMix, to solve the multi-label classification problem. The key step is to improve data by balancing the distributions and removing noisy labels from the training data.

Zhu et al. [37] recently developed a practical data augmentation strategy in the context of regression tasks, which is a relatively less explored topic. The authors demonstrated the effectiveness of their idea in mitigating the adverse effects of an imbalanced data distribution and verified their approach’s adaptability to diverse regression tasks. Mitchell et al. [38] proposed a method for measuring data quality, and developed a variety of data-measurement approaches that can be used in research and practice. Bertucci et al. [39] developed a DendroMap powered by Treemaps to explore image data. It enables users to gain insight into data properties and distributions in ML pipelines. Johnson et al. [40] developed a slice-discovery method to identify underperforming populations/groups and high-error subsets in training data. The proposed method can be helpful to identify and repair the bad parts of data. Hansen et al. [41] explored ways to integrate data-centric insights while generating high-quality synthetic data. Anik et al. [42] explored ways to explain training data characteristics to allow users to judge the fairness and trustworthiness of ML models. Pi et al. [43] devised a method to enhance the performance of active learning, which is widely used to label data. Some frameworks have also been developed to quicken the training of AI models via data optimization and transformations [44].

Based on the in-depth analysis of the SOTA studies published thus far, it can be concluded that research on DC-AI is still in the infancy stage. Researchers are advancing this new paradigm from multiple aspects such as proposing ways to increase data quality, develo** prototypes to gain data-related insights in ML models, proposing ways to quantify the effect of samples in AI models, figuring out underperforming populations, and addressing technical and social issues of AI with DC-AI. However, more efforts are needed to adopt DC-AI in diverse applications and datasets to fully harness the potential of this new paradigm. Our article provides an insightful and comprehensive discussion of DC-AI and its associated concepts, and the presented analysis (basics, architecture, case study, challenges, future work) can spark further developments in this line of work.

3. Introduction of Model-Centric AI and Data-Centric AI

In this section, we describe the MC-AI and DC-AI workflows and highlight the main differences between them. A typical MC-AI approach can be applied to any real-world problem with the help of six steps: (1) problem definition, (2) data collection, (3) the collected data’s pre-processing, (4) AI model training, (5) AI model deployment, and (6) the deployed model’s performance monitoring [45]. Specifically, AI models are trained on data collected from relevant users/environments after basic pre-processing; performance is analyzed, and then, the models are deployed in real environments. Subsequently, performance is gauged, and if the performance is poor, only architectural aspects of the AI model are improved (e.g., modifying the network structure, altering the model size, tuning the hyperparameters, and varying

α

) [46]. In the MC-AI approach, the AI practitioners mainly focus on five things [27]: (i) advancing the network architecture of AI models, (ii) optimizing the architecture of AI models by removing unused layers or pruning redundant weights, (iii) develo** new AI models, (iv) devising new training methods and advancing data modalities, and (v) improving performance via hyperparameter tuning.

In MC-AI, three common practices are adopted to resolve performance-related issues:

Rigorously fine-tune the code/algorithm of the AI model.
Obtain additional data for everything and take the average, or optimize the learning rate ( $α$ ).
Switch to an alternate AI model (CNN → LSTM or LSTM → RNN).

All of the above practices have potential drawbacks when it comes to some specific areas. For example, the first practice may yield minimal enhancement in the AI model’s result; the second practice might increase the time complexity (or start a never-finishing data curation/collection cycle); the third may significantly enlarge the commercialization time (e.g., development → deployment) of AI products. To solve these problems, AI experts need to go back and examine the entire process (scrutinize the entire process), even when the data are not being curated. However, this practice (tracing each phase) can be costly, and can likely stop the AI implementation process for a long period. Hence, an alternate approach is needed to address these problems and to advance the status of AI developments.

Figure 1 shows the workflow of the DC-AI paradigm when applied in any realistic setting. In contrast to MC-AI, DC-AI suggests rigorous inspection and debugging of data before building (or training) AI models and can be vastly successful in some specific areas involving limited/poor-quality data. Also, when an AI model gives poor results, developers are required to inspect the data, as well as the code/algorithm. The in-depth and rigorous screening of the data is performed (Step 3), and if the results are not satisfactory (Step 7), the code and data are both improved. By paying ample attention to the data in all phases of AI system development, misleading samples are not inadvertently propagated to the training process, and therefore, highly reliable AI systems can be developed [47,48]. Fundamental differences exist in each approach. The prime focus of MC-AI is to optimize code and code-related aspects; in contrast, DC-AI explores ways to improve data quality and consistency.

Quantitative Analysis/Comparison between MC-AI and DC-AI

The essence of DC-AI is to improve data in order to accomplish fair learning and better performance. The key evaluation aspects to compare DC-AI and MC-AI are accuracy, model complexity (in terms of parameters), the least number of samples without degrading accuracy, the false positive/negative rate, precision, recall, sensitivity, specificity, the

F_{1}

measure, etc. To better highlight the promises of DC-AI and to compare it with MC-AI, we report quantitative results on the steel sheet inspection scenario [22]. In this scenario, the goal is to accurately predict the defects of different kinds on steel sheets. The overview of the style sheets under investigation and examples of the defects are shown in Figure 2.

The quantitative results of both approaches on the steel sheet inspection scenario are given in Table 1. In this example, the baseline approach has an accuracy of 76.20%, and the expected accuracy is ≥90%. To this end, the MC-AI approach and related techniques (discussed in earlier parts of this section) yielded no enhancement in accuracy, as shown in the left part of Table 1. Therefore, the target accuracy was not achieved even after trying five different MC-AI-related techniques. In contrast, DC-AI yielded a substantial enhancement in accuracy, as shown in the right part of Table 1, and the accuracy is higher than expected. Based on these results, it is fair to say that DC-AI has better quantitative results than MC-AI, particularly when some desired accuracy targets exist.

In another study, it was envisioned by Andrew Ng that DC-AI solely focuses on data quality enhancement and fifty well-crafted examples/images are sufficient to train the neural network (NN) model in some cases [15]. In contrast, MC-AI mostly ignores such aspects and builds NN models with entire datasets encompassing millions of images/examples. These results and experimental findings indicate better quantitative results/performance from DC-AI than MC-AI.

4. General Architecture of the DC-AI

We present a general architecture for DC-AI along with relevant components that can be applied to solve any real-life problem using AI in Figure 3.

The analysis in Figure 3 helps in understanding the main focus and workflow of DC-AI when solving problems. In the DC-AI paradigm, there are six key steps. In the first step, experts from industry, academia, and/or both define the problem that needs to be solved via AI. In the second step, data are collected from the relevant domains/people. In the third step, data engineering practices are rigorously applied to delicately improve the data quality. In the fourth step, the suitable model is trained using high-quality data produced in the former step. In the fifth step, the hyperparameters are tuned to obtain better results with AI models. In the last stage, the trained model is deployed in realistic settings for inference. In DC-AI, various approaches are employed in distinct stages of the AI project lifecycle to enhance the quality of the data, leading to reliable and transformative AI system development for social good. Besides, some general requirements are also ensured in the project’s lifecycle to develop reliable AI solutions. DC-AI can resolve many drawbacks in MC-AI, discussed in the next section, and can extend the applicability of AI to many real-world settings that cannot be solved with the MC-AI approach alone.

It is worth noting that DC-AI is a new paradigm, and data quality, cleaning, and wrangling are well-studied fields in the AI domain. The quality of data has a greater effect on AI models, as well as downstream tasks, and therefore, data quality is enhanced in most AI applications. To inspect or improve data quality, many sophisticated operations such as consistency, timeliness, relevance, effectiveness, accuracy, and meta-data management are performed. All these operations can assist in addressing multiple problems such as noise reduction, filtering of improper labels, tossing out ambiguous samples, addressing data imbalance, and curating more samples/labels to repair problematic data [49]. Data cleaning is another promising technique that is applied to remove outliers from data and to handle missing values. The outliers are removed using

m i n

–

m a x

analysis for numerical data and visual plots (histogram) for non-numerical data. Samples with missing values are either removed or imputed by determining new values. In addition, duplicate instances are removed to lower the complications of AI models. Data cleaning is imperative to prevent the poor performance of AI models, and it also contributes to correctly capturing the statistical relationship between features [50]. Many advanced techniques such as confident learning, curriculum learning, active learning, core set selection, model-aware cleaning, data integration, etc., are vastly contributing to realizing DC-AI.

In some cases, data can stem from multiple domains/sources, and therefore, consistency, alignment, and reliability are imperative while integrating multiple sources of data. To this end, data wrangling techniques are necessary to improve data quality while develo** any AI/ML model [46,51]. For example, the structure analysis (also known as discovering) operation of data wrangling can help in analyzing the data, which can be helpful to identify the corresponding operations to be performed on the data. Similarly, data validation contributes to checking and improving the consistency and distribution of data before use. Data enrichment explores ways to identify and resolve multiple issues concerning data quality. For example, if the data size is small, data enrichment can guide the addition of more samples to the data. It can also suggest relevant sampling or augmentation techniques depending on the data/problem. In some cases, domain knowledge is leveraged to curate the best features or to transform the features to improve data quality, leading to the performance enhancement of AI models. DC-AI encompasses all of the above-cited techniques and some more advanced techniques to ensure that final data are sound, robust, accurate, and highly relevant to the problem being solved via AI.

It is worth noting that the architecture of the DC-AI paradigm presented in Figure 3 is very general, and requires modifications when applied to different real-world problems/applications. The number and type of techniques are subject to change depending on the quality of the data, as well as the problem under investigation. Also, some additional methods might be needed to quantify the impact of each data point (or image) on the model performance. Similarly, some visualization tools like t-SNE, UMAP, Random Projection, etc., might be required to perform exploratory data analysis of large datasets enclosing many views with diverse data types [52]. Furthermore, some design choices about data collection and the size of the data also require in-depth investigation and collaboration with multiple stakeholders. The utilization of privacy mechanisms might be challenging in some applications, particularly when high utility/accuracy is desirable. Additionally, the identification of out-of-distribution (OOD) data/sample detection can be challenging for massive datasets, and requires techniques like an isolation forest to perform this job. Removing bias completely from the data might be challenging when the sources of the data are unknown or the data need to be collected from some specific community/individuals. Ensuring responsible use of the data and preventing misuse in the entire lifecycle of an AI project is a complex task, and more technical solutions are required to ensure the right use of the data. Lastly, the adoption of standard practices like ‘datasheets for datasets’ needs to be integrated with the proposed architecture to properly document the datasets and associated details [53]. However, the architecture presented in this paper along with the techniques can serve as a reference architecture, which can be enriched/customized depending on the real situations/problems.

The architecture given in Figure 3 assists in accomplishing the conceptual goals of this paper as most of the mechanisms reported are data-tailored along with the necessary focus on model-related aspects. Specifically, it provides phase-specific requirements for both data and models that can be useful to implement the DC-AI approach. We report a detailed taxonomy of techniques for data engineering, which can contribute to data quality enrichment and realization of the DC-AI concept in real-world scenarios. The theoretical goals can be met by determining the optimal order or types of techniques to be used in each scenario (or domain). Also, some critical aspects of the system can be modeled mathematically or can be proven formally with the help of statistics or mathematics measures. This architecture depicts the traditional workflow along with the useful techniques to be implemented in each phase, which can contribute to the fulfillment of the design goals of any AI project. From the application point of view, the architecture might not be suitable for each application; however, we believe it still can serve as a reference/baseline to be further customized for each application. Lastly, guaranteeing the feasibility and appropriateness of each technique may not be straightforward, and require more in-depth investigation/analysis from relevant entities.

5. Insight to When to Amalgamate DC-AI with MC-AI

Developers and users of AI technology face various problems with far-reaching implications, and rectifying AI technology has become urgent amid its wide-scale adoption [54,55]. To this end, the amalgamation of DC-AI and MC-AI might be necessary to help AI technology contribute more to the social good [56]. It is worth noting that there exist interactive relationships between DC-AI and MC-AI because the changes in data result in changes in the code of AI models. Figure 4 presents the interactive relationship between DC-AI and MC-AI based on the data characteristics/properties and coding aspects.

For example, a change in data size requires parameter adjustment in relevant AI models, indicating the interactive relationship between code and data. The parameters of AI models for different data sizes are different to prevent under-fitting or over-fitting. Similarly, to evaluate data, AI models (learning algorithms) are needed. For instance, to find hard examples, the decision boundary of SVM can be helpful. The points that tend to lie close to the boundary are hard and vice versa. Similarly, AI models are subject to change based on data modality. SVM/RF are most suitable for tabular data, whereas neural networks are preferred for image data. Similarly, when the data modality changes, the type and value of the parameters are also subject to change. If some bad parts of the data are dropped, then the model also needs to be sparse (e.g., some layers can be dropped). When the data are aligned, then it should be model-agnostic, meaning it can work with most models of a similar type. If the data are diverse (or if data diversity analysis needs to be performed), then it should give consistent performance across AI models. The feature maps of data properties can assist in checking the stability of AI model training, indicating the relationship of DC-AI with MC-AI. Lastly, if the wrong samples are used, then model robustness should decrease, prompting the need for a better model choice. All these examples prove the interactive relationships between these two approaches. Also, some of the studies have already pinpointed the complementary relationship between these two approaches [28]. In some cases, both approaches have been jointly used to prepare high-quality and consistent data for training complex CNN models [57]. Since data and code are two key elements of AI technology development, therefore there are very close relationships between them when it comes to solving real-life problems by leveraging AI.

The conceptual basis behind the integration of DC-AI with MC-AI is the intricate and fundamental relationship between data and respective AI models/algorithms. The data are the cornerstone and food for AI models, and therefore, one cannot simply solve real-world problems by tweaking the AI models/algorithms’ code. Also, AI model performance is subject to change based on the quality and quantity of data; therefore, the code and model are intrinsically linked with each other, justifying the conceptual basis between these two approaches. Furthermore, the deficiency in the data cannot be compensated through modifications in code alone. The theoretical basis behind the integration of MC-AI and DC-AI is the interplay between data and code when it comes to solving some real-world problems. The equation of conventional AI is: code + data, whereas the equations of MC-AI and DC-AI are code^′ + data and code + data^′, respectively. As shown in these equations, only the priorities are slightly different in each approach, and the underlying formal relationship is the same (data meet AI algorithms/models); therefore, the theoretical basis is justified as well. In some real-life cases, there are exact performance targets (e.g., accuracy over 95% or zero false positive/negative rates, etc.), and these targets cannot be accomplished with DC-AI/MC-AI alone; therefore, these approaches need to be integrated on such an empirical basis. In some cases, both code and data tuning might be simultaneously required to accomplish certain target performance objectives, justifying the empirical basis for this integration. Next, we explain some noteworthy scenarios/problems/situations that require the integration of DC-AI with MC-AI.

5.1. When Outcomes of MC-AI Alone Are Not Reliable

There have been growing concerns regarding automated decisions made by AI technology and their explanations. For example, a decision made by AI can be highly unfair when under-representative data are used in the training process [54]. Similarly, explanations of how a particular decision was made by a neural network model are often missing. Unfortunately, these outcome-related problems have not yet been resolved by the MC-AI approach. To this end, amalgamation with DC-AI is required to address these concerns. For example, decisions can be made fair after identifying faulty parts of the data and substituting them by curating additional data with generative tools. Explanations of the results can be provided by offering higher visibility in the training data, and by offering feature-level information. Careful integration of DC-AI with MC-AI can increase the reliability of AI results, and AI can widely contribute to the benefit of our society. Figure 5 presents real-world situations that require the simultaneous use of both paradigms.

In most situations listed in Figure 5, data can emerge from multiple sources, and therefore, the careful fusion of data and appropriate AI/ML model selection are imperative to yield better results. Furthermore, if static data are used, then data quality enrichment and cleaning are still required before feeding these data to AI/ML models to lower the computing overheads. In addition, customization/upgradation of AI models is needed considering the frequent change in data modalities, as well as data sources. In some cases, the amalgamation of both of these paradigms is necessary considering the problem being solved (e.g., equity and inclusion) to prevent the pitfalls of AI technology for nations.

5.2. To Meet Certain Performance Targets

In some cases, there are set performance targets (e.g., accuracy ≥ 90%, 0 FPR/FNR, etc.) that need to be met using AI. However, MC-AI alone may be unsuccessful in meeting those targets because it often overlooks the data, which constitutes a vital component of AI quality. Andrew Ng [22] highlighted a scenario (e.g., defect detection in steel) in which the target accuracy was set to 90%. The author sequentially applied MC-AI and DC-AI approaches and analyzed the impact on the resulting accuracy. Despite tweaking the code (i.e., utilizing the MC-AI approach), there was no change in accuracy. In three different tests, MC-AI minimally enhanced the accuracy (e.g., 0.00%, 0.04%, and 0.00%), and the required accuracy targets were not accomplished by MC-AI. In contrast, DC-AI was successful in enhancing the accuracy significantly, and the required accuracy targets were met. In three different tests, DC-AI enhanced the accuracy (e.g., 0.40%, 3.06%, and 16.9%). Through this scenario, we can conclude that DC-AI is needed to meet target objectives/scores while applying AI models/algorithms to solve real-world problems.

5.3. When Computing Overhead Is Unaffordable beyond a Certain Limit

Generally, deep learning models are trained with millions or billions of images or videos, leading to higher computing overhead. In some cases (e.g., the retail industry), overhead can be significantly large due to model-centric efforts (obtaining more data on everything, as cited earlier), and AI technology becomes unaffordable in some sectors. To address this, DC-AI is of pivotal importance because it requires collecting only the needed data, rather than doubling the existing data [15]. For example, with the MC-AI mindset, a facial recognition system trained on 350 million images would require 350 million more (700 million images in total) in the case of poor performance, whereas DC-AI may only require 50 more images because a small segment of images can be noisy. Considering these promises, the integration of DC-AI with MC-AI is vital to lowering computing overhead. Consequently, AI technology can become affordable, leading to humankind’s well-being around the world.

5.4. Augmenting Lifetimes of AI Systems

In most cases, the results of AI systems in academic labs and real-life environments are quite different and are subject to many external conditions that are often overlooked while develo** AI systems [58]. Apart from other reasons, ignoring data completeness and timeliness is the leading cause of data drift in deployed AI systems, eventually degrading their performance. These problems can be fixed by retraining and/or redevelo** the AI system. However, such solutions are costly and hinder AI’s application to solving real-world problems. In this regard, DC-AI’s integration with MC-AI is necessary to ensure data quality and consistency while develo** AI systems, leading to a longer lifetime.

5.5. Limited Availability of Representative Data

In many real-world cases, high-quality data may not be available, and the acquisition of new data can be difficult owing to constrained budgets. Therefore, the doubling rule for training data employed in MC-AI may not be possible. In this context, DC-AI is invaluable for improving the existing data, and compensating for a deficiency in giant datasets.

Besides the above-cited aspects/problems, the integration of DC-AI with MC-AI is necessary and can advance AI technology from multiple perspectives. MC-AI has already been well-investigated in the recent past. Hence, now is the right time to put more effort into DC-AI to improve the overall status of AI technology.

6. Case Study (Proof of Concept Example) to Evaluate the Effects of the Key Parameters of DC-AI on Performance

The DC-AI paradigm has already achieved breakthroughs in many AI applications such as reducing the time from the implementation to the deployment of AI models, training AI models with fewer data, accelerating the convergence of AI models, reducing the overall complexity of AI models, reducing the cost of AI technology by exploiting the code of already-developed models, and extending AI applications [59,60,61,62,63]. The umbrella of DC-AI applications and techniques is constantly expanding, and therefore, the adoption of AI technology is increasing worldwide. In some cases, the existing AI applications are being modified to encompass the concepts of DC-AI to increase their reliability and efficiency in practical scenarios. Below, we demonstrate an empirical case study (proof of concept example) concerning DC-AI, which can address the technical shortcomings of traditional AI models by improving data quality.

Enhancing the Learning Ability and Generalization Power of ML Algorithms in Safety-Critical Applications (e.g., Medical Scenario)

In this case study, we examine the impact of DC-AI on the learning ability and generalization power of ML algorithms in safety-critical applications, specifically medical diagnosis. To analyze the effects of the key parameters of the DC-AI paradigm on the performance, we chose the stroke prediction dataset [64]. This dataset is publicly available and has been widely used in the AI community for predicting the possibility of stroke [65]. It encompasses a variety of features and has stroke as the target class. The detailed information of the predictors in this dataset is given as follows: Gender (C), Hypertension (N), Ever-married (C), Work type (C), Residence type (C), Smoking status (C), Age (N), Heart disease (N), Glucose level (N), BMI (N), Patient id (N). The target class is stroke (N), the probability having values 0 and 1. We denote both types of features (e.g., categorical and numerical) by C and N, respectively. In this dataset, the cardinality of the target class (e.g., stroke) is two, and the frequency of each value is different. Figure 6 highlights the frequency values of each category. Referring to Figure 6, it can be observed that there exists a very high imbalance between stroke (probability value = 1) and no stroke (probability value = 0). In the realistic scenario, the frequency of both stroke probability value = 1 and stroke probability value = 0 should be close enough to ensure the balanced learning/training of AI models.

In the initial phase, we applied the MC-AI approach for computing the accuracy value using the equation below.

A = \frac{T_{p} + T_{n}}{T_{p} + F_{p} + T_{n} + F_{n}}

(1)

where

T_{p}, T_{n}, F_{p}, F_{n}

denote the true positives, true negatives, false positives, and false negatives, respectively.

We built the three ML models from the stroke prediction dataset without improving its quality, just like traditional AI system development. We applied three different settings:

In setting I, we fixed the parameters of the ML models, but varied the amount of test and training data. For the sake of simplicity, we name this setting as $α$ .
In setting II, we fixed the amount of training and test data but varied the parameters of the ML models. For the sake of simplicity, we name this setting as $β$ .
In setting III, we simultaneously changed both (the parameters and the amount of data) while computing the $A$ value. For the sake of simplicity, we name this setting as $γ$ .

After building three different ML models, we computed

A

for the analysis. Table 2 presents the results of the experiments.

From the results, we can see that the three settings yielded only marginal improvements in the results. In setting

α

, we changed the training data size from 66% to 80% and computed the

A

results using the test data (the size of the test data varied from 33% to 20%). In this setting, we used fixed values of the hyperparameter (e.g., unpruned tree, default values for both SVM and random forest parameters). The

A

value in this setting slightly changed after varying the value of the data sizes. In the

β

setting, we fixed the training and test data sizes, but changed the parameter values of all three models. We used the best-pruned tree, the optimized values of the SVM hyperparameters (polynomial kernel function’s degree, gamma, shrinking, tol, etc.), and the hyperparameter-optimized values of random forest (number of trees, number of variables required to split the tree’s node, sampling size, node complexity, tree depth, etc.). In this case,

A

was also slightly changed. In the

γ

settings, we simultaneously varied both (hyperparameters and data size) and computed

A

. In this setting, the results were slightly better compared to the other settings. However, we observed that most ML models learned the information of one class only, and the inference for the other class was zero. The best model (e.g., random forest) in some cases was unable to even predict a single instance correctly, owing to poor learning. Therefore, this model cannot be used in some safety-critical applications (e.g., prediction and classification of cancer or COVID-19). The use of such a model can lead to negative consequences in realistic scenarios. The structure of confusion metrics obtained in three test cases is shown in Figure 7.

As shown in Figure 7, category 1 (possibility of stroke) remained highly unlearned in most cases. If an ML model trained with such data is used in real-world cases, the possibility of accurately predicting category 1 is close to zero. In these circumstances, the sampling method used in most ML models is highly unbiased, meaning the samples drawn from the data contain the higher instances from 0 categories, and therefore, the learning and generalization of ML models are very poor. In one case (the sample size was 0.8 (80%)), we observed a difference between classes 0 and 1 of 4002, which is very high considering the size of this dataset. Since the representation from one class is extremely low, the classifier tends to ignore the samples of the minority class, leading to poor learning and generalization. How to identify and fix these types of problems using AI approaches is one of the hot topics of modern times.

In the next set of experiments, we applied DC-AI to improve the learning ability of the classifiers and to fix other technical problems (low accuracy, poor sampling) that cannot be resolved in a fine-grained manner using MC-AI alone. As the main purpose of DC-AI is to improve data, therefore we developed a pipeline to inspect the data first. Figure 8 presents the pipeline of DC-AI that was employed to improve the data to address the MC-AI’s problems via DC-AI. In this pipeline, there is a total of six steps that were sequentially applied to curate better data for enhancing the performance of the AI/ML models.

In this dataset, the number of records is very low and the distribution skew is also very high. Therefore, after identifying these two vulnerabilities in the dataset, we curated some more data with the help of a conditional generative adversarial network (CGAN [66]). Afterward, the data fusion was performed in the minority class to enhance the performance of the ML classifiers. Although many data augmentation techniques have been proposed to address the class imbalance problem, our approach is different as it adds fewer, but good-quality records, leading to significant enhancement in the results. Specifically, we added only 3042 records in the stoke prediction dataset by rigorously following the DC-AI criteria. In this case, the existing techniques that do not follow DC-AI may add 5366 records [67]. The addition of more records by not following DC-AI can lead to only very small improvements in the results, whereas the computing overheads can be large, especially when the data size is large. In contrast, the careful application of DC-AI can significantly improve the results without increasing the computing cost.

To evaluate the significance of DC-AI, we applied three different settings similar to the MC-AI approach. In setting I, we fixed the parameters of the ML models, but varied the number of records in the testing and training data. In this case, we added more records in the original data step by step and built the model. In the second setting, we froze the training and testing data, but varied the hyperparameters of the ML algorithms. In the third setting, we reduced the hyperparameters to prove the benefits of DC-AI. In this setting, we built the model with fewer parameters while using good-quality data. For example, in a random forest, we reduced the number of trees to be constructed. Similarly, in SVM, we reduced the parameter size using the grid-search method. In the experiments, we found that if the data quality is good, the hyperparameters can be reduced (or small values can be used), leading to improved results. For the sake of simplicity, we name these settings

α

,

β

, and

γ

. Table 3 presents the results of the experiment that were obtained from improved data. From the results, it can be seen that DC-AI yielded a much higher performance compared to MC-AI, and the

A

value is close to 100% in the random forest case. Furthermore, it is worth noting that this value was attained with significantly reduced hyperparameters compared to MC-AI. These results prove the supremacy of DC-AI over MC-AI.

The DC-AI approach has generated very balanced confusion matrices compared to the MC-AI approach. Specifically, we observed that, in most experiments,

T_{p} > F_{p}

and

T_{n} > F_{n}

. Figure 9 presents the structure of the confusion matrices obtained via the DC-AI approach (e.g., the data quality was significantly improved before building ML models). From the results, it can be noticed that

F_{p}

and

F_{n}

were much lower compared to the MC-AI approach, which did not improve the data quality. In addition, the minority class did not have a higher misclassification. These results fortify the significance of the DC-AI approach in practical scenarios from the perspective of higher accuracy, as well as the balance in the confusion matrices. These results highlight the importance of DC-AI and prove that DC-AI has abilities to yield better results, leading to the better generalization and learning ability of ML classifiers. These results validate the fact that DC-AI is the better choice over MC-AI in safety-critical applications.

Next, we demonstrate the effects of DC-AI on the sampling quality. We chose ten samples for analysis and comparison purposes. When the sample size was 0.8 (80%), the maximum difference between classes 0 and 1 in DC-AI was 1212, which is very low compared to MC-AI. From these results, it can be observed that gaps in the number of records of the two stroke categories were small compared to MC-AI, and the sampling quality was good. These results verify the superiority of DC-AI compared to MC-AI in terms of sampling quality. The analysis presented above underscores that the DC-AI approach is imperative in most cases, especially when the underlying data have poor quality. DC-AI has yielded better results in three different aspects. The experimental analysis and comparison prove the essence of the DC-AI approach in practical scenarios, particularly when the data are imbalanced or limited to train AI models.

7. Challenges and Future Prospects

DC-AI is a very new paradigm and is prone to many technical challenges when it comes to actual use in real-world scenarios. For instance, in the absence of clear standards for gauging the quality of data, how can one use DC-AI? Similarly, there are no practical tools (or software) that can spot data vulnerabilities to alert developers. We identified and discussed multiple technical challenges of DC-AI in real-life scenarios from the perspective of data or models, or both, and presented various important directions for future research/development. Joint efforts to solve these challenges can spark technical developments in this line of work. It is worth noting that DC-AI is a present-day paradigm, and there is significant room for research from any aspect (the development of new techniques, models, frameworks, prototypes, etc.). The key challenges and important future research directions are given below:

There is a significant lack of technologies that can assist in realizing DC-AI. At present, there are only a few enabling technologies (synthetic data, transfer learning (TL), etc.) used in DC-AI. However, the potential synergies between DC-AI and other AI technologies, such as transfer learning and generative models, can expand the application horizon for DC-AI. For instance, the TL approach is handy in addressing data scarcity and availability issues, and it can foster data re-usability across identical domains/problems [68]. TL can assist in realizing the DC-AI concept in low-resource healthcare settings [68]. TL-like approaches are required to prepare data for diverse domains with minimal effort by transferring knowledge/information from one domain to another. On the other hand, generative models like GAN/TVAE are handy for generating new data (also known as synthetic data), which can be used to enlarge the data size, as well as the diversity [66]. The generative models can assist in solving imbalanced learning problems in machine learning (ML), and can increase prediction accuracy for minor classes [69,70]. In some cases, they contribute to reducing the bias in ML/AI models [71]. Furthermore, generative models assist in performing complex tasks (e.g., human analysis) with AI, which are not possible in conventional settings, owing to limited high-quality data [72]. However, there is still a serious lack of technologies that can pinpoint faults in the data, and that can raise alerts when some parts of the data are misleading or labeled incorrectly. Therefore, develo** new DC-AI-enabling technologies or modifying the existing technologies to accomplish the DC-AI criteria is very challenging. Thus, the development of enabling technologies that can assist in achieving multiple aspects of DC-AI is an exciting area of future research.
In some cases, the DC-AI approach is no longer required when the data are sound, complete, up-to-date, and highly diverse. However, there is a lack of methodologies/procedures through which one can identify when DC-AI is inherently fulfilled. Furthermore, the decision about when to use DC-AI also needs rigorous criteria/methods, because MC-AI yields better results in some areas. For example, if there are fewer vulnerabilities in the data, the corresponding AI model can overcome them without a DC-AI approach by utilizing averaging, or by employing an optimized $α$ (learning rate). To this end, one may not need the DC-AI approach. Hence, develo** techniques that can distinguish whether DC-AI is imperative or not can be a feasible topic for future research.
It is very hard to decide how much quality enhancement is needed when it comes to different areas (e.g., predictive maintenance vs. human activity analysis). For example, an application for text analysis may need less pre-processing because the application needs to handle both noisy and well-written text. In contrast, an automated diagnosis system employed for cancer disease may need rigorous data quality enhancement to ensure accurate diagnoses. Hence, it is very tricky to assess what amount of data quality is reasonable in a DC-AI approach when it comes to diverse applications. Therefore, develo** unified tools/techniques that can provide a reasonable analysis of the data is a pending issue, and requires further investigation from the AI community.
There is a misconception about the DC-AI approach and traditional pre-processing. It is important to note that pre-processing is a common practice while building AI systems, and has significance. In contrast, DC-AI is a complete discipline that includes pre-processing as one of the aspects. Pre-processing is applied to data that have already been collected, whereas DC-AI is all about the data-first strategy—what data to collect, from whom to collect them, how to collect them, how to improve them, what aspects to improve, etc. Also, DC-AI includes data versioning, completeness, timeliness, risk/error analysis, etc., which are not part of pre-processing at all. Therefore, convincing AI developers to accept this new paradigm may be challenging. Hence, the amalgamation of DC-AI and pre-processing is imperative to highlight the need for DC-AI in future endeavors.
Another challenge is to find some potential/attractive use cases, like the detection of defects in steel, to pinpoint the efficacy/benefits of DC-AI (https://venturebeat.com/ai/why-data-remains-the-greatest-challenge-for-machine-learning-projects/, accessed on 5 March 2024). In the future, making the DC-AI approach a de facto standard for AI applications imposes various challenges, because many AI systems have already been developed and deployed in real-world settings. In some cases, obtaining more data to improve performance or fiddling with AI models is beneficial when it comes to applications like voice-activated assistants. In contrast, obtaining more data or improving models when all the data belong to one source (e.g., defect detection in a machine based on its operating sounds) may not be beneficial. Hence, it is challenging to differentiate applications that require DC-AI versus MC-AI. In the future, identifying relevant domains where DC-AI can bring more good than harm is an exciting research area.
DC-AI is about debugging and compiling data, but in the absence of sophisticated tools, it is hard to identify the ambiguous parts of the data and fine-tune them accordingly. In addition, separating faulty and non-faulty parts of data that are enclosed in diverse formats (e.g., tables, graphs, trajectories, and images) is also very challenging. Therefore, the development of data debugging and compiling tools similar to programming languages (i.e., Java, C/C++) is imperative in the coming years. The development of such compilers can give valuable hints about the type/nature of problems/vulnerabilities in the underlying data that can be corrected at the earliest.
Recently, the umbrella of DC-AI techniques has been expanding (or being amalgamated with traditional pre-processing methods), posing the challenge of which technique might yield promising results when it comes to different datasets/applications. For example, simple visualization may help to identify missing labels in 100 tuples. In contrast, one cannot use simple visualization when it comes to image data of different species, or a billion images for one application. Hence, it is difficult to select suitable DC-AI techniques for each respective AI application. In this regard, the selection of optimal DC-AI techniques that can yield reliable results, regardless of the domain, is an attractive area of future research.
In the absence of successful implementations, it is challenging to determine the order in which DC-AI techniques will be applied to yield robust AI systems. For example, determining when to apply data labeling and data completeness analysis in a gaze estimation application is challenging. Similarly, it is hard to decide when to not use a certain technique (e.g., data availability). Considering these circumstances, deciding the order and types of techniques is tricky. Similarly, the set of DC-AI techniques applied to tabular data may not yield feasible results in image data, and therefore, a set of unified DC-AI techniques is required for each data type. Furthermore, determining the optimal order and types of DC-AI techniques to apply for different datasets and applications is a very complex problem. In our recent work, we devised a general system by discussing the order and types of DC-AI techniques to be employed in stroke prediction scenarios, which can be used as a reference to determine the optimal order and types of DC-AI techniques to apply for different datasets and applications [45]. Similarly, some potential DC-AI techniques that can be generically applied across the ML pipeline are given in Seedat et al. [49]. However, determining the optimal order and types of DC-AI techniques to apply for different datasets and applications is a very challenging problem, and it requires further investigation from relevant communities. Ablation studies to determine appropriate combinations of DC-AI techniques for diverse domains is an exciting prospect to be explored in the DC-AI context.
Just like conventional AI approaches, some pre-processing techniques (e.g., sampling) do not work well with some AI models. The same problem can happen with DC-AI techniques, and therefore, analyzing the suitability of DC-AI techniques with respect to data and AI models requires in-depth investigation. Generalizing DC-AI techniques to effectively work in related applications with slight modifications is also very challenging. Hence, conducting tests and identifying suitable DC-AI techniques for each AI model requires further investigation from the AI community.
In conventional MC-AI, the data are evaluated just once (e.g., before feeding them into the AI model). By contrast, in DC-AI, the evaluation of the data is required at multiple stages in the AI system lifecycle. Hence, data quality analysis before building the model (e.g., during pre-processing) and after model training are different. In the pre-processing stage, data completeness/freshness is mandatory, but after training, balanced data utilization (whether the AI model utilizes all parts of the data equally or not) is imperative. To perform systematic analysis and quality assurance, sophisticated expertise may be required in each stage. However, there is a substantial lack of domain experts for each stage of the AI system lifecycle, which can hinder the applicability of DC-AI in practical scenarios. Identifying and documenting the list of requirements to be fulfilled concerning data in each stage is very challenging, and requires more investigation.
There is a genuine lack of procedures and standards concerning data quality. For example, 10% of the data being good may be sufficient in some applications, whereas some applications may require 90% to yield consistent results. Considering the huge diversity in data styles and applications, develo** appropriate standards and procedures for gauging data quality is challenging. In the future, it is vital to develop formal procedures and well-defined standards for gauging data quality in multiple AI applications.
The key focus of DC-AI is to augment data quality. However, this is a very time-consuming, challenging, and laborious task, especially when the size of the data is very large. In addition, reducing the overall complexity of the data-quality-enhancement process is an interesting and urgent topic. At present, there is a lack of software or customized libraries to automatically perform most operations. Develo** supportive libraries for the DC-AI paradigm is a vibrant avenue for research.
Recently, there has been an increasing trend toward training complex AI models with as little data as possible to overcome computing overhead [17,29,73]. Similarly, data optimization techniques are needed for training complex AI models with the fewest (but complete) data. In this regard, develo** data optimization and reduction techniques without compromising accuracy (or other objectives) by utilizing DC-AI concepts is a vibrant research area. Data quality is a key performance index for DC-AI, and therefore, the use of low-cost tools and techniques to improve data quality and out-of-distribution detection has become more urgent than ever. Hence, devising practical simulation/implementation tools that can enhance the quality of data through various operations (consistent labeling, outlier removal, etc.) at the least cost is needed.

As cited above, a substantial number of challenges exist that are impacting the realization of DC-AI, and therefore, significant efforts are required to develop tools, techniques, models, etc., to truly benefit from DC-AI. Lastly, advancing /improving software developed for the MC-AI approach is another sustainable way to harness the full potential of DC-AI.

8. Next-Generation Computing for DC-AI Paradigm

The success of the DC-AI paradigm lies in devising systematic techniques for inspecting and enhancing data quality, which requires ample attention to how data are curated, prepared, annotated/labeled, and integrated into AI systems. Therefore, it is crucial to identify research challenges and future directions in the DC-AI paradigm while amalgamating it with next-generation computing technologies such as cloud/fog/edge/serverless architecture, AI/ML/DL, and quantum computing. In this section, we discuss five research opportunities in next-generation computing by leveraging DC-AI:

In recent years, cloud-/fog-/edge-/serverless-based computing architectures have ensured the timely collecting, processing, and analytics of big data by using AI models. In the future, DC-AI integration with these computing architectures is imperative for cost savings, improved data management, automation, robustness, and addressing verifiability-related issues [74]. DC-AI can enhance the efficacy of cloud-/fog-/edge-/serverless-based applications by ensuring the data used in them are of high quality, robust, dependable, and representative of the problem being solved. However, the integration of the DC-AI paradigm with these latest architectures is not easy and can induce many research challenges such as the fusion, alignment, and consistency of data stemming from multiple sources being possibly tricky and slow. Furthermore, data reliability, data representation, and the privacy and security of data can be the main barriers to integrating DC-AI with next-generation computing architectures. In addition, preparing high-quality data for each architecture depending upon the application requirement is also very challenging. In the serverless computing architecture, most of the operations are managed by third parties, and therefore, DC-AI integration can lead to transparency, bias, and sustainability issues [74]. In the next generation of computing, the involvement of domain experts with AI-based systems regardless of computing architectures will be mandatory, and therefore, it is very challenging to prepare domain experts for relevant architectures. Finally, upgrading the current AI-integrated systems that are mostly based on MC-AI is another big challenge and requires much effort from the AI community.
Thus far, there has been limited synergy from merging the DC-AI approach with other AI/ML/DL technologies such as pre-trained models, TL, and generative models. In the future, more such synergies are needed to improve critical aspects of DC-AI (i.e., performance overhead, domain adaptation, data curation, and integration) [31]. Hence, exploring the possible synergies between DC-AI and more of the latest technologies to expand the horizon for DC-AI is a vibrant area of future research.
With the advent of quantum computing (QC) (a powerful computing paradigm), a drastic change in the performances of AI models is expected, and most AI models can work with zettabytes (ZB) of data [31]. QC encompasses many innovative architectures such as quantum entanglement and quantum superposition and can process huge volumes of data in a few milliseconds compared to classical systems [74]. However, the QC paradigm can also face obstacles in effective data utilization because most of the real-world datasets are poisoned, biased, noisy, skewed, and/or incomplete. To this end, an amalgamation of DC-AI with QC is imperative to solve longstanding problems in the healthcare sector (e.g., genome analysis), drug development, secure cryptosystems, and sustainability problems. The joint use of these paradigms can contribute to develo** reliable and robust AI systems. However, it is challenging to apply DC-AI and QC technologies to some complex problems such as climate change owing to higher uncertainties and constraints. The development of DC-AI-integrated QC-based libraries with a higher level of abstraction and flexibility is an important avenue for future research. Lastly, develo** QC-based architectures for data distribution and governance in AI systems is also a promising research area.
TinyML is an emerging paradigm that brings ML algorithms close to ultra-low-powered devices, such as microcontroller units (MCUs), to enhance service quality [75]. In DC-AI, data availability and accessibility are imperative, and therefore, this synergy is handy. To that end, exploring the role of DC-AI in enhancing the technical persuasiveness and robustness of TinyML (also known as MLoPs) is a fascinating area of research amid the rapid rise in wearable devices around the globe.
Recently, there has been an increasing focus on develo** domain-specific and dedicated hardware accelerators to meet the growing demand for fast processing with the least energy consumption [76]. To this end, the analog in-memory computing (IMC) architecture has shown promising results in future generations of AI, as well as computer vision-related tasks. Similarly, develo** hardware accelerators to improve the computing efficiency of DC-AI is also a promising area for future developments.

9. Concluding Remarks

In this paper, we highlighted the significance of a modern paradigm named DC-AI that can be invaluable for AI development, advancement, and adoption in the coming years. Specifically, we highlighted what DC-AI entails, and the key distinctions between DC-AI and MC-AI. We identified and discussed the situations/scenarios that require the integration of DC-AI with MC-AI to further advance AI technology. We devised a generic architecture along with the supportive techniques/requirements for the implementation of DC-AI, which can be used as a reference/baseline to implement DC-AI in realistic scenarios. We reported a case study along with the empirical results to corroborate the potential of DC-AI and to highlight its superiority over MC-AI when data quality is poor. We also discussed the next-generation computing for the DC-AI paradigm, which can expand the horizon for DC-AI research and developments. To accelerate the research in DC-AI, we provided various challenges and avenues for future research that can contribute toward bringing DC-AI into practice in an AI-driven era. However, to unlock the full potential of DC-AI, substantial efforts are required from academia, industry, and all those concerned with leveraging AI for the well-being of humans while avoiding its pitfalls. Finally, it is important to note that the DC-AI may not be sufficient to address all AI-related problems, and proper integration between DC-AI and MC-AI is mandatory so that AI can contribute more to the prosperity of humans. In future work, we intend to further investigate the optimal order and types of DC-AI techniques to apply for different datasets, applications, and data modalities in order to foster further developments in AI technology.

Author Contributions

A.M. and S.O.H. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (RS-2024-00340882).

Data Availability Statement

The data are contained within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ker, J.; Wang, L.; Rao, J.; Lim, T. Deep learning applications in medical image analysis. IEEE Access 2017, 6, 9375–9389. [Google Scholar] [CrossRef]
Fidon, L.; Aertsen, M.; Kofler, F.; Bink, A.; David, A.L.; Deprest, T.; Emam, D.; Guffens, F.; Jakab, A.; Kasprian, G.; et al. A Dempster-Shafer approach to trustworthy AI with application to fetal brain MRI segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3784–3795. [Google Scholar] [CrossRef] [PubMed]
Shaker, A.M.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. UNETR++: Delving into efficient and accurate 3D medical image segmentation. IEEE Trans. Med Imaging 2024. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Yi, C.; Du, H.; Niyato, D.; Kang, J.; Cai, J.; Shen, X. A revolution of personalized healthcare: Enabling human digital twin with mobile AIGC. IEEE Netw. 2024. [Google Scholar] [CrossRef]
Liu, Y.; Chen, B.; Wang, S.; Lu, G.; Zhang, Z. Deep Fuzzy Multi-Teacher Distillation Network for Medical Visual Question Answering. IEEE Trans. Fuzzy Syst. 2024, 1–15. [Google Scholar] [CrossRef]
Li, G.; Xu, J.; Li, Z.; Chen, C.; Kan, Z. Sensing and navigation of wearable assistance cognitive systems for the visually impaired. IEEE Trans. Cogn. Dev. Syst. 2022, 15, 122–133. [Google Scholar] [CrossRef]
Kanthimathi, T.; Rathika, N.; Fathima, A.J.; Rajesh, K.; Srinivasan, S.; Thamizhamuthu, R. Robotic 3D Printing for Customized Industrial Components: IoT and AI-Enabled Innovation. In Proceedings of the 2024 14th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 18–19 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 509–513. [Google Scholar]
Chang, M.; Chen, K.H.; Chen, Y.S.; Hsu, C.C.; Chu, C.C. Developments of AI-Assisted Fault Detection and Failure Mode Diagnosis for Operation and Maintenance of Photovoltaic Power Stations in Taiwan. IEEE Trans. Ind. Appl. 2024. [Google Scholar] [CrossRef]
Yuan, X.; Wang, Y.; Wang, C.; Ye, L.; Wang, K.; Wang, Y.; Yang, C.; Gui, W.; Shen, F. Variable Correlation Analysis-Based Convolutional Neural Network for Far Topological Feature Extraction and Industrial Predictive Modeling. IEEE Trans. Instrum. Meas. 2024, 73, 1–10. [Google Scholar] [CrossRef]
Justus, V.; Kanagachidambaresan, G. Machine learning based fault-oriented predictive maintenance in industry 4.0. Int. J. Syst. Assur. Eng. Manag. 2024, 15, 462–474. [Google Scholar] [CrossRef]
Li, L.; Ota, K.; Dong, M. Deep learning for smart industry: Efficient manufacture inspection system with fog computing. IEEE Trans. Ind. Inform. 2018, 14, 4665–4673. [Google Scholar] [CrossRef]
Li, D.; Zhang, Z.; Yu, K.; Huang, K.; Tan, T. ISEE: An intelligent scene exploration and evaluation platform for large-scale visual surveillance. IEEE Trans. Parallel Distrib. Syst. 2019, 30, 2743–2758. [Google Scholar] [CrossRef]
Yuan, X.; Xu, N.; Ye, L.; Wang, K.; Shen, F.; Wang, Y.; Yang, C.; Gui, W. Attention-Based Interval Aided Networks for Data Modeling of Heterogeneous Sampling Sequences with Missing Values in Process Industry. IEEE Trans. Ind. Inform. 2024, 20, 5253–5262. [Google Scholar] [CrossRef]
Fan, Y.; Pang, W.; Lu, S. HFPQ: Deep neural network compression by hardware-friendly pruning-quantization. Appl. Intell. 2021, 51, 7016–7028. [Google Scholar] [CrossRef]
Strickland, E. Andrew Ng, AI Minimalist: The Machine-Learning Pioneer Says Small is the New Big. IEEE Spectr. 2022, 59, 22–50. [Google Scholar] [CrossRef]
Hegde, C. Anomaly Detection in Time Series Data using Data-Centric AI. In Proceedings of the 2022 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), Bangalore, India, 8–10 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Motamedi, M.; Sakharnykh, N.; Kaldewey, T. A data-centric approach for training deep neural networks with less data. ar** a Path to Data Dominance: Strategies for Digital Data-Centric Enterprises; Springer International Publishing: Berlin/Heidelberg, Germany, 2023; pp. 143–161. [Google Scholar]
Baek, D.; Dasari, M.; Das, S.R.; Ryoo, J. DcSR: Practical Video Quality Enhancement Using Data-Centric Super Resolution; Association for Computing Machinery: New York, NY, USA, 2021; pp. 336–343. [Google Scholar]
Dataset, S.P.D. Kaggle. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed on 7 June 2024).
Sailasya, G.; Kumari, G.L.A. Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. Int. J. Adv. Comput. Sci. Appl. 2021, 12. [Google Scholar] [CrossRef]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Dina, A.S.; Siddique, A.; Manivannan, D. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks. IEEE Access 2022, 10, 96731–96747. [Google Scholar] [CrossRef]
Seedat, N.; Imrie, F.; van der Schaar, M. Navigating Data-Centric Artificial Intelligence with DC-Check: Advances, Challenges, and Opportunities. IEEE Trans. Artif. Intell. 2023, 1–15. [Google Scholar] [CrossRef]
Hussein, H.I.; Anwar, S.A. Synthetic data and reduction method to enhancing prediction in SVM to imbalanced data classification problem. In Proceedings of the AIP Conference Proceedings; AIP Publishing: Long Island, NY, USA, 2024; Volume 2750. [Google Scholar]
Yun, J.; Lee, J.S. Learning from class-imbalanced data using misclassification-focusing generative adversarial networks. Expert Syst. Appl. 2024, 240, 122288. [Google Scholar] [CrossRef]
Juwara, L.; El-Hussuna, A.; El Emam, K. An evaluation of synthetic data augmentation for mitigating covariate bias in health data. Patterns 2024. [Google Scholar] [CrossRef]
Joshi, I.; Grimmer, M.; Rathgeb, C.; Busch, C.; Bremond, F.; Dantcheva, A. Synthetic data in human analysis: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4957–4976. [Google Scholar] [CrossRef] [PubMed]
Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5314–5321. [Google Scholar] [CrossRef]
Gill, S.S.; Xu, M.; Ottaviani, C.; Patros, P.; Bahsoon, R.; Shaghaghi, A.; Golec, M.; Stankovski, V.; Wu, H.; Abraham, A.; et al. AI for next generation computing: Emerging trends and future directions. Internet Things 2022, 19, 100514. [Google Scholar] [CrossRef]
Moin, A.; Challenger, M.; Badii, A.; Günnemann, S. Supporting AI Engineering on the IoT Edge through Model-Driven TinyML. In Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 884–893. [Google Scholar]
Örnhag, M.V.; Güler, P.; Knyaginin, D.; Borg, M. Accelerating AI Using Next-Generation Hardware: Possibilities and Challenges With Analog In-Memory Computing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 488–496. [Google Scholar]

Figure 1. Workflow of the DC-AI paradigm as applied to real-world problems. Notable steps are # 3 and # 7, where data screening and tuning build more robust AI models (adapted from [45]).

Figure 2. Test scenario of MC-AI and DC-AI: (a) overview of steel inspection sheet and (b) examples of different defects (adapted from [22]). The red boxes in (b) show different kinds of defects in the steel inspection sheet.

Figure 3. The general architecture of the DC-AI paradigm (partially adapted from [27]).

Figure 4. Overview of interactive relationships between DC-AI and MC-AI.

Figure 5. Overview of ten real-world situations that need both MC-AI and DC-AI paradigms together, and the key role of each paradigm.

Figure 6. Frequency analysis of the target class. This schematic shows the frequency information of different category values of the target class.

Figure 7. Structure of confusion matrix. This schematic shows the structure of the confusion matrix obtained from experiments via MC-AI.

Figure 8. DC-AI pipeline applied to stroke prediction data. This schematic shows the process used to significantly enhance the data quality.

Figure 9. Structure of confusion matrix. This schematic shows the structure of the confusion matrix obtained from experiments via DC-AI.

Table 1. Quantitative results and comparisons between MC-AI and DC-AI (adapted from [22]).

Approach	Accuracy (%)	Approach	Accuracy (%)
Baseline	76.20%	Baseline	76.20%
MC-AI approach	+0.00%	DC-AI approach	+16.9%
New values	76.20%	New values	93.1%

Table 2.

A

comparisons of using the MC-AI approach (

α

= (data size change),

β

= (hyperparameter tuning), and

γ

= (data size and hyperparameter tuning)).

Table 2.

A

comparisons of using the MC-AI approach (

α

= (data size change),

β

= (hyperparameter tuning), and

γ

= (data size and hyperparameter tuning)).

ML Algorithms	$A$ Results under Three Distinct Settings (Mostly MC-AI)
ML Algorithms	$α$	$β$	$γ$
Decision tree	0.7886∼0.8291	0.7986∼0.8123	∼83.31
SVM	0.8186∼0.8491	0.8281∼0.8501	∼86.79
Random forest	0.8286∼0.8798	0.8481∼0.8929	∼92.72

Table 3.

A

comparisons from using the DC-AI approach (

α

= (data augmentation),

β

= (hyperparameter tuning), and

γ

= (augmented data and reduced hyperparameters)).

Table 3.

A

comparisons from using the DC-AI approach (

α

= (data augmentation),

β

= (hyperparameter tuning), and

γ

= (augmented data and reduced hyperparameters)).

ML Algorithms	$A$ Results under Three Different Settings (Mostly DC-AI)
ML Algorithms	$α$	$β$	$γ$
Decision tree	0.8127∼0.8411	0.8103∼0.8513	∼89.11
SVM	0.8386∼0.8701	0.9161∼0.9271	∼94.65
Random forest	0.8802∼0.9198	0.8909∼0.9209	∼99.83 (≈100%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Majeed, A.; Hwang, S.O. Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era. Appl. Syst. Innov. 2024, 7, 54. https://doi.org/10.3390/asi7040054

AMA Style

Majeed A, Hwang SO. Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era. Applied System Innovation. 2024; 7(4):54. https://doi.org/10.3390/asi7040054

Chicago/Turabian Style

Majeed, Abdul, and Seong Oun Hwang. 2024. "Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era" Applied System Innovation 7, no. 4: 54. https://doi.org/10.3390/asi7040054

Article Menu

Towards Unlocking the Hidden Potentials of the Data-Centric AI Paradigm in the Modern Era

Abstract

1. Introduction

2. Background and Related Work

3. Introduction of Model-Centric AI and Data-Centric AI

Quantitative Analysis/Comparison between MC-AI and DC-AI

4. General Architecture of the DC-AI

5. Insight to When to Amalgamate DC-AI with MC-AI

5.1. When Outcomes of MC-AI Alone Are Not Reliable

5.2. To Meet Certain Performance Targets

5.3. When Computing Overhead Is Unaffordable beyond a Certain Limit

5.4. Augmenting Lifetimes of AI Systems

5.5. Limited Availability of Representative Data

6. Case Study (Proof of Concept Example) to Evaluate the Effects of the Key Parameters of DC-AI on Performance

Enhancing the Learning Ability and Generalization Power of ML Algorithms in Safety-Critical Applications (e.g., Medical Scenario)

7. Challenges and Future Prospects

8. Next-Generation Computing for DC-AI Paradigm

9. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI