Timing and Performance Metrics for TWR-K70F120M Device

Adam, George K.

doi:10.3390/computers12080163

Open AccessArticle

Timing and Performance Metrics for TWR-K70F120M Device

by

George K. Adam

CSLab Computer Systems Laboratory, Department of Digital Systems, University of Thessaly, 41500 Larisa, Greece

Computers 2023, 12(8), 163; https://doi.org/10.3390/computers12080163

Submission received: 21 July 2023 / Revised: 10 August 2023 / Accepted: 11 August 2023 / Published: 14 August 2023

Download

Browse Figures

Versions Notes

Abstract

:

Currently, single-board computers (SBCs) are sufficiently powerful to run real-time operating systems (RTOSs) and applications. The purpose of this research was to investigate the timing performance of an NXP TWR-K70F120M device with μClinux OS on concurrently running tasks with real-time features and constraints, and provide new and distinct technical data not yet available in the literature. Towards this goal, a custom-built multithreaded application with specific compute-intensive sorting and matrix operations was developed and applied to obtain measurements in specific timing metrics, including task execution time, thread waiting time, and response time. In this way, this research extends the literature by documenting performance results on specific timing metrics. The performance of this device was additionally benchmarked and validated against commonly used platforms, a Raspberry Pi4 and BeagleBone AI SBCs. The experimental results showed that this device stands well both in terms of timing and efficiency metrics. Execution times were lower than with the other platforms, by approximately 56% in the case of two threads, and by 29% in the case of 32-thread configurations. The outcomes could be of practical value to companies which intend to use such low-cost embedded devices in the development of reliable real-time industrial applications.

Keywords:

single-board computers; embedded systems; real-time; multithreading; performance metrics; time measurements; benchmarking; μClinux; TWR-K70F120M

1. Introduction

Single-board computers (SBCs) facilitate the Internet of Things (IoT) and can enable Fog and Edge compute applications to run efficiently on IoT computing and data generating nodes (e.g., sensors). In IoT architectures, there is a tendency for the computational power to be pushed out closer to the edge, on smart devices such as SBCs. The volume of data generated from IoT devices, sensors, and other equipment is forcing enterprises and manufacturers to design and develop SBCs with sufficient computing power for real-time analytics, and for applications with real-time attributes and requirements [1,2]. Single-board computers are usually cost-effective and versatile commercial off-the-shelf (COTS) computer platforms, which offer significantly reduced time to implementation and efficient solutions in many sectors, including Industrial IoT (IIoT) applications (e.g., Texas Instruments BeagleBone Black, NXP i.MX 8 series-based boards, Raspberry Pi Compute Module 4, etc.). ARM-based microcontrollers are commonly deployed in many SBCs, mobile phones, and different types of embedded devices and industrial applications. This is because they offer high performance and power efficiency at a reasonable price/performance ratio, and have low-power requirements [3].

An important feature of such devices in industrial solutions is the requirement for real-time capabilities, i.e., the capability to support real-time timing constraints (e.g., a deterministic response, minimized latencies, and bounded execution times). A real-time application must complete real-time tasks within a deterministic deadline. For example, the time elapsed from the appearance of an event to the actual system response often must be within a strict time range. Such timing requirements are very important in industrial production and IIoT [4,5].

Single-board computers deploy operating systems (OSs) which make use of the minimum of the resources available by the embedded microcontroller, and usually with a small memory footprint. Real-time operating systems (RTOSs) are quite common in such devices and embedded systems applications [6,7]. Real-time kernels can be used for develo** applications that perform multiple tasks (threads) simultaneously in a deterministic way. Real-time applications can be created without an RTOS; however, timing issues can be solved more efficiently with an RTOS.

The low-cost and power consumption of SBCs, their adaptability, as well as their stability in availability by key market manufacturers such as Texas Instruments, Inc., Advantech Co., Digi International, Inc., NXP Semiconductors, and Raspberry Pi Foundation, have enabled such products to be deployed into several cases where processing requirements are met by small and smart embedded devices [8,9]. The continuous increase in applications of SBCs has also led to the development of more powerful peripherals being included on such boards.

The need for SBCs that support real-time applications running multiple functions at guaranteed times, particularly in industry, will continue to rise. A market research report by KBV Research foresees that the global single-board computer market size is expected to reach 3.8 billion USD by 2026, rising at a market growth of 7.2% compound annual growth rate (CAGR) during the forecast period [10]. In industrial automation, this market growth is expected to be much higher than all other competitive sectors e.g., aerospace and defense, healthcare, and consumer electronics.

Therefore, their overall performance will continue to be assessed and taken into consideration according to each specific application and its requirements. In this direction, this research assessed the real-time computing capabilities and timing performance of a specific SBC, in particular the NXP TWR-K70F120M board [11]. This was motivated primarily by the interest of a machinery manufacturing company which has already procured some of these devices, and is looking to deploy them further on its machines’ production [12]. One of the company’s intentions in the near future is to deploy this board in a variety of control applications in the construction and production of automated machines for concrete products. This research investigated the capability of the NXP TWR-K70F120M board, running a µClinux distribution [13], to support applications with real-time constraints, and its overall performance in terms of timing metrics, such as task execution time, waiting time, and response time. One of the main objectives was to research and provide new technical data on the board’s timing performance, not available in the literature, which could benefit practical real-time applications based on this device.

Among the contributions provided by this research were the performance measurements, carried out by the use of a custom-built multithreaded application with real-time attributes, developed for this purpose. In addition, a benchmark was applied that ran additional performance tests based on cryptography algorithms. As is indicated in Section 2, to my knowledge, such information on timing performance metrics for this board is not available. Furthermore, beyond the assessment of the TWR-K70F120M platform with a benchmark and a specific multithreaded application, the same methodology and application assessment software was implemented and tested on two other popular single-board computers, a BeagleBone AI and a Raspberry Pi4 Model B, with Linux OS support. The results were compared to those obtained by the NXP TWR-K70F120M board, and conclusions were drawn upon its performance.

A summary of the provided contributions follows:

A custom-built multithreaded software application with specific compute-intensive sorting and matrix operations and real-time attributes, and its implementation on performance measurements. The software integrates specific core functionality for targeted timing metrics and optimal performance measurements, which are not available in broad benchmark applications for the purposes of this research and the particular NXP TWR-K70F120M platform.
New and distinct performance measurements for this single-board computer which are not available in the literature. The research conducted on the performance of this board is very limited. Therefore, this research extends the literature by documenting performance results on specific timing metrics and addresses the need for such performance characteristics for this board.
Outcomes of practical value to companies which intend to use such low-cost embedded devices in the development of reliable real-time industrial applications. By considering the performance results of this applied research, interested manufacturers could minimize the time, effort, and cost spent on research and design using this platform in manufacturing of better products, and potentially identify any failures and therefore avoid any unexpected performance issues.

This paper is structured as follows: Section 2 describes previous related work. Section 3 presents the methodology applied, the system’s hardware and software infrastructure, and the performance metrics acquired in measurements. Section 4 presents the experimental framework used as the testbed, the results obtained on performance measurements of the NXP TWR-K70F120M board, and the comparative results with other SBC devices. Section 5 provides a brief discussion about the ideas in the paper and the research outcomes, and Section 6 provides concluding remarks.

2. Related Work

In recent years, in several application sectors, there has been a tendency towards the use of more versatile, flexible, and low-cost devices based on ARM processors, provided by many vendors, such as the Nvidia Jetson Nano, the NXP SBC-S32V234, the BeagleBone AI-64, and the Raspberry Pi Compute Module 4 single-board computers [14,15,16].

Real-time operating systems have also been employed in order to provide support to system applications running multiple tasks (threads) with real-time constraints, particularly in embedded systems [17,18,19], for example, open-source real-time operating systems, such as µClinux and μC/OS-III [20], which target microcontrollers without memory-management unit (MMU) support. A number of works make use of µClinux to create applications on microcontrollers [21,22].

The performance of real-time systems and multithreaded applications has been analyzed and benchmarked with many different approaches. The techniques and tools used depend on the aspects of performance that are targeted for measurement and evaluation, most commonly schedulability and timing issues in real-time systems [23]. In general, there is considerable work on performance assessment of single-board computers; however, there has not been much work on the NXP TWR-K70F120M platform.

This NXP TWR-K70F120M board can be found in several applications [24,25,26]. However, studies that implement performance assessment of this board, either with standard benchmarks or with specific software developed for such purposes, are very limited. An interesting research study is the work on performance analysis of an embedded system by Vicari [27]. This thesis presented debugging and tracing with the Lauterbach μTrace device in Trace32 environment of a μClinux signal processing application running on the TWR-K70F120M development board. However, the focus is on the performance analysis of the execution of two different finite impulse response (FIR) filter implementations (with different complexities), and a fast Fourier transform (FFT) algorithm (with different numbers of points for which the transform is computed).

An interesting benchmark was applied, provided by wolfSSL Inc. [28], which examines the performance of the TWR-K70F120M platform using its in-house application wolfSSL package. This package includes a wolfCrypt benchmark application. Because the underlying cryptography is a very performance-critical aspect of SSL/TLS, this benchmark application runs performance tests on wolfCrypt’s algorithms. The performance of the TWR-K70F120M platform was determined under the MQX RTOS and using the fastmath library and CodeWarrior 10.2 IDE. In previous research work, beyond performance measurements and assessment of the TWR-K70F120M platform on running multithreaded applications with real-time constraints, this wolfCrypt benchmark has additionally been applied for comparison purposes. In this case, the implementation of this benchmark was under μClinux (kernel v2.6.33) and the fastmath library, with the use of CodeWarrior 11.1 IDE for the application build [29].

3. Materials and Methods

3.1. Tower System Architecture

The tower system under investigation is based on the NXP (e.g., Freescale) TWR-K70F120M board, which includes a Kinetis K70 family (MK70FN1M0VMJ12) microcontroller (MCU) having a 32-bit Arm Cortex-M4 core processor, and running an embedded version of Linux (μClinux).

The tower system platform consists of the main controller module (TWR-K70F120M), primary and secondary side elevators, and a serial module (TWR-SER) (see Figure 1). The TWR-K70F120M controller module, beyond the Kinetis MCU, on-board includes 1GB DDR2 SDRAM memory, 2GB NAND Flash memory, and various user-controllable LEDS, push buttons, switches, touch pads, a potentiometer, etc. The Kinetis MK70FN1M0VMJ12 microcontroller includes a 32-bit Arm Cortex-M4 core with DSP instructions running at 120 MHz, 1 MB of program Flash, 128 KB SRAM, 16-bit ADC, 12-bit DAC, various circuits for peripheral communication, and operates at a low voltage input range (1.7–3.6 V). The ARM Cortex-M4 does not support simultaneous multithreading (SMT).

The TWR-K70F120M board is powered up by connecting it to a host PC (running either Windows or Linux) through a mini-USB connector on the TWR-K70F120M board. For programming purposes, a serial interface is established to the host PC by plugging an RS-232 cable to the serial connector on the TWR-SER board. On the host PC side, the serial link provides a serial console device to the TWR-K70F120M. The software installed on the board is configured for a terminal (e.g., putty) with the following COM-port settings: 115200 8N1. On the Linux host, the serial console is available using a /dev/ttySn device. Network connectivity to the board is provided by plugging a standard Ethernet cable into the TWR-SER 10/100 Ethernet connector. The board is pre-configured with an IP address of 192.168.0.2.

3.2. Tower System Software Infrastructure

Linux is an open-source OS that can be installed on a variety of different types of architectures. A version of μClinux (kernel v2.6.33) intended for microcontrollers without MMUs is loaded into the external Flash (2 GB) on the TWR-K70F120M board, to support the execution of the measurement software. This version supports POSIX Threads. This library supports real-time and preemptive scheduling. Pthread functions are used to set the threads’ real-time execution features including scheduling policy, CPU affinity, and timing.

On a power-on, the default configuration copies the Linux image from the external Flash to RAM and jumps to the Linux kernel entry point in RAM, to allow booting Linux. The Tower System provides a complete platform for develo** and testing purposes. The CodeWarrior software was used for the development of the measurements’ multithreaded application. The application was loaded (with Kinetis Flash Programmer) into the Kinetis MK70FN1M0VMJ12 32-bit Arm Cortex-M4 microcontroller’s internal Flash (1 MB program Flash). The application uses the embedded SRAM (128 KB) of the microcontroller as storage for the data used in computations (sorting and matrix operations). The system application development flow is shown in Figure 2.

3.3. Methodology

A primary goal of this research was to investigate some of the important performance metrics regarding timing constraints, such as task execution time, waiting time, and response time (latency), of NXP TWR-K70F120M SBC, running a version of μClinux for embedded microcontrollers without MMUs.

For this purpose, a specific measurement application was developed and implemented as a multithreaded software module. The design methodology was based on a dynamic and iterative multiple-thread-generation approach, by using POSIX Threads (an API defined by the standard IEEE POSIX.1c). Pthreads are preferred for the development of the multithreaded measurement software, rather than an API such as OpenMP, in order to have better low-level control. This measurement software enabled the estimation of certain performance metrics under concurrent execution of real-time tasks (threads), on a μClinux kernel.

3.3.1. Multithreaded Application

The multithreaded design was based upon an algorithm which iteratively generates multiple processes (jobs) and threads (tasks), which run concurrently in the same core. Thus, the Cortex-M4 core is shared by at least two software threads supported by a single hardware thread. The processes generated perform two types of computations: sorting and matrix operations. Each process generates multiple pairs of threads assigned to run concurrently on a single core. Therefore, on every execution run, initially two software threads, and later multiples of twos are instantiated.

In the “sorting” process, each thread executes a computational intensive workload (function) that is either a selection sort or the quicksort algorithm. Selection sort is a simple sorting algorithm that divides the data array into a subarray of already sorted elements and a subarray of remaining elements to be sorted. The sorted array is finally populated based on an iteration procedure, upon which the smallest element from the unsorted subarray is placed at the end of the sorted array. Quicksort is a relatively more complex algorithm. It uses a divide-and-conquer strategy to divide the array into two subarrays. The sorted array is produced upon a procedure that reorders the array by moving all the elements less than a specific value (pivot) to its left and all the elements greater than it to its right.

In the “matrix” process, each thread performs either matrix addition or matrix multiplication operations. The multiplication of matrices is one of the basic computations used as a benchmark due to its very compute-intensive nature. For testing purposes, the threads generated are multiples of 2, 4, 8, 16, and 32 software threads.

Threads belonging to the same process share all the memory of the process and its resources. Therefore, the threads within each process share common data for their respective computation available in shared SRAM memory. Different data blocks are allocated for each process in the on-chip SRAM to support the execution of multiple threads which perform different operations. For sorting operations, data arrays of 32-bit integers of size 10² to 10⁶ elements were allocated for each sorting thread in the on-chip shared memory to support the concurrent execution. Sorted data output was written out in the same memory space. For matrix operations, data sets were based on 512, 1024, 2048, 4096, and 8192 arrays of short integers. A schematic view of the application’s structure is shown in Figure 3.

3.3.2. Generation of Multiple Processes and Threads

Multiple processes and threads were created repetitively. The repetitive functions (create_processes(), create_threads()) built a tree structure of multiple processes and concurrent threads, up to a certain quantity (np for processes and nth for threads). This tree structure is an aggregation of nodes consisting of processes and threads (see Figure 4). Each process contains a group of threads assigned to run concurrently, in synchrony (pthread_join()). In the experiments, each process ran a variable number of threads, from a minimum two up to 32 threads. These multiples of threads resulted in a considerable contention for core resources simultaneously, an increase in thread switches, and thread delays on waiting for execution.

The algorithm for the iterative generation of each process distributes and allocates equally the workloads to each thread for optimal computation. A synopsis of the generation of multiple threads and processes is shown as pseudocode in Algorithm 1 and as a flow chart in Figure 5.

Algorithm 1 Iterative multiple processes and threads creation

Inputs: number of processes p [2:np], number of threads th [2:32]

create_processes(p) {

for number of processes np {

set_process_attributes();

set_process_policy(RoundRobin, 99);

create_threads(th); {

for number of threads nth {

pthread_create(&threads[nth--], NULL, threads_functions[nth--], NULL);

pthread_join(threads[nth--], NULL); }}}

threads_functions(nth); {
set_threads_attributes();
if (th%2<>0)
sort_ops(); } //sorting operations
else
matrix_ops(); } //matrix operations
perform_measurements();
Outputs: measurements [execution time, threads waiting time, response latency]

The threads were synchronized with mutexes. Therefore, data allocated in on-chip memory that support the execution of multiple threads were protected through software mutexes (pthread_mutex_lock(), pthread_mutex_unlock()). Mutexes guarantee exclusive access to the resources and facilitate thread synchronization. The thread attributes were configured to run with real-time execution features, including scheduling policy (SCHED_RR), CPU affinity (core 0), priority (99), and timing. Precise timing and reliable performance metrics require accurate timing source. The application software performed time performance measurements using the system call clock_gettime(), with the highest possible resolution, and the clock id set to CLOCK_MONOTONIC. The threads shared the same high priority, which was set to 99. The SCHED_RR scheduling policy is preferred for real-time applications. Therefore, a Round Robin scheduler (SCHED_RR) was applied appropriately for real-time scheduling of threads. The multithreaded application defined and implemented the scheduling policy and attributes of the threads. The source code in C of the experimental software module is available as an open-source project at GitHub [30]. The same application module and performance measurement methodology could be applied to other Linux-based systems and platforms.

3.4. Performance Metrics

In Linux Operating Systems, a task is synonymous with a thread. In real-time applications, usually each task is scheduled as a thread, with real-time SCHED_FIFO or SCHED_RR policy and high priority. Time measurements are based upon the use of the system call clock_gettime() defined in POSIX timers header library implementation. This system call was invoked with the highest possible resolution, and with the clock id set to CLOCK_MONOTONIC. Each working thread during its execution recorded data for the performance metrics to be evaluated (including execution time, waiting time, and response latency).

3.4.1. Execution Time

For a real-time system, a primary performance criterion under consideration is the amount of time it takes to complete a task, and consequently the system throughput. Therefore, the functions’ execution time to perform the sorting and matrix operations must be measured, in order to examine the tower (TWR) unit and μClinux OS in terms of real-time computing capability. An average value of the total execution time (

t_{e x e c}

) for a given number of execution runs (n iterations) was estimated by Equation (1):

t_{e x e c} = \frac{\sum t_{e n d} - t_{b e g i n}}{n},

(1)

where

t_{e n d}

is the time it takes to finish the task’s execution, and

t_{b e g i n}

is the initial time the execution is started. The above is illustrated in Figure 6.

3.4.2. Thread Waiting Time

Each thread was assigned to run an independent task having its own memory stack and sharing a process’s space and time. The time it takes for a thread to execute a single task is essential in performance measurements. The threads were running concurrently on a single core. Therefore, there were frequent periods where any of the threads within a process is just waiting (idle thread) for execution resources (see Figure 6). In consequence, the time a thread spends waiting for a resource or an event is an important metric. Usually, the time the thread is awaiting exceeds the time the thread spends using the resource. Reducing wait times (for resources) is a common strategy for improving the overall performance of a system. Therefore, in order to evaluate the performance of the threads’ dynamic operation in multithreaded applications, it is essential to measure their waiting time. A simple but effective way of identifying when a thread is active or idle (waiting) is to use a thread-specific state variable, which is set when the thread exits the execution cycle (e.g., waiting for a mutex on a resource), and is unset when it begins the execution again. In this way, a global time counter was used to count and add the total sum of the thread’s waiting time (t_wait) on the basis of Equation (2):

t_{w a i t} = \sum_{i = 1}^{m} {t_{w}}_{i},

(2)

where t_wi is the task’s individual waiting time intervals out of the total m.

3.4.3. Response Time

Applications having timing constraints should respond as soon as possible meeting any specified deadlines, an important aspect of a real-time system. For hard real-time systems, the deadlines must always be met and the response times guaranteed. Therefore, it is important to measure the response time (or latency) of a task’s running thread. A real-time task is characterized by its execution time, usually relevant to a deadline, and a maximum (worst-case) response latency as the upper bound. The worst-case response latency is a typical metric of the determinism of a real-time task. A task’s response time or latency is defined as the overall time elapsed from the arrival of this task to the moment this task is switched to a running state, and receiving its first response. The overall response latency includes the interrupt latency, i.e., the time it takes to appear upon the arrival of the task and its service time, and the task’s scheduling latency, i.e., the time it takes for the scheduler to run the task (see Figure 6). An average value of the response latency (rl) for a number of runs (n iterations) was calculated using Equation (3):

r l = \frac{\sum_{n}^{i} (t_{f r e s p} - t_{a r r})}{n},

(3)

where

t_{a r r}

is the task’s arrival time,

t_{f r e s p}

is the task’s first response time, and n is the number of performed iterations.

4. Results

4.1. The Experimental Framework

The reliability of the TWR-K70F120M platform (TWR) to efficiently run the multithreaded application was examined with experimental measurements. As mentioned earlier, this platform integrates a Kinetis MK70FN1M0VMJ12 microcontroller, which includes a 32-bit Arm Cortex-M4 core with DSP instructions running at 120 MHz. In addition, for comparison purposes, the same experimental measurements were executed on two other popular single-board computers, a Raspberry Pi4 (RPi4) and a BeagleBone AI (BBAI) (see Figure 7). These low-cost, low-power, and stand-alone single-board computers are being extensively used for embedded applications. The Raspberry Pi4 board integrates a Broadcom BCM2711 SoC having an ARM Cortex-A72 4-core processor running at 1.5 GHz. The BeagleBone AI is built upon Texas Instruments Sitara AM5729 SoC having a dual ARM Cortex-A15 processor running at 1 GHz. Neither of the processors support simultaneous multithreading (SMT). One of the goals of SMT is to keep the shared pipelines full where they would otherwise stall or have bubbles. The 32-bit ARM Cortex-M4 architecture is based on a 3-stage small and efficient pipeline that does not have considerable load-store latencies. The interrupt latency takes 12 cycles. The ARM Cortex-A15 and Cortex-A72 core architectures have a 15-stage pipeline. All of the above system platforms run Linux versions. The TWR runs μClinux (kernel v2.6.33), and the other two RPi4 and BBAI platforms run UBUNTU (kernel 4.14.74-v7+).

4.2. Performance Measurements on TWR-K70F120M Platform

A number of experiments were carried out to evaluate the performance of the tower architecture and verify its feasibility for efficiently running multithreaded applications with real-time features. The application module developed and used in the experimental runs scheduled the execution initially of two pairs of computational tasks concurrently on a single core, as pairs of two software threads, and later their multiples (2, 4, 8, 16 pairs). Thus, the core resources are shared by at least two threads running concurrently. This core architecture by default allows one-to-one threads to core map**. This running scheme of having multiple threads assigned to run on the core with the same scheduling policy and priorities added substantially to the workload.

Each process scheduled two tasks as software threads which ran concurrently and performed different computations. The computational tasks consist of sorting and matrix operations. The sorting algorithms (selection and quick sort) were executed on a single core with multiple thread configurations (2 to 32) and datasets of 32-bit integers of size 10² to 10⁶ (100 to 1 M) elements. Similarly, matrix operations (addition and multiplication) were executed with data sets based on 512, 1024, 2048, 4096, and 8192 arrays of short integers.

The experiments were executed multiple times, approximately several thousand iterations, in order to obtain a sufficient number of values to estimate averages. Table 1 presents the results obtained for the execution times for selection and quicksort algorithms for a variety of datasets and thread configurations.

The results show the differences in structure and complexity of each sorting algorithm. The quicksort is a relatively more complex algorithm; however, it was faster than the selection sort. For example, for 1000 iterations, the selection sort requires O(n²) or about 1,000,000 operations where quicksort requires only about 10,000. In all cases, the quicksort algorithm achieved lower execution times than selection sort.

As depicted in Figure 8, the results indicate that regarding the execution time, data sorting takes less time as the number of threads increases. The increase in the number of threads indeed resulted in lower execution times, but execution did not become substantially more efficient. The ratio of the execution time on a single core with 32 threads to that on two threads was lower by approximately 22%, in all cases of sorting operations.

Table 2 presents the results for the execution times for matrix additions and multiplications for a variety of datasets and threads.

The multiplication of matrices even with a few thousand elements is a heavy loaded function. Such a workload can impact the execution of the threads running concurrently on the same core. Regarding the core’s execution performance with multiple threads, the results produced are illustrated in Figure 9. As is evident, there is a tendency toward increased execution times for all matrix operations, as the number of threads increases. This is quite evident in the case of 32-thread combinations. The ratio of the execution time on a single core with 32 threads to that on two threads is higher by approximately 41%, in all cases of matrix operations. Although the computations’ workload was distributed equally across all threads, the threads’ execution suffered from heavy contention for the same shared core resources by multiple threads.

Table 3 presents the averages of execution times, thread waiting times, and response times for both sorting and matrix operations. These are illustrated in Figure 10.

As the number of threads increases up to 32, the average execution time decreases up to 16%, the average thread waiting time decreases up to 7%, and the average first response time increases up to 22% for all operations. The overall results show the minimum response time to be on average 239 μs (case of two threads), and the maximum worst-case response (latency) to be on average 291 μs (case of 32 threads). Therefore, a value of about 300 μs, as an upper bound, could be an acceptable safety margin for such applications in most real-time systems. These are illustrated in Figure 11. It is obvious that multiple pairs of threads, particularly in the case of 32 threads, led to further contention for execution resources, resulting in greater delays of the first response.

4.3. Comparative Performance Measurements

An interesting question could be how the Raspberry Pi4 and BeagleBone AI SBCs behave upon the execution of such a software module. For comparison purposes, the same application module was imported and executed on these two single-board computers. Although all the devices, including the tower system (TWR-K70F120M platform), have embedded microcontrollers which are based on ARM processors and run versions of Linux distributions, they are quite different in architecture and other specifications. Therefore, it is hard to provide generic performance conclusions. The intention was to reconfirm that the results obtained with the TWR platform are equally acceptable and comparable, rather than providing a fair comparison, since the hardware architectures and OS kernel versions are quite different.

The RPi4 and BBAI are 4-core and 2-core devices, respectively. Since none of these cores supports simultaneous multithreading, each core can only run one hardware thread at a time. However, each core can switch between several (software) threads of the running application. For this reason, in order to make the comparisons more fair and reliable, both the RPi4 and BBAI, 4-core and 2-core devices, respectively, were forced to use only a single core, as the TWR is based upon a single core processor. Thus, the affinity was set only to a single core (core 0).

Table 4 provides the results obtained running the application for sorting and matrix operations on those platforms, and Table 5 provides a summary of the average values for both operations.

In general, the results show that the TWR performs relatively well compared to the other platforms. The performance of the multithreaded application running on the TWR platform was comparable to that running on the RPi4 and BBAI, particularly in terms of execution times. The execution times on the TWR platform were lower than those on the other platforms, by approximately 56% in the case of two threads, and by 29% in the case of 32 threads. As illustrated in Figure 12, the TWR performed better as the number of threads remained small. This was expected to some extent due to the single core architecture, optimized to run more efficiently with fewer threads. The thread waiting time and response time were also comparable. The outcome is that overall, the TWR platform is indeed reliable in efficiently handling the execution of the multithreaded application.

4.4. Benchmarking

The wolfSSL Incorporation has investigated the performance of the TWR-K70F120M platform using its in-house application wolfSSL package. This package includes a wolfCrypt benchmark application. Because the underlying cryptography is a very performance-critical aspect of SSL/TLS, this benchmark application runs performance tests on wolfCrypt’s algorithms. The performance of the TWR-K70F120M platform was determined under the MQX RTOS and using the fastmath library and CodeWarrior 10.2 IDE.

In this research, I applied the same benchmark, but under μClinux and using the fastmath library and CodeWarrior 11.1 IDE for the build. Typical output shows the time it took to run each benchmark (duration in seconds) and the throughput in MB/s. These are summarized in Table 6. As illustrated in Figure 13, the results obtained under μClinux are very close to those under MQX RTOS regarding the duration and throughput, with average deviation values of 0.4 s and 0.3 MB/s, respectively.

5. Discussion

The experiments ran efficiently for several thousand iterations on the NXP TWR-K70F120M board, for a variety of datasets and thread configurations with equally distributed computational workload. The resources of the single Cortex-M4 core were shared concurrently by multiple threads having the same real-time scheduling policy attributes and priorities. The computations consisting of sorting and matrix operations were sufficient as workload. In addition, the differences in structure and complexity of sorting algorithms, as well the intensive matrix operations have contributed towards a comprehensive view of the execution behavior of the device under investigation.

The running scheme applied gradually increased the number of threads involved in execution of the tasks. This added substantially to the contention of the threads for the core’s shared execution resources. This impact became quite obvious in matrix operations, resulting in an increase in average execution times. The ratio of the execution time on a single core with 32 threads to that on two threads was higher by approximately 41%, in all cases of matrix operations. On the other hand, in sorting operations, the increase in the number of threads resulted in lower execution times, although execution did not become substantially more efficient. The ratio of the execution time on a single core with 32 threads to that on two threads was lower by approximately 22%, in all cases of sorting operations.

As a general outcome for both operations, the average of execution time decreased as the number of threads increased. In the case of 32 threads, the average execution time decreased up to 16%, the average thread waiting time decreased up to 7%, and the average first response time increased up to 22% for all operations. Taking into consideration that the average execution time decreased up to 16%, the increase in the initial first response time by 22% shows that the threads gradually performed better despite the contention for shared execution resources, particularly at the first rounds of executions.

Another important metric is the maximum worst-case response (latency). This was shown to be on average 291 μs (case of 32 threads) and 239 μs (case of two threads). Therefore, a value of about 300 μs, as an upper bound, could be an acceptable safety margin for such applications in most real-time systems.

As stated in the related work section, such information on thread timing performance of this board running μClinux OS is not available in the literature. Research on performance analysis of this board by Vicari [27], although focused on debugging and tracing procedures, provided some limited information on average and worst-case execution times. However, this timing information was not based on running a multithreaded application with real-time attributes, as was the focus of the present research.

The comparative results with RPi4 and BBAI showed that the TWR-K70F120M performed relatively well compared to these platforms, particularly in terms of execution times. Actually, the execution times in the TWR were lower than the other devices, by approximately 56% in the case of two threads, and by 29% in the case of 32 threads. In general, the TWR performed better as the number of threads remained small. This was expected to some extent due to the TWR’s single-core architecture, optimized to run more efficiently with fewer threads. The thread waiting time and response time were also comparable. To my knowledge, such comparative data on timing performance metrics are not available in the literature.

The application of wolfCrypt benchmark on the TWR-K70F120M platform under μClinux, regarding its duration and throughput, has also reconfirmed the appropriate performance of the tower. The results obtained under μClinux are very close to those under MQX RTOS regarding the duration and throughput, with average deviation values of 0.4 s and 0.3 MB/s, respectively. Overall, the TWR platform is indeed reliable in efficiently handling the execution of applications with multiple threads and real-time features.

6. Conclusions

This paper showed that the TWR-K70F120M platform is reliable in efficiently handling the execution of multithreaded applications with real-time attributes. This was documented primarily through experimental runs of a custom-built software application which generated and scheduled for execution multiple pairs of threads of equal workload, and with the same real-time policy. This application takes measurements on timing metrics, including task execution time, thread waiting time, and first response time. Such information on timing performance metrics for this board is not available in the literature.

Comparative results on the other two commonly used platforms, the Raspberry Pi4 and BeagleBone AI, also reconfirmed its efficient operation. The execution times on the TWR-K70F120M platform were substantially lower than the others, by approximately 56% in the case of two threads, and by 29% in the case of 32 threads. In addition, wolfCrypt benchmark under μClinux, regarding its duration and throughput, also reconfirmed its appropriate performance.

In industry, the need for SBCs that support real-time applications running multithreaded processes at guaranteed times is of growing importance. Therefore, the outcomes of this research could be of practical value to companies which intend to use such a low-cost embedded device to develop reliable real-time industrial applications. Since almost all applications use multithreading techniques, the results provided by this research on thread timing metrics for this device, could be taken into account during the design and development, which in turn can save time, effort, and cost.

It is also important to state that the same application module and performance measurement methodology could be applied to other Linux-based systems and platforms. Research is already underway into the implementation of this device in the automation and control of machines for concrete products. The outcome of such real case studies will provide valuable feedback for future research.

The performance metrics investigated in this research include task execution time, worst-case execution time, thread waiting time, and response latency. Although these are some of the most interesting metrics for real-time applications, they could be extended to include additional metrics. A limitation of the multithreaded application used in this research is that the number of executable threads in a process is limited to 32. This number could be increased; however, the experiments show that a further increase in the number of threads that ran concurrently on a single core led to further contention for execution resources, resulting in longer delays in response times.

Funding

This research received no external funding.

Data Availability Statement

The experimental software module is available as an open-source project at GitHub https://github.com/gadam2018/NXP-TWR (accessed on 22 July 2023).

Acknowledgments

The author would like to thank the Computer Systems Laboratory (CSLab, https://cslab.ds.uth.gr/) (accessed on 12 June 2023) in the Department of Digital Systems, University of Thessaly, Greece, for the technical support and the resources provided for this experimental research.

Conflicts of Interest

The author declares no conflict of interest.

References

Fernández-Cerero, D.; Fernández-Rodríguez, J.Y.; Álvarez-García, J.A.; Soria-Morillo, L.M.; Fernández-Montes, A. Single-Board-Computer Clusters for Cloudlet Computing in Internet of Things. Sensors 2019, 19, 3026. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Adam, G.K. Real-time performance analysis of distributed multithreaded applications in a cluster of ARM-based embedded devices. Int. J. High Perform. Syst. Archit. 2022, 11, 105–116. [Google Scholar] [CrossRef]
Coelho, P.; Bessa, C.; Landeck, J.; Silva, C. The Potential of Low-Power, Cost-Effective Single Board Computers for Manufacturing Scheduling. Procedia Comput. Sci. 2023, 217, 904–911. [Google Scholar] [CrossRef]
Galkin, P.; Golovkina, L.; Klyuchnyk, I. Analysis of Single-Board Computers for IoT and IIoT Solutions in Embedded Control Systems. In Proceedings of the International Scientific-Practical Conference Problems of Infocommunications, Kharkiv, Ukraine, 9–12 October 2018; pp. 297–302. [Google Scholar] [CrossRef]
Prashanth, K.V.; Akram, P.S.; Reddy, T.A. Real-time issues in embedded system design. In Proceedings of the International Conference on Signal Processing and Communication Engineering Systems, Guntur, India, 2–3 January 2015; pp. 167–171. [Google Scholar] [CrossRef]
Hee, Y.H.; Ishak, M.K.; Asaari, M.S.M.; Seman, M.T.A. Embedded operating system and industrial applications: A review. Bull. Electr. Eng. Inform. 2021, 10, 1687–1700. [Google Scholar] [CrossRef]
Ungurean, I. Timing Comparison of the Real-Time Operating Systems for Small Microcontrollers. Symmetry 2020, 12, 592. [Google Scholar] [CrossRef] [Green Version]
Costa, D.G.; Duran-Faundez, C. Open-Source Electronics Platforms as Enabling Technologies for Smart Cities: Recent Developments and Perspectives. Electronics 2018, 7, 404. [Google Scholar] [CrossRef] [Green Version]
Rodrigues, J.M.F.; Cardoso, P.J.S.; Monteiro, J. Smart Systems Design, Applications, and Challenges; IGI Global: Hershey, PA, USA, 2020. [Google Scholar]
KBV Research: Single Board Computer Market Size. Available online: www.kbvresearch.com/single-board-computer-market/ (accessed on 5 July 2023).
NXP Semiconductors: Kinetis K70 120 MHz Tower System Module. Available online: www.nxp.com/design/development-boards/tower-development-boards/mcu-and-processor-modules/kinetis-modules/kinetis-k70-120-mhz-tower-system-module:TWR-K70F120M (accessed on 4 February 2023).
Adam Co.: Automatic Press Machines. Available online: https://www.adam.com.gr/ (accessed on 15 March 2023).
µClinux. Available online: https://en.wikipedia.org/wiki/%CE%9CClinux (accessed on 11 April 2023).
ModBerry: Industrial Raspberry Pi with Compute Module 4. Available online: https://modberry.techbase.eu/tag/compute-module-4/ (accessed on 25 April 2023).
NXP Semiconductors: S32V2 Vision and Sensor Fusion low-Cost Evaluation Board. Available online: www.nxp.com/design/development-boards/automotive-development-platforms/s32v-mpu-platforms/s32v2-vision-and-sensor-fusion-low-cost-evaluation-board:SBC-S32V234 (accessed on 16 January 2023).
RevPi: Open Source IPC based on Raspberry Pi. Available online: https://revolutionpi.com/ (accessed on 5 December 2022).
Wang, K.C. Embedded and Real-Time Operating Systems; Springer: Cham, Switzerland, 2017. [Google Scholar]
Seo, S.; Kim, J.; Kim, S.M. An Analysis of Embedded Operating Systems: Windows CE Linux VxWorks uC/OS-II and OSEK/VDX. Int. J. Appl. Eng. Res. 2017, 12, 7976–7981. [Google Scholar]
Adam, G.K. Co-Design of Multicore Hardware and Multithreaded Software for Thread Performance Assessment on an FPGA. Computers 2022, 11, 76. [Google Scholar] [CrossRef]
MicroC/OS: Micro-Controller Operating Systems. Available online: https://en.wikipedia.org/wiki/Micro-Controller_Operating_Systems (accessed on 7 March 2023).
Aysu, A.; Gaddam, S.; Mandadi, H.; Pinto, C.; Wegryn, L.; Schaumont, P. A design method for remote integrity checking of complex PCBs. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, Dresden, Germany, 14–18 March 2016; pp. 1517–1522. [Google Scholar]
Musaddiq, A.; Zikria, Y.B.; Hahm, O.; Yu, H.; Bashir, A.K.; Kim, S.W. A Survey on Resource Management in IoT Operating Systems. IEEE Access 2018, 6, 8459–8482. [Google Scholar] [CrossRef]
Belleza, R.R.; Freitas, E.P. Performance study of real-time operating systems for internet of things devices. IET Softw. 2018, 12, 176–182. [Google Scholar] [CrossRef]
Petrellis, N. A Scalar Interpolator for the Improvement of an ADC Output Linearity. Int. J. Eng. Sci. Innov. Technol. 2014, 3, 591–600. [Google Scholar]
Arrobo, G.E.; Perumalla, C.A.; Hanke, S.B.; Ketterl, T.P.; Fabn, P.J.; Gitin, R.D. An innovative wireless Cardiac Rhythm Management (iCRM) system. In Proceedings of the Wireless Telecommunications Symposium, Washington, DC, USA, 9–11 April 2014; pp. 1–5. [Google Scholar] [CrossRef] [Green Version]
Deschambault, O.; Gherbi, A.; Legare, C. Efficient Implementation of the MQTT Protocol for Embedded Systems. J. Inf. Process. Syst. 2017, 13, 26–39. [Google Scholar] [CrossRef] [Green Version]
Vicari, L. Performance Analysis of an Embedded System. Master’s Thesis, Politecnico Di Torino, Torino, Italy, 2018. [Google Scholar]
wolfSSL Inc. Benchmarking wolfSSL and wolfCrypt. Available online: www.wolfssl.com/docs/benchmarks/ (accessed on 5 June 2023).
NXP Semiconductors. CodeWarrior Embedded Software Development Tools. Available online: www.nxp.com/design/software/development-software/codewarrior-development-tools:CW_HOME (accessed on 8 December 2022).
NXP-TWR. Available online: https://github.com/gadam2018/NXP-TWR (accessed on 20 July 2023).

Figure 1. The Tower System.

Figure 2. System application development flow.

Figure 3. Multithreaded application structure.

Figure 4. Processes and threads tree structure.

Figure 5. Iterative multiple processes and threads creation.

Figure 6. Task scheduling and running.

Figure 7. Experimental setup.

Figure 8. Sorting operation execution times.

Figure 9. Matrix operation execution times.

Figure 10. Sorting and matrix operation timing metrics.

Figure 11. Ratio of execution, threads waiting and response times.

Figure 12. Comparison of average values of timing metrics on different platforms.

Figure 13. WolfCrypt Benchmarking results for the TWR-K70F120M under MQX and μClinux.

Table 1. Selection sort and quicksort execution times.

Threads	Sort Alg	Datasets (of 32-Bit Integers)
		10²	10³	10⁴	10⁵	10⁶
		Execution Time (ms)					Avg	Mean	Ratio
2	Selection	0.25	0.45	1.52	2.56	5.90	2.136	1.83	-
2	Quick	0.18	0.30	1.09	1.79	4.30	1.532	1.83	-
4	Selection	0.22	0.37	1.27	2.12	5.25	1.846	1.67	−9%
4	Quick	0.17	0.31	1.10	1.67	4.20	1.490	1.67	−9%
8	Selection	0.20	0.29	1.10	1.98	4.97	1.708	1.58	−14%
8	Quick	0.18	0.32	1.12	1.56	4.11	1.458	1.58	−14%
16	Selection	0.17	0.25	1.02	1.88	4.56	1.576	1.48	−19%
16	Quick	0.17	0.29	1.01	1.45	3.97	1.378	1.48	−19%
32	Selection	0.16	0.23	0.97	1.82	4.48	1.532	1.44	−22%
32	Quick	0.15	0.25	1.00	1.43	3.87	1.340	1.44	−22%

Table 2. Matrix addition and multiplication execution times.

Threads	Matrix	Matrices (Size of Short Integers)
		512	1024	2048	4096	8192
		Execution Time (ms)					Avg	Mean	Ratio
2	Addition	0.02	0.04	0.09	0.20	0.32	0.134	0.20	-
2	Multiplication	0.05	0.09	0.21	0.29	0.67	0.262	0.20	-
4	Addition	0.02	0.03	0.07	0.20	0.31	0.126	0.19	−2%
4	Multiplication	0.04	0.11	0.22	0.29	0.65	0.262	0.19	−2%
8	Addition	0.04	0.05	0.06	0.21	0.34	0.140	0.21	7%
8	Multiplication	0.06	0.13	0.24	0.32	0.66	0.282	0.21	7%
16	Addition	0.05	0.07	0.09	0.25	0.38	0.168	0.24	20%
16	Multiplication	0.08	0.15	0.26	0.35	0.69	0.306	0.24	20%
32	Addition	0.07	0.10	0.11	0.29	0.44	0.202	0.28	41%
32	Multiplication	0.12	0.20	0.31	0.40	0.75	0.356	0.28	41%

Table 3. Average execution times, thread waiting times, and response times.

Threads	Average Execution Time (ms)				Average Thread Waiting Time (ms)				Average Response Time (ms)
Threads	Sorting Ops	Matrix Ops	Mean	Ratio	Sorting Ops	Matrix Ops	Mean	Ratio	Sorting Ops	Matrix Ops	Mean	Ratio
2	1.834	0.198	1.016	-	0.877	0.041	0.459	-	0.454	0.024	0.239	-
4	1.668	0.194	0.931	−8%	0.686	0.052	0.369	−20%	0.470	0.015	0.243	1%
8	1.583	0.211	0.897	−12%	0.824	0.046	0.435	−5%	0.512	0.023	0.268	12%
16	1.477	0.237	0.857	−16%	0.765	0.060	0.413	−10%	0.531	0.010	0.271	13%
32	1.436	0.279	0.858	−16%	0.785	0.072	0.429	−7%	0.560	0.022	0.291	22%

Table 4. Results of timing metrics for sorting and matrix operations on different platforms.

Platform	Linux Kernel	Threads	Average Execution Time (ms)		Average Thread Waiting Time (ms)		Average Response Time (ms)
			Sorting Ops	Matrix Ops	Sorting Ops	Matrix Ops	Sorting Ops	Matrix Ops
TWR	2.6	2	1.834	0.198	0.877	0.041	0.454	0.024
RPi4	4.14		3.12	0.23	1.3	0.8	0.9	0.008
BBAI	4.14		2.79	0.2	0.9	0.7	0.5	0.007
TWR	2.6	4	1.668	0.194	0.686	0.052	0.470	0.015
RPi4	4.14		2.89	0.17	1.2	0.6	0.8	0.006
BBAI	4.14		2.36	0.19	0.6	0.5	0.5	0.007
TWR	2.6	8	1.583	0.237	0.824	0.046	0.512	0.023
RPi4	4.14		2.5	0.14	1.2	0.7	0.7	0.005
BBAI	4.14		2.08	0.18	0.5	0.5	0.4	0.006
TWR	2.6	16	1.477	0.237	0.765	0.060	0.531	0.010
RPi4	4.14		2.39	0.12	1.1	0.6	0.6	0.004
BBAI	4.14		1.98	0.16	0.3	0,4	0.5	0.006
TWR	2.6	32	1.436	0.279	0.785	0.072	0.560	0.022
RPi4	4.14		2.3	0.11	1.1	0.7	0.7	0.005
BBAI	4.14		1.87	0.15	0.3	0.5	0.6	0.006

Table 5. Summary of average values of timing metrics on different platforms.

Platform	Threads	Average Execution Time (ms)	Average Thread Waiting Time (ms)	Average Response Time (ms)
TWR	2	1.016	0.459	0.239
RPi4		1.675	1.050	0.454
BBAI		1.495	0.800	0.254
TWR	4	0.931	0.369	0.243
RPi4		1.530	0.900	0.403
BBAI		1.275	0.550	0.254
TWR	8	0.897	0.435	0.268
RPi4		1.320	0.950	0.353
BBAI		1.130	0.500	0.203
TWR	16	0.857	0.413	0.271
RPi4		1.255	0.850	0.302
BBAI		1.070	0.350	0.253
TWR	32	0.858	0.429	0.291
RPi4		1.205	0.900	0.353
BBAI		1.010	0.400	0.303

Table 6. WolfCrypt Benchmarking summary of results for the TWR-K70F120M.

Wolfcrypt Benchmark	OS	Duration (sec)	Throughput (MB/s)
AES 5120 kB	MQX	9.059	0.55
AES 5120 kB	μClinux	10.120	0.49
ARC4 5120 kB	MQX	2.190	2.28
ARC4 5120 kB	μClinux	2.201	2.27
DES 5120 kB	MQX	18.453	0.27
DES 5120 kB	μClinux	19.897	0.25
MD5 5120 kB	MQX	1.396	3.58
MD5 5120 kB	μClinux	1.411	3.54
SHA 5120 kB	MQX	3.635	1.38
SHA 5120 kB	μClinux	3.712	1.35
SHA-256 5120 kB	MQX	9.145	0.55
SHA-256 5120 kB	μClinux	9.356	0.53

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Adam, G.K. Timing and Performance Metrics for TWR-K70F120M Device. Computers 2023, 12, 163. https://doi.org/10.3390/computers12080163

AMA Style

Adam GK. Timing and Performance Metrics for TWR-K70F120M Device. Computers. 2023; 12(8):163. https://doi.org/10.3390/computers12080163

Chicago/Turabian Style

Adam, George K. 2023. "Timing and Performance Metrics for TWR-K70F120M Device" Computers 12, no. 8: 163. https://doi.org/10.3390/computers12080163

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Timing and Performance Metrics for TWR-K70F120M Device

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Tower System Architecture

3.2. Tower System Software Infrastructure

3.3. Methodology

3.3.1. Multithreaded Application

3.3.2. Generation of Multiple Processes and Threads

3.4. Performance Metrics

3.4.1. Execution Time

3.4.2. Thread Waiting Time

3.4.3. Response Time

4. Results

4.1. The Experimental Framework

4.2. Performance Measurements on TWR-K70F120M Platform

4.3. Comparative Performance Measurements

4.4. Benchmarking

5. Discussion

6. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI