A Deterministic Branch Prediction Technique for a Real-Time Embedded Processor Based on PicoBlaze Architecture

Ali, Ehsan; Pora, Wanchalerm

doi:10.3390/electronics11213438

Open AccessArticle

A Deterministic Branch Prediction Technique for a Real-Time Embedded Processor Based on PicoBlaze Architecture

by

Ehsan Ali

and

Wanchalerm Pora

^*

Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok 10330, Thailand

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(21), 3438; https://doi.org/10.3390/electronics11213438

Submission received: 27 September 2022 / Revised: 15 October 2022 / Accepted: 18 October 2022 / Published: 24 October 2022

(This article belongs to the Special Issue FPGAs Based Hardware Design)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper proposes a new deterministic branch prediction unit to achieve a uniformly timed instruction set architecture (ISA). The deterministic ISA is achieved by utilizing two address buses in conjunction with dual-port block RAMs that are common in commercial FPGAs. The goal is to remove mandatory branch and load delays to achieve a uniform one clock cycle per every instruction. To demonstrate the concept, the proposed architecture is applied to the **linx PicoBlaze firm core. The result is a new soft core named DAP-Zipi8 that reduces the clock per instruction (CPI) metric of PicoBlaze from two to one at the expense of extra logic and a longer critical path. The increased critical path reduces maximum achievable clock speed from 357.509 MHz to 224.022 MHz. Merging the gain in CPI with the loss in maximum clock frequency still improves overall processor performance by 18.28–19.49%. The high-performance deterministic DAP-Zipi8 is a viable choice for hard RTES applications.

Keywords:

FPGA; field programmable gate arrays; microprocessors; real-time embedded systems; **linx PicoBlaze; deterministic instruction set architecture

1. Introduction

This paper focuses on central processing units for real-time embedded systems (RTESs). The majority of microprocessors available on the market are not designed for hard RTESs [1]. Advanced performance improvement techniques (pipelining, branch prediction units (BPUs), floating point units (FPUs), caching, memory management units (MMUs), frequency scaling, shared buses, etc.) sacrifice determinism and introduce timing anomalies [1,2,3] which increase the complexity of static timing analysis (STA) [4,5].

A good example of the increase in the complexity of STA is the case of a pipeline stall, where execution of an instruction must stall (e.g., due to register data dependency) for

n

extra cycles where

n

depends on pipeline depth. Another example is incorrect predictions from the BPU, which forces the processor to discard speculatively fetched instructions, thus incurring a delay (equal to the number of stages between the fetch and execute stages [6]).

FPU performance depends on implementation and input operands. For example, a subnormal input can increase the execution time by two orders of magnitude [7]. A cache miss requires the upper memory layers to be accessed, which imposes a much longer delay. Accessing a memory page that is not mapped into virtual address space causes a page fault in the MMU, forcing a page to be loaded from disk which, again, incurs a delay. Frequency scaling and shared buses exhibit similar non-deterministic delays. All these performance improving techniques introduce timing anomalies and increase STA’s complexity.

There is a misconception that fast computing equals real-time computing. Rather than being fast, the most important property of RTESs is predictability [8]. All techniques mentioned above are sources of indeterminism. They add complexity to static analysis tools and have a negative impact on worst-case execution time analysis (WCET), which determines the bounded response time of an RTES. Although achieving acceptable WCET analysis is still possible in the presence of those advanced techniques (through end-to-end testing, static analysis, and measurement-based analysis [9]), achieving better WCET analysis when some features are present (e.g., caches [1]) is still an open problem. Therefore, designers tend to use simpler microprocessors that have adapted reduced instruction set computer (RISC) architecture with less of those performance improving features for hard real-time systems. The RISC architecture has a major advantage in real-time systems as the average instruction execution time is shorter than complex instruction set computer (CISC) architecture. This leads to shorter interrupt latency and shorter response times [10]. One of the major neglected sources of performance inconsistency is indeterministic instruction set architecture (ISA). Branch instructions require more clock cycles if taken than not taken. For example, ARM11 branch instructions require three clock cycles if taken, but one cycle if not taken [11]. In PowerPC 755, a simple addition may take anywhere from 3 up to 321 cycles [12] due to its non-compositional architecture [13] that produces a domino effect.

For most 4-bit, 8-bit, 16-bit, and non-pipelined microarchitectures without caches, one could simply sum up the execution times of individual instructions to obtain the exact execution cycle of the instructions sequence [14,15]. This is only valid if the ISA of a microarchitecture is deterministic. In this context, determinism means the exact number of clock cycles for all instructions is known, and the number of clock cycles per instruction is permanent and does not vary based on previous states of the processor. This property is very important in hard real-time embedded systems that need to respond to external events (e.g., execution completion of machine instructions in a procedure) with precise timing. In those systems, WCET estimation cannot be used, as even a single clock cycle deviation from expected timing makes the system non-functional. A good example of such systems is the controller of multi-core architectures, where a complex finite state machine performs the role of an operating system and delegates independent tasks to cores and retrieves the result.

Consequently, RISC-V, ARM, Intel, MIPS, and all processors that have a pipeline, cache systems, or other sources of indeterminism cannot be used in systems where cycle-accurate predication is one of their hard requirements. PicoBlaze is a good choice as it is already a deterministic core (uniform CPI = 2) with relatively low performance. It can be used as a controller for a complex finite state machine that governs multiple cores.

In this paper, a technique for a deterministic branch prediction is proposed. Using the proposed design, the processor always has the correct program counter regardless of whether the branch is taken or not, which eliminates ISA indeterminism. The **linx PicoBlaze firm core has a clock per instruction (CPI) value of two for all its instructions [16]. It is already a deterministic core through the setting of CPI to two. It is modified to incorporate the proposed architecture in this paper. A lookahead circuit, in conjunction with a dual-fetch mechanism, is employed for reducing the CPI from two to one while retaining the ISA determinism (identical CPI for all instructions).

The uniform CPI = 1 value for all instructions is achieved by removing register data dependency and flags/conditional branch interlocks. That is why “branch and load delay” definitions are given; how other architectures have dealt with them will also be discussed. Note that CPI provides a sufficient way of comparing two different implementations of the same ISA (in our case PicoBlaze ISA) [17]; therefore, no benchmarking program is required because both cores execute the same instruction sequence.

The objective and contribution of our work is to improve processor performance without sacrificing ISA determinism. In the case of **linx PicoBlaze, the objective can be translated to improving the performance of the core from CPI = 2 to CPI = 1. A dual-fetch technique alongside a branch prediction circuit is proposed that fetches two instructions at one clock cycle and uses the second fetch for the sole purpose of removing branch and load delays with the goal of achieving uniform CPI = 1 values. The dual-issue technique (related work) requires a pipeline and refers to fetching two instructions at each clock cycle and then issuing them to the next stage of a pipeline to achieve CPI = 0.5 without a guarantee of CPI uniformity. In our ongoing project, a complex finite state machine has been implemented using a PicoBlaze core that controls 1024 other PicoBlaze cores. Because of deterministic ISA, the state machine can react to external triggers, such as completion of procedure execution, and can retrieve and then pass the result to other cores at precise clock cycles (precise timing).

The contributions of this paper are:

A microprocessor architecture that eliminates branch and load delays to achieve uniform CPI = 1 values.
The utilization of unused ports of FPGA memory primitives to boost overall processor performance while retaining ISA determinism.
The 18.28–19.49% performance improvement of **linx PicoBlaze in terms of MIPS.

Preliminary definitions are provided in the next session and related work is presented in Section 3. A brief overview of PicoBlaze architecture is then provided in Section 4. In Section 5, a technique (proposed in [18]) is employed to transform the PicoBlaze into a modifiable soft core named Zipi8. The source code of the new core is written at the RTL-level, which makes architectural customization possible. Section 6 discusses the Zipi8 modifications used to achieve CPI = 1; the modified core is named DAP-Zipi8. The work presented in this section contains the two main contributions of the paper. Finally, the comparison of resource and power utilization for DAP-Zipi8 versus PicoBlaze is presented in Section 8. The verification process is covered in Section 9.

2. Definitions

Real-time systems (RTSs) are computing systems that must react within precise time constraints to events in the environment [19]. We can categorize RTSs into three groups [18]:

Hard RTSs: impose strict timing requirements with fatal consequences if temporal demands are not met.
Soft RTSs: set coarse temporal requirements, without catastrophic consequences if several deadlines are missed.
Firm RTSs: set fine-grained temporal requirements, without fatal consequences in the case of infrequent deadline misses.

Embedded systems are computing systems with tightly coupled hardware and software integration that are designed to perform a dedicated function [20]. The reactive nature of embedded systems is shown in Figure 1. A reactive system must respond to events in the environment within defined time constraints. External events being aperiodic and unpredictable makes it more difficult to respond within a bounded time frame [21].

Hard real-time embedded systems (RTESs) refer to those embedded systems which require real-time behavior with for a missed deadline [22]. The software part of an RTS is an application that runs either in stand-alone mode (bare metal) or scheduled as a task on a real-time operating system (RTOS). The hardware part includes one or more central processing units (CPU), memory elements, and input/output (I/O) devices with interrupt mechanisms to provide deterministic bounded responses to external events.

The term timing anomaly refers to a situation where a local worst case does not entail the global worst case. For instance, a cache miss (the local worst case) may result in a shorter execution time than a cache hit due to scheduling effects [3]. The domino effect is a severe special case of timing anomalies that causes the difference in execution time of the same program starting in two different hardware states to become arbitrarily high [13].

One of the metrics of microprocessor performance is the average number of clock cycles per instruction (CPI), the lower the value the better the performance. Given a sample program with

p

instructions, the instruction count

n_{i}

for each instruction type

i

, and the number of clocks needed to execute instruction type

c_{i}

, CPI can be defined as shown in Equation (1).

C P I = \frac{\sum_{i} n_{i} c_{i}}{p}

(1)

CPI in conjunction with processor clock rate can be used to determine the time needed to execute a program [17]. The classic 8051 CPU requires at least 12 cycles per instruction (CPI > 12) [23], PIC16 takes 4 cycles or more (CPI > 4) [24], but ** clocks to eliminate load and branch delays. The drawbacks of this approach are:

Incompatibility with optimization algorithms embedded in electronic design automation (EDA) tools.
No FPGA primitive support to implement the design.
Accessing memory after MUL instruction needs two cycles instead of one, and interrupt and events have a delay in some cases.
Difficulty reaching high clock speeds (e.g., 60 MIPS needs a 120 MHz oscillator).

4. The PicoBlaze Firm Core

4.1. Overview

KCPSM6 is an upgraded version of the (K)constant Coded Programmable State Machine 3 (KCPSM3) [70] and is the technical name of **linx PicoBlaze. It is an 8-bit firm core with 32 general-purpose 8-bit registers which are arranged in two banks. All instructions have 18-bit width and need two clock cycles to be executed [16]. The instruction bitfields are divided into a 6-bit opcode allowing

2^{6} = 64

unique instructions (55 out of 64 instructions are implemented) and a 12-bit field set aside for operands, as shown in Table 2. The core architectural overview is shown in Figure 3. Its program memory can go up to 4 KB and it has scratch pad memory (SPM) for temporary data storage, with a maximum size of 256 bytes. Additionally, it has a stack with a depth of 30 and 256 I/O ports.

As shown in Table 2, the 12-bit operand field accommodates one or a mixture of the following values: “aaa, kk, pp, p, ss, x, y”. For example, the “JUMP aaa” instruction is encoded to a 0x22aaa hex value, 0x22 is the opcode, and 0xaaa is the 12-bit jump target address, or “LOAD sX, sY” is encoded to 0x00xy, 0x00 is the opcode, 4-bit x is the destination register, and 4-bit y is source register. PicoBlaze has three flags: Carry (C), Zero (Z), and Interrupt Enable (IE). There is an interrupt pin which forces the processor to execute code residing in the Interrupt Service Routine (ISR) (its memory address location is predefined), and there is a sleep input pin for freezing all operations [16].

4.2. PicoBlaze Source Code Analysis

The PicoBlaze core is provided in both VHDL and Verilog languages. VHDL is chosen for describing the proposed hardware design. FPGA primitives are the basic building blocks of a design. They perform dedicated functions in the device, implement standards for I/O pins, and have standardized names [71].

The first step in source code analysis is to scan the code for all primitives used in the design. The list of all primitives used in PicoBlaze is as follows: “LUT6, LUT6_2, FD, FDR, FDRE, XORCY, MUXCY, RAM32M, RAM256X1S”.

The second step is to study the FPGA manufacturer’s library guide to retrieve the detailed functionality of each primitive, and then write a VHDL implementation of it to obtain vendor-independent modules [72]. In the case of PicoBlaze, the “**linx 7 Series FPGA Libraries Guide” [73] provides the detailed behavior of each primitive. The next section provides the equivalent vendor-independent VHDL code of each primitive.

5. Zipi8: A PicoBlaze Compatible Soft Core

In this section, the methodology behind transforming a PicoBlaze firm core to a soft core using vendor-independent primitive definitions (in VHDL) is detailed.

5.1. Primitive Conversion to Vendor-Independent VHDL

One of the primitives listed in the previous section is picked as an example: LUT6. The **linx Library Guide reads “LUT6 is a six-input look-up table (LUT), it can either act as asynchronous 64-bit ROM (with 6-bit addressing) or implement any six-input logic function” [73]. A VHDL implementation must be written according to the extracted definition of the primitive.

Listing 1 shows one of the LUT6 instances used in the PicoBlaze core as an example. The ‘pc_mode2_lut’ is the instance name, and 0xFFFF_FFFF_0004_0000 is a 64-bit hexadecimal constant used as the initial value of the LUT6 primitive. I0, I1, I2, I3, I4, and I5 are inputs, and O is output signals.

First, a Boolean function minimization on the six-input logic function using the given 64-bit LUT value is performed. The minimization method can be either manual or automated, using algorithms such as the Espresso logic minimizer [74]. Equation (2) shows the result of minimization of the six-input logic function LUT6(I5, I4, I3, I2, I1, I0) shown in Listing 1.

LUT 6 (I 5, I 4, I 3, I 2, I 1, I 0) = O = I 5 + I 4 . \bar{I 3} . \bar{I 2} . I 1 . \bar{I 0}

(2)

Listing 1: An example of LUT6 primitive instantiation used in the PicoBlaze core.

pc_mode2_lut : LUT6

generic map (INIT=>X"FFFFFFFF00040000")

port map (

I0 => instruction (12),

I1 => instruction (14),

I2 => instruction (15),

I3 => instruction (16),

I4 => instruction (17),

I5 => active_interrupt,

O => pc_mode (2)

);

After replacing the I0, I1, I2, I3, I4, I5, and O variables in Equation (2) with the name of signals connected to them, the exact equivalent vendor-independent VHDL implementation of LUT6 can be derived, as shown in Listing 2.

Listing 2: An example of vendor-independent VHDL implementation of LUT6.

pc_mode (2) <=

active_interrupt or

instruction (17) and

(not instruction (16)) and

(not instruction (15)) and

instruction (14) and

(not instruction (12));

The case for other primitives is the same. The vendor-independent VHDL implementation of the rest of the primitives, including “LUT6_2, FD, FDR, FDRE, XORCY, MUXCY, RAM32M, RAM256X1S”, can be found in Supplementary S1, which includes the VHDL source code of all primitives in a **linx Vivado project.

5.2. Modular Conversion of PicoBlaze to Zipi8

The PicoBlaze VHDL source code has no modular structure. It is a module in a VHDL file with a long list of primitive instantiations connected via signals. To convert the design from a firm core (PicoBlaze) to soft core (named Zipi8 by the authors), it is sufficient to directly replace all the instances with vendor-independent VHDL equivalent code, as mentioned in the previous section. If, along the process, the related primitives are grouped into VHDL modules (based on the characteristic equation of flip-flops) and then transformation is performed, then complexity can be managed, human errors are minimized, and a modular design emerges. Additionally, the process provides better understanding of the internal architecture of the design.

The PicoBlaze core is transformed into 16 modules which use source code comments and original primitive names. The module names are listed below, and their source code can be found in Supplementary S1:

arith_and_logic_operations;
decode4alu;
decode4_pc_statck;
decode4_strobes_enables;
flags;
mux_outputs_from_alu_spm_input_ports;
program_counter;
register_bank_control;
sel_of_2nd_op_to_alu_and_port_id;
sel_of_out_port_value;
shift_and_rotate_operations;
spm_with_output_reg;
stack;
state_machine;
two_banks_of_16_gp_reg;
x12_bit_program_address_generator.

The modules listed above and important signals between them are shown in Figure 4. It is a simplified version of a fully detailed schematic that is available (in Supplementary S2) in Encapsulated Postscript (EPS) format. To simplify the diagram, occasionally two or three related modules are combined. This is indicated by mentioning module numbers in parentheses. For example, the ‘Decoders’ module consists of three submodules: (2), (3), and (4). Both program memory and the processor share the same clock signal. Those modules which are synchronized with the clock are marked with a triangular symbol. The absence of a clock symbol indicates pure Combinatorial Logic (CL) (e.g., the ‘Operand Selection’ module).

5.3. Zipi8 Architecture

The important paths, such as the ‘data path’ and ‘instruction path’, are explicitly marked in Figure 4. The allocation of two separate buses connected to two different memory blocks indicates a Harvard architecture [56]. To explain the instruction execution mechanism of PicoBlaze, a sample program (Listing 3) with a branch instruction is manually traced.

Listing 3: A sample PicoBlaze program.

Start_at_0x000:

LOAD s0, 05 ;Loads value 05 into registers 0 – Mem. Location: 0x001

LOAD s1, 04 ;Loads value 04 into registers 1 – Mem. Location: 0x002

JUMP subprogram_at_01c ; – Mem. Location: 0x003

; ...

subprogram_at_01c:

ADD s1, s0 ; s1 <= s1 + s0 ; – Mem. Location: 0x01c

As shown in Figure 5, the de-assertion of the reset signal puts the processor into the run state. In this state, the processor waits for the first rising edge of the clock that triggers an instruction fetch from memory location 0x000. The fetch results in the ‘Instruction Path’ bus (see Figure 4) hold valid data (it is the first instruction, ‘LOAD s0, 05’, in Listing 3).

The instruction bus is connected to flip-flops in ‘Decoders’, ‘State Machine & Control’, ‘Flags’, and ‘Program Counter’ modules. When the second clock arrives, the instruction is decoded (sx_addr is set to 0 to select register s0, and the 05 constant value is placed on the instruction [7:0] bus, the kk instruction bitfield), the next state of machine is calculated, flags are set, and finally the program counter (PC) is incremented by one.

In the third clock cycle, the instruction at location 0x001 is fetched and the result of the ALU is written back into the register in parallel. This results in the s0 register holding the constant value 05. In the next clock cycle, the instruction at location 0x001 (which is ‘LOAD s1, 04’) is fetched.

As with previous instructions, the decode and execute stages happen in the next clock cycle, which sets the sx_addr signal (see Figure 4) to 1 and prompts the second ALU operand (kk bitfield) to hold the constant value 04. In the next clock cycle, the processor writes back the result into the register bank, resulting in constant value 04 being stored in the s1 register and, at the same time, the next instruction (‘JUMP subprogram_at_01c’) being fetched.

In the next cycle, the JUMP instruction is decoded and, instead of ‘pc = pc + 1’ and the next consecutive instruction being fetched, pc is set to a value of 0x01C, which is the jump target location. In the next cycle, the instruction at location 0x01C of program memory (‘ADD s1, s0’) is fetched. The ADD instruction is then decoded, and the ALU needs some time (ALU propagation delay) to perform the add operation. The result is ready before the rising edge of the next clock cycle arrives, when it will be written back into the s1 register, and so on. This manual execution tracing clearly shows the behavior of the PicoBlaze when it executes a branch instruction in two clock cycles.

Each original PicoBlaze instruction takes exactly two clock cycles (CPI = 2), making its ISA performance deterministic. This turns PicoBlaze into a suitable candidate for safety-critical real-time embedded systems [38] if its performance can be improved without adding a pipeline or caches. In the next section, a new design is proposed that achieves CPI = 1 with PicoBlaze, resulting in significant performance improvement.

5.4. Zipi8 Verification

We use the comparison method to verify the integrity of the Zipi8 core against the PicoBlaze. Flip-flop output signals have a one-to-one relationship in both cores. Therefore, the transformation process can be validated by probing signals at all output junctures of flip-flops in both cores and by using VHDL assert statements to catch any discrepancies between them. Verification details and extra information on PicoBlaze to Zipi8 conversion can be found in [18].

6. DAP-Zipi8: A Modified Zipi8 Soft Core with CPI = 1

Modification of the PicoBlaze becomes feasible after converting it to the Zipi8 soft core. The goal is to improve performance without those indeterministic performance improvement techniques that were discussed in the Introduction. In this section, at first, the overall mechanism of the proposed technique, in terms of how to reduce CPI from two to one, is provided without diving deep into details. Next, as a case study, the proposed design is applied to the converted Zipi8 core, which is ** ISA uniformity intact motivated the work presented in the next section.

Processor	Load Delay	Branch Delay
IBM 801	Locks register, can be optimized by a compiler [57]	Branch with execute (BWE) [57] *
RISC I	Load and Store, always takes two cycles [42]	Delayed jump [42] †
SPARC-V8	Load-use interlock stalls the pipeline [58]	Annulling delayed branches [58] ‡
SPARC-V9	Similar to SPARC-V8 but a 64-bit version	Annulled delayed branches [59] ‡
MIPS-I	Delayed loads with mandatory load delay slot [60]	Delayed branch with a branch delay slot [60]
MIPS-II	Removes mandatory load delay slot; in case of violation, extra real cycles will be added [61]	Branch-likely [62] §
MIPS32	Interlock by load delay stalls the pipeline [61]	Branch-likely, compact branches [63] ‖
ARM7TDMI (3-stages)	All loads take at least three cycles [64,65]	All branches take at least three cycles [64], [65]
ARM9TDMI (5-stages)	Load-use interlock incurs one extra cycle if the following instruction uses a loaded word [66]	All cases take three cycles [64,66]
ARM11 (8-stages)	Takes one to five clock cycles due to register interlocks [11]	Dynamic branch prediction/folding [11] ¶
SiFive E31 (RISC-V)	All loads have three-cycle result latency [67]	Branch predictor with one-cycle latency, misprediction incurs an extra three cycles [67]
PowerPC 750 CL	Out-of-order load/store unit with two or three cycles of latency	Static/dynamic branch prediction/folding ∗∗

Opcode (6-bit)	Operands (12-bit)
6-bit always	aaa	12-bit address (0x000-0xFFF)
	kk	8-bit constant (0x00-0xFF)
	pp	8-bit port ID (0x00-0xFF)
	p	4-bit port ID (0-F)
	ss	8-bit scratch pad location (0x00-0xFF)
	x	4-bit register within the bank (s0-sF)
	y	4-bit register within the bank (s0-sF)

Core	Power @ 100 MHz	Power @ Maximum Achievable Frequency
PicoBlaze (KCPSM6)	3 mW	12 mW
Zipi8	4 mW	14 mW
DAP-Zipi8	8 mW	20 mW

Article Menu

A Deterministic Branch Prediction Technique for a Real-Time Embedded Processor Based on PicoBlaze Architecture

Abstract

1. Introduction

2. Definitions

4. The PicoBlaze Firm Core

4.1. Overview

4.2. PicoBlaze Source Code Analysis

5. Zipi8: A PicoBlaze Compatible Soft Core

5.1. Primitive Conversion to Vendor-Independent VHDL

5.2. Modular Conversion of PicoBlaze to Zipi8

5.3. Zipi8 Architecture

5.4. Zipi8 Verification

6. DAP-Zipi8: A Modified Zipi8 Soft Core with CPI = 1

7. Zipi8 (CPI = 2) to DAP-Zipi8 (CPI = 1)

7.1. Adding the Dual Address Bus and Branch Prediction Circuit

7.2. Program Counter Module Modification

7.3. Stack Module Modification

8. Resource and Power Utilization

9. Verification

9.1. Isolated Instruction Execution

9.2. Math Library Execution

9.3. Random Execution from an Instructon Pool

10. Conclusions

Supplementary Materials

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Core	CPI	Max Freq. (MHz)	LUTs	Regs.	Carry8	F7 Mux	F8 Mux
PicoBlaze (KCPSM6)	2	369.041	122	74	7	16	8
Zipi8	1	357.509	157	74	0	16	8
DAP-Zipi8	1	224.022	305	49	2	16	8