Performance-aware energy-efficient data cache accesses

With the increased growth of Internet of Things (IoT) devices, the need for energy efficient computing systems is more important than ever. Many of these systems are battery operated and often in places where recharging or replacing the batteries is difficult. This is why researchers and the semiconductor industry are using a significant amount of resources in increasing energy efficiency by developing embedded systems such that it consumes as little power as possible while still increasing the performance.

snail vs cougar illustrating slow speed vs rapid

In the beginning of the 90s, the slowness of memory speed compared to processor speed was felt, known as the memory gap. This led to an introduction of cache memories, small amount of fast, but expensive memory, used to increase the performance of load operations by keeping required data closer to the processor than the main memory. However, because cache memories, L1 DC in particular, are optimized for performance rather than energy consumption, the energy consumed by cache memories can account for a significant amount of the total energy consumed by microprocessor-based architectures. Techniques such as Speculative Halt Tag Access (SHA) and Early Load Data Dependence Detection (ELD3), based on way-halting and sequential loads, respectively, can be used to reduce the energy dissipation without sacrificing the performance of L1 DC.

Cache Memories

To provide the Central Processing Unit (CPU) with necessary data as quick as possible, the most frequenctly used data is stored in the caches that are placed closer to the CPU. Figure 1a shows a typical memory hierarchy existing in today’s computers. L1 cache is usually placed on-chip, such that it is possible to exploit locality by keeping data likely to be reused as close as possible to the CPU. If there is a cache miss in L1 DC, the search request will begin for L2 DC, which is often larger than L1 DC, thus results in higher latency. With each cache miss, the search proceeds to the next level memory until the requested data is found.

In order to reduce the search time for data requests, the caches often have a restricted placement policy, known as cache associativity. Cache hits are then detected through an associative search of all tags, instead of searching through the entire cache. Conventional L1 DCs are often set-associative caches with low associativity, where the latency of load operations is optimized by accessing all ways with the same tag address in parallel, as shown in Figure 1b. However, this results in a signicant amount of wasted energy as only data from one way is used. To reduce the energy consumption, numerous cache architectures, such as way-prediction, way-shutdown and highly-associative have been proposed. However, these optimization techniques leads to increased latency and complexity, which makes them unattractive for L1 DCs.

Speculative Halt-Tag Access (SHA)

Practical way-halting by speculatively accessing halt tags, is a cache architecture that can reduce the energy dissipation without increasing the latency and complexity. That is accompished by halting cache ways that cannot possibly contain the requested data, thus avoid accessing all ways with the same index unnecessarily. The technique is based on the observation that the displacement address often is small and usually only change the offset of the relative memory address, see Figure 2. This makes it possible to require the halt tags, low-order bits of the tag, using the base address, in parallel with the memory address calculation in the address generation stage. Since the base address and the displacement address is available in the address generation stage, a comparison of the tag and index bits of the base address and effective address can be done to determine if the displacement is small, before the data access stage.

 

Figure 2: Displacement check

When the displacement check succeeds, the halt tags can be accessed from the halt-tag cache, such that a halt tag check is performed before each L1 DC load operation. Halt tag bits of the base address are compared with the halt tag bits of each cache way accessed. If there is a match between the halt tag from the base address and the halt tag from a way, the bit corresponding to the way in the response vector is enabled to indicate that there is a halt-tag match, see Figure 3. Only the ways that have enabled bit are then accessed in the next pipeline.

 

Figure 3: SHA implemented in a three-stage pipeline

The technique, has no performance penalty and adds very little complexity to a conventional processor core design. Although the displacement address often is small, the SHA technique can not be used for displacement addresses that will change the tag or index bits during the address calculation. When the displacement address is too large for the SHA technique, Early Load Data Dependence Detection (ELD3) can be used to reduce the energy dissipation.

Early Load Data Dependence detection (ELD3)

Early Load Data Dependence Detection (ELD3) is an approach that can detect if the load operation has a data dependency with a following instruction that will cause a pipeline stall. If there is no data dependency between the load instruction and the following instructions, the load operation is performed sequentially where all tag ways are accessed, but only one data way in which the data resides, is accessed in the next clock cycle. However, if there is a data dependency, the data access is performed parallelly where both tag and data ways are accessed in the tag/data access stage, shown in Figure 4.

 

Figure 4: ELD3 implemented in a four-stage pipeline.

In order to decide whether the data ways should be accessed sequentially or in parallel, the information about data dependency between the load instruction and the following instructions must be available at the time of load operation. In a conventional in-order pipeline processor, the information must be available before the end of the address generation stage. Commonly, it is possible to check for data dependency by comparing the destination register of the load instruction with the source registers of the instruction that immediately follows it. However, it is not directly possible to check for data dependency between the load instruction in address generation stage and the second and third upcoming instructions, which is required by ELD3 technique. Therefore, a Data Dependency Bit (DDB) memory is implemented in the address generation stage that holds the dependency status for each instruction in level-one instruction cache (L1 IC). When a load instruction is detected after instruction fetch, the data dependency bit is accessed from the DDB memeory for the corresponding instruction. Figure 5 illustrates the DDB memory for a two-way L1 IC. The dependency bit will be correct as long as the cache line is not evicted from the L1 IC. Should a cache line be evicted from L1 IC, the load operation will still be executed correctly at the expense of an additional stall cycle. Moreover, the dependency bit for the load instruction will be updated during the writeback, such that the dependency bit is correct next time the load instruction is executed.

 

Figure 5: DDB memory for a two-way L1 IC

SHA+ELD3

By combining SHA and ELD3 together, the ELD3 technique can be used when the displacement address is too large for the SHA technique. A load operation can then be performed like this:

1: When the displacement is small, the halt tags are accessed but the DDB memory is not accessed. The tag and data ways are accessed in parallel, but SHA will halt both tag and data ways using the hit vector from halt tag access.

2: When the displacement is too large for SHA, the halt tags are not accessed, but the DDB memory is accessed. The outcome of DDB memory will decide the next step taken.

2a: If the DDB memory returns a dependency bit which is cleared, then all tag ways are accessed in parallel, but the data way is sequentially accessed.

2b: Or if the DDB memory returns a dependency bit which is set, the tag and data ways are accessed in parallel, such that the data can be forwarded to the following instruction and avoid a stall cycle.

Results

The effectiveness of the SHA and ELD3 implementations were evaluated by running MiBench benchmark applications on a four-stage RISC-V RV32I 32-bit processor, implemented on Single-ISA Heterogeneous MAny-core Computer (SHMAC) framework. SHMAC is a research project initiated by the Energy Efficient Computing Systems (EECS) department at Norwegian University of Science and Technology (NTNU), that uses a tile-based architecture. The MiBench applications were compiled using the RISC-V GCC toolchain, and analyzed using the SHMACsim, a cycle-accurate simulator for the SHMAC framework.

Figure 6: Way distribution between SHA and ELD3

Figure 6 shows the average number of ways accesses when using SHA and ELD3  techniques in combination, relative to a conventional cache implementation. When the displacement is small and the SHA technique is used, only one tag and data way is accessed for most load instructions, shown as S:1. In addition, when there is a cache miss, zero tag and data ways are accessed. Load instructions that result in cache misses occur, as we can see from S:0, quite frequently. When the displacement is too large for SHA and there is no data dependency, we can see that the number of data ways accessed is reduced significantly by accessing the data ways sequentially using ELD3, shown as E:S.

Conclusion

Improving the energy-efficiency of computing is an important area of research, and there is a potential for reducing the energy dissipation of caches. This article shows that using the concept of practical way-halting and data dependency detection, it is possible to reduce the energy dissipation for L1 DC without reducing the performance.