Introduction: Functional Description

Overview:

This valiant crew from WRC decided to investigate a hardware-based instruction scheduler. We began researching a method to optimize the ARS extended instruction set (ARSe ©).  This instruction set includes loads and stores, unops (unary operations - use two registers), binops (binary operations - use up to three registers), jump sources and destinations. Loads and stores have a latency of more than one clock cycle, which causes stalls when the instruction that follows is dependent on the completion of the load/store.  Independent instructions can be moved into these latent positions to prevent stalls, which makes more efficient use of clock cycles.  The potential target for our processor would be to reorder instructions in this manner before a CPU executes them.

Because our instruction scheduler operates on small blocks of code, its proper use would not be directly in the instruction stream going to the processor. Instead, we would want to place the ARSe instruction scheduler between main memory and the instruction cache on the processor, as illustrated:

This is a similar configuration to what modern processors like Intel’s Willamette and Transmeta’s Crusoe do with a “trace cache,” which caches decoded (or translated) copies of the instructions, which are available for the processor to retrieve at a later time. Like these trace caches, our chip is not involved with the interpretation of the instructions themselves. As a proof of concept we are using a simple instruction set that broadly allows many of the interesting cases for instructions scheduling.