A superscalar CPU architecture implements a form of parallelism on a single chip, thereby allowing the system as a whole to run much faster than it would otherwise be able to at a given clock speed. The term is a modification of scalar, processors that run one instruction per clock cycle, themselves a step up from earlier processors that would take a variable number of cycles to complete any given operation.

In a superscalar CPU several functional units of the same type are included, along with additional circuitry to dispatch instructions to the units. For instance most superscalar designs include more than one integer unit (typically referred to as an ALU). The dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to the two units.

Performance of the dispatcher is key to the overall performance of a superscalar design. The task is not a simple one, the instructions a = b + c; d = e + f can be run in parallel because none of the results are dependent on other calculations. However the instructions a = b + c; d = a + f may or may not be able to run in parallel, depending on the order in which the instructions complete as they move through the units.

Much of modern CPU design is dedicated to increasing the accuracy of the dispatcher system, and allowing it to keep the multiple units in use at all times. This has become increasingly important as the number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design like the PowerPC 970 include four ALUs, two FPUs and two SIMD units as well. If the dispatcher is ineffective in keeping all of these units fed with instructions, the performance of the system as a whole will suffer greatly.

Superscalar systems were originally implemented on RISC CPUs. This was because the RISC design results in a simple core, allowing several of them to be built onto a single CPU. This was the reason that RISC designs were faster than CISC through the 1980s and into the 1990s, but as the chip manufacturing processes improved, even "complex" designs like the IA-32 were able to go superscalar.

Dramatic improvements in the quality of the control unit now appear unlikely, limiting future improvements in speed of the basic superscalar design. One potential solution to this problem is to move the dispatcher logic out of the chip and into the compiler, which can spend considerably more time and effort on making the best decisions possible. This is the basic premise of very long instruction word(VLIW) CPU designs, which is also known as, static superscalar or compile time scheduling.

See also: Super-threading, Simultaneous multithreading