Coding for Reuse Course - Module 3 Lesson 5
Hardware Multiplication

In this lesson we will study how to perform multiplication, and different hardware multiplication algorithms

License Terms  

The source and/or object code (subsequently referred to as code) presented in this and subsequent lessons is the property of and owner who retains copyright protection. This code in whole or in part is licensed only for single copy per user, non-commercial use; except in the case of educational institutions, where a license of single copy per student, only as part of course curricula, is permitted.

Donald Gerard Polak (owner)


So how do we multiply? Think back to elementary school. Suppose we want to multiply 29 by 27.


So what is happening here. Let's take a closer look.

   multiplicand --> 29 
   multiplier    -->  27 
cross product --> 63 
      product -->  783  

Each cross product is the product of one symbol from the multiplicand by one symbol from the multiplier, shifted left by the sum of the symbol positions from the multiplicand and the multiplier. Symbol positions are numbered from the right, starting at zero. The product is the sum of the cross products.

The maximum width of each cross product is the sum of the symbol width from the multiplicand plus the symbol width of the multiplier. The maximum width of the product is the width of the multiplicand plus the width of the multiplier. These distinctions may not be readily apparent in base 10 math, but are very important in binary math.

Using distribution theorem, we can see that the cross products do not have to be stored. Instead, if the product is initially set to zero, each cross product is simply added to the product.

Binary Multiplication  

In its simplest form, binary multiplication simply consists of shift and add operations. The LSb of the multiplier is first checked. If is one, the multiplicand is added to the product, which was initially set to zero. the multiplicand is then shifted left one place, and the multiplier is shifted right one place. This operation is repeated once for each bit of the multiplier. In the case of our S8085D processor, this means 8 x1 clock cycles or 4 sysclk cycles. This is well within our constraints.

A Basic Multiplier

Horner's Method  

Although the basic multiplier is almost optimal in its use of resources, we can actually make it a little smaller by using Horner's method of polynomial evaluation. The basic multiplier uses a sixteen-bit shift register for the multiplicand, plus a sixteen-bit product register. Horner's method reduces the sixteen-bit multiplicand register to an eight-bit static register. In Horner's method, the product is first cleared, the multiplicand is moved to an eight-bit static register, and the multiplier moved to an eight-bit left shift register. The MSb of the multiplier is first tested. If it is one. the multiplicand is added to the product. The product and the multiplier are then shifted left and the process repeated until all bits of the multiplier are used. Horner's method still consumes the same amount of clock cycles, but uses fewer resources. We will use Horner's method for our multiplier.

Booth Multiplier  

The Booth multiplier is popular for signed multiplications, and is frequently used for internal multipliers in FPGAs that feature multipliers. It relies on bit pairs from the multiplier to decide whether to add or subtract the multiplicand from the product. It is rather logic intensive, and there are several good articles published on this technique. Since the S8085D multiplies are unsigned, we will not discuss it in detail here.

Faster Multipliers  

Our discussion, so far, has focused on 1x16 symbol widths. That is, we are using one bit from the multiplier, and 16 bits from either the multiplicand or product. Note here, the symbol widths may be asymmetric. That is, the symbol width of the multiplicand and the symbol width of the multiplier do not have to match.

In order to multiply faster, we need to increase our symbol widths. For example, an 8x8 multiplier could complete in a single clock cycle. However, the impact on logic utilization is exponential in nature. An 8x8 multiplier could consume over 65,000 logic cells. This logic utilization could be diminished somewhat, by using parallel adders, but this would also have an impact on maximum clock speed due to ripple carry delays.

A 4x4 multiplier could complete the multiply in 4 clk_x1 clock cycles. This would require 16 UDPs, plus 8 first level enhanced multiplexers, and one OR gate for each of the 8 bits of the product. In short 200 logic gates. The control and shift logic would add a handful of gates. The LSb of the product can be shared. but this implementation would still come in at approximately 150 logic cells.

Using asymmetric symbol widths, we can obtain the same execution time by using an 8x2 multiplier. This requires an additional 16-bit adder to form the cross product. This resembles a Booth multiplier without using the sign. The issue here is that it will have an impact on the maximum clock frequency of the entire device, due to the additional adder delays, unless it is mitigated.


The 8x2 multiplier impact on overall performance can easily be mitigated by registering the cross product after the initial add, and then using the registered value to add to the product. This process is known as pipelining. In pipelining, one microstate output is first registered, and then passed on to the next microstate on the next clock cycle. In our example, we use a pipeline width of one. In other words, we add one clk_x1 cycle to our instruction. A multiply instruction would still consume only six sysclk cycles. but we will still add one sixteen-bit adder, one 10-bit register, and some additional control logic.

Coding and Simulation  

Coding our multiplier required modification of the source select logic, the microcode state machine, the microcode state expander, the flag enables, the alu top level, and the flag logic; as well as coding the multiplier itself. we also discovered an error during initial simulation, and using the techniques discussed in module 3, lesson 1 I tracked it down to the carry into the third nibble of the adder and corrected it.

The multiplier operates as follows:

  • When mule is asserted, the nf,sf,and of flags are cleared and the uf flag is set. At the same time the contents of b are moved to the multiplicand register in the multiplier and the contents of the c register are moved to the multiplier register in the multiplier. Simultaneously, the b and c registers are cleared, the mpye register is set, and a 3-bit down counter is loaded to an all ones value.

  • On the next clock cycle, the microcode state machine de-asserts mule and requests a six cycle bus delay. When rack is received, it will wait until mpye is de-asserted before requesting the next instruction fetch.

  • Meanwhile, as long as mpye is set, the bc register pair is updated on every clock cycle, the adder is enabled, addend1 is set to the bc register pair shifted left by one place, and the multiplicand, which is the contents of the multiplicand register enabled by bit 7 of the multiplier is set as the source. On each clock cycle that occurs with mpye set, the multiplier is rotated left one bit, and the down counter is decremented. Once the down counter reaches zero, it is halted, and mpye is de-asserted on the next clock.

The modified code looks like this :

Project with MUL instruction implemented.

Of course we need to add a couple of instructions to our test bench to exercise the multiply instruction :

Updated test bench for exercising the MUL instruction

When we run our simulation we get the following :

GTKWAVE Display of the MUL Instruction

  1. What is a cross product?

  2. What is a reason that the Booth multiplier is popular.

  3. How does Horner's method differ from the basic multiplier?

  4. What do we need to do to multiply faster?

  5. Do symbol widths have to be symetric?

  6. What is the width of each cross product? What is the width of the product?