License Terms


The source and/or object code (subsequently referred to as code)
presented in this and subsequent lessons is the property of
saxelec.com and owner who retains copyright protection. This code in
whole or in part is licensed only for single copy per user,
noncommercial use; except in the case of educational institutions,
where a license of single copy per student, only as part of course
curricula, is permitted.
Donald Gerard Polak (owner)

Basics


So how do we multiply? Think back to elementary school. Suppose we
want to multiply 29 by 27.
29
27

63
14
18
4

783


So what is happening here. Let's take a closer look.
multiplicand > 29
multiplier >
27

cross product > 63
14
18
4

product >
783


Each cross product is the product of one symbol from the
multiplicand by one symbol from the multiplier, shifted left
by the sum of the symbol positions from the multiplicand and
the multiplier. Symbol positions are numbered from the right,
starting at zero. The product is the sum of the cross products.
The maximum width of each cross product is the sum of the symbol width
from the multiplicand plus the symbol width of the multiplier. The
maximum width of the product is the width of the multiplicand plus
the width of the multiplier. These distinctions may not be readily
apparent in base 10 math, but are very important in binary math.
Using distribution theorem, we can see that the cross products
do not have to be stored. Instead, if the product is initially set to
zero, each cross product is simply added to the product.

Binary Multiplication


In its simplest form, binary multiplication simply consists of shift
and add operations. The LSb of the multiplier is first checked. If
is one, the multiplicand is added to the product, which was initially
set to zero. the multiplicand is then shifted left one place, and
the multiplier is shifted right one place. This operation is
repeated once for each bit of the multiplier. In the case of our
S8085D processor, this means 8 x1 clock cycles or 4 sysclk cycles.
This is well within our constraints.
A Basic Multiplier

Horner's Method


Although the basic multiplier is almost optimal in its use of
resources, we can actually make it a little smaller by using
Horner's method of polynomial evaluation. The basic multiplier
uses a sixteenbit shift register for the multiplicand, plus
a sixteenbit product register. Horner's method reduces the
sixteenbit multiplicand register to an eightbit static register.
In Horner's method, the product is first cleared, the multiplicand is
moved to an eightbit static register, and the multiplier moved to
an eightbit left shift register. The MSb of the multiplier is first
tested. If it is one. the multiplicand is added to the product. The
product and the multiplier are then shifted left and the process
repeated until all bits of the multiplier are used. Horner's
method still consumes the same amount of clock cycles, but uses
fewer resources. We will use Horner's method for our multiplier.

Booth Multiplier


The Booth multiplier is popular for signed multiplications, and is
frequently used for internal multipliers in FPGAs that feature
multipliers. It relies on bit pairs from the multiplier to
decide whether to add or subtract the multiplicand from the
product. It is rather logic intensive, and there are several good
articles published on this technique. Since the S8085D multiplies
are unsigned, we will not discuss it in detail here.

Faster Multipliers


Our discussion, so far, has focused on 1x16 symbol widths. That is,
we are using one bit from the multiplier, and 16 bits from
either the multiplicand or product. Note here, the symbol widths
may be asymmetric. That is, the symbol width of the multiplicand and
the symbol width of the multiplier do not have to match.
In order to multiply faster, we need to increase our symbol widths.
For example, an 8x8 multiplier could complete in a single clock
cycle. However, the impact on logic utilization is exponential in
nature. An 8x8 multiplier could consume over 65,000 logic cells. This
logic utilization could be diminished somewhat, by using parallel
adders, but this would also have an impact on maximum clock speed due
to ripple carry delays.
A 4x4 multiplier could complete the multiply in 4 clk_x1 clock cycles.
This would require 16 UDPs, plus 8 first level enhanced multiplexers,
and one OR gate for each of the 8 bits of the product. In short
200 logic gates. The control and shift logic would add a handful of
gates. The LSb of the product can be shared. but this implementation
would still come in at approximately 150 logic cells.
Using asymmetric symbol widths, we can obtain the same execution time
by using an 8x2 multiplier. This requires an additional 16bit
adder to form the cross product. This resembles a Booth multiplier
without using the sign. The issue here is that it will have
an impact on the maximum clock frequency of the entire device,
due to the additional adder delays, unless it is mitigated.

Pipelining


The 8x2 multiplier impact on overall performance can easily be
mitigated by registering the cross product after the initial
add, and then using the registered value to add to the product.
This process is known as pipelining. In pipelining, one microstate
output is first registered, and then passed on to the next microstate
on the next clock cycle. In our example, we use a pipeline width
of one. In other words, we add one clk_x1 cycle to our instruction.
A multiply instruction would still consume only six sysclk cycles.
but we will still add one sixteenbit adder, one 10bit register,
and some additional control logic.

Coding and Simulation


Coding our multiplier required modification of the source select
logic, the microcode state machine, the microcode state expander,
the flag enables, the alu top level, and the flag logic; as well as
coding the multiplier itself. we also discovered an error during
initial simulation, and using the techniques discussed in
module 3, lesson 1 I tracked it down
to the carry into the third nibble of the adder and corrected it.
The multiplier operates as follows:

When mule is asserted, the nf,sf,and of flags are cleared and
the uf flag is set. At the same time the contents of b are
moved to the multiplicand register in the multiplier and the
contents of the c register are moved to the multiplier
register in the multiplier. Simultaneously, the b and c registers
are cleared, the mpye register is set, and a 3bit down counter is
loaded to an all ones value.

On the next clock cycle, the microcode state machine deasserts
mule and requests a six cycle bus delay. When rack is received,
it will wait until mpye is deasserted before requesting the
next instruction fetch.

Meanwhile, as long as mpye is set, the bc register pair is updated
on every clock cycle, the adder is enabled, addend1 is set to
the bc register pair shifted left by one place, and the
multiplicand, which is the contents of the multiplicand register
enabled by bit 7 of the multiplier is set as the source. On each
clock cycle that occurs with mpye set, the multiplier is rotated
left one bit, and the down counter is decremented. Once the
down counter reaches zero, it is halted, and mpye is
deasserted on the next clock.
The modified code looks like this :
Project with MUL instruction implemented.
Of course we need to add a couple of instructions to our test
bench to exercise the multiply instruction :
Updated test bench for exercising the MUL instruction
When we run our simulation we get the following :
GTKWAVE Display of the MUL Instruction

Exercise



What is a cross product?

What is a reason that the Booth multiplier is popular.

How does Horner's method differ from the basic multiplier?

What do we need to do to multiply faster?

Do symbol widths have to be symetric?

What is the width of each cross product? What is the width of the
product?
