The source and/or object code (subsequently referred to as code)
presented in this and subsequent lessons is the property of
saxelec.com and owner who retains copyright protection. This code in
whole or in part is licensed only for single copy per user,
non-commercial use; except in the case of educational institutions,
where a license of single copy per student, only as part of course
curricula, is permitted.
Donald Gerard Polak (owner)
There are various methods that you can use to verify an Intellectual
Property with simulation. Remember, some instructions are both data
and instruction sensitive, which means that every combination of
instruction data has the potential to cause errors if you are not
careful. The selection of a particular strategy largely depends on
where you are at in the debug and commissioning process.
In the lite strategy, we first separate the instructions into
groups that use similar execution sequences. We then completely
verify those common blocks shared by all groups, such as
clock, source selection, and destination selection. Finally,
we select a sample instruction from each group to simulate the
execution path. In this method you must make sure that you cover
all 'corner cases' (i.e. situations in which the result can
vary from the norm). For instance, the INXr instruction usually
does not set the UF or VF flag, unless the data in the register
is 0xFFFF. Cases such as these must be included in a lite
The lite strategy is generally used for initial simulations, and
may be enough if all the corner cases are covered properly.
Brute Force Strategy
In its purest form, the brute force strategy means that you
simulate very instruction with every combination of data. For
obvious reasons this is never done. Generally, the brute force
method just uses the actual target code and examines the
results at periodic intervals to detect where the processor state
begins to vary from the expected. The period between where the
processor was in in expected state, and where variance was
detected is gradually narrowed until the error is isolated
to the failing instruction.
Obviously this method is extremely time consuming. It helps
if you have an actual target processor to compare with, but
not that much. The brute force strategy is only used during final
debug and commissioning when all else fails.
I will wake you up after this discussion.
If you have to test every instruction with every input combination
you will generate a lot of data. Not just the results of the
instruction, but also the processor flags have to be checked.
It would just about take a career to check all that data manually.
A cyclic redundancy check (CRC) offers a way to check large blocks
of data very quickly. A CRC on the generating side is the one's
complement of the remainder after division (module 2) of the data
by the generating polynomial. On the testing side the CRC is the
remainder after division (modulo 2) of the data, raised to the
exponent of the CRC width, plus the generating side CRC.
In the absence of any deviations between the generating side
data and the testing side data, the receiving side CRC will
be a fixed value.
So, for instance, to check a 1024-byte block of data with a 16
bit CRC, you would only have to divide an 8192-bit number by
a 16-bit number and get the remainder, without a divide function.
Simple right? Not to mention the fact that the amount of
data generated would exceed the memory addressing capability of
the 8085x processors.
Before you give this up as a pure theoretical discussion with no
merit, I will let you know about some hardware methods to generate
and check CRCs simply.
The first method is known as a linear feedback shift register
(LFSR). The generating polynomials (Gx) always has
roots one greater than the MSb of the CRC and a '+ 1' root.
This defines the feedback path. Essentially the MSb of the
CRC is exclusive OR'ed with the next bit of incoming data. The
results of the exclusive OR are placed in the LSb of the CRC
and exclusive OR'ed with all other roots. For instance, the linear
feedback shift register for a CRC16 (Gx = x16
+ x12 + x5 + 1) appears as follows:
LFSR for CRC16
The LFSR is initially seeded with an all ones value. Data is then
shifted in from LSb to MSb. When checking, the CRC is shifted in
from MSb to LSb. With no errors the result at the testing side
will be an all zeroes value.
The problem with the LFSR method is that it requires eight clock
cycle for each byte. However, French mathematician Galois
described polynomial fields. By examining our LFSR we can see that
each new CRC value is dependent upon the uppermost eight bytes of
the previous CRC value, and the next byte of data. There are
two methods where these fields can be used.
In software, we can setup a table containing 256 entries of
values to exclusive OR our CRC and data with. This method can
calculate CRCs very quickly.
In Verilog, we use a register for the CRC and an equation
for each bit in the register. The new register value is calculated
in a single clock cycle. We normally perform the bit calculations
in a Verilog function. A Verilog function is contained within
a Verilog module that performs a task. It uses as many inputs as
you like to develop a single output. The output of the function
is not declared, as the function name is used as the output
data. When calling a function, we normally use ordered
A simple way to derive the equations for each CRC bit is to start
with some index cards, one index card for each CRC bit. Label each
card with cn, where n represents the respective bit
position in the CRC. Then lay them out in an ordered row, with the
highest value of n in the leftmost position. Viola, you have just
created a virtual LFSR. Now take the leftmost card and write
D0< on it. Then place it in the rightmost position.
proceed to the index cards that occupied the other generation
polynomial roots at the start of the process a write the data
from the card you just moved on them also. Repeat this process
eight times, each time increasing the subscript of Dn.
If a particular Dn or Cn term appears twice
on the same index card after a move, cross it out. At the end
of the eighth move, you are left with the terms for each bit
in the new CRC value that you exclusive OR together to form
each bit. The bit positions are in order with the MSb leftmost.
We can now build a CRC checker I/O device for our test bench. This
I/O device requires a reset address, to preload our CRC register
to the all ones value and a read/write address to send data to
the CRC checker and retrieve the results. If you don't want to
mirror the known good CRC value in your code, then you must also
have a third address in your I/O device to accomplish this
(remember, data is shifted in LSb to MSb whereas the CRC is
shifted in MSb to LSb).
A CRC checker I/O device for CRC16 (ITU)
Of course, we need a known good CRC for this to work. This is
done with a software program that emulates the instruction and
data that you are trying to check, generates the appropriate
flags, and generates a CRC for the sequence. This is the known
good CRC that you can use for your test.
The CRC method can be used for final certification of an IP. Other
than that, this has been a purely theoretical discussion with
no merit. You can wake up now.
Debugging an IP with Simulation
I don't know about you, but I, like most designers, cannot seem to
develop a new architecture, write 9,000+ lines of code for it, and
have it work perfectly the first time. On the other hand, I expect
most of the code to work, and focus on removing errors incrementally
as I add or change things. This is where simulation hits its stride.
But, before you begin your simulation, you must (as much as possible)
be familiar enough with your architecture so that you can locate
the source of any errors you encounter. Simulation is an iterative
process. Most times you will need to remove one or two errors
and re-simulate, as a single error can cause multiple effects or
screen other problems.
In order to simulate some of the added instructions from
Module 2, Lesson 6 we will need to update
our test bench again. Up until now, we were exclusively using read-only
memory. But some of our additional blocks will require read/write
memory. Therefore, we will require a RAM block.
RAM blocks are fairly simple to encode, but most vendors require
a precise coding format to recognize (infer) a RAM block. Fortunatly
this format is consistent amongst the major programmable logic and
synthesis tool vendors. A simple, single port RAM block has been
added to our test bench.
A Verilog function has been added to our test bench to automatically
calculate the address width required to support the RAM depth. This
function is known as a constant function, that is evaluated by the
pre-processor. Constant functions, introduced in Verilog 2001, can
act on parameters. Unfortunately, constant functions are not
uniformly supported by all tools. Quartus II supports constant
functions, IVERILOG does not. To work around this issue, a `define
has been added to require an additional parameter when IVERILOG
RAM block depth should always be an Entier (integer) log2 value.
We probably would encounter difficulties achieving this objective
if we made our RAM depth plus our ROM depth match the addressing
range of the 8085. At the very least it would require a much more
sophisticated address decoder, and an address translator block.
Therefore, we have set our RAM block to the full addressing range of
the 8085x for now and ignore the lowermost 2K.
Therefore, we come up with the following test bench :
New Test Bench
When we run the simulation we come up with the following errors:
GTKWAVE Display of MVIA Error
We can see from this display that the bus cycles are correct, but
the accumulator is not updated. When we examine our code, we see that
the update happens when the microcode is in state MFETCH1W, the
instruction latch contains the instruction MVIA, and edck is present.
Let us add those signals to our display:
GTKWAVE Display of MVIA Error
From this display we can see that the instruction latch is changing
values when it is not supposed to. We examine the equation for
instruction latch enable in expander_r and see that it will need
additional qualifiers to omit some of the wait states.
This error is a major show stopper, so we must correct it immediately
and try again. Therefore, we have this new skeleton to run the simulation
Project Skeleton with
instruction latch change corrected
When we rerun the simulation we come up with the following:
GTKWAVE Display of MVIA Correction
We can see that the instruction error has been cleared. Now we can
look for any subsequent errors. We can see that the LXISP instruction
is loading both sides of the SP at once:
GTKWAVE Display of LXISP Error
This is coming from a flaw in our register file. Because data is
transferred one byte at a time from memory to the SP, we need to break
the SP into two halves with separate enables. This error also pertains
to the PC.
Moving forward we can see that the PUSHB instruction does not execute
GTKWAVE Display of PUSHB Error
Closer examination of the microcode state machine shows
that we failed to code the MWRSP1 state.
Let us remove these errors and also move addend1 generation to
source_sel_r in accordance with
Module 1, Lesson, Rule 3.11. This
activity creates the following code:
Project Skeleton with
initial corrections for LXISP and PUSHB
When we rerun our simulation, we can see that the LXISP instruction is
now operating correctly:
But when we check the PUSHB instruction we get the following:
The bus cycle does not have enough T states, and during the bus cycle
the address, SP data, and the data being written are not correct.
The address is problematic. The 8085 PUSH instructions are decrement
first and then use as far as the stack pointer is concerned. However,
if we simply change MWRSP1 operation to DECNF, it would match the
MWRSP2 instruction. Rather than adding differentiator bits, we
take advantage of the fact there are three clock cycle in an
instruction fetch before we need to post the next bus request.
Therefore, we will make MWRSP1D the landing state and set the
operation of MWRSP1 to NOOP. MWRSP1 mow waits for rack and handles
the diverge. MWRSP1D simply decrements the stack pointer.
The short bus cycle means that we have to add a delay bus cycle to
the stack instructions. Therefore the DELAY2 state code must be
altered to check the instruction latch and diverge if necessary. The
IFETCH3 state now makes all PUSH and POP instructions enter DELAY2
before going to their next landing states.
We can track the data problem to the data register in the bus
state machine. Write data is being asserted on the same clock
cycle as it is being asserted. We can fix this by making the data
register a transparent latch.
I also used similar techniques to track down address issues with
the STA, SHLD, and STAR instructons (note I had to change the
definition file to compile for the S8085C). This leaves us
with the following code :
Project Skeleton with
all current errors corrected :
We rerun our simulation and focus in on the PUSHB instruction. It is
now operating correctly:
What is a CRC?
What hardware can be used for a polynomial divide?
Name the simulation strategies.
What is a constant function?
What should RAM depth always be?
Pick some other instructions and check them. You may add additional
instructions to the test bench if you like.