Coding for Reuse Course - Module 3 Lesson 1
Simulation and debug.

In this lesson learn about different simulation strategies and when and how to use them. We will discuss Verilog functions, update our test bench and simulate the added instructions from Module 2, Lesson 6. We will also study how to use our simulation to track down and correct errors.

License Terms  

The source and/or object code (subsequently referred to as code) presented in this and subsequent lessons is the property of and owner who retains copyright protection. This code in whole or in part is licensed only for single copy per user, non-commercial use; except in the case of educational institutions, where a license of single copy per student, only as part of course curricula, is permitted.

Donald Gerard Polak (owner)

Simulation Strategies  

There are various methods that you can use to verify an Intellectual Property with simulation. Remember, some instructions are both data and instruction sensitive, which means that every combination of instruction data has the potential to cause errors if you are not careful. The selection of a particular strategy largely depends on where you are at in the debug and commissioning process.

  1. Lite Strategy

    In the lite strategy, we first separate the instructions into groups that use similar execution sequences. We then completely verify those common blocks shared by all groups, such as clock, source selection, and destination selection. Finally, we select a sample instruction from each group to simulate the execution path. In this method you must make sure that you cover all 'corner cases' (i.e. situations in which the result can vary from the norm). For instance, the INXr instruction usually does not set the UF or VF flag, unless the data in the register is 0xFFFF. Cases such as these must be included in a lite simulation.

    The lite strategy is generally used for initial simulations, and may be enough if all the corner cases are covered properly.

  2. Brute Force Strategy

    In its purest form, the brute force strategy means that you simulate very instruction with every combination of data. For obvious reasons this is never done. Generally, the brute force method just uses the actual target code and examines the results at periodic intervals to detect where the processor state begins to vary from the expected. The period between where the processor was in in expected state, and where variance was detected is gradually narrowed until the error is isolated to the failing instruction.

    Obviously this method is extremely time consuming. It helps if you have an actual target processor to compare with, but not that much. The brute force strategy is only used during final debug and commissioning when all else fails.

  3. CRC Strategy

    I will wake you up after this discussion.

    If you have to test every instruction with every input combination you will generate a lot of data. Not just the results of the instruction, but also the processor flags have to be checked. It would just about take a career to check all that data manually.

    A cyclic redundancy check (CRC) offers a way to check large blocks of data very quickly. A CRC on the generating side is the one's complement of the remainder after division (module 2) of the data by the generating polynomial. On the testing side the CRC is the remainder after division (modulo 2) of the data, raised to the exponent of the CRC width, plus the generating side CRC. In the absence of any deviations between the generating side data and the testing side data, the receiving side CRC will be a fixed value.

    So, for instance, to check a 1024-byte block of data with a 16 bit CRC, you would only have to divide an 8192-bit number by a 16-bit number and get the remainder, without a divide function. Simple right? Not to mention the fact that the amount of data generated would exceed the memory addressing capability of the 8085x processors.

    Before you give this up as a pure theoretical discussion with no merit, I will let you know about some hardware methods to generate and check CRCs simply.

    The first method is known as a linear feedback shift register (LFSR). The generating polynomials (Gx) always has roots one greater than the MSb of the CRC and a '+ 1' root. This defines the feedback path. Essentially the MSb of the CRC is exclusive OR'ed with the next bit of incoming data. The results of the exclusive OR are placed in the LSb of the CRC and exclusive OR'ed with all other roots. For instance, the linear feedback shift register for a CRC16 (Gx = x16 + x12 + x5 + 1) appears as follows:

    LFSR for CRC16

    The LFSR is initially seeded with an all ones value. Data is then shifted in from LSb to MSb. When checking, the CRC is shifted in from MSb to LSb. With no errors the result at the testing side will be an all zeroes value.

    The problem with the LFSR method is that it requires eight clock cycle for each byte. However, French mathematician Galois described polynomial fields. By examining our LFSR we can see that each new CRC value is dependent upon the uppermost eight bytes of the previous CRC value, and the next byte of data. There are two methods where these fields can be used.

    In software, we can setup a table containing 256 entries of values to exclusive OR our CRC and data with. This method can calculate CRCs very quickly.

    In Verilog, we use a register for the CRC and an equation for each bit in the register. The new register value is calculated in a single clock cycle. We normally perform the bit calculations in a Verilog function. A Verilog function is contained within a Verilog module that performs a task. It uses as many inputs as you like to develop a single output. The output of the function is not declared, as the function name is used as the output data. When calling a function, we normally use ordered instantiation.

    A simple way to derive the equations for each CRC bit is to start with some index cards, one index card for each CRC bit. Label each card with cn, where n represents the respective bit position in the CRC. Then lay them out in an ordered row, with the highest value of n in the leftmost position. Viola, you have just created a virtual LFSR. Now take the leftmost card and write D0< on it. Then place it in the rightmost position. proceed to the index cards that occupied the other generation polynomial roots at the start of the process a write the data from the card you just moved on them also. Repeat this process eight times, each time increasing the subscript of Dn. If a particular Dn or Cn term appears twice on the same index card after a move, cross it out. At the end of the eighth move, you are left with the terms for each bit in the new CRC value that you exclusive OR together to form each bit. The bit positions are in order with the MSb leftmost.

    We can now build a CRC checker I/O device for our test bench. This I/O device requires a reset address, to preload our CRC register to the all ones value and a read/write address to send data to the CRC checker and retrieve the results. If you don't want to mirror the known good CRC value in your code, then you must also have a third address in your I/O device to accomplish this (remember, data is shifted in LSb to MSb whereas the CRC is shifted in MSb to LSb).

    A CRC checker I/O device for CRC16 (ITU)

    Of course, we need a known good CRC for this to work. This is done with a software program that emulates the instruction and data that you are trying to check, generates the appropriate flags, and generates a CRC for the sequence. This is the known good CRC that you can use for your test.

    The CRC method can be used for final certification of an IP. Other than that, this has been a purely theoretical discussion with no merit. You can wake up now.

Debugging an IP with Simulation  

I don't know about you, but I, like most designers, cannot seem to develop a new architecture, write 9,000+ lines of code for it, and have it work perfectly the first time. On the other hand, I expect most of the code to work, and focus on removing errors incrementally as I add or change things. This is where simulation hits its stride. But, before you begin your simulation, you must (as much as possible) be familiar enough with your architecture so that you can locate the source of any errors you encounter. Simulation is an iterative process. Most times you will need to remove one or two errors and re-simulate, as a single error can cause multiple effects or screen other problems.

In order to simulate some of the added instructions from Module 2, Lesson 6 we will need to update our test bench again. Up until now, we were exclusively using read-only memory. But some of our additional blocks will require read/write memory. Therefore, we will require a RAM block.

RAM blocks are fairly simple to encode, but most vendors require a precise coding format to recognize (infer) a RAM block. Fortunatly this format is consistent amongst the major programmable logic and synthesis tool vendors. A simple, single port RAM block has been added to our test bench.

A Verilog function has been added to our test bench to automatically calculate the address width required to support the RAM depth. This function is known as a constant function, that is evaluated by the pre-processor. Constant functions, introduced in Verilog 2001, can act on parameters. Unfortunately, constant functions are not uniformly supported by all tools. Quartus II supports constant functions, IVERILOG does not. To work around this issue, a `define has been added to require an additional parameter when IVERILOG is used.

RAM block depth should always be an Entier (integer) log2 value. We probably would encounter difficulties achieving this objective if we made our RAM depth plus our ROM depth match the addressing range of the 8085. At the very least it would require a much more sophisticated address decoder, and an address translator block. Therefore, we have set our RAM block to the full addressing range of the 8085x for now and ignore the lowermost 2K.

Therefore, we come up with the following test bench :

New Test Bench

When we run the simulation we come up with the following errors:

GTKWAVE Display of MVIA Error

We can see from this display that the bus cycles are correct, but the accumulator is not updated. When we examine our code, we see that the update happens when the microcode is in state MFETCH1W, the instruction latch contains the instruction MVIA, and edck is present. Let us add those signals to our display:

GTKWAVE Display of MVIA Error

From this display we can see that the instruction latch is changing values when it is not supposed to. We examine the equation for instruction latch enable in expander_r and see that it will need additional qualifiers to omit some of the wait states.

This error is a major show stopper, so we must correct it immediately and try again. Therefore, we have this new skeleton to run the simulation again.

Project Skeleton with instruction latch change corrected

When we rerun the simulation we come up with the following:

GTKWAVE Display of MVIA Correction

We can see that the instruction error has been cleared. Now we can look for any subsequent errors. We can see that the LXISP instruction is loading both sides of the SP at once:

GTKWAVE Display of LXISP Error

This is coming from a flaw in our register file. Because data is transferred one byte at a time from memory to the SP, we need to break the SP into two halves with separate enables. This error also pertains to the PC.

Moving forward we can see that the PUSHB instruction does not execute at all:

GTKWAVE Display of PUSHB Error

Closer examination of the microcode state machine shows that we failed to code the MWRSP1 state.

Let us remove these errors and also move addend1 generation to source_sel_r in accordance with Module 1, Lesson, Rule 3.11. This activity creates the following code:

Project Skeleton with initial corrections for LXISP and PUSHB

When we rerun our simulation, we can see that the LXISP instruction is now operating correctly:

But when we check the PUSHB instruction we get the following:

The bus cycle does not have enough T states, and during the bus cycle the address, SP data, and the data being written are not correct.

The address is problematic. The 8085 PUSH instructions are decrement first and then use as far as the stack pointer is concerned. However, if we simply change MWRSP1 operation to DECNF, it would match the MWRSP2 instruction. Rather than adding differentiator bits, we take advantage of the fact there are three clock cycle in an instruction fetch before we need to post the next bus request. Therefore, we will make MWRSP1D the landing state and set the operation of MWRSP1 to NOOP. MWRSP1 mow waits for rack and handles the diverge. MWRSP1D simply decrements the stack pointer. The short bus cycle means that we have to add a delay bus cycle to the stack instructions. Therefore the DELAY2 state code must be altered to check the instruction latch and diverge if necessary. The IFETCH3 state now makes all PUSH and POP instructions enter DELAY2 before going to their next landing states.

We can track the data problem to the data register in the bus state machine. Write data is being asserted on the same clock cycle as it is being asserted. We can fix this by making the data register a transparent latch.

I also used similar techniques to track down address issues with the STA, SHLD, and STAR instructons (note I had to change the definition file to compile for the S8085C). This leaves us with the following code :

Project Skeleton with all current errors corrected :

We rerun our simulation and focus in on the PUSHB instruction. It is now operating correctly:

  1. What is a CRC?

  2. What hardware can be used for a polynomial divide?

  3. Name the simulation strategies.

  4. What is a constant function?

  5. What should RAM depth always be?

  6. Pick some other instructions and check them. You may add additional instructions to the test bench if you like.