Computer Architectures

Experiments with Cache Simulator Dinero IV



A) Performance Evaluation of Memory Hierarchy

When designing a memory hierarchy, values of many parameters need to be set. The basic issue is the number of cache levels and their types (instruction, data, unified).
Then, it is necessary to choose the capacity, associativity and block size of individual caches. Choosing the write strategy and block replacement strategy is not trivial either.
These parameters significantly affect the performance of the entire computer, as well as its manufacturing costs.
For a quality design, it is necessary to establish a quantitative metric of the memory subsystem performance.

    There are two approaches in general use (for details, see lectures 8 and 9.)

Note: The above equations are the first approximation useful for understanding basic trends.
For a more exact calculation for a given memory subsystem configuration, it is necessary to break the MR*MP and MAPI*MR*MP products down in detail into individual components. (Some calculation examples can be found in the lecture slides.)

B) Cache simulator Dinero IV

Dinero IV is a so-called trace-driven simulator that can be used to determine MR and MAPI for a given memory hierarchy configuration.
(Another category is formed by execution-driven simulators that use a real program in binary or text form as their input. Trace-driven simulators are simpler and faster. Execution-driven simulators are more accurate; some of them are able to simulate an entire computer including the operating system - such as Simplescalar, SimOS, Simics, RSIM.)

Dinero IV needs a trace file. It is essentially a log of actual memory activity of a given program (instruction fetching, data reads, data writes, cache flushes). Several input formats exist for Dinero IV. These formats differ by instruction encoding and their accuracy. Most of the time, we'll use the trace files of SPECint95 benchmarks in the Dinero III format.
(For a serious simulation, one should use a more extensive set of benchmarks as well as input data - on the order of billions of instructions).

The Dinero III format is a text file. Each line corresponds to one memory access:

    type address

    where type is
    0 read data.
    1 write data.
    2 instruction fetch
    3 escape record (treated as unknown access type).
    4 escape record (causes cache flush).

    Address is a hexadecimal number between 0 and ffffffff

    Example:
    1 70f22c84
    1 70f22c80
    2 136a
    0 200
    1 280

The Dinero III format assumes that every memory access deals with a 32b word (instruction or data)
The more accurate Dinero IV format allows specifying data size in individual accesses; therefore, simulations that use this format are more realistic.
The actual configuration of the memory subsystem is specified by command-line options. The complete list of options is here.

    The most important options are :

-lN-Tbsize P Sets the block size of level N and type T cache to P bytes
        -lN-Tsize P Sets the capacity of level N and type T cache to P bytes
        -lN-Tassoc U Sets the associativity of level N and type T cache to P to U.
        -lN-Trepl C Sets the block replacement policy of level N and type T cache to P to C (l=LRU, f=FIFO, r=random). Default is LRU.

N specifies the cache level (1 – 5)
T specifies the cache type (i ... instruction, d... data, u ... unified)

Other useful options are
    -stat-idcombine Combined statistics of instruction and data cache
    -lN-Tccc Splits the misses into compulsory, capacity and conflict ones for level N type T cache

Example command line :

dineroIV -l1-usize 16k -l1-ubsize 16 -l1-uassoc 4 -l1-uccc -informat d <tex.din >tex1.rpt

Simualtes a l1 unified cache (l1-u) of size 16 KB, block size 16B, 4-way set associative, LRU (default), Write back (default)
input format d (DineroIII) . Trace file is tex.din. Result is stored into tex1.rpt.

After execution, the tex1.rpt file contains the following :

---Dinero IV cache simulator, version 7
--Written by Jan Edler and Mark D. Hill
--Copyright (C) 1997 NEC Research Institute, Inc. and Mark D. Hill.
--All rights reserved.
--Copyright (C) 1985, 1989 Mark D. Hill. All rights reserved.
--See -copyright option for details

---Summary of options (-help option gives usage information).

-l1-usize 16384
-l1-ubsize 16
-l1-usbsize 16
-l1-uassoc 4
-l1-urepl l
-l1-ufetch d
-l1-uwalloc a
-l1-uwback a
-l1-uccc
-skipcount 0
-flushcount 0
-maxcount 0
-stat-interval 0
-informat d
-on-trigger 0x0
-off-trigger 0x0

---Simulation begins.
---Simulation complete.

l1-ucache

Metrics  Total Instrn Data Read Write Misc
----------------- ------ ------ ------ ------ ------ ------
Demand Fetches 8324761 5973081 2351680 1306550 1045130 0
Fraction of total 1.0000 0.7175 0.2825 0.1569 0.1255 0.0000

Demand Misses 24661 872 23789 5076 18713 0
Demand miss rate 0.0030 0.0001 0.0101 0.0039 0.0179 0.0000
Compulsory misses 2438 52 2386 516 1870 0
Capacity misses 22196 820 21376 4533 16843 0
Conflict misses 27 0 27 27 0 0
Compulsory fraction 0.0989 0.0596 0.1003 0.1017 0.0999 0.0000
Capacity fraction 0.9000 0.9404 0.8986 0.8930 0.9001 0.0000
Conflict fraction 0.0011 0.0000 0.0011 0.0053 0.0000 0.0000

Multi-block refs 0
Bytes From Memory 394576
( / Demand Fetches) 0.0474
Bytes To Memory 299424
( / Demand Writes) 0.2865
Total Bytes r/w Mem 694000
( / Demand Fetches) 0.0834

---Execution complete.

In this listing, the most important data are in the line “Demand miss rate”. It specifies the combined MR for data and instructions as well as MR for instructions, data, reads and writes. The line “Fraction of total” is also important because it lets you find out the number of Load and Store instructions. Notice that 72% memory accesses are related to instructions, 28% are data accesses. The ratio of “Read” and “Write” accesses to the number of instruction accesses determines the percentage of “Load” and “Store
instructions in the given program. (The percentage of “Load” instructions is 1306550/ 5973081 = 22 %)
Also notice that the misses are divided into compulsory, capacity, and conflict (or collision) misses.
The number of bytes transferred between the processor and the memory can be used to determine the load of the system bus.


C) Running the DineroIV Simulator on the Computers in the Department

At this time, the program can be run on the dual.felk.cvut.cz (2x PIII@550 MHz) and hwlog.felk.cvut.cz (P4@3.2 GHz with HT) machines.

To run the program, follow this procedure:
1. Use Putty (Accessories-Communications-Putty) to log in to dual.felk.cvut.cz or hwlog.felk.cvut.cz in SSH mode.

To log in to hwlog.felk.cvut.cz, it is necessary to enter Connection/SSH menu in Putty and set the
"Preferred SSH protocol version" to 2 !

2. Copy the contents of /opt/vlsi/aps/dinero_vzor into a work directory, e.g. Note that there is a space in between the star and the dot in cp command.
mkdir dinero
cd dinero
cp -r /opt/vlsi/aps/dinero_vzor/*  .
 

3. Now it should be possible to edit the run_dinero script:

or

Run_dinero is a script that allows you to run DineroIV in various configurations.
Edit this file as needed. Take care to keep all the commmand line options on a single line (watch out for line breaks in the editor!).

The following trace files are available for testing:

D) Assignments

Perform the following measurements on the benchmark assigned to you by your teacher. Your answers are due in the 12th week at the latest in the EXCEL format - mail them to your teacher. The file must contain your names!

1. Measure how MR and AMAT depend on the block size (other conditions are the same)
. For the measurements, use a unified L1 cache, size 1KB, 4KB and 8KB, one-way set associative.
Block size is 16B, 32B, 64B, 128B, 256B.

- Graph MR=f(block size) for various cache sizes.
- Graph AMAT= f(block size) for various cache sizes (time is in CPU cycles)
HT = 1, MP in the table (80 + blocksize / 8B) :

Block size

16B

32B

64B

128B

256B

MP

82

84

88

96

112


Is it possible to find an optimal block size for the given cache?

2. Measure how MR depends on associativity.
Perform the measurements on a unified L1 cache, size 4KB, 8KB, 16KB, 32KB
block size is 32B (MP=84), associativity s=1,2,4,8,fully associative
- Graph MR = f(associativity) for various cache sizes
Is it true that a cache of size N and s=1 has the same MR as a cache of size N/2 and s=2 ?

- Graph TCPU = f (associativity) for various cache sizes (omit the fully associative cache).
For IC, MAPI see the output of Dinero, MP=84, CPIint = 1.11

Associativity

1

2

4

8

Tclk

1

1,36

1,44

1,52


(Assume Tclk = HT - e.g. a pipelined DLX processor)

3. Assuming that other parameters are constant, fill in “increases”, “decreases” and “does not change” into the following table.

Cache parameter

Hit Time

Miss Rate

Miss Penalty

If Cache capacity increases ...




If block size increases ...




If associativity increases ...





In some cases, the relationship is complex. In such a case, fill in “changes” in the table and describe the relationship verbally.

© Miloš Bečvář, 2005