Experiments with Cache Simulator Dinero IV

Computer Architectures

A) Performance Evaluation of Memory Hierarchy

When designing a memory hierarchy, values of many parameters need to be set. The basic issue is the number of cache levels and their types (instruction, data, unified).
Then, it is necessary to choose the capacity, associativity and block size of individual caches. Choosing the write strategy and block replacement strategy is not trivial either.
These parameters significantly affect the performance of the entire computer, as well as its manufacturing costs.
For a quality design, it is necessary to establish a quantitative metric of the memory subsystem performance.

There are two approaches in general use (for details, see lectures 8 and 9.)

Average Memory Access Time

AMAT = HT + MR * MP

AMAT ... Average Memory Access Time
HT ... Hit Time = cache access time if the item is present (Cache Hit).
MR ... Miss Rate
MP ... Miss Penalty = access time if the item is not present in cache (Cache Miss)
The equation states that the Average Memory Access Time is equal to the cache access time; when the requested item is not found, this value is increased by the access time of the next level of memory hierarchy. In a two-level memory system, MP corresponds to the main memory access time.
CPU Performance Equation

T_CPU = IC * (CPI_int + MAPI * MR * MP) * T_clk

IC ... Instruction Count of the given program
CPI_int... CPI when assuming ideal memory hierarchy (CPI_int = CPI_{pipeline_ideal} + Stalls Per Instruction)
MAPI ... Memory Access Per Instruction (MAPI = 1 + LSR)
1 accounts for fetching an instruction from the memory, LSR ... (Load /Store Rate) gives the frequency of Load/Store instruction occurence
MR ... Miss Rate
MP ... Miss Penalty = access time if the item is not found in cache (Cache Miss)

Evaluation using T_CPU is more accurate since it corresponds to the program execution time. At the first glance, the missing HT may seem confusing. However, cache access time (Hit Time) is already included in T_clk (assuming that the cache access corresponds to a single pipeline stage - see e.g. IF and MEM). Or, it is already accounted for in T_clk and CPI_int, if the memory access takes several pipeline stages (see superpipelining).
The value of CPI_int is obtained from simulation, e.g. in WinDLX, assuming an ideal cache. Further wait states accounting for cache misses (MAPI * MR * MP) are added to this value.
Both equations suggest that to minimize AMAT or T_CPU, one must minimize HT, MR and MP. However, these factors are not independent.
Moreover, HT and MP are technology-dependent parameters (and can be determined e.g. with the CACTI simulator).
MR and MAPI are technology-independent and can be determined with a Cache simulator, e.g. Dinero.

Note: The above equations are the first approximation useful for understanding basic trends.
For a more exact calculation for a given memory subsystem configuration, it is necessary to break the MR*MP and MAPI*MR*MP products down in detail into individual components. (Some calculation examples can be found in the lecture slides.)

B) Cache simulator Dinero IV

Dinero IV is a so-called trace-driven simulator that can be used to determine MR and MAPI for a given memory hierarchy configuration.
(Another category is formed by execution-driven simulators that use a real program in binary or text form as their input. Trace-driven simulators are simpler and faster. Execution-driven simulators are more accurate; some of them are able to simulate an entire computer including the operating system - such as Simplescalar, SimOS, Simics, RSIM.)

Dinero IV needs a trace file. It is essentially a log of actual memory activity of a given program (instruction fetching, data reads, data writes, cache flushes). Several input formats exist for Dinero IV. These formats differ by instruction encoding and their accuracy. Most of the time, we'll use the trace files of SPECint95 benchmarks in the Dinero III format.
(For a serious simulation, one should use a more extensive set of benchmarks as well as input data - on the order of billions of instructions).

The Dinero III format is a text file. Each line corresponds to one memory access:

type address

where type is
0 read data.
1 write data.
2 instruction fetch
3 escape record (treated as unknown access type).
4 escape record (causes cache flush).

Address is a hexadecimal number between 0 and ffffffff

Example:
1 70f22c84
1 70f22c80
2 136a
0 200
1 280

The Dinero III format assumes that every memory access deals with a 32b word (instruction or data)
The more accurate Dinero IV format allows specifying data size in individual accesses; therefore, simulations that use this format are more realistic.
The actual configuration of the memory subsystem is specified by command-line options. The complete list of options is here.

The most important options are :

-lN-Tbsize P Sets the block size of level N and type T cache to P bytes
-lN-Tsize P Sets the capacity of level N and type T cache to P bytes
-lN-Tassoc U Sets the associativity of level N and type T cache to P to U.
-lN-Trepl C Sets the block replacement policy of level N and type T cache to P to C (l=LRU, f=FIFO, r=random). Default is LRU.

N specifies the cache level (1 – 5)
T specifies the cache type (i ... instruction, d... data, u ... unified)

Other useful options are
-stat-idcombine Combined statistics of instruction and data cache
-lN-Tccc Splits the misses into compulsory, capacity and conflict ones for level N type T cache

Example command line :

dineroIV -l1-usize 16k -l1-ubsize 16 -l1-uassoc 4 -l1-uccc -informat d <tex.din >tex1.rpt

Simualtes a l1 unified cache (l1-u) of size 16 KB, block size 16B, 4-way set associative, LRU (default), Write back (default)
input format d (DineroIII) . Trace file is tex.din. Result is stored into tex1.rpt.

After execution, the tex1.rpt file contains the following :

---Dinero IV cache simulator, version 7 --Written by Jan Edler and Mark D. Hill --Copyright (C) 1997 NEC Research Institute, Inc. and Mark D. Hill. --All rights reserved. --Copyright (C) 1985, 1989 Mark D. Hill. All rights reserved. --See -copyright option for details

---Summary of options (-help option gives usage information).

-l1-usize 16384 -l1-ubsize 16 -l1-usbsize 16 -l1-uassoc 4 -l1-urepl l -l1-ufetch d -l1-uwalloc a -l1-uwback a -l1-uccc -skipcount 0 -flushcount 0 -maxcount 0 -stat-interval 0 -informat d -on-trigger 0x0 -off-trigger 0x0

---Simulation begins. ---Simulation complete.

l1-ucache

Metrics Total Instrn Data Read Write Misc ----------------- ------ ------ ------ ------ ------ ------ Demand Fetches 8324761 5973081 2351680 1306550 1045130 0 Fraction of total 1.0000 0.7175 0.2825 0.1569 0.1255 0.0000

Demand Misses 24661 872 23789 5076 18713 0 Demand miss rate 0.0030 0.0001 0.0101 0.0039 0.0179 0.0000 Compulsory misses 2438 52 2386 516 1870 0 Capacity misses 22196 820 21376 4533 16843 0 Conflict misses 27 0 27 27 0 0 Compulsory fraction 0.0989 0.0596 0.1003 0.1017 0.0999 0.0000 Capacity fraction 0.9000 0.9404 0.8986 0.8930 0.9001 0.0000 Conflict fraction 0.0011 0.0000 0.0011 0.0053 0.0000 0.0000

Multi-block refs 0 Bytes From Memory 394576 ( / Demand Fetches) 0.0474 Bytes To Memory 299424 ( / Demand Writes) 0.2865 Total Bytes r/w Mem 694000 ( / Demand Fetches) 0.0834

---Execution complete.

In this listing, the most important data are in the line “Demand miss rate”. It specifies the combined MR for data and instructions as well as MR for instructions, data, reads and writes. The line “Fraction of total” is also important because it lets you find out the number of Load and Store instructions. Notice that 72% memory accesses are related to instructions, 28% are data accesses. The ratio of “Read” and “Write” accesses to the number of instruction accesses determines the percentage of “Load” and “Store”
instructions in the given program. (The percentage of “Load” instructions is 1306550/ 5973081 = 22 %)
Also notice that the misses are divided into compulsory, capacity, and conflict (or collision) misses.
The number of bytes transferred between the processor and the memory can be used to determine the load of the system bus.

C) Running the DineroIV Simulator on the Computers in the Department

At this time, the program can be run on the dual.felk.cvut.cz (2x PIII@550 MHz) and hwlog.felk.cvut.cz (P4@3.2 GHz with HT) machines.

To run the program, follow this procedure:
1. Use Putty (Accessories-Communications-Putty) to log in to dual.felk.cvut.cz or hwlog.felk.cvut.cz in SSH mode.

To log in to hwlog.felk.cvut.cz, it is necessary to enter Connection/SSH menu in Putty and set the
"Preferred SSH protocol version" to 2 !

2. Copy the contents of /opt/vlsi/aps/dinero_vzor into a work directory, e.g. Note that there is a space in between the star and the dot in cp command.
mkdir dinero
cd dinero
cp -r /opt/vlsi/aps/dinero_vzor/* .

3. Now it should be possible to edit the run_dinero script:

vi run_dinero

copy run_dinero script with WinSCP2 to your working directory on your computer.

Run_dinero is a script that allows you to run DineroIV in various configurations.
Edit this file as needed. Take care to keep all the commmand line options on a single line (watch out for line breaks in the editor!).

The following trace files are available for testing:

spice.din – SPICE program – analog simulator of electric circuits (SPECfp95), DineroIII format
cc1.din – CC program - C preprocesor (SPECint95), DineroIII format
eon - EON program (SPECfp2000), SBC format

If you want to install DineroIV on your computer at home, use these instructions.

D) Assignments

Perform the following measurements on the benchmark assigned to you by your teacher. Your answers are due in the 12^th week at the latest in the EXCEL format - mail them to your teacher. The file must contain your names!

1. Measure how MR and AMAT depend on the block size (other conditions are the same)
. For the measurements, use a unified L1 cache, size 1KB, 4KB and 8KB, one-way set associative.
Block size is 16B, 32B, 64B, 128B, 256B.

- Graph MR=f(block size) for various cache sizes.
- Graph AMAT= f(block size) for various cache sizes (time is in CPU cycles)
HT = 1, MP in the table (80 + blocksize / 8B) :

Block size	16B	32B	64B	128B	256B
MP	82	84	88	96	112

Is it possible to find an optimal block size for the given cache?

2. Measure how MR depends on associativity.
Perform the measurements on a unified L1 cache, size 4KB, 8KB, 16KB, 32KB
block size is 32B (MP=84), associativity s=1,2,4,8,fully associative
- Graph MR = f(associativity) for various cache sizes
Is it true that a cache of size N and s=1 has the same MR as a cache of size N/2 and s=2 ?

- Graph T_CPU = f (associativity) for various cache sizes (omit the fully associative cache).
For IC, MAPI see the output of Dinero, MP=84, CPI_int = 1.11

Associativity	1	2	4	8
Tclk	1	1,36	1,44	1,52

(Assume T_clk = HT - e.g. a pipelined DLX processor)

3. Assuming that other parameters are constant, fill in “increases”, “decreases” and “does not change” into the following table.

Cache parameter	Hit Time	Miss Rate	Miss Penalty
If Cache capacity increases ...
If block size increases ...
If associativity increases ...

In some cases, the relationship is complex. In such a case, fill in “changes” in the table and describe the relationship verbally.