Computer Architectures
Experiments with Cache Simulator Dinero IV
A) Performance Evaluation of Memory Hierarchy
Average Memory Access Time
AMAT = HT + MR * MP
AMAT ... Average Memory Access Time
HT ... Hit Time = cache access time if the item is present (Cache Hit).
MR ... Miss Rate
MP ... Miss Penalty = access time if the item is not present in cache
(Cache Miss)
The equation states that the Average Memory Access Time is equal to the
cache access time;
when the requested item is not found, this value is increased by the
access time of the next
level of memory hierarchy. In a two-level memory system, MP corresponds
to the main memory access time.
CPU Performance Equation
TCPU = IC * (CPIint +
MAPI * MR * MP) * Tclk
IC ... Instruction Count of the given program
CPIint ... CPI when assuming ideal memory hierarchy
(CPIint = CPIpipeline_ideal + Stalls Per
Instruction)
MAPI ... Memory Access Per Instruction (MAPI = 1 + LSR)
1 accounts for fetching an instruction from the memory, LSR ... (Load
/Store Rate)
gives the frequency of Load/Store instruction occurence
MR ... Miss Rate
MP ... Miss Penalty = access time if the item is not found in cache
(Cache Miss)
Evaluation using TCPU is more accurate since it corresponds
to the program execution time. At the first glance, the missing HT may
seem confusing. However, cache access time (Hit Time) is already
included in Tclk (assuming that the cache access corresponds
to a single pipeline stage - see e.g. IF and MEM). Or, it is already
accounted for in Tclk and CPIint, if the memory
access takes several pipeline stages (see superpipelining).
The value of CPIint is obtained from simulation, e.g. in
WinDLX, assuming an ideal cache. Further wait states accounting for
cache misses (MAPI * MR * MP) are added to this value.
B) Cache simulator Dinero IV
Dinero IV is a so-called trace-driven simulator that can be
used to determine MR and MAPI
for a given memory hierarchy configuration.
(Another category is formed by execution-driven simulators that
use a real program in binary or text form as their input. Trace-driven
simulators are simpler and faster. Execution-driven
simulators are more accurate; some of them are able to simulate an
entire computer including the operating system
- such as Simplescalar, SimOS, Simics, RSIM.)
Dinero IV needs a trace file. It is essentially a log of actual
memory activity of a given program
(instruction fetching, data reads, data writes, cache flushes). Several
input formats exist for Dinero IV. These formats differ by instruction
encoding and their accuracy. Most of the time, we'll use the trace
files
of SPECint95 benchmarks in the Dinero III format.
(For a serious simulation, one should use a more extensive
set of benchmarks as well as input data - on the order of billions of
instructions).
The Dinero III format is a text file. Each line corresponds to one
memory access:
type address
where type is
0 read data.
1 write data.
2 instruction fetch
3 escape record (treated as unknown access type).
4 escape record (causes cache flush).
Address is a hexadecimal number between 0 and ffffffff
Example:
1 70f22c84
1 70f22c80
2 136a
0 200
1 280
The most important options are :
N specifies the cache level (1 –
5)
T specifies the cache type (i
... instruction, d... data, u ... unified)
Other useful options are
-stat-idcombine
Combined statistics of instruction and data cache
-lN-Tccc Splits
the
misses into compulsory, capacity and conflict ones for level N
type
T cache
Example command line :
dineroIV -l1-usize 16k -l1-ubsize 16 -l1-uassoc 4 -l1-uccc -informat d
<tex.din >tex1.rpt
Simualtes a l1 unified cache (l1-u) of size 16 KB,
block size 16B, 4-way set associative, LRU
(default), Write back (default)
input format d (DineroIII) . Trace file is tex.din.
Result is stored into tex1.rpt.
After execution, the tex1.rpt file contains the following :
---Dinero IV cache simulator, version
7
--Written by Jan Edler and Mark D. Hill
--Copyright (C) 1997
NEC Research Institute, Inc. and Mark D. Hill.
--All rights
reserved.
--Copyright (C) 1985, 1989 Mark D. Hill. All rights
reserved.
--See -copyright option for details
---Summary of options (-help option
gives
usage information).
-l1-usize 16384
-l1-ubsize 16
-l1-usbsize
16
-l1-uassoc 4
-l1-urepl l
-l1-ufetch d
-l1-uwalloc
a
-l1-uwback a
-l1-uccc
-skipcount 0
-flushcount
0
-maxcount 0
-stat-interval 0
-informat d
-on-trigger
0x0
-off-trigger 0x0
---Simulation begins.
---Simulation
complete.
l1-ucache
Metrics Total Instrn Data Read
Write Misc
----------------- ------ ------ ------ ------ ------ ------
Demand Fetches 8324761 5973081 2351680 1306550 1045130
0
Fraction of
total 1.0000 0.7175 0.2825 0.1569 0.1255 0.0000
Demand Misses 24661 872 23789 5076
18713 0
Demand miss
rate 0.0030 0.0001 0.0101 0.0039 0.0179 0.0000
Compulsory misses 2438 52 2386 516 1870 0
Capacity misses 22196 820 21376 4533 16843 0
Conflict misses 27 0 27 27 0 0
Compulsory fraction 0.0989 0.0596 0.1003 0.1017 0.0999 0.0000
Capacity fraction 0.9000 0.9404 0.8986 0.8930 0.9001 0.0000
Conflict fraction 0.0011 0.0000 0.0011 0.0053 0.0000 0.0000
Multi-block refs 0
Bytes From Memory 394576
( / Demand Fetches) 0.0474
Bytes To Memory 299424
( / Demand Writes) 0.2865
Total Bytes r/w Mem 694000
( / Demand Fetches) 0.0834
---Execution complete.
In this listing, the most important data are in the line “Demand miss rate”.
It specifies the combined MR for data and instructions as well as MR
for instructions, data, reads and writes.
The line “Fraction of total”
is also important because it lets you
find out the number of Load and Store instructions.
Notice that 72% memory accesses are related to instructions, 28% are
data accesses. The ratio of
“Read” and “Write” accesses to the number of
instruction accesses
determines the percentage of “Load”
and “Store”
instructions in the given program.
(The percentage of “Load”
instructions is 1306550/ 5973081 = 22 %)
Also notice that the misses are divided into compulsory, capacity, and
conflict (or collision) misses.
The number of bytes transferred between the processor and the memory
can be used to determine the load of the system bus.
C) Running the DineroIV Simulator on the Computers in the
Department
At this time, the program can be run on the dual.felk.cvut.cz
(2x PIII@550 MHz)
and hwlog.felk.cvut.cz (P4@3.2 GHz with HT) machines.
To run the program, follow this procedure:
1. Use Putty (Accessories-Communications-Putty) to log
in to dual.felk.cvut.cz or hwlog.felk.cvut.cz in SSH
mode.
To log in to hwlog.felk.cvut.cz, it is necessary to enter Connection/SSH menu in Putty and set
the
"Preferred
SSH protocol version" to 2 !
2. Copy the contents of /opt/vlsi/aps/dinero_vzor into a
work directory, e.g. Note that there is a space in between the star and the dot
in cp command.
mkdir dinero
cd dinero
cp
-r /opt/vlsi/aps/dinero_vzor/* .
3. Now it should be possible to edit the run_dinero script:
or
Run_dinero is a script that allows you to run DineroIV in various
configurations.
Edit this file as needed. Take care to keep all the commmand line
options on a single line (watch out
for line breaks in the editor!).
The following trace files are available for testing:
spice.din – SPICE program – analog simulator of electric circuits (SPECfp95), DineroIII format
cc1.din – CC program - C preprocesor (SPECint95), DineroIII format
eon - EON program (SPECfp2000), SBC
format
If you want to install DineroIV on your computer at home, use these
instructions.
D) Assignments
Perform the following measurements on the benchmark assigned to you
by your teacher. Your answers are due in the 12th
week at the latest in the EXCEL format - mail
them to your teacher. The file must contain your names!
1. Measure how MR and AMAT depend on the block size (other
conditions are the same)
.
For the measurements, use a unified L1 cache, size 1KB, 4KB and 8KB,
one-way set associative.
Block size is 16B, 32B, 64B, 128B, 256B.
- Graph MR=f(block size) for various cache sizes.
- Graph AMAT= f(block size) for various cache sizes
(time is in CPU cycles)
HT = 1, MP in the table (80 + blocksize / 8B) :
Block size |
16B |
32B |
64B |
128B |
256B |
---|---|---|---|---|---|
MP |
82 |
84 |
88 |
96 |
112 |
Is it possible to find an optimal block size for the given cache?
2. Measure how MR depends on associativity.
Perform the measurements on a unified L1 cache, size 4KB, 8KB, 16KB,
32KB
block size is 32B (MP=84), associativity s=1,2,4,8,fully associative
- Graph MR = f(associativity) for various cache sizes
Is it true that a cache of size N and s=1 has the same MR as a cache of
size N/2 and s=2 ?
- Graph TCPU
= f (associativity) for various cache sizes (omit the fully associative
cache).
For IC, MAPI see the output of Dinero, MP=84, CPIint = 1.11
Associativity |
1 |
2 |
4 |
8 |
---|---|---|---|---|
Tclk |
1 |
1,36 |
1,44 |
1,52 |
(Assume Tclk = HT - e.g. a pipelined DLX processor)
3. Assuming that other parameters are constant, fill in “increases”,
“decreases”
and “does not change” into the following table.
Cache parameter |
Hit Time |
Miss Rate |
Miss Penalty |
---|---|---|---|
If Cache capacity increases ... |
|
|
|
If block size increases ... |
|
|
|
If associativity increases ... |
|
|
|
In some cases, the relationship is complex. In such a case, fill in
“changes” in the table and describe the relationship verbally.
© Miloš Bečvář, 2005