Towards cache-optimal address allocation:
How fast could your code have run if you had known where to allocate memory?

Alexander Miller & Mario Preishuber

( a.miller@cs.uni-salzburg.at & m.preishuber@cs.uni-salzburg.at )

Computational Systems Group Department of Computer Sciences University of Salzburg

Problem Statements

Alexander:
Given a trace T of load and store instructions
find an algorithm that changes the memory addresses in the instructions such that
the resulting trace T′ is semantically equivalent to T but
performs equal or better than T for a given cache model C.
Mario:
Given a trace T of load and store instructions
find metrics that characterize the trace performance for a given cache model C and
implement an execution engine for computing their quantities.

Hardware Model

hardware model
The cache is a memory that can be accessed faster but is smaller in size than main memory. It is transparent to the CPU. The smallest unit of data in the cache are cache lines.

Performance

Memory Access Type

Costs in Cycles

cache load 1
cache store 1
memory load 5
memory store 5

Temporal Performance
(cycles per access [CPA])

T ... trace and C ... cache
$$ \text{CPA}(T,C) = \frac{\text{Sum of cycles}(T, C)}{\text{Sum of accesses}(T)} $$

Memory Access Trace

// sum three numbers, sum = x + y + z
void main() {
  int sum, x, y, z;
  sum = 0;
  x = 1;
  y = 2;
  sum = sum + x;
  z = 3;
  sum = sum + y;
  sum = sum + z;
}



// memory access trace (length 13)
store &sum  // sum = 0
store &x    // x = 1
store &y    // y = 2
load  &sum  // sum = sum + x
load  &x
store &sum
store &z    // z = 3
load  &sum  // sum = sum + y
load  &y
store &sum
load  &sum  // sum = sum + z
load  &z
store &sum
A memory access trace is a sequence of main memory accesses, recorded from the execution of a program. A memory access is a tuple consisting of access type, and the address accessed.

Liveness



definition liveness

An address is live beginning from the first store to it and ending in the last load before the next store. The second store marks the beginning of a new liveness interval.

Trace Transformation

original trace
Original Trace
$$ CPA(T_{original}, C_{LRU}) =$$
$$\frac{1*(5 + 4) + 5*(4 + 5)}{13} = 4.15$$
single assignment trace
Single Assignment Trace (SA)
$$ CPA(T_{sa}, C_{LRU}) =$$
$$\frac{1*(2 + 7) + 5*(4 + 7)}{13} = 4.92$$
compact trace
Compact Trace
$$ CPA(T_{compact}, C_{LRU}) =$$
$$\frac{1*(2 + 6) + 5*(4 + 4)}{13} = 3.69$$
Assume a LRU cache with 2 cache lines, each cache line fits exactly one variable (e.g., A, B, C, ...).
cache loads cache stores memory loads memory stores

Experiments

cpa: omnetpp benchmark results
cpa: omnetpp gobmk results
cpa: xalancbmk benchmark results
cache loads cache stores memory loads memory stores
cache size=32KB, cache line size=64byte