Constructing Instruction Traces from Cache-filtered Address Traces (CITCAT)

a research proposal by Charlton D. Rose
January 16, 1998

Introduction and Tutorial

Every second, a typical processor in a desktop computer completes hundreds of millions of simple operations called instructions. Because instructions are the basic elements of computer programs, researchers are interested in ways to make them execute faster and more efficiently.

An instruction trace is a "journal" of all the instructions completed by a processor, between time a and time b, as it performs a task. Instruction traces are valuable to researchers because they provide statistical information about the demands placed on a processor while it performs various tasks. This data makes it possible to identify deficiencies in the system, and guides researchers as they revise the processor, the hardware that supports it, or the programs that run on it.

Short instruction traces have limited value because they reflect the processor's behavior in a limited context. Long instruction traces, on the other hand, are more useful but difficult to obtain. Unfortunately, the storage requirements for even 10 seconds-worth of instruction trace data are enormous. Even if the storage device being used has the capacity to hold a very long trace, it probably cannot record the data as fast as it is produced. This is one of the reasons why long, accurate instruction traces are so difficult to obtain.

Some trace collection techniques simply record data as fast as possible, and when they get behind, they allow some data to be skipped until they are caught up. Other trace collection techniques actually pause the processor when they get behind, and then restart it when they are ready for more. Both techniques, however, fail to produce long, accurate instruction traces because they either lose data or interfere with the processor's normal execution.

For each instruction executed by a processor, the processor must make one or more references to its memory. This would lead one to believe that a trace of the processor's memory accesses is even more difficult to obtain than an instruction trace. However, it is not. Most processors are equipped with a special device called a "cache," an intermediate storage device that lies between the processor and its main memory. Since a processor's memory requests are usually localized and repetitive, its cache serves as the processor's "short term memory," effectively shielding the main memory banks from most of the processor's requests. On occasion, when the processor needs data that isn't in the cache, the cache behaves like a proxy, retrieving the data from the main memory banks and then returning it to the processor. The next time the processor asks for that same data, however, the cache will be prepared to respond instantly.

A cache-filtered address trace is a record of a processor's memory requests that were not immediately satisfiable by its cache. Unlike instruction traces, long, accurate cache-filtered address traces are relatively easy to collect.

Proposal

I believe that, under certain conditions, cache-filtered address traces contain enough information to generate long, accurate instruction traces. To pursue this idea, I will develop CITCAT (Constructing Instruction Traces from Cache-filtered Address Traces), a technique which I believe is capable of performing this conversion. CITCAT will combine the best features of several trace collection techniques, including instruction inlining, hardware monitoring, and processor simulation, to produce long, statistically accurate instruction traces, with negligible effects on the processor being traced. A brief introduction to the CITCAT algorithm follows:

Processors are finite state machines. This means that the next state of a processor can always be predicted if its present state and external stimuli are known. By determining the state of the processor at a given point in time, and then capturing a record of all the events that affect it afterwards, one can use this information to simulate, in software, the processor's exact behavior. The simulated environment, in turn, makes it easy to record the processor's activities without altering its behavior or skipping valuable data. If the simulator is accurate, then the simulated processor will perform exactly the same operations as the original processor, so that a trace collected from the simulated processor will be identical to a trace which might have been collected from the real processor (were it possible)! Because both the present state data required to initialize the simulation and the external stimuli data required to drive the simulation can be obtained from a carefully prepared, cache-filtered address trace, it is possible to convert cache-filtered address traces into instruction traces. This is the goal of the CITCAT algorithm.

An interesting side-effect of CITCAT is that because the instruction traces will be computed, rather than stored, CITCAT may result in extremely efficient trace data compression. To distribute a trace to other researchers, for example, all I would have to do is bundle the simulator software, an initial state record, and an event record into a single package, and then allow the recipients to generate the trace by executing the simulation on their own machines. If CITCAT is successful, it is speculated that it may exceed lossless compression levels beyond 500 to 1. This, of course, may turn into a fresh area for future research.

Conclusion

Instruction traces contain data that is valuable to researchers who evaluate and revise computer systems. An informal survey of three major conference proceedings has revealed that:

Approximately 23% of . . . accepted papers at . . . ISCA 24, ASPLOS VII, and MICRO-30 dealing with computer system architecture and performance use trace-driven simulation to acquire their experimental results. Most of the trace data used was representative of scientific workloads such as the integer and floating point SPEC benchmarks. [From this,] two observations can be made. . . . First, there is certainly a need for trace data since nearly a quarter of all the work found worthy of publication at these prestigious conferences use it. Second, there is a real need for trace data representative of other workloads such as transaction processing, real time applications, multimedia applications, teleconferencing, video streams, and entertainment software such as games.

Unfortunately, traditional trace collection techniques have failed to produce long, accurate traces that contain enough information to be called "statistically accurate." As a result, many researchers have been forced to settle for less, hoping (with some uncertainty) that the results of their research are still accurate.

In an effort to remedy this problem, I will develop and explore a new trace collection technique, called "CITCAT," which has the potential to convert cache-filtered address traces — which are easy to collect — into long, accurate instruction traces — which are nearly impossible to collect.

If CITCAT is successful in producing long, accurate instruction traces, it is no understatement to say that researchers everywhere will be "pounding down the doors" to get a hold of these traces. I will explore the strengths and weaknesses of CITCAT and publish the results of my work. If the algorithm proves successful, it will have a dramatic impact on the way traces are collected and distributed in the future.