On x86 and amd64, an easy and cheap way of getting access to a 32bit cycle counter is using the rdtsc
instruction. On ARMv7 machines (read: mostly the Cortex-A series,) you can get a similar cycle count from the performance monitoring unit (PMU.)
Unfortunately, that cycle count is restricted from user-mode by default on ARM/Linux, and trying to touch it results in an illegal instruction violation. There’s various information floating around the net about enabling PMU access, most of which I got from this SO post. However, I never found any comprehensive solution to this which I could use easily.
My testbed here consists of a Samsung Chromebook (dual core Cortex-A15) running Ubuntu 13.04, and an ODROID-U2 (quad core Cortex-A9) running a Ubuntu/Linaro 12.10 variant.
Preface: compile woes and whatnot
While attempting to compile patdiff on my Chromebook, I ran into a compilation error in core_extended
due to the fact it used rdtsc
in a C file - and without an #ifdef
guard in sight! Obviously the assembler threw a fit and the compilation failed.
So of course, I set off to fix that on my ARM machines.
Enabling the cycle counter in kernel mode
User-mode access to the cycle counter has to be enabled through a kernel module. Ideally the kernel module should tear things down when it’s unloaded too.
Flipping the bits
I was happy to find out that enabling the needed bits on the processor was not very difficult, and the documentation was nicely explanatory .
Essentially, you need to set some flags on a coprocessor debug register. The register in particular is CP15. The instruction is very long winded but pretty easy to look up in the ARM manual:
mcr p15, 0, <Rt>, c14, 0
where <Rt>
is a register containing either 1, which enables user-mode access, or 0, which disables user-mode access. mcr
stands for “Move to coprocessor register from ARM register.”
But besides enabling user-mode access to the counters, we also need to specify which counters we can access. One of these is the cycle counter (there are various other ones, like cache hits and branch predictor stats.) We can also set options, but we’ll just use the default options to enable all counters. In GNU C parlance, we can do this all with:
#define PERF_DEF_OPTS (1 | 16)
...
static void
enable_cpu_counters(void* data)
{
/* Enable user-mode access to counters. */
asm volatile("mcr p15, 0, %0, c9, c14, 0" :: "r"(1));
/* Program PMU and enable all counters */
asm volatile("mcr p15, 0, %0, c9, c12, 0" :: "r"(PERF_DEF_OPTS));
asm volatile("mcr p15, 0, %0, c9, c12, 1" :: "r"(0x8000000f));
}
The function to disable CPU counters is quite similar, just changing some of the indexes into CP15 around.
Doing it across every processor
One important detail to keep in mind when doing all this is that for an SMP system, kernel threads which run your module code will be scheduled across various the various CPUs on your machine.
However, user-mode access to the cycle counter is a per-cpu configuration, because every CPU has a set of registers to keep track of, including debug ones. This means we have to enable access on every CPU. If we don’t do this, it’s possible that the kernel module’s init
function enabling the counters will run on CPU A, and your program accessing the counter will be scheduled to run on CPU B.
In practice this means you’ll confusingly get illegal instruction errors about half the time you run your programs (although you can set the CPU affinity manually using taskset(1)
.)
Luckily it’s easy to do this in our module: just pass a function pointer to on_each_cpu
1:
static int __init
init(void)
{
on_each_cpu(enable_cpu_counters, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] initialized");
return 0;
}
static void __exit
fini(void)
{
on_each_cpu(disable_cpu_counters, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] unloaded");
}
Using the cycle counter from user space
Now that we have the cycle counter, we can use it to implement something basically like x86’s rdtsc
operation:
static inline uint32_t
rdtsc32(void)
{
#if defined(__GNUC__) && defined(__ARM_ARCH_7A__)
uint32_t r = 0;
asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) );
return r;
#else
#error Unsupported architecture/compiler!
#endif
}
While in the kernel module we used the mcr
instruction to move from ARM -> Coprocessor register, here we’re moving from Coprocessor -> ARM register via mrc
(the cycle count is also contained in CP15.)
There’s unfortunately no 64bit cycle counter available from what I could immediately see. But I didn’t look hard. Anyway, after doing this, you can just do your typical dance to count cycles:
uint32_t start_time = 0;
uint32_t end_time = 0;
start_time = rdtsc32();
// ... do expensive thing ...
end_time = rdtsc32();
printf("cycle delta = %u\n", end_time - start_time);
Other notes
The PMU on ARM also has two extra things you can toggle. For one, you can reset the cycle counter to zero to get a more accurate measurement.
Another trick is that the PMU can be put into ‘divider’ mode, where the cycle counter will increase once every 64 cycles instead. This allows you to monitor a much larger amount of cycles, at the expense of some small-term accuracy.
Doing it the easy way
But really, there’s actually a way of doing this that’s about a billion times easier. Do you know what it is? Use the Linux perf
infrastructure. Specifically, the perf_event_open
syscall allows you to read the hardware cycle counter in a portable, sane fashion, with no extra kernel module needed.
I did this by using GNU C’s __attribute__((constructor))
and __attribute__((destructor))
routines. The constructor invokes the system call which returns a file descriptor. We can later read from the file descriptor to get the cycle count from the processor.
static int fddev = -1;
__attribute__((constructor)) static void
init(void)
{
static struct perf_event_attr attr;
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_CPU_CYCLES;
fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
}
__attribute__((destructor)) static void
fini(void)
{
close(fddev);
}
static inline long long
cpucycles(void)
{
long long result = 0;
if (read(fddev, &result, sizeof(result)) < sizeof(result)) return 0;
return result;
}
There’s some more documentation about perf_event_open
here.
However, for small segments of code with few cycles there is a large difference in accuracy in the two approaches - in the tests included in my code, this makes the difference between 300 cycles reported and 4000 reported. The perf_event_open
approach involves a system call which takes a noticeable amount of time by itself, and it will clock the transitions between user/kernel space in the overall time.
At least, this is my guess. But this is only a relatively constant overhead and for bigger amounts of code it’s probably not as much of a deal. You really need to be running your benchmarks more than once anyway, too.
Conclusion
tl;dr just use perf_event_open
and save yourself some sanity (hopefully I can get a patch to core_extended
using this approach.) You can also avoid rogue kernel modules. But if in some insane world you’re writing ARM/Windows or ARM/OSX drivers and need PMU support, this might help (but you’ll still need a driver.)
The code for this is all on github.