## Friday, May 3, 2013

### User-mode performance counters for ARM/Linux

On x86 and amd64, an easy and cheap way of getting access to a 32bit cycle counter is using the rdtsc instruction. On ARMv7 machines (read: mostly the Cortex-A series,) you can get a similar cycle count from the performance monitoring unit (PMU.)

Unfortunately, that cycle count is restricted from user-mode by default on ARM/Linux, and trying to touch it results in an illegal instruction violation. There’s various information floating around the net about enabling PMU access, most of which I got from this SO post. However, I never found any comprehensive solution to this which I could use easily.

My testbed here consists of a Samsung Chromebook (dual core Cortex-A15) running Ubuntu 13.04, and an ODROID-U2 (quad core Cortex-A9) running a Ubuntu/Linaro 12.10 variant.

# Preface: compile woes and whatnot

While attempting to compile patdiff on my Chromebook, I ran into a compilation error in core_extended due to the fact it used rdtsc in a C file - and without an #ifdef guard in sight! Obviously the assembler threw a fit and the compilation failed.

So of course, I set off to fix that on my ARM machines.

# Enabling the cycle counter in kernel mode

User-mode access to the cycle counter has to be enabled through a kernel module. Ideally the kernel module should tear things down when it’s unloaded too.

## Flipping the bits

I was happy to find out that enabling the needed bits on the processor was not very difficult, and the documentation was nicely explanatory .

Essentially, you need to set some flags on a coprocessor debug register. The register in particular is CP15. The instruction is very long winded but pretty easy to look up in the ARM manual:

mcr p15, 0, <Rt>, c14, 0

where <Rt> is a register containing either 1, which enables user-mode access, or 0, which disables user-mode access. mcr stands for “Move to coprocessor register from ARM register.”

But besides enabling user-mode access to the counters, we also need to specify which counters we can access. One of these is the cycle counter (there are various other ones, like cache hits and branch predictor stats.) We can also set options, but we’ll just use the default options to enable all counters. In GNU C parlance, we can do this all with:

#define PERF_DEF_OPTS (1 | 16)
...
static void
enable_cpu_counters(void* data)
{
asm volatile("mcr p15, 0, %0, c9, c14, 0" :: "r"(1));
/* Program PMU and enable all counters */
asm volatile("mcr p15, 0, %0, c9, c12, 0" :: "r"(PERF_DEF_OPTS));
asm volatile("mcr p15, 0, %0, c9, c12, 1" :: "r"(0x8000000f));
}

The function to disable CPU counters is quite similar, just changing some of the indexes into CP15 around.

## Doing it across every processor

One important detail to keep in mind when doing all this is that for an SMP system, kernel threads which run your module code will be scheduled across various the various CPUs on your machine.

However, user-mode access to the cycle counter is a per-cpu configuration, because every CPU has a set of registers to keep track of, including debug ones. This means we have to enable access on every CPU. If we don’t do this, it’s possible that the kernel module’s init function enabling the counters will run on CPU A, and your program accessing the counter will be scheduled to run on CPU B.

In practice this means you’ll confusingly get illegal instruction errors about half the time you run your programs (although you can set the CPU affinity manually using taskset(1).)

Luckily it’s easy to do this in our module: just pass a function pointer to on_each_cpu1:

static int __init
init(void)
{
on_each_cpu(enable_cpu_counters, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] initialized");
return 0;
}

static void __exit
fini(void)
{
on_each_cpu(disable_cpu_counters, NULL, 1);
}

# Using the cycle counter from user space

Now that we have the cycle counter, we can use it to implement something basically like x86’s rdtsc operation:

static inline uint32_t
rdtsc32(void)
{
#if defined(__GNUC__) && defined(__ARM_ARCH_7A__)
uint32_t r = 0;
asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) );
return r;
#else
#error Unsupported architecture/compiler!
#endif
}

While in the kernel module we used the mcr instruction to move from ARM -> Coprocessor register, here we’re moving from Coprocessor -> ARM register via mrc (the cycle count is also contained in CP15.)

There’s unfortunately no 64bit cycle counter available from what I could immediately see. But I didn’t look hard. Anyway, after doing this, you can just do your typical dance to count cycles:

uint32_t start_time = 0;
uint32_t end_time = 0;

start_time = rdtsc32();
// ... do expensive thing ...
end_time = rdtsc32();

printf("cycle delta = %u\n", end_time - start_time);

## Other notes

The PMU on ARM also has two extra things you can toggle. For one, you can reset the cycle counter to zero to get a more accurate measurement.

Another trick is that the PMU can be put into ‘divider’ mode, where the cycle counter will increase once every 64 cycles instead. This allows you to monitor a much larger amount of cycles, at the expense of some small-term accuracy.

# Doing it the easy way

But really, there’s actually a way of doing this that’s about a billion times easier. Do you know what it is? Use the Linux perf infrastructure. Specifically, the perf_event_open syscall allows you to read the hardware cycle counter in a portable, sane fashion, with no extra kernel module needed.

I did this by using GNU C’s __attribute__((constructor)) and __attribute__((destructor)) routines. The constructor invokes the system call which returns a file descriptor. We can later read from the file descriptor to get the cycle count from the processor.

static int fddev = -1;
__attribute__((constructor)) static void
init(void)
{
static struct perf_event_attr attr;
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_CPU_CYCLES;
fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
}

__attribute__((destructor)) static void
fini(void)
{
close(fddev);
}

static inline long long
cpucycles(void)
{
long long result = 0;
if (read(fddev, &result, sizeof(result)) < sizeof(result)) return 0;
return result;
}

There’s some more documentation about perf_event_open here.

However, for small segments of code with few cycles there is a large difference in accuracy in the two approaches - in the tests included in my code, this makes the difference between 300 cycles reported and 4000 reported. The perf_event_open approach involves a system call which takes a noticeable amount of time by itself, and it will clock the transitions between user/kernel space in the overall time.

At least, this is my guess. But this is only a relatively constant overhead and for bigger amounts of code it’s probably not as much of a deal. You really need to be running your benchmarks more than once anyway, too.

# Conclusion

tl;dr just use perf_event_open and save yourself some sanity (hopefully I can get a patch to core_extended using this approach.) You can also avoid rogue kernel modules. But if in some insane world you’re writing ARM/Windows or ARM/OSX drivers and need PMU support, this might help (but you’ll still need a driver.)

The code for this is all on github.

1. This is quite different from smp_call_function which I originally tried, since importantly it also runs on the CPU you call it on. See here.

1. FYI, but you'll want to check that the syscall isn't returning -1. On some platforms (e.g. Linaro on a Pandaboard) the hardware counters that these rely on are not enabled due to known bugs.

2. I'm following your mrc approach. First I got compile error:
/tmp/ccsE2gGW.s: Assembler messages:
/tmp/ccsE2gGW.s:4046: Error: selected processor does not support Thumb mode mrc p15,0,r2,c9,c13,0'
After I added the line "LOCAL_ARM_MODE := arm" in my Android.mk, compile passed.
When I ran it, I got "Illegal instruction" error. Tested on Samsung S5.

3. Anyone get any device hardware events on Android with perf_event_open? I've tried a few devices and get "0" for almost everything that would be useful :(

4. That is so cute, I would of never thought of that. I am definitely making me one or maybe a few! LolBanner Stands

5. Very useful information!
Now I'm wondering if it is possible to flush cache using user-level application?
The instruction to flush cache is "mcr p15, 0, r0, c7, c14, 1"
Is there a corresponding instruction to enable use-mode issuing this instruction? Thank you!

6. Hi,

This module is exactly what I need for my performance measurement on my Jetson TK1 (ARM Quad-Core Cortex-A15 + NVIDIA GPU, OS= Ubuntu 14.04)

When I run: sudo make runtests, I get the following error:
make: *** [all] Error 2

Has anyone an idea why I get this error or what is wrong?
$sudo apt-get install module-assistant That's it; you can now compile kernel modules..." -ref: http://www.linuxdevcenter.com/pub/a/linux/2007/07/05/devhelloworld-a-simple-introduction-to-device-drivers-under-linux.html?page=1 7. The main difference between the two approaches is that using perf_event_open you measure only cycles when the process/thread really occupies the CPU - i.e. something like "cpu time". With the rdtsc approach you just read continuously running counter and therefor you get something like "wall time". AFAIK it's not possible to get "wall time" like cycle count using perf_event_open - check e.g. "profiling sleep times" on perf's wiki. 8. Hello, I am getting the following error when I try to run it on Jetson tk1 board. Can anybody please help? sudo make runtests KMOD ko/enable_arm_pmu.ko make[3]: *** [/home/ubuntu/Downloads/enable_arm_pmu-master/ko/enable_arm_pmu.o] Error 1 make[2]: *** [_module_/home/ubuntu/Downloads/enable_arm_pmu-master/ko] Error 2 make[1]: *** [all] Error 2 make: *** [ko/enable_arm_pmu.ko] Error 2 1. hey, try my solution here, a had approximately the same problem : https://github.com/thoughtpolice/enable_arm_pmu/issues/4 2. Thank you very much for your help 9. Hello, I cloned and ran "make runtests" on my SAMA5D3 development board (ARM-A5) running a debian build, and get the error below. Any idea how to fix this? Thanks! xplained@SAMA5D3-Xplained:~/EE382N21_Project/enable_arm_pmu$ make runtests