On x86 and amd64, an easy and cheap way of getting access to a 32bit cycle counter is using the rdtsc
instruction. On ARMv7 machines (read: mostly the Cortex-A series,) you can get a similar cycle count from the performance monitoring unit (PMU.)
Unfortunately, that cycle count is restricted from user-mode by default on ARM/Linux, and trying to touch it results in an illegal instruction violation. There’s various information floating around the net about enabling PMU access, most of which I got from this SO post. However, I never found any comprehensive solution to this which I could use easily.
My testbed here consists of a Samsung Chromebook (dual core Cortex-A15) running Ubuntu 13.04, and an ODROID-U2 (quad core Cortex-A9) running a Ubuntu/Linaro 12.10 variant.
Preface: compile woes and whatnot
While attempting to compile patdiff on my Chromebook, I ran into a compilation error in core_extended
due to the fact it used rdtsc
in a C file - and without an #ifdef
guard in sight! Obviously the assembler threw a fit and the compilation failed.
So of course, I set off to fix that on my ARM machines.
Enabling the cycle counter in kernel mode
User-mode access to the cycle counter has to be enabled through a kernel module. Ideally the kernel module should tear things down when it’s unloaded too.
Flipping the bits
I was happy to find out that enabling the needed bits on the processor was not very difficult, and the documentation was nicely explanatory .
Essentially, you need to set some flags on a coprocessor debug register. The register in particular is CP15. The instruction is very long winded but pretty easy to look up in the ARM manual:
mcr p15, 0, <Rt>, c14, 0
where <Rt>
is a register containing either 1, which enables user-mode access, or 0, which disables user-mode access. mcr
stands for “Move to coprocessor register from ARM register.”
But besides enabling user-mode access to the counters, we also need to specify which counters we can access. One of these is the cycle counter (there are various other ones, like cache hits and branch predictor stats.) We can also set options, but we’ll just use the default options to enable all counters. In GNU C parlance, we can do this all with:
#define PERF_DEF_OPTS (1 | 16)
...
static void
enable_cpu_counters(void* data)
{
/* Enable user-mode access to counters. */
asm volatile("mcr p15, 0, %0, c9, c14, 0" :: "r"(1));
/* Program PMU and enable all counters */
asm volatile("mcr p15, 0, %0, c9, c12, 0" :: "r"(PERF_DEF_OPTS));
asm volatile("mcr p15, 0, %0, c9, c12, 1" :: "r"(0x8000000f));
}
The function to disable CPU counters is quite similar, just changing some of the indexes into CP15 around.
Doing it across every processor
One important detail to keep in mind when doing all this is that for an SMP system, kernel threads which run your module code will be scheduled across various the various CPUs on your machine.
However, user-mode access to the cycle counter is a per-cpu configuration, because every CPU has a set of registers to keep track of, including debug ones. This means we have to enable access on every CPU. If we don’t do this, it’s possible that the kernel module’s init
function enabling the counters will run on CPU A, and your program accessing the counter will be scheduled to run on CPU B.
In practice this means you’ll confusingly get illegal instruction errors about half the time you run your programs (although you can set the CPU affinity manually using taskset(1)
.)
Luckily it’s easy to do this in our module: just pass a function pointer to on_each_cpu
1:
static int __init
init(void)
{
on_each_cpu(enable_cpu_counters, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] initialized");
return 0;
}
static void __exit
fini(void)
{
on_each_cpu(disable_cpu_counters, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] unloaded");
}
Using the cycle counter from user space
Now that we have the cycle counter, we can use it to implement something basically like x86’s rdtsc
operation:
static inline uint32_t
rdtsc32(void)
{
#if defined(__GNUC__) && defined(__ARM_ARCH_7A__)
uint32_t r = 0;
asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) );
return r;
#else
#error Unsupported architecture/compiler!
#endif
}
While in the kernel module we used the mcr
instruction to move from ARM -> Coprocessor register, here we’re moving from Coprocessor -> ARM register via mrc
(the cycle count is also contained in CP15.)
There’s unfortunately no 64bit cycle counter available from what I could immediately see. But I didn’t look hard. Anyway, after doing this, you can just do your typical dance to count cycles:
uint32_t start_time = 0;
uint32_t end_time = 0;
start_time = rdtsc32();
// ... do expensive thing ...
end_time = rdtsc32();
printf("cycle delta = %u\n", end_time - start_time);
Other notes
The PMU on ARM also has two extra things you can toggle. For one, you can reset the cycle counter to zero to get a more accurate measurement.
Another trick is that the PMU can be put into ‘divider’ mode, where the cycle counter will increase once every 64 cycles instead. This allows you to monitor a much larger amount of cycles, at the expense of some small-term accuracy.
Doing it the easy way
But really, there’s actually a way of doing this that’s about a billion times easier. Do you know what it is? Use the Linux perf
infrastructure. Specifically, the perf_event_open
syscall allows you to read the hardware cycle counter in a portable, sane fashion, with no extra kernel module needed.
I did this by using GNU C’s __attribute__((constructor))
and __attribute__((destructor))
routines. The constructor invokes the system call which returns a file descriptor. We can later read from the file descriptor to get the cycle count from the processor.
static int fddev = -1;
__attribute__((constructor)) static void
init(void)
{
static struct perf_event_attr attr;
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_CPU_CYCLES;
fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
}
__attribute__((destructor)) static void
fini(void)
{
close(fddev);
}
static inline long long
cpucycles(void)
{
long long result = 0;
if (read(fddev, &result, sizeof(result)) < sizeof(result)) return 0;
return result;
}
There’s some more documentation about perf_event_open
here.
However, for small segments of code with few cycles there is a large difference in accuracy in the two approaches - in the tests included in my code, this makes the difference between 300 cycles reported and 4000 reported. The perf_event_open
approach involves a system call which takes a noticeable amount of time by itself, and it will clock the transitions between user/kernel space in the overall time.
At least, this is my guess. But this is only a relatively constant overhead and for bigger amounts of code it’s probably not as much of a deal. You really need to be running your benchmarks more than once anyway, too.
Conclusion
tl;dr just use perf_event_open
and save yourself some sanity (hopefully I can get a patch to core_extended
using this approach.) You can also avoid rogue kernel modules. But if in some insane world you’re writing ARM/Windows or ARM/OSX drivers and need PMU support, this might help (but you’ll still need a driver.)
The code for this is all on github.
FYI, but you'll want to check that the syscall isn't returning -1. On some platforms (e.g. Linaro on a Pandaboard) the hardware counters that these rely on are not enabled due to known bugs.
ReplyDeleteI'm following your mrc approach. First I got compile error:
ReplyDelete/tmp/ccsE2gGW.s: Assembler messages:
/tmp/ccsE2gGW.s:4046: Error: selected processor does not support Thumb mode `mrc p15,0,r2,c9,c13,0'
After I added the line "LOCAL_ARM_MODE := arm" in my Android.mk, compile passed.
When I ran it, I got "Illegal instruction" error. Tested on Samsung S5.
Anyone get any device hardware events on Android with perf_event_open? I've tried a few devices and get "0" for almost everything that would be useful :(
ReplyDeleteVery useful information!
ReplyDeleteNow I'm wondering if it is possible to flush cache using user-level application?
The instruction to flush cache is "mcr p15, 0, r0, c7, c14, 1"
Is there a corresponding instruction to enable use-mode issuing this instruction? Thank you!
Hi,
ReplyDeleteThis module is exactly what I need for my performance measurement on my Jetson TK1 (ARM Quad-Core Cortex-A15 + NVIDIA GPU, OS= Ubuntu 14.04)
When I run: sudo make runtests, I get the following error:
make -C /lib/modules/3.10.40-ged4f697/build SUBDIRS=/home/ubuntu/Downloads/enable_arm_pmu-master/ko modules
make[1]: Entering directory `/usr/src/linux-headers-3.10.40-ged4f697'
CC [M] /home/ubuntu/Downloads/enable_arm_pmu-master/ko/enable_arm_pmu.o
/usr/src/linux-headers-3.10.40-ged4f697/scripts/recordmcount: 1: /usr/src/linux-headers-3.10.40-ged4f697/scripts/recordmcount: Syntax error: "(" unexpected
make[2]: *** [/home/ubuntu/Downloads/enable_arm_pmu-master/ko/enable_arm_pmu.o] Error 2
make[1]: *** [_module_/home/ubuntu/Downloads/enable_arm_pmu-master/ko] Error 2
make[1]: Leaving directory `/usr/src/linux-headers-3.10.40-ged4f697'
make: *** [all] Error 2
Has anyone an idea why I get this error or what is wrong?
Thx in advance,
Kind regards
Problem solved:
DeleteThis did the job:
"The module-assistant package for Debian installs packages and configures the system to build out-of-kernel modules. Install it with:
$ sudo apt-get install module-assistant
That's it; you can now compile kernel modules..."
-ref: http://www.linuxdevcenter.com/pub/a/linux/2007/07/05/devhelloworld-a-simple-introduction-to-device-drivers-under-linux.html?page=1
The main difference between the two approaches is that using perf_event_open you measure only cycles when the process/thread really occupies the CPU - i.e. something like "cpu time".
ReplyDeleteWith the rdtsc approach you just read continuously running counter and therefor you get something like "wall time".
AFAIK it's not possible to get "wall time" like cycle count using perf_event_open - check e.g. "profiling sleep times" on perf's wiki.
Hello,
ReplyDeleteI am getting the following error when I try to run it on Jetson tk1 board. Can anybody please help?
sudo make runtests
KMOD ko/enable_arm_pmu.ko
make[3]: *** [/home/ubuntu/Downloads/enable_arm_pmu-master/ko/enable_arm_pmu.o] Error 1
make[2]: *** [_module_/home/ubuntu/Downloads/enable_arm_pmu-master/ko] Error 2
make[1]: *** [all] Error 2
make: *** [ko/enable_arm_pmu.ko] Error 2
hey,
Deletetry my solution here, a had approximately the same problem : https://github.com/thoughtpolice/enable_arm_pmu/issues/4
Thank you very much for your help
DeleteHello, I cloned and ran "make runtests" on my SAMA5D3 development board (ARM-A5) running a debian build, and get the error below. Any idea how to fix this? Thanks!
ReplyDeletexplained@SAMA5D3-Xplained:~/EE382N21_Project/enable_arm_pmu$ make runtests
KMOD ko/enable_arm_pmu.ko
make: *** /lib/modules/3.10.0+/build: No such file or directory. Stop.
make[1]: *** [all] Error 2
make: *** [ko/enable_arm_pmu.ko] Error 2
Nice blog. Very informative. Thanks for sharing. Keep updating
ReplyDeleteLinux online training
Linux online course
Linux online training in kurnool
Linux online course in kurnool
Linux online training in Hyderabad
Hi, with AArch64 - A53 64 bit - bare metal (boot to EL3) - is it possible to just enable the perf counters and read them, or do I need to set up a user mode etc ? I've got some boot code running and I'd like to profile, but dont have access to H/W trace etc or DS5.
ReplyDeleteI want to use it on my raspberry pi 4 which is a ARMv8-A architecture and Cortex A72 processor , can you help me how can i use ?? thanks in advance
ReplyDeleteSemua akan membaik..
ReplyDeleteSatoshiV3
mmorpg oyunlar
ReplyDeleteınstagram takipci satin al
TİKTOK JETON HİLESİ
tiktok jeton hilesi
antalya saç ekimi
referans kimliği nedir
İnstagram Takipçi Satın Al
Instagram Takipçi
metin2 pvp serverlar
Perde Modelleri
ReplyDeleteSms Onay
Mobil ödeme bozdurma
Nft nasil alinir
ANKARA EVDEN EVE NAKLİYAT
TRAFİK SİGORTASI
dedektör
SİTE KURMAK
aşk kitapları
SMM PANEL
ReplyDeleteSMM PANEL
İS İLANLARİ BLOG
instagram takipçi satın al
hirdavatciburada.com
WWW.BEYAZESYATEKNİKSERVİSİ.COM.TR
servis
tiktok para hilesi indir