## Sunday, June 2, 2013

### Control.Category: now with kind polymorphism

Just a quick update.

The other day, I pushed a patch to the Haskell base library that will enable kind polymorphism for the Category type class. That change will come to the public when GHC 7.8.1 is released.

This is a simple change that makes Category far more general, while not breaking any code already out there. (The change was originally proposed by none other than Edward Kmett, although he’s not the only one wanting it.)

# Out with the old, in with the new

The typical definition of Category looks like:

class Category cat where
id  :: cat a a
(.) :: cat b c -> cat a b -> cat a c

Which is all great. But there’s a problem with this. By default, the type variable cat has a kind something like (cat :: * -> * -> *) - meaning a Category can only be created from types where a and b are of kind *.

This isn’t very rich or expressive. In fact, there are many legal Category instances which cannot be created with the above definition, as the parameters (a and b) are of a more exotic kind - for example, not of kind * but of kind (*,*) - the kind of promoted tuples.

Luckily for us, GHC has kind polymorphism. So we just needed to enable it. This is a completely backwards compatible change as well.

With that little addition of LANGUAGE PolyKinds, we can now host much richer instances of Category. With -XPolyKinds, kind generalization occurs automatically. So at pretty much no cost, the change turned:

class Category (cat :: * -> * -> *) where

into:

class Category (cat :: k -> k -> *) where

where the k stands for a kind variable, and can be instantiated to various new kinds, not just *.

In GHC 7.8, if you ask for the kind of Category, you will get:

Prelude> import Control.Category
Prelude Control.Category> :i Category
class Category (k :: BOX) (cat :: k -> k -> *) where
Control.Category.id :: cat a a
(Control.Category..) :: cat b c -> cat a b -> cat a c
-- Defined in ‛Control.Category’
instance Category * (->) -- Defined in ‛Control.Category’
Prelude Control.Category> :k Category
Category :: (k -> k -> *) -> Constraint
Prelude Control.Category>

(For the curious: the (k :: BOX) annotation cannot be given explicitly, and is a form introduced as a result of PolyKinds, where kind parameters are introduced ‘implicitly’ in their definition.)

# Logical entailment

There’s a simple instance of Category which we already know and love: the instance for function types. Let the type variable cat become (->) and we get:

id :: cat a a
= { given the equality, cat = (->) }
id :: (->) a a
= { by syntactic equality }
id :: a -> a

(.) :: cat b c -> cat a b -> cat a c
= { given the equality, cat = (->) }
(.) :: ((->) b c) -> ((->) a b) -> ((->) a c)
= { by syntactic equality }
(.) :: (b -> c) -> (a -> b) -> (a -> c)

And these are the regular types of Prelude.id and (Prelude..), so:

instance Category (->) where
id  = Prelude.id
(.) = (Prelude..)

Another example Edward brought up though, is logical entailment (or logical implication.) This is an entertaining example because it’s easy and looks very similar to the above types. With modern GHC features, we can formulate this using the new Constraint kind.

Let Dict be the ultimate reified dictionary, using a GADT:

data Dict :: Constraint -> * where
Dict :: ctx => Dict ctx

Given a term of type Dict x, we can witness the constraint x. Max Bolingbroke wrote some stuff about this trick when he implemented ConstraintKinds.

Now, let the type a |- b read ‘a entails b’, implemented as:

newtype (|-) a b = Sub (a => Dict b)

I’ll get to the explanation in a moment. If we peer through the looking glass we can see the term Sub Dict as a kind of evidence that some statement a |- b holds. So now there are some obvious statements we can make about entailment. For one, there is the simple property of reflexivity: “if a, then a”, or in other words a |- a:

reflexive :: a |- a
reflexive = Sub Dict

Another property is transitivity. Given the implication a |- b, and b |- c, we can derive the statement a |- c.

First, we must introduce a helper. Given a term r which requires context b, we may derive a term r requiring context a, iff a |- b.

In other words, we may substitute the constraint under which a given term occurs:

(\\) :: a => (b => r) -> (a |- b) -> r
r \\ Sub Dict = r

Remember the definition of (|-) which is Sub (a => Dict b). Given a dictionary of b, we automatically derive the constituent a for the constraint. Dict will imply the constraint b, but Dict is a term, and thus we can ‘peel it off’ by abandoning the Dict like the above, leaving only a.

Now we can define the property of transitivity:

transitive :: (b |- c) -> (a |- b) -> (a |- c)
transitive f g = Sub (Dict \\ f \\ g)

We can view the transitive function as taking two pieces of evidence and returning a third. We use the helper (\\) to ‘build’ the final evidence piecewise: first we get Dict \\ f which replaces c for b, and then we replace b for the constraint a. This then ‘trivially’ witnesses the fact we can replace c for a.

But these definitions are the exact instances of Category for (|-), and they directly mirror the instances for (->) too:

id        :: a -> a
reflexive :: a |- a

(.)        :: (b -> c) -> (a -> b) -> (a -> c)
transitive :: (b |- c) -> (a |- b) -> (a |- c)

And therefore:

instance Category (|-) where
id  = reflexive
(.) = transitive

Note that the kind of entailment is: (|-) :: Constraint -> Constraint -> *

With a kind-polymorphic Category, the type (|-) can now be a valid instance where it couldn’t be before.

By the way, the above definitions are all available as part of Edward’s constraints package.

# Conclusion

This change was proposed a while ago, and could have been implemented as early as GHC 7.4 technically. But it’s a nice addition to top off what will already be an awesome 7.8 release.

Full code can be found here.

## Friday, May 3, 2013

### User-mode performance counters for ARM/Linux

On x86 and amd64, an easy and cheap way of getting access to a 32bit cycle counter is using the rdtsc instruction. On ARMv7 machines (read: mostly the Cortex-A series,) you can get a similar cycle count from the performance monitoring unit (PMU.)

Unfortunately, that cycle count is restricted from user-mode by default on ARM/Linux, and trying to touch it results in an illegal instruction violation. There’s various information floating around the net about enabling PMU access, most of which I got from this SO post. However, I never found any comprehensive solution to this which I could use easily.

My testbed here consists of a Samsung Chromebook (dual core Cortex-A15) running Ubuntu 13.04, and an ODROID-U2 (quad core Cortex-A9) running a Ubuntu/Linaro 12.10 variant.

# Preface: compile woes and whatnot

While attempting to compile patdiff on my Chromebook, I ran into a compilation error in core_extended due to the fact it used rdtsc in a C file - and without an #ifdef guard in sight! Obviously the assembler threw a fit and the compilation failed.

So of course, I set off to fix that on my ARM machines.

# Enabling the cycle counter in kernel mode

User-mode access to the cycle counter has to be enabled through a kernel module. Ideally the kernel module should tear things down when it’s unloaded too.

## Flipping the bits

I was happy to find out that enabling the needed bits on the processor was not very difficult, and the documentation was nicely explanatory .

Essentially, you need to set some flags on a coprocessor debug register. The register in particular is CP15. The instruction is very long winded but pretty easy to look up in the ARM manual:

mcr p15, 0, <Rt>, c14, 0

where <Rt> is a register containing either 1, which enables user-mode access, or 0, which disables user-mode access. mcr stands for “Move to coprocessor register from ARM register.”

But besides enabling user-mode access to the counters, we also need to specify which counters we can access. One of these is the cycle counter (there are various other ones, like cache hits and branch predictor stats.) We can also set options, but we’ll just use the default options to enable all counters. In GNU C parlance, we can do this all with:

#define PERF_DEF_OPTS (1 | 16)
...
static void
enable_cpu_counters(void* data)
{
asm volatile("mcr p15, 0, %0, c9, c14, 0" :: "r"(1));
/* Program PMU and enable all counters */
asm volatile("mcr p15, 0, %0, c9, c12, 0" :: "r"(PERF_DEF_OPTS));
asm volatile("mcr p15, 0, %0, c9, c12, 1" :: "r"(0x8000000f));
}

The function to disable CPU counters is quite similar, just changing some of the indexes into CP15 around.

## Doing it across every processor

One important detail to keep in mind when doing all this is that for an SMP system, kernel threads which run your module code will be scheduled across various the various CPUs on your machine.

However, user-mode access to the cycle counter is a per-cpu configuration, because every CPU has a set of registers to keep track of, including debug ones. This means we have to enable access on every CPU. If we don’t do this, it’s possible that the kernel module’s init function enabling the counters will run on CPU A, and your program accessing the counter will be scheduled to run on CPU B.

In practice this means you’ll confusingly get illegal instruction errors about half the time you run your programs (although you can set the CPU affinity manually using taskset(1).)

Luckily it’s easy to do this in our module: just pass a function pointer to on_each_cpu1:

static int __init
init(void)
{
on_each_cpu(enable_cpu_counters, NULL, 1);
printk(KERN_INFO "[" DRVR_NAME "] initialized");
return 0;
}

static void __exit
fini(void)
{
on_each_cpu(disable_cpu_counters, NULL, 1);
}

# Using the cycle counter from user space

Now that we have the cycle counter, we can use it to implement something basically like x86’s rdtsc operation:

static inline uint32_t
rdtsc32(void)
{
#if defined(__GNUC__) && defined(__ARM_ARCH_7A__)
uint32_t r = 0;
asm volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(r) );
return r;
#else
#error Unsupported architecture/compiler!
#endif
}

While in the kernel module we used the mcr instruction to move from ARM -> Coprocessor register, here we’re moving from Coprocessor -> ARM register via mrc (the cycle count is also contained in CP15.)

There’s unfortunately no 64bit cycle counter available from what I could immediately see. But I didn’t look hard. Anyway, after doing this, you can just do your typical dance to count cycles:

uint32_t start_time = 0;
uint32_t end_time = 0;

start_time = rdtsc32();
// ... do expensive thing ...
end_time = rdtsc32();

printf("cycle delta = %u\n", end_time - start_time);

## Other notes

The PMU on ARM also has two extra things you can toggle. For one, you can reset the cycle counter to zero to get a more accurate measurement.

Another trick is that the PMU can be put into ‘divider’ mode, where the cycle counter will increase once every 64 cycles instead. This allows you to monitor a much larger amount of cycles, at the expense of some small-term accuracy.

# Doing it the easy way

But really, there’s actually a way of doing this that’s about a billion times easier. Do you know what it is? Use the Linux perf infrastructure. Specifically, the perf_event_open syscall allows you to read the hardware cycle counter in a portable, sane fashion, with no extra kernel module needed.

I did this by using GNU C’s __attribute__((constructor)) and __attribute__((destructor)) routines. The constructor invokes the system call which returns a file descriptor. We can later read from the file descriptor to get the cycle count from the processor.

static int fddev = -1;
__attribute__((constructor)) static void
init(void)
{
static struct perf_event_attr attr;
attr.type = PERF_TYPE_HARDWARE;
attr.config = PERF_COUNT_HW_CPU_CYCLES;
fddev = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
}

__attribute__((destructor)) static void
fini(void)
{
close(fddev);
}

static inline long long
cpucycles(void)
{
long long result = 0;
if (read(fddev, &result, sizeof(result)) < sizeof(result)) return 0;
return result;
}

There’s some more documentation about perf_event_open here.

However, for small segments of code with few cycles there is a large difference in accuracy in the two approaches - in the tests included in my code, this makes the difference between 300 cycles reported and 4000 reported. The perf_event_open approach involves a system call which takes a noticeable amount of time by itself, and it will clock the transitions between user/kernel space in the overall time.

At least, this is my guess. But this is only a relatively constant overhead and for bigger amounts of code it’s probably not as much of a deal. You really need to be running your benchmarks more than once anyway, too.

# Conclusion

tl;dr just use perf_event_open and save yourself some sanity (hopefully I can get a patch to core_extended using this approach.) You can also avoid rogue kernel modules. But if in some insane world you’re writing ARM/Windows or ARM/OSX drivers and need PMU support, this might help (but you’ll still need a driver.)

The code for this is all on github.

1. This is quite different from smp_call_function which I originally tried, since importantly it also runs on the CPU you call it on. See here.

## Wednesday, March 6, 2013

### cryptol-mode for Emacs released

I’ve finally released my Emacs mode for Cryptol, the domain specific language for Cryptography, from Galois. For those who aren’t aware, you basically write a cryptographic algorithm in Cryptol, and render it to C or VHDL to be synthesized on something (like my Papilio One!)

At the moment it’s not really useful. It just has syntax highlighting and a REPL launcher. But I’m an Emacs Lisp novice, so this is my great achievement!

It’s available here on GitHub. It supports both marmalade and MELPA for installation, using package.el. If you’re using Emacs 24 or above and have configured your repository, it’s as easy as M-x package-install RET cryptol-mode.

I’d recommend just using MELPA, which will do hourly builds from the git repository. There’s not really anything there that will ‘break’ from using it, so you might as well get the latest features regularly.

I’ll be writing a blog post soon about using Cryptol a little more in depth for a specification of OCB (when I get the time to write it…)

## Sunday, March 3, 2013

### Retrospective: Corelan Live

Several weeks ago, I had the extraordinary pleasure of being a part of Corelan Live Training, which offers an intense, multi-day Win32 exploit development class. Many of the Metasploit, Metasploit Pro and Nexpose engineers in Austin, TX at Rapid7 underwent the training in Feburary. (On this note, I must give mad props to my office-mates @_sinn3r and @thelightcosine - Corelan members - for setting this up.)

The class was taught by Peter “corelanc0d3r” van Eeckhoutte, a well known metasploit contributor and exploit developer. It was 2 days of intense, 10hrs+ training each day.

We developed exploits for traditional stack buffer overflows, SEH overwrites, and integer arithmetic bugs. We went all the way from simple stack smashing, to full ASLR/DEP bypass with ROP chains, to heap spraying modern browsers (including exclusive coverage of Peter’s DEPS technique for heap spraying Firefox and IE10.) We hacked custom applications, media software, FTP clients, HTTP servers, and wrote our very own Metasploit modules.1

The class was incredible and well thought out. Peter took the time to demonstrate applications that show the key ideas behind pulling off a certain kind exploit, but also offered many real world challenges and applications to test us. It was often a competition in the class to see who could get a shell first.

This class was quite fantastic and I really had a blast developing exploits. Since then, I’ve independently written 2 more exploits against publicly known vulnerabilities (with known exploits already existing.) And I’m working on 2 more for Opera 12.12 and Safari 4.x (with no public exploits, but the vulnerabilities are fixed.) Expect metasploit modules soon!

Peter himself was great too; incredibly enthusiastic and engaged with all of his students. I’d recommend you take his class if you ever get a chance. It’s a lot of fun!

## Sunday, February 24, 2013

### My Homebrew tap

For anyone who’s curious, I’ve taken time over the past few months to accumulate software I like in my own homebrew tap. For those OS X users who aren’t aware, homebrew is pretty much the best package manager there is IMO. For those who use brew and don’t know what taps are, they’re basically tiny extra repositories you can tack on and have brew manage.

The following software is included in there:

• Caml-light, which is essentially the predecessor to OCaml. Good for teaching and good as an example of implementing a PL.

• Coccinelle, the semantic patching tool for C code.

• Compcert, the verifying C compiler.

• MetaOCaml, which is a version of OCaml 4.00.1 featuring multi-stage programming facilities. (Specifically, this is BER MetaOCaml version N100.)

• Metis, an automated theorem prover for first order logic.

• Nix, the purely functional package manager (featuring launchctl integration.)

• OchaCaml, which is essentially Caml-light extended with delimited continuations and answer type modification.

• Ott, a tool for writing programming language definitions.

• Tengine, a fork of nginx.

• Z3, a super-awesome SMT solver from Microsoft Research.

If you’re a brew user, you can get my tap by just saying:

$brew tap thoughtpolice/personal$ brew update

And you’re done. Now go ahead and install all of my formula!

## Sunday, February 10, 2013

### Public service announcement: patdiff is awesome

This is just a word of note: patdiff is an amazing diff tool from Jane Street that’s totally blowing my mind. I’m very, very thankful to Jake McArthur (a Jane St. employee) for showing me this tool. He was missing it at home too!

It uses Bram Cohen’s patience diff algorithm for computing differences between two files. But the real benefit of patdiff is the beautiful output besides that: only relevant, changed terms are highlighted, and insertion/deletion/modification markers only appear on the side.

It’s been difficult training myself to the new output of patdiff since it’s quite different from most other formulations of patience diff. Git itself has patience diff (but it wouldn’t really make a difference in the picture, ha ha!1) So does bzr, and probably hg. However, I really like this tool a lot already, even though I’ve only used it for a day or two.

I think the most important aspect of patience diff is that it matches lines which actually change much more accurately, like in Bram’s example in his post I linked to. As a programmer, you often make changes to code like this:

 void func1() {
x += 1
}

+void functhreehalves() {
+    x += 1.5
+}
+
void func2() {
x += 2
}

Which is completely reasonable. But a lot of tools interpret this as:

 void func1() {
x += 1
+}
+
+void functhreehalves() {
+    x += 1.5
}

void func2() {
x += 2
}
 But I love this color on my 80x24terminals! Source.

Which is completely annoying.

So, patience diff is a lot more accurate. patdiff takes this to 11, and is very particular about what it highlights on the line too - this is the real thing that sets it apart from other tools.

Notice in the picture that it only highlights the exact words that were inserted or deleted from my .cabal file. At a glance, you might think “well, that makes it hard to visually distinguish those two inserted and deleted blocks.” That is, it’s hard to see the difference between the ‘bulk’ changes without looking to the left-hand side. And yes, you’re right - git with color, on the other hand, makes the contrast between insertion blocks and deletion blocks very clear. But the contrast is clear much in the same way it’s clear on your friends green and black PT cruiser.

Considering the extra accuracy, I think retraining my eyes on reading diffs a little is worth it.

## Getting patdiff

Briefly, you need to install opam. If you’re on OS X and use Homebrew, that’s as easy as 1-2-3:

$brew update && brew install opam If you’re on Linux, you can use the installer script to install it into $HOME. You’ll need the PCRE development libraries and the OCaml compiler, however. On Ubuntu, you can get these by doing:

$sudo apt-get install libpcre3-dev$ sudo apt-get install ocaml

Then, install opam using the automated installer:

$wget http://www.ocamlpro.com/pub/opam_installer.sh$ sh ./opam_installer.sh $HOME/bin After you install it, initialize the opam environment: $ opam init

Finally, add the following line to your $HOME/.profile: $ (which opam > /dev/null) && eval $(opam config -env) And run it in any existing, open shells. Finally, just update and install: $ opam update
$opam install patdiff ## Adding a git patdiff alias You’ll notice in the picture above I configured git to have a patdiff alias that shows things all beautiful and ocamlized. It’s pretty easy, luckily: $ git config --global alias.patdiff 'difftool -y -x patdiff'

And now you use git patdiff to show differences, with the syntax:

$git patdiff OLD NEW Note that because of this syntax between old and new, if you want to see the diff of the latest commit, you must do: $ git patdiff HEAD~ HEAD

and not do:

$git patdiff HEAD HEAD~ The latter will show you a ‘reverse’ diff instead. I’m looking into making patdiff the default for all traditional diff operations. It seems that git difftool does support this, but only for a handful of extra diff tools. It’s likely doable with some shell script nastiness. I also need to see if I can get it working with mercurial, which is what we use at work. ## Update: Using patdiff with git diff You can use patdiff with regular git diff too! Do this: $ cat > $HOME/bin/git-external_patdiff.sh #!/bin/sh patdiff$2 $5 -alt-old a/$5 -alt-new b/$5 | cat ^D$ chmod +x $HOME/bin/git-external_patdiff.sh That will put a simple git wrapper for patdiff in your $HOME/bin. Next, configure git to use it:

$git config --global diff.external$HOME/bin/git-external_patdiff.sh

Then git diff will use patdiff as well.

## Update: meta blog diffs

Here’s another example of using patdiff while editing this very blog post on my OSX machine. Isn’t that diff just so much better?

 Americans: do not confuse wibbles and tribbles. They’re unrelated.

1. Before you ask: no, I do not do stand up comedy.

### Controlling inlining phases in GHC

Recently on StackOverflow there was a question about controlling inlining phases in GHC. I set out to answer that question, and I decided that it was interesting enough to post here. Below is my answer slightly modified to fit the context of this blog.

## Introduction

You may have seen phase control and inlining annotations in libraries like vector, repa and dph before, but how do they work? It’s nice to see a cut-down and concrete example of where using phase control in combination with RULES/INLINE is beneficial.1 You don’t see them beyond heavily optimized libraries which are often complex, so case studies are great.

Here is an example I implemented recently, using recursion schemes. We will illustrate this using the concept of catamorphisms. You don’t need to know what those are in detail, just that they characterize ‘fold’ operators. (Really, do not focus too much on the abstract concepts here. This is just the simplest example I have, where you can have a nice speed-up.)

## Quick intro to catamorphisms

We begin with Mu, the fix-point type, and a definition of Algebra which is just a fancy synonym for a function which “deconstructs” a value of f a to return an a.

newtype Mu f = Mu { muF :: f (Mu f) }

type Algebra f a = f a -> a

We may now define two operators, ffold and fbuild, which are highly-generic versions of the traditional foldr and build operators for lists:

ffold :: Functor f => Algebra f a -> Mu f -> a
ffold h = go h
where go g = g . fmap (go g) . muF
{-# INLINE ffold #-}

fbuild :: Functor f => (forall b. Algebra f b -> b) -> Mu f
fbuild g = g Mu
{-# INLINE fbuild #-}

Roughly speaking, ffold destroys a structure defined by an Algebra f a and yields an a. fbuild instead creates a structure defined by its Algebra f a and yields a Mu value. That Mu value corresponds to whatever recursive data type you’re talking about. Just like regular foldr and build: we deconstruct a list using its cons, and we build a list using its cons, too. The idea is we’ve just generalized these classic operators, so they can work over any recursive data type (like lists, or trees!)

Finally, there is a law that accompanies these two operators, which will guide our overall RULE:

forall f g. ffold f (build g) = g f

This rule essentially generalizes the optimization of deforestation/fusion - the removal of the intermediate structure. (I suppose the proof of correctness of said law is left as an exercise to the reader. Should be rather easy via equational reasoning.)

We may now use these two combinators, along with Mu, to represent recursive data types like a list. And we can write operations over that list.

data ListF a f = Nil | Cons a f
deriving (Eq, Show, Functor)
type List a = Mu (ListF a)

instance Eq a => Eq (List a) where
(Mu f) == (Mu g) = f == g

lengthL :: List a -> Int
lengthL = ffold g
where g Nil = 0
g (Cons _ f) = 1 + f
{-# INLINE lengthL #-}

And we can define a map function as well:

mapL :: (a -> b) -> List a -> List b
mapL f = ffold g
where g Nil = Mu Nil
g (Cons a x) = Mu (Cons (f a) x)
{-# INLINE mapL #-}

## Inlining FTW

We now have a means of writing terms over these recursive types we defined. However, if we were to write a term like

lengthL . mapL (+1) $xs Then if we expand the definitions, we essentially get the composition of two ffold operators: ffold g1 . ffold g2$ ...

And that means we’re actually destroying the structure, then rebuilding it and destroying again. That’s really wasteful. Also, we can re-define mapL in terms of fbuild, so it will hopefully fuse with other functions.

Well, we already have our law, so a RULE is in order. Let’s codify that:

{-# RULES
-- Builder rule for catamorphisms
"ffold/fbuild" forall f (g :: forall b. Algebra f b -> b).
ffold f (fbuild g) = g f
#-}

Next, we’ll redefine mapL in terms of fbuild for fusion purposes:

mapL2 :: (a -> b) -> List a -> List b
mapL2 f xs = fbuild (\h -> ffold (h . g) xs)
where g Nil = Nil
g (Cons a x) = Cons (f a) x
{-# INLINE mapL2 #-}

Aaaaaand we’re done, right? Wrong!

## Phases for fun and profit

The problem is there are zero constraints on when inlining occurs, which will completely mess this up. Consider the case earlier that we wanted to optimize:

lengthL . mapL2 (+1) \$ xs

We would like the definitions of lengthL and mapL2 to be inlined, so that the ffold/fbuild rule may fire afterwords, over the body. So we want to go to:

ffold f1 . fbuild g1 ...

via inlining, and after that go to:

g1 f1

via our RULE.

Well, that’s not guaranteed. Essentially, in one phase of the simplifier, GHC may not only inline the definitions of lengthL and mapL, but it may also inline the definitions of ffold and fbuild at their use sites. This means that the RULE will never get a chance to fire, as the phase ‘gobbled up’ all of the relevant identifiers, and inlined them into nothing.

The observation is that we would like to inline ffold and fbuild as late as possible. By inlining their definitions as late as possible, we will try to expose as many possible opportunities for our RULE to fire. And if it doesn’t, then the body will get inlined, and GHC will still give it’s best. But ultimately, we want it to inline late; the RULE will save us more efficiency than any clever compiler optimization.

So the fix here is to annotate ffold and fbuild and specify they should only fire at phase 1:

ffold g = ...
{-# INLINE[1] ffold #-}

fbuild g = ...
{-# INLINE[1] fbuild #-}

Now, mapL and friends will be inlined very early, but these will come very late. GHC begins from some phase number N, and the phase numbers decrease to zero. Phase 1 is the last phase. It would also be possible to inline fbuild/ffold sooner than Phase 1, but this would essentially mean you need to start increasing the number of phases to make up for it, or start making sure the RULE always fires in some earlier stages.

## Conclusion

You can find all of this and more in a gist of mine2, with all the mentioned definitions and examples here. It also comes with a criterion benchmark of our example: with our phase annotations, GHC is able to cut the runtime of lengthL . mapL2 in half compared to lengthL . mapL1, when the RULE fires.

If you would like to see this yourself, you can compile the code with the -ddump-simpl-stats, and see that the ffold/fbuild rule fired during the compilation pipeline.

Finally, most of the same principles apply to libraries like vector or bytestring. The trick is that you may have multiple levels of inlining here, and a lot more rules. This is because techniques like stream/array fusion tend to effectively fuse loops and reuse arrays - as opposed to here, where we just do classical deforestation, by removing an intermediate data structure. Depending on the traditional ‘pattern’ of code generated (say, due to a vectorized, parallel list comprehension) it may very much be worth it to interleave or specifically phase optimizations in a way that obvious deficiencies are eliminated earlier on. Or, optimize for cases where a RULE in combination with an INLINE will give rise to more RULEs (hence the staggered phases you see sometimes - this basically interleaves a phase of inlining.) For these reasons, you can also control the phases in which a RULE fires.

So, while RULEs with phases can save us a lot of runtime, they can take a lot of time to get right too. This is why you often see them only in the most ‘high performance’, heavily optimized libraries.

## Footnotes

1. The original question was “what kinds of functions benefit from phase control” which to me sounds like asking “which functions benefit from constant subexpression elimination.” I am not sure how to accurately answer this, if it’s even possible! This is more of a compiler-realm thing, than any theoretical result about how functions or programs behave - even with mathematical laws, not all ‘optimizations’ have the results you expect. As a result, the answer is effectively “you’ll probably know when you write and benchmark it.”

2. You can safely ignore a lot of other stuff in the file; it was mostly a playground, but may be interesting to you too. There are other examples like naturals and binary trees in there - you may find it worthwhile to try exploiting various other fusion opportunities, using them.