Recently on StackOverflow there was a question about controlling inlining phases in GHC. I set out to answer that question, and I decided that it was interesting enough to post here. Below is my answer slightly modified to fit the context of this blog.

## Introduction

You may have seen phase control and inlining annotations in libraries like `vector`

, `repa`

and `dph`

before, but how do they work? It’s nice to see a cut-down and concrete example of where using phase control in combination with `RULES`

/`INLINE`

is beneficial.^{1} You don’t see them beyond heavily optimized libraries which are often complex, so case studies are great.

Here is an example I implemented recently, using recursion schemes. We will illustrate this using the concept of catamorphisms. You don’t need to know what those are in detail, just that they characterize ‘fold’ operators. (Really, do not focus too much on the abstract concepts here. This is just the simplest example I have, where you can have a nice speed-up.)

## Quick intro to catamorphisms

We begin with `Mu`

, the fix-point type, and a definition of `Algebra`

which is just a fancy synonym for a function which “deconstructs” a value of `f a`

to return an `a`

.

```
newtype Mu f = Mu { muF :: f (Mu f) }
type Algebra f a = f a -> a
```

We may now define two operators, `ffold`

and `fbuild`

, which are highly-generic versions of the traditional `foldr`

and `build`

operators for lists:

```
ffold :: Functor f => Algebra f a -> Mu f -> a
ffold h = go h
where go g = g . fmap (go g) . muF
{-# INLINE ffold #-}
fbuild :: Functor f => (forall b. Algebra f b -> b) -> Mu f
fbuild g = g Mu
{-# INLINE fbuild #-}
```

Roughly speaking, `ffold`

*destroys* a structure defined by an `Algebra f a`

and yields an `a`

. `fbuild`

instead *creates* a structure defined by its `Algebra f a`

and yields a `Mu`

value. That `Mu`

value corresponds to whatever recursive data type you’re talking about. Just like regular `foldr`

and `build`

: we deconstruct a list using its cons, and we build a list using its cons, too. The idea is we’ve just generalized these classic operators, so they can work over *any* recursive data type (like lists, or trees!)

Finally, there is a law that accompanies these two operators, which will guide our overall `RULE`

:

`forall f g. ffold f (build g) = g f`

This rule essentially generalizes the optimization of deforestation/fusion - the removal of the intermediate structure. (I suppose the proof of correctness of said law is left as an exercise to the reader. Should be rather easy via equational reasoning.)

We may now use these two combinators, along with `Mu`

, to represent recursive data types like a list. And we can write operations over that list.

```
data ListF a f = Nil | Cons a f
deriving (Eq, Show, Functor)
type List a = Mu (ListF a)
instance Eq a => Eq (List a) where
(Mu f) == (Mu g) = f == g
lengthL :: List a -> Int
lengthL = ffold g
where g Nil = 0
g (Cons _ f) = 1 + f
{-# INLINE lengthL #-}
```

And we can define a `map`

function as well:

```
mapL :: (a -> b) -> List a -> List b
mapL f = ffold g
where g Nil = Mu Nil
g (Cons a x) = Mu (Cons (f a) x)
{-# INLINE mapL #-}
```

## Inlining FTW

We now have a means of writing terms over these recursive types we defined. However, if we were to write a term like

`lengthL . mapL (+1) $ xs`

Then if we expand the definitions, we essentially get the composition of two `ffold`

operators:

`ffold g1 . ffold g2 $ ...`

And that means we’re actually destroying the structure, then rebuilding it and destroying *again*. That’s really wasteful. Also, we can re-define `mapL`

in terms of `fbuild`

, so it will hopefully fuse with other functions.

Well, we already have our law, so a `RULE`

is in order. Let’s codify that:

```
{-# RULES
-- Builder rule for catamorphisms
"ffold/fbuild" forall f (g :: forall b. Algebra f b -> b).
ffold f (fbuild g) = g f
#-}
```

Next, we’ll redefine `mapL`

in terms of `fbuild`

for fusion purposes:

```
mapL2 :: (a -> b) -> List a -> List b
mapL2 f xs = fbuild (\h -> ffold (h . g) xs)
where g Nil = Nil
g (Cons a x) = Cons (f a) x
{-# INLINE mapL2 #-}
```

Aaaaaand we’re done, right? Wrong!

## Phases for fun and profit

The problem is there are zero constraints on when inlining occurs, which will completely mess this up. Consider the case earlier that we wanted to optimize:

`lengthL . mapL2 (+1) $ xs`

We would like the definitions of `lengthL`

and `mapL2`

to be inlined, so that the `ffold/fbuild`

rule may fire afterwords, over the body. So we want to go to:

`ffold f1 . fbuild g1 ...`

via inlining, and after that go to:

`g1 f1`

via our `RULE`

.

Well, that’s not guaranteed. Essentially, in one phase of the simplifier, GHC may not *only* inline the definitions of `lengthL`

and `mapL`

, but it may also inline the definitions of `ffold`

and `fbuild`

at their use sites. This means that the RULE will never get a chance to fire, as the phase ‘gobbled up’ all of the relevant identifiers, and inlined them into nothing.

The observation is that we would like to inline `ffold`

and `fbuild`

*as late as possible*. By inlining their definitions as late as possible, we will try to expose as many possible opportunities for our RULE to fire. And if it doesn’t, then the body will get inlined, and GHC will still give it’s best. But ultimately, we want it to inline late; the `RULE`

will save us more efficiency than any clever compiler optimization.

So the fix here is to annotate `ffold`

and `fbuild`

and specify they should only fire at phase 1:

```
ffold g = ...
{-# INLINE[1] ffold #-}
fbuild g = ...
{-# INLINE[1] fbuild #-}
```

Now, `mapL`

and friends will be inlined very early, but these will come very late. GHC begins from some phase number N, and the phase numbers decrease to zero. Phase 1 is the last phase. It would also be possible to inline `fbuild/ffold`

sooner than Phase 1, but this would essentially mean you need to start increasing the number of phases to make up for it, or start making sure the RULE always fires in some earlier stages.

## Conclusion

You can find all of this and more in a gist of mine^{2}, with all the mentioned definitions and examples here. It also comes with a criterion benchmark of our example: with our phase annotations, GHC is able to cut the runtime of `lengthL . mapL2`

in half compared to `lengthL . mapL1`

, when the `RULE`

fires.

If you would like to see this yourself, you can compile the code with the `-ddump-simpl-stats`

, and see that the `ffold/fbuild`

rule fired during the compilation pipeline.

Finally, most of the same principles apply to libraries like vector or bytestring. The trick is that you may have multiple levels of inlining here, and a *lot* more rules. This is because techniques like stream/array fusion tend to effectively fuse loops and reuse arrays - as opposed to here, where we just do classical deforestation, by removing an intermediate data structure. Depending on the traditional ‘pattern’ of code generated (say, due to a vectorized, parallel list comprehension) it may very much be worth it to interleave or specifically phase optimizations in a way that obvious deficiencies are eliminated earlier on. Or, optimize for cases where a `RULE`

in combination with an `INLINE`

will give rise to more `RULE`

s (hence the staggered phases you see sometimes - this basically interleaves a phase of inlining.) For these reasons, you can also control the phases in which a `RULE`

fires.

So, while `RULE`

s with phases can save us a lot of runtime, they can take a lot of time to get right too. This is why you often see them only in the most ‘high performance’, heavily optimized libraries.

## Footnotes

The original question was “what kinds of functions benefit from phase control” which to me sounds like asking “which functions benefit from constant subexpression elimination.” I am not sure how to accurately answer this, if it’s even possible! This is more of a compiler-realm thing, than any theoretical result about how functions or programs behave - even

*with*mathematical laws, not all ‘optimizations’ have the results you expect. As a result, the answer is effectively “you’ll probably know when you write and benchmark it.”↩You can safely ignore a lot of other stuff in the file; it was mostly a playground, but may be interesting to you too. There are other examples like naturals and binary trees in there - you may find it worthwhile to try exploiting various other fusion opportunities, using them.↩

Nice post with awesome points! Can’t wait for the next one.

ReplyDeleteAcer - 14" Laptop - 6GB Memory - 500GB Hard Drive - Hazy Purple

Acer - 15.6" Touch-Screen Laptop - 6GB Memory - 500GB Hard Drive - Silver