pietro / posts / canonical-forms

[TODO create the repo]

Introduction

In math, we have equivalent relations. An equivalence relation ~ lets us write x ~ y to say that x and y are interchangeable for all practical purposes, even though they may look different. Consider fractions, for example, where we have that a/b ~ c/d whenever ad = bc. So, for example, 1/3 ~ 2/6 ~ 100/300, and as expected, when doing arithmetic, we can swap any of these forms for any other equivalent form and the result would be the same. So equivalence relations say that while the expressions might be syntactically different¹, they are semantically equivalent². While in math we can simply wave our hands and freely switch between different representations of the same object, when programming, we need to have a concrete representation of the objects. Consider the following:

@dataclass
class Frac:
  num: int
  den: int

a = Frac(1, 3)
b = Frac(2, 6)

assert(a == b) # fail!

³ would obviously fail, since a and b have different numbers for each of their fields.

Solution V1

One way is to add an eq method to check whether two expressions are equivalent.

def eq(a: Frac, b: Frac) -> bool:
    return a.num * b.den == a.den * b.num

This would work, since eq encodes precisely our definition of semantic equivalence. But this doesn’t solve all the problems. For example, if we wanted to add fractions to a hash set, we would need a hash function satisfying eq(a, b) => hash(a) == hash(b). With our current representation, there’s no obvious way to compute such a hash from the raw num and den fields alone, as the equivalence check is inherently pairwise, but hashing needs to produce a single consistent value per equivalence class.

Solution V2

Ideally, we would like for all expressions in the same equivalence class to have a common representation. Then, whenever we generate a new object, we can simply convert it to this shared representation, and use that instead. Now a == b would work out of the box, since a and b, upon creation, would have been converted to the same (syntactically equal) canonical form!⁴

[TODO ADD IMAGE HERE]

We call this shared representation the canonical form of the expression.⁵

The canonical form reduces semantic equivalence to syntactic equality, which is much easier to work with.

In our case, we can have the following canonical form:

def canonical(a: Frac) -> Frac:
  g = gcd(a.num, a.den)
  if a.den < 0:
    g = -g
  return Frac(a.num // g, a.den // g)

Let’s verify that this is a valid canonical form:

a/b ~ c/d means a*d = b*c, i.e. a = k*c and b = k*d for some integer k. Dividing both by gcd(a,b) = k*gcd(c,d) gives a/gcd(a,b) = c/gcd(c,d) and b/gcd(a,b) = d/gcd(c,d) — so both fractions reduce to the same thing.
Why the if statement? We need to canonicalize the sign: otherwise -1/2 and 1/-2 would produce different representations, even though they’re equivalent. Negating g when the denominator is negative ensures the denominator is always positive, with the sign carried by the numerator.[^]

Then, whenever we generate a new object, we simply call canonical on it.

a = canonical(Frac(1, 3))
b = canonical(Frac(2, 6))

assert(a == b) # success!!!

A side note

Eagle-eyed readers might point out that there are no safety checks against forgetting to call canonical and consequently having a rampant non-canonical object running amok in our program. To prevent this, we would embed the canonicalization into the constructor of the object, ensuring that whenever we create a new object, it is always in canonical form. But this doesn’t stop all the problems. For example:

def double(a: Frac) -> Frac:
  b = a.copy()
  b.num *= 2
  return b
a = canonical(Frac(1, 2))
b = double(a)
print(b) # Frac(2, 2) oops! should be Frac(1, 1)

If we wanted to ensure that the object is always in canonical form, we have two options:

Forbid direct modification of the object (b.num *= 2), by making all the objects immutable, thus forcing all instances to use the constructor (and thus go through the canonicalizer).
Guarantee that the output (and side effects) of any function we write always produces a canonical form (so composition will also preserve canonicality).

Awesome! We can now use our dear == guilt-free!

A harder example: Series-Parallel Structures

For those who are still interested, we can consider a more complex example:

Definition

Series Parallel Expressions (SPE). can be defined recursively as follows. A SPE is either:

Two SPEs in series: S(SPE, SPE)
Two SPEs in parallel: P(SPE, SPE)
An integer: int

[ADD IMAGE HERE, has 2 examples]

From the picture, you might be able to infer what properties we define the SPEs to have:

Series is associative
Parallel is associative and commutative Due to these properties, there are many different ways of representing the same SPE:

[ADD OTHER IMAGE, fully series graphm fully parallel, and fully mixed, with below all the equivalent expressions].

Can we find a canonical form for these expressions? Lets see!

First Try

In this representation, we have different ways of representing the same expression: P(P(1, 2), 3), P(3, P(1, 2)), ... We want to canonicalize them. Let’s see what rewriting rules are needed to do this:

Associativity: P(1, P(2, 3)) == P(P(1, 2), 3). To canonicalize this, we can use the following rewrite rule:

P(a, P(b, c)) -> P(P(a, b), c)

(Note that a, b, c don’t have to be ints; they can be any sub-expression.) Series composition is also associative, so we can do the same: S(a, S(b, c)) -> S(S(a, b), c)

Commutativity: P(1, 2) == P(2, 1). To canonicalize this, we can use the following rewrite rule:

P(b, a) -> P(a, b) whenever a < b. This is nice and easy if we have a and b as ints, but what if we have a and b as sub-expressions? We have to impose a total order ⁶ across all possible expressions. Let’s call it P_<(a, b), which is true if a < b and false otherwise. One possible order is as follows: if a, b are different types: ints < series < parallel if a, b are the same type: ints: a < b if (well) a < b series: S(a1, b1) < S(a2, b2) lexicographically, i.e. compare a1 vs a2 first, and if equal, compare b1 vs b2. parallel: same lexicographic comparison, but since P(a, b) = P(b, a), we first normalize each side by putting the smaller element first, then compare lexicographically.

So does this work? NO! Not only does this system fail to produce unique normal forms — it doesn’t even terminate! Consider P(3, P(1, 2)):

P(3, P(1, 2)) →[assoc] P(P(3, 1), 2) →[comm: 2 < P(3,1)] P(2, P(3, 1)) →[assoc] P(P(2, 3), 1) →[comm: 1 < P(2,3)] P(1, P(2, 3)) →[assoc] P(P(1, 2), 3) →[comm: 3 < P(1,2)] P(3, P(1, 2)) → …

We’re back where we started! Associativity pushes things left, but commutativity (since int < parallel) keeps swapping them back right. The two rules fight each other forever. Let’s try a second approach!

Second Try

We see that a lot of issues arise from the fact that we are representing S and P as binary operators. But since they are associative, they should really be n-ary operators (so operating over a vector). Furthermore, P is commutative, so it should really be operating over a multiset ⁷.

Lets start with our updated representation:

SP = NSeries([SP, …]) | NParallel({SP, …}) | int | None

Where […]: represents a vector of SP expressions, and {…}: represents a multiset of SP expressions.

This introduces another set of tricky canonicalization issues: P(P({1})) = P(1) = P(1, S({}))⁸

How do we know it works?

But how do we know that, given two expressions which we know to be semantically equivalent, both are reduced to the same canonical form? We have to prove two things:

Termination
Local confluence

Termination:

Local confluence:

This, by the […] theorem (why does it work?), gives us convergence! Meaning that …

Conclusion

Canonicalization is harder than one might think!

HOw much in the strucutre, how much in the behavior?

Use the type as much as possible to help you enforce canonicalization. For example, in the n-ary case we could have used a vector instead of a multiset for NParallel, but this would have required additional canonicalization rules to enforce commutativity between all the terms (something analogous to our P_<(a, b) order, where we swap 2 consecutive elements if the first one is greater than the second one. Doing this a bunch of times effectively sorts the arguments of P). Analogously, we can enforce that S and P have more than 1 argument by using something like NonEmpty, where our new vector, instead of having Series = [SP], we have Series = SP SP [SP], i.e. SP has 2 elements and then some more (possibly 0). Note that the type forces us to have at least 2 elements, we cannot possibly have a Series object with less than 2 elements. Thank you compiler, very cool! For more complex representations, you cna always ramp this up to the extreme, and have the type system enforce ALL the constrains that you care about (e.g. with something like Lean), but it’s a balacing act after all.

If you want to know more about this, look up Make Illegal States Unrepresentable and Parse Don’t Validate.

[^] One other option is to represent the fraction as (sign, num, den), and have num and den be positive.

syntactically, a.k.a. they are written out the same way. $\frac{1}{2}$ and $\frac{2}{4}$ are syntactically different, since they are not the same string. ↩
semantically, a.k.a. they have the same meaning under some predefined interpretation. $\frac{1}{2}$ and $\frac{2}{4}$ are semantically equivalent, since they represent the same number. ↩
Why not write a = 1/3 using exact division? Floats don’t do exact division (can’t blame them, they are just approximations of the real numbers after all), so (try it at home!) 1/3 = 0.3333333333333333, while 0.1/0.3 = 0.33333333333333337. In general, this is the reason why we would want to represent a fraction as a pair of integers in the first place. ↩
In case we wanted to still keep the original representation, we can keep the object the same and simply convert on the fly to canonical form whenever needed (so for example, instead of a == b, we would have canonical(a) == canonical(b), or instead of a in S, we would have canonical(a) in S etc…). ↩
Canonical Form: unique representative of the equivalence class. Normal Form: expression that cannot be reduced any further under given some transformation rules. As you’ll see later, we want to find a way to transform a given expression such that, all equivalent expressions, are reduced to the same normal form (and thus the normal form is also the canonical form). ↩
Total order: i.e. I can always compare two expressions and say which one is smaller. ↩
A multiset is a set where elements can appear multiple times. ↩
Why do we need a None now? Because now a valid SP expression can be the S({}), which is just an empty graph. We didn’t have this problem before because Series and Parallel always took exactly 2 operators, but it’s a problem now. ↩