May | 2022 | weaselhat

Program synthesis is an appealing notion: I give you a pretty good (but incomplete) description of a program, and you give me a working program back out. Flash Fill—program synthesis by example that ships in Excel—is the standard example that PL people trot out, and it is an incredible success. (Check out Sumit Gulwani’s POPL 2015 keynote (video in supplemental material) if you’ve never heard of it.)

Fiat Cryptography is another great example of program synthesis success. Where Flash Fill is programming by example, Fiat Cryptography is deductive synthesis: here, an expert manually refines a specification of arithmetic operations used in elliptic curve cryptography to achieve fast and verified implementations of the necessary primitives. Fiat Cryptography is used in BoringSSL, a part of Chrome and Android (check out the paper).

These two approaches are quite different: Flash Fill offers nearly instantaneous feedback as part of an existing interface; Fiat Cryptography is effectively a set of Coq lemmas and tactics, only approachable by experts. (You might wonder why they’re both called program synthesis! For that, I’d follow Sam Tobin-Hochstadt’s second-hand quote: “a synthesizer is just a compiler that doesn’t work”, that is, these tools are both synthesis because they take specifications as input and produce valid implementations as output, when they produce output… but they don’t always succeed in producing output.)

Flash Fill and Fiat Cryptography are tech-transfer success stories working on starkly different timescales: Flash Fill is an end-user synthesis that takes examples and produces Excel formulae in a fraction of a second; Fiat Cryptography is a tool for experts to generate extremely specialized code in a more interactive, involved way, possibly taking days to generate something acceptable. Given these remarkable differences, I wonder… when is program synthesis worthwhile?

A theory of effort

Suppose you’re trying to write some program, and you could (a) use program synthesis or (b) write the program by hand. Let’s say it takes E time to come up with an appropriate specification, S time to synthesize a solution, and I time to inspect and approve a solution. Conversely, suppose it takes P time to write the program and T time to test and approve it.

In general, synthesis is only worthwhile when E + S + I < P + T. Considering that most programmers are more familiar with conventional programming tasks and users are often reluctant to install and use a complex new tool, it’s likely that synthesis is only reasonable when E + S + I <<< P + T (we read <<< as “is much less than”). Let’s take this theory for a spin.

Flash Fill

Flash Fill is example based, where simple string transformations are canonical examples. Writing up some example string transformations might take a few seconds, i.e., E ~ 10. Synthesis is very fast—say, S ~ 1. Inspection might take longer—from a few seconds to a minute or so. Let’s be careful and say I ~ 30 (though your typical Excel user may be a little more YOLO about things).

Conversely, programming string manipulations in Excel are fiddly; I’d certainly have to spend a minute to find the right functions, figure out the annoying indexing bits (another minute—or more), and then go through and verify that I got it right. A user with less programming expertise might take longer. Let’s say P ~ 120 and T ~ 20—though many users will give up before figuring it out at all.

Here synthesis is much faster than writing the program by hand: we have 31 ~ 140. It’s a huge savings in time for an experienced programmer to use Flash Fill, never mind a user less familiar with programming.

Fiat Cryptography

Fiat Cryptography bakes in a set of techniques for deriving very efficient implementations of cryptographic arithmetic primitives. In this case E amounts to choosing an appropriate prime and its representation and encoding it in Coq. I pored over their commit logs, but I couldn’t find an example of new primes being introduced. Let’s say it takes a day, though I suspect it’s much less. (For example, it takes a few days for them to implement new optimizations—a new prime amounts to simply picking a representation.) S is the time it takes for Coq to compile the code—I haven’t tried it, but it takes about 1.5hrs to compile in their CI. Inspection is… free, or rather, the low, low price of trusting their trusted computing base (TCB)! Okay, okay: their validation in CI takes another 1.5hrs. ~~Trust but verify.~~ Verify but test.

On the other hand, what does it take to implement and verify operations for a new prime by hand? Months! And at the end of those months, one has less confidence in the final product compared to an unverified one; verifying the handwritten code will take months more.

Again, synthesis is substantially faster: a day or two, compared to many months. As a bonus, the output of synthesis is substantially more trustworthy than what you’d get rolling your own.

RLIBM

Another example of well situated, worthwhile synthesis is RLIBM, a collection of fast and correct elementary floating point functions. RLIBM doesn’t advertise itself as synthesis, but that’s exactly what it is: given a specification (e.g., log2f) and a correct oracle, they generate appropriate polynomials for approximation. A number of the talks are available online.

RLIBM is less of a ‘drop in’ replacement than Fiat Cryptography, but they’re already seeing plenty of success in getting their code into LLVM. Synthesis is slow—it can take quite some time for them to generate and validate polynomials (i.e., E, S, and I are on the order of days judging from their interactions with LLVM devs). But the value proposition is huge: their approach yields more efficient code, offers a single polynomial for multiple rounding modes and bits of precision, and is more correct than the state of the art. What’s more, the start of the art is also very expensive: Intel has a whole team of mathematical experts working on this problem (P is very, very high); for float32, I = T, since both cases use exhaustive checking.

New frontiers

Synthesis is worthwhile when its costs (coming up with a specification/examples E, running the synthesizer S, and inspecting the solution I) are substantially less than the cost of programming by hand (programming the solution P and testing it T). That is, synthesis is worthwhile when E + S + I <<< P + T.

My estimate is conservative: the framing assumes that all we get from synthesis is a replacement for programming, no better than what we’d produce ourselves. But for users without programming experience, Flash Fill doesn’t just speed things up, it makes new things possible. Similarly, Fiat Cryptography doesn’t just produce a fast C implementation of arithmetic primitives… its proofs add confidence that the result is correct! RLIBM is faster and more precise than existing solutions.

Performance changes how users use software. Fabian Giesen is quoted to similar effect in Dan Luu’s post on productivity and velocity; Shriram Krishnamurthi has made similar comments, too (Thanks to Sam Tobin-Hochstadt and Dan Luu for the links!) Such a maxim certainly applies here: synthesis opens new doors. Non-programmers get help writing simple programs; programmers save substantial time and effort. Either way, synthesis boosts productivity.

When is synthesis not worthwhile?

We’ve seen three examples of program synthesis that have clearly provided value in the real world. Flash Fill, Fiat Cryptography, and RLIBM are clearly worthwhile. When is synthesis not worthwhile?

Let’s consider the case of general-purpose programming. A variety of program synthesizers can generate code in functional or imperative languages. A common example in these settings is linked list reversal. A quadratic solution in a functional language is fairly simple if you know about append or snoc, but harder without it; efficient, linear versions in functional and imperative languages are a bit tougher.

Many academic tools are example based, in which case it’d be reasonable to say that E ~ T. For synthesis to be worthwhile, then, we would need S + I <<< P. That is, the time to run the synthesizer and inspect the resulting program must be substantially less than the time to write the program itself.

Writing a linear-time linked list reversal in OCaml takes me about 30s; less in Haskell. Imperative linked list reversal is a bit slower—say, 2.5min. Testing either of these won’t take long—a few seconds suffices in an FP REPL, or as much as a minute to write the JUnit in Java.

Existing synthesis tools can handily beat my programming time, so… what gives? Why aren’t program synthesis tools everywhere? I think there are a few reasons:

In fact, E > T. Looking at synthesis benchmarks, the tools that beat my time take many more examples than I would bother with. Tail-recursive reversal takes 14 examples in an older tool I know of… and that’s to generate the two-argument version, not the outer function that one might care about. If I know to generate the accumulator-passing type of the inner function, why do I need help writing list reverse?
I is large. I looked for research on how I and P relate, i.e., how long it takes to write a function as compared to understanding and agreeing with someone else’s implementation, but I couldn’t find much concrete information. I suspect that if the standard is ‘believing a program to be correct’, then I is on the same order as P when T is small, i.e., programs that can be easily tested are as hard to check as they are to write. As T increases, then I is on the order of T. (But who cares what I think? It’s a testable hypothesis!)
S should include the cost of invoking the tool… and maybe the amortized cost of acquiring the tool. Flash Fill just comes with Excel and is quite discoverable in the UI, but existing tools need to be downloaded, compiled, and installed, and they don’t have much in the way of editor integration.

Alternatively, consider GitHub’s Copilot. You write a little prose explanation and a function header, and it does the rest. Copilot seems to set E quite low, but I is comparatively high. What’s worse, E != T when you give Copilot textual descriptions rather than input/output examples. Copilot addresses HCI issues nicely—it’s easy to install and easy to use—but it produces particularly low-confidence code. For Copilot code, I’d say I = T, i.e., I’d want to thoroughly test anything that came out of Copilot. (And more importantly, for licensing reasons, I wouldn’t want to use anything that came out of Copilot.)

To sum up: synthesizing code for general purpose languages has E + S + I less than P + T in some cases, but not so much less than P + T that it seems worthwhile.

My examples of worthwhile synthesis offer two primary ways for synthesis to be worthwhile: either focus on fast-turnarounds and excellent usability (very low E, S, and I), or specialize in a domain where the state of the art is excruciatingly hard (P and T are huge and you have a story—verification, validation, etc.—for keeping I low). Flash Fill’s successful tech transfer depended heavily on HCI improvements. Fiat Cryptography’s and RLIBM’s success did not—but they target specialized domains.

Don’t be discouraged

Program synthesis offers a novel relationship between ~~programmers~~ people and computers. I am not at all saying to give up on forms of synthesis that don’t meet my criteria of E + S + I <<< P + T. Some forms of synthesis may not be worthwhile yet, but we don’t know what changes and opportunities the future holds!

Program synthesis is an exciting new area, and there’s lots of PL work on it. It’s unreasonable to expect every academic paper to be immediately applicable, so I don’t expect every proposed synthesizer to meet my criteria. It is important, however, to be realistic about where we stand. If you’re working on synthesis, think about how E, S, and I compare to P and T for you. If your approach doesn’t yet hit my criteria, what needs to change to make that happen?

Postscript

There’s some great discussion on the Twitter thread announcing this post.

Kleene algebra with tests is a beautiful, powerful framework for reasoning about programs. You can easily encode conventional While programs into KAT, and KAT enjoys decidable equality. Reasoning with KAT feels like you’re cheating Alan Turing himself: here we are, deciding nontrivial properties of programs!

The gist of KAT is that you write programs using a regular expression like notation: + for parallel composition, ; for sequential, and * for iteration. So you might encode:

while x > 0:
  y += 1
  x -= 1

As (xGt0; incY; decX)*; ¬xGt0, where xGt0 is a ‘test’ and incY and decX are ‘actions’. KAT’s equivalence decision procedure can prove that this program is equivalent to any finite unrolling of itself… neat!

NetKAT is the most impactful application of KAT: it’s an influential and successful academic project, and its ideas can already be found in numerous real, production systems. In light of NetKAT’s remarkable success… why don’t we apply KAT more often?

What’s hard about KAT?

On its own, KAT proves plenty of nice theorems, but none of them reason about particular program behaviors. In the code snippet above, xGt0, incY, and decX are uninterpreted—there’s no relationship between, say xGt0 and decX. That is, you might expect that ¬xGt0;decX;¬xGt0 is equivalent to ¬xGt0;decX, because decrementing a number less than or equal to 0 will yield a number that is also less than or equal to 0. The names of our tests and actions are suggestive, but KAT treats them absractly. If you want to reason about the semantics of your tests and actions, you need to build a custom, concrete KAT. NetKAT reasons about fields on packets, and doing so means building a particular, concrete KAT with particular actions. The original paper spends quite a bit of effort proving this new, custom KAT has a sound, complete, and decidable equivalence checking.

Worse still, KAT’s metatheory is very challenging. To create NetKAT, Nate Foster and company worked through closely related ideas for a few years before Nate joined Cornell and started working with Dexter Kozen, KAT’s progenitor. Only then did they realize that KAT would be a good fit, and they got to work on developing a concrete KAT—NetKAT. Unfortunately, “collaborate with Dexter” is an approach that doesn’t scale.

How to cook a KAT

In an upcoming PLDI 2022 paper, “Kleene Algebra Modulo Theories: A Framework for Concrete KATs”, Ryan Beckett, Eric Campbell, and I show how to generate a KAT over a given theory, i.e., a set of tests, actions, and their equational theory. We call the approach Kleene algebra modulo theories, or KMT. The paper covers quite a few examples:

booleans and bit vectors
monotonic natural numbers
unbounded sets and maps
NetKAT

What’s more, our approach allows for higher-order theories, like taking the product of two theories or using finite-time LTL to reason about another theory. (Our approach abstracts and generalizes Temporal NetKAT, which is just a concrete instance of our more general method.)

To build a KMT, you provide primitive tests and actions, along with weakest preconditions relating each pair of test and action. There’s an ordering requirement: a test must be no smaller than its preconditions. With these in hand, we’re able to automatically derive a KAT with good properties in a pay-as-you-go fashion:

If your theory is sound, the KAT is sound.
If your theory is complete, the KAT is complete.
If your theory’s satisfiability checking is decidable, we can derive a decision procedure for equivalence.

I’m particularly excited that our framework is prototype-ready: our code is implemented as an OCaml library, where you define theories as functors. Please try it out—mess around and write your own theories, following our examples. We hope that KMT will significantly lower the bar for entry, making it easier for more people to play around with KAT’s powerful equivalence checking.

What’s the catch?

There’s more than one way to cook a KAT. KMT generates KATs with tracing semantics, i.e., the exact trace of actions matters. In KAT+B! or NetKAT, later updates override earlier ones, e.g., x:=false; x:=true ? x:=true… but KMT will treat these terms differently, because they have different traces. KAT+B! deliberately avoids tracing; NetKAT only traces at predefined points, by means of their dup primitive, which marks the current state as historically salient. There’s no deep reason for KMT to use tracing, and we believe KMT can be generalized to support dup-like controls for tracing.

The ordering constraint on weakest preconditions is a strong one. Our natural numbers, sets, and maps must be monotonic: they may grow or shrink, but not both. They cannot be compared, e.g., two natural-valued variables x and y can be compared to constants but not each other.

KMT is also just a prototype. It’s fast for small programs, but it takes dedicated work to make a KAT’s decision procedure efficient enough on more serious examples.

Why are you talking about cooking KATs?

The greatest POPL paper of all time is Manna and Pnueli 1983, “How to cook a temporal proof system for your pet language”. Why? Just take a look a the first page:

The header of the paper offsets the author names to the right. A line drawing dominates the top: a dog wags its tail, tongue dripping eagerly in front of a kabob marked with "ADA" and "shared variable" and "CSP". — *I rest my case.*

Update

KMT won a distinguished paper award at PLDI!

weaselhat

Month: May 2022

When is program synthesis worthwhile?