Answer set programming (ASP) is the powerhouse technology you’ve never heard of

The first person to explain answer set programming (ASP, pronounced ‘ay ess pee’) to me (Joe Osborn) told me about the three line implementation of graph coloring and the termination guarantees. Like any PL person, I recoiled in horror: you have an always-terminating language where I can solve NP-complete problems in three lines? So every time I write a program of three lines or more, I have to wonder… how exponential is this? No thank you!

I was wrong. Answer set programming is fast and easy to use; it’s a cousin to SAT that offers different tradeoffs than SMT, and I think PL folks have a lot to gain from ASP. As some evidence in that direction, let me tell you about an upcoming paper of mine appearing at POPL 2023, with Aaron Bembenek and Steve Chong.

From SMT to ASP: Solver-Based Approaches to Solving Datalog Synthesis-as-Rule-Selection Problems
Aaron Bembenek, Michael Greenberg, and Stephen Chong (POPL 2023) (artifact

Datalog synthesis-as-rule-selection

Our paper is, on the face of it, about a program synthesis problem for Datalog: given a collection of candidate rules and an input/output example, select the candidate rules that transform the input to the output. The previous state-of-the-art (ProSynth) used a CEGIS loop: use Z3 to guess some rules, try it in Soufflé, use why- and why-not provenance to feed information back to Z3 to make a better guess. Our work gives three new solutions to the problem:

  • Datalog as a monotonic theory in SMT. Monotonic theories get a big performance boost, and modern solvers like Z3 and CVC4 support them. And Datalog is the monotonic theory ne plus ultra: we [read: Aaron] wrote Z3 and CVC4 plugins that turn any Datalog program into a monotonic theory. You can use this to do the CEGIS loop with a single call to SAT (but many calls to Datalog).
  • Using loop formulae to force SMT to find a least stable model. Every Datalog program has a logical denotation as the least model satisfying its Clark completion. Borrowing from ASP implementation techniques, we can use the Clark completion extended with loop formulae to rule out the hallucinatory models SMT is prone to finding. You can use this approach to do the CEGIS loop with not so many calls to Datalog, but possibly many calls to SAT.
  • Just encode it in ASP. The conventional ASP implementation strategy (grounder and solver) found in tools like clingo admits a direct encoding of the synthesis problem; our translator is just 187 SLOC. ASP lets you name that tune in just one note: you call the grounder, then you call the solver, and then you’re done.

The gist of it is that the ASP solution is not only vastly simpler than the others, it outstrips them in performance, showing a ~9x geomean speedup compared to the state of the art. (I wrote previously about how to summarize early versions of these numbers.) Practically speaking, the ASP synthesizer always returns in under a second, while every other solution shows slowdowns on some of the benchmarks, taking tens of seconds or even timing out at ten minutes. There’s lots more detail in the paper.

I should add that the existing benchmarks are pretty easy: the examples are small, with example facts and candidate rules numbering in the low hundreds. In this setting, anything more than a few seconds isn’t worthwhile. We have some more criticism of the problem setting in the paper, but I can sum up my feeling as: when confronted with the benchmark challenge of implementing strongly connected components (SCC) in Datalog, who has the technical capacity to (a) write up an input/output example for SCC for a non-trivial graph, (b) generate several hundred candidate rules, (c) install and run a program synthesis tool, and (d) verify that the resulting program generalizes but (e) lacks the technical capacity to simply implement SCC in Datalog? Transitive closure is the “Hello World” of Datalog; strongly connected components is the “Fahrenheit-to-Celsius” of Datalog. The benchmarks don’t characterize a problem that anyone needs to solve.

Answer Set Programming

While I’m proud of the contributions outlined above, I’m most excited about introducing ASP to a broader audience. It’s a historical accident that ASP is more or less confined to the AI subfield, where a lot of great SAT and SMT research continues apace—not to mention the program synthesis work published at venues like AAAI and IJCAI under the umbrella of ‘inductive logic programming’. (Any PL person working on synthesis should look closely at Popper and ILASP.)

Our paper goes into more detail, but I like to contrast ASP with SMT by talking about justification. ASP will find solutions that are in some sense well justified by your problem statement (formally, it will find stable models); SMT has no such guarantee.

SMT solvers do an incredible job of finding witnesses for existentials: just (declare-const x Foo) and SMT will find your Foo, no problem. Well, maybe a problem: SMT’s approach to finding an x of type Foo is to go into the desert on a vision quest—it’ll keep hallucinating new values for x until it finds a satisfying one. On the plus side, that means SMT can find an x that doesn’t appear anywhere in the constraints you gave it. On the minus side, that means SMT is straight up hallucinating values—and you’re the one controlling set and setting to make sure SMT stays this side of sanity.

ASP never hallucinates. If something is in an answer set, it’s there for a reason. ASP’s non-monotonic semantics lets you freely mix negation and recursion (unlike Datalog, which forces negation to be stratified); the stable model semantics guarantees you won’t get nonsensical, circular answers that justify their premises with their conclusions.

Put simply: ASP is a SAT-like discipline that lets you work efficiently and correctly with inference rules. SMT is a SAT-like discipline that lets you work efficiently and correctly with existentials over equality and theory predicates. PL work needs both styles of reasoning; as we’ve shown, ASP can bring simple solutions and startling efficiency over complex appoaches using SMT. Maybe it’ll work for your problem, too?

What’s hard about grad school?

Part of what’s hard about grad school is that things are undirected. In undergrad, you have well delimited homework assignments, maybe a project or two. But in grad school everything is open ended, and you’re lost at sea!

And that broad scope—confronting the unknown and feeling stupid—really is part of what’s hard. But in this lecture about writing, Larry McEnerney makes the point that there’s something even harder about grad school than its broad scope. Grad school is hard because it’s about adopting new values.


Undergrad is really all about skills. You’re paying money to get a guided tour through certain skills, with an experienced master paying attention to you, i.e., observing, helping, grading, mentoring, and so on. As Larry says (with perhaps a little too much glee), you’re paying for faculty to pay attention—your work product is not something they want, but rather they are paid to evaluate it and help you improve.

Graduate school teaches you skills, too. In a more intense way—you’re one of a few apprentices at the foot of a master. At least in CS, or at least the way I’ve seen it work best.

Beyond skills

But graduate school isn’t just skills. Graduate school is also the more complex process of joining a community—getting your “union card”. To be successful, grad students must learn and adopt the unwritten codes of their new community. It is hard work to discover and understand what the community values, and early grad students often confuse “novel and difficult application of a tool or technique” for “valuable contribution to the community”.

For example, in programming languages (PL) it’s common for early grad students to revel in complicated notation, subtle formalism, and the details of proofs. Many choose grad school for the novelty—to be the first to know something! And, to be sure, formal notation and novel proofs are fun things. And they are important things! If you get these things wrong, you’re sunk. But caring about these things most is missing the point, and will lead to a frustrating experience writing the paper, reading reviews, writing the response, figuring out what to do next, and so on.

Larry’s point of view is a realist one. Disturbingly realist: he says he’s accused of ‘fascism’ for asking you to identify powerful stakeholders in the community and to cater to them. But he is right: you can only publish papers that reviewers think are valuable, and you will have to cater to their needs to get them to feel that way.

For a PL-specific take, Laurie Tratt’s essay on “What Makes a Good Research Proposal?” uses a value-oriented framework:

…a good proposal must address three fundamental questions:

1. What is the problem being tackled?
2. Why is the problem worth being tackled?
3. What is the insight that makes tackling the problem plausible?

Although the third of these is by far the most important, the first two set up the necessary context to understand the third.

First, note that (2) asks, “Why is it valuable?”. I would go further and say that (2) and (3) are equally important. A plausible approach on a pointless problem doesn’t get you very far.

As someone who recoiled in disgust at even the notion of applied research in grad school, I’ve come completely around to Larry McEnerney’s point of view. None of the formalism matters if it doesn’t do something. Proofs aren’t worth the wildly overengineered LaTeX macros they’re written in if the properties they ensure don’t matter to anyone. I hated this kind of talk as a graduate student, and yet here I am. (In related news, I tell my undergraduate students to take notes… and I didn’t take a single note in class in undergrad. If I had a lawn, I would be asking people to get off it.)

Good news

The good news is that it is within your power to move the field. You have to accommodate the community where it is, but you have a say in where it’s going. In 2017, I told Benjamin Pierce that I was planning to work on the POSIX shell. He laughed, saying he thought it was “irredeemably bad”. At an OBT talk at POPL that year, Phil Wadler said most of my planned work wasn’t even PL. In 2020, I published Smoosh at POPL; in 2021, we had a paper and a panel at HotOS on the shell; in 2022, we have a paper at OSDI.

You don’t have to give up your vision or your voice. In fact, I think my own voice is a key asset in moving the community. Compelling writing and presentations are persuasive! And writing is exactly where Larry picks up.

an early Moog modular synthesizer

When is program synthesis worthwhile?

Program synthesis is an appealing notion: I give you a pretty good (but incomplete) description of a program, and you give me a working program back out. Flash Fill—program synthesis by example that ships in Excel—is the standard example that PL people trot out, and it is an incredible success. (Check out Sumit Gulwani’s POPL 2015 keynote (video in supplemental material) if you’ve never heard of it.)

Fiat Cryptography is another great example of program synthesis success. Where Flash Fill is programming by example, Fiat Cryptography is deductive synthesis: here, an expert manually refines a specification of arithmetic operations used in elliptic curve cryptography to achieve fast and verified implementations of the necessary primitives. Fiat Cryptography is used in BoringSSL, a part of Chrome and Android (check out the paper).

These two approaches are quite different: Flash Fill offers nearly instantaneous feedback as part of an existing interface; Fiat Cryptography is effectively a set of Coq lemmas and tactics, only approachable by experts. (You might wonder why they’re both called program synthesis! For that, I’d follow Sam Tobin-Hochstadt’s second-hand quote: “a synthesizer is just a compiler that doesn’t work”, that is, these tools are both synthesis because they take specifications as input and produce valid implementations as output, when they produce output… but they don’t always succeed in producing output.)

Flash Fill and Fiat Cryptography are tech-transfer success stories working on starkly different timescales: Flash Fill is an end-user synthesis that takes examples and produces Excel formulae in a fraction of a second; Fiat Cryptography is a tool for experts to generate extremely specialized code in a more interactive, involved way, possibly taking days to generate something acceptable. Given these remarkable differences, I wonder… when is program synthesis worthwhile?

A theory of effort

Suppose you’re trying to write some program, and you could (a) use program synthesis or (b) write the program by hand. Let’s say it takes E time to come up with an appropriate specification, S time to synthesize a solution, and I time to inspect and approve a solution. Conversely, suppose it takes P time to write the program and T time to test and approve it.

In general, synthesis is only worthwhile when E + S + I < P + T. Considering that most programmers are more familiar with conventional programming tasks and users are often reluctant to install and use a complex new tool, it’s likely that synthesis is only reasonable when E + S + I <<< P + T (we read <<< as “is much less than”). Let’s take this theory for a spin.

Flash Fill

Flash Fill is example based, where simple string transformations are canonical examples. Writing up some example string transformations might take a few seconds, i.e., E ~ 10. Synthesis is very fast—say, S ~ 1. Inspection might take longer—from a few seconds to a minute or so. Let’s be careful and say I ~ 30 (though your typical Excel user may be a little more YOLO about things).

Conversely, programming string manipulations in Excel are fiddly; I’d certainly have to spend a minute to find the right functions, figure out the annoying indexing bits (another minute—or more), and then go through and verify that I got it right. A user with less programming expertise might take longer. Let’s say P ~ 120 and T ~ 20—though many users will give up before figuring it out at all.

Here synthesis is much faster than writing the program by hand: we have 31 ~ 140. It’s a huge savings in time for an experienced programmer to use Flash Fill, never mind a user less familiar with programming.

Fiat Cryptography

Fiat Cryptography bakes in a set of techniques for deriving very efficient implementations of cryptographic arithmetic primitives. In this case E amounts to choosing an appropriate prime and its representation and encoding it in Coq. I pored over their commit logs, but I couldn’t find an example of new primes being introduced. Let’s say it takes a day, though I suspect it’s much less. (For example, it takes a few days for them to implement new optimizations—a new prime amounts to simply picking a representation.) S is the time it takes for Coq to compile the code—I haven’t tried it, but it takes about 1.5hrs to compile in their CI. Inspection is… free, or rather, the low, low price of trusting their trusted computing base (TCB)! Okay, okay: their validation in CI takes another 1.5hrs. Trust but verify. Verify but test.

On the other hand, what does it take to implement and verify operations for a new prime by hand? Months! And at the end of those months, one has less confidence in the final product compared to an unverified one; verifying the handwritten code will take months more.

Again, synthesis is substantially faster: a day or two, compared to many months. As a bonus, the output of synthesis is substantially more trustworthy than what you’d get rolling your own.


Another example of well situated, worthwhile synthesis is RLIBM, a collection of fast and correct elementary floating point functions. RLIBM doesn’t advertise itself as synthesis, but that’s exactly what it is: given a specification (e.g., log2f) and a correct oracle, they generate appropriate polynomials for approximation. A number of the talks are available online.

RLIBM is less of a ‘drop in’ replacement than Fiat Cryptography, but they’re already seeing plenty of success in getting their code into LLVM. Synthesis is slow—it can take quite some time for them to generate and validate polynomials (i.e., E, S, and I are on the order of days judging from their interactions with LLVM devs). But the value proposition is huge: their approach yields more efficient code, offers a single polynomial for multiple rounding modes and bits of precision, and is more correct than the state of the art. What’s more, the start of the art is also very expensive: Intel has a whole team of mathematical experts working on this problem (P is very, very high); for float32, I = T, since both cases use exhaustive checking.

New frontiers

Synthesis is worthwhile when its costs (coming up with a specification/examples E, running the synthesizer S, and inspecting the solution I) are substantially less than the cost of programming by hand (programming the solution P and testing it T). That is, synthesis is worthwhile when E + S + I <<< P + T.

My estimate is conservative: the framing assumes that all we get from synthesis is a replacement for programming, no better than what we’d produce ourselves. But for users without programming experience, Flash Fill doesn’t just speed things up, it makes new things possible. Similarly, Fiat Cryptography doesn’t just produce a fast C implementation of arithmetic primitives… its proofs add confidence that the result is correct! RLIBM is faster and more precise than existing solutions.

Performance changes how users use software. Fabian Giesen is quoted to similar effect in Dan Luu’s post on productivity and velocity; Shriram Krishnamurthi has made similar comments, too (Thanks to Sam Tobin-Hochstadt and Dan Luu for the links!) Such a maxim certainly applies here: synthesis opens new doors. Non-programmers get help writing simple programs; programmers save substantial time and effort. Either way, synthesis boosts productivity.

When is synthesis not worthwhile?

We’ve seen three examples of program synthesis that have clearly provided value in the real world. Flash Fill, Fiat Cryptography, and RLIBM are clearly worthwhile. When is synthesis not worthwhile?

Let’s consider the case of general-purpose programming. A variety of program synthesizers can generate code in functional or imperative languages. A common example in these settings is linked list reversal. A quadratic solution in a functional language is fairly simple if you know about append or snoc, but harder without it; efficient, linear versions in functional and imperative languages are a bit tougher.

Many academic tools are example based, in which case it’d be reasonable to say that E ~ T. For synthesis to be worthwhile, then, we would need S + I <<< P. That is, the time to run the synthesizer and inspect the resulting program must be substantially less than the time to write the program itself.

Writing a linear-time linked list reversal in OCaml takes me about 30s; less in Haskell. Imperative linked list reversal is a bit slower—say, 2.5min. Testing either of these won’t take long—a few seconds suffices in an FP REPL, or as much as a minute to write the JUnit in Java.

Existing synthesis tools can handily beat my programming time, so… what gives? Why aren’t program synthesis tools everywhere? I think there are a few reasons:

  • In fact, E > T. Looking at synthesis benchmarks, the tools that beat my time take many more examples than I would bother with. Tail-recursive reversal takes 14 examples in an older tool I know of… and that’s to generate the two-argument version, not the outer function that one might care about. If I know to generate the accumulator-passing type of the inner function, why do I need help writing list reverse?
  • I is large. I looked for research on how I and P relate, i.e., how long it takes to write a function as compared to understanding and agreeing with someone else’s implementation, but I couldn’t find much concrete information. I suspect that if the standard is ‘believing a program to be correct’, then I is on the same order as P when T is small, i.e., programs that can be easily tested are as hard to check as they are to write. As T increases, then I is on the order of T. (But who cares what I think? It’s a testable hypothesis!)
  • S should include the cost of invoking the tool… and maybe the amortized cost of acquiring the tool. Flash Fill just comes with Excel and is quite discoverable in the UI, but existing tools need to be downloaded, compiled, and installed, and they don’t have much in the way of editor integration.

Alternatively, consider GitHub’s Copilot. You write a little prose explanation and a function header, and it does the rest. Copilot seems to set E quite low, but I is comparatively high. What’s worse, E != T when you give Copilot textual descriptions rather than input/output examples. Copilot addresses HCI issues nicely—it’s easy to install and easy to use—but it produces particularly low-confidence code. For Copilot code, I’d say I = T, i.e., I’d want to thoroughly test anything that came out of Copilot. (And more importantly, for licensing reasons, I wouldn’t want to use anything that came out of Copilot.)

To sum up: synthesizing code for general purpose languages has E + S + I less than P + T in some cases, but not so much less than P + T that it seems worthwhile.

My examples of worthwhile synthesis offer two primary ways for synthesis to be worthwhile: either focus on fast-turnarounds and excellent usability (very low E, S, and I), or specialize in a domain where the state of the art is excruciatingly hard (P and T are huge and you have a story—verification, validation, etc.—for keeping I low). Flash Fill’s successful tech transfer depended heavily on HCI improvements. Fiat Cryptography’s and RLIBM’s success did not—but they target specialized domains.

Don’t be discouraged

Program synthesis offers a novel relationship between programmers people and computers. I am not at all saying to give up on forms of synthesis that don’t meet my criteria of E + S + I <<< P + T. Some forms of synthesis may not be worthwhile yet, but we don’t know what changes and opportunities the future holds!

Program synthesis is an exciting new area, and there’s lots of PL work on it. It’s unreasonable to expect every academic paper to be immediately applicable, so I don’t expect every proposed synthesizer to meet my criteria. It is important, however, to be realistic about where we stand. If you’re working on synthesis, think about how E, S, and I compare to P and T for you. If your approach doesn’t yet hit my criteria, what needs to change to make that happen?


There’s some great discussion on the Twitter thread announcing this post.

How to cook a KAT for your pet theory

Kleene algebra with tests is a beautiful, powerful framework for reasoning about programs. You can easily encode conventional While programs into KAT, and KAT enjoys decidable equality. Reasoning with KAT feels like you’re cheating Alan Turing himself: here we are, deciding nontrivial properties of programs!

The gist of KAT is that you write programs using a regular expression like notation: + for parallel composition, ; for sequential, and * for iteration. So you might encode:

while x > 0:
  y += 1
  x -= 1

As (xGt0; incY; decX)*; ¬xGt0, where xGt0 is a ‘test’ and incY and decX are ‘actions’. KAT’s equivalence decision procedure can prove that this program is equivalent to any finite unrolling of itself… neat!

NetKAT is the most impactful application of KAT: it’s an influential and successful academic project, and its ideas can already be found in numerous real, production systems. In light of NetKAT’s remarkable success… why don’t we apply KAT more often?

What’s hard about KAT?

On its own, KAT proves plenty of nice theorems, but none of them reason about particular program behaviors. In the code snippet above, xGt0, incY, and decX are uninterpreted—there’s no relationship between, say xGt0 and decX. That is, you might expect that ¬xGt0;decX;¬xGt0 is equivalent to ¬xGt0;decX, because decrementing a number less than or equal to 0 will yield a number that is also less than or equal to 0. The names of our tests and actions are suggestive, but KAT treats them absractly. If you want to reason about the semantics of your tests and actions, you need to build a custom, concrete KAT. NetKAT reasons about fields on packets, and doing so means building a particular, concrete KAT with particular actions. The original paper spends quite a bit of effort proving this new, custom KAT has a sound, complete, and decidable equivalence checking.

Worse still, KAT’s metatheory is very challenging. To create NetKAT, Nate Foster and company worked through closely related ideas for a few years before Nate joined Cornell and started working with Dexter Kozen, KAT’s progenitor. Only then did they realize that KAT would be a good fit, and they got to work on developing a concrete KAT—NetKAT. Unfortunately, “collaborate with Dexter” is an approach that doesn’t scale.

How to cook a KAT

In an upcoming PLDI 2022 paper, “Kleene Algebra Modulo Theories: A Framework for Concrete KATs”, Ryan Beckett, Eric Campbell, and I show how to generate a KAT over a given theory, i.e., a set of tests, actions, and their equational theory. We call the approach Kleene algebra modulo theories, or KMT. The paper covers quite a few examples:

  • booleans and bit vectors
  • monotonic natural numbers
  • unbounded sets and maps
  • NetKAT

What’s more, our approach allows for higher-order theories, like taking the product of two theories or using finite-time LTL to reason about another theory. (Our approach abstracts and generalizes Temporal NetKAT, which is just a concrete instance of our more general method.)

To build a KMT, you provide primitive tests and actions, along with weakest preconditions relating each pair of test and action. There’s an ordering requirement: a test must be no smaller than its preconditions. With these in hand, we’re able to automatically derive a KAT with good properties in a pay-as-you-go fashion:

  • If your theory is sound, the KAT is sound.
  • If your theory is complete, the KAT is complete.
  • If your theory’s satisfiability checking is decidable, we can derive a decision procedure for equivalence.

I’m particularly excited that our framework is prototype-ready: our code is implemented as an OCaml library, where you define theories as functors. Please try it out—mess around and write your own theories, following our examples. We hope that KMT will significantly lower the bar for entry, making it easier for more people to play around with KAT’s powerful equivalence checking.

What’s the catch?

There’s more than one way to cook a KAT. KMT generates KATs with tracing semantics, i.e., the exact trace of actions matters. In KAT+B! or NetKAT, later updates override earlier ones, e.g., x:=false; x:=true ? x:=true… but KMT will treat these terms differently, because they have different traces. KAT+B! deliberately avoids tracing; NetKAT only traces at predefined points, by means of their dup primitive, which marks the current state as historically salient. There’s no deep reason for KMT to use tracing, and we believe KMT can be generalized to support dup-like controls for tracing.

The ordering constraint on weakest preconditions is a strong one. Our natural numbers, sets, and maps must be monotonic: they may grow or shrink, but not both. They cannot be compared, e.g., two natural-valued variables x and y can be compared to constants but not each other.

KMT is also just a prototype. It’s fast for small programs, but it takes dedicated work to make a KAT’s decision procedure efficient enough on more serious examples.

Why are you talking about cooking KATs?

The greatest POPL paper of all time is Manna and Pnueli 1983, “How to cook a temporal proof system for your pet language”. Why? Just take a look a the first page:

The header of the paper offsets the author names to the right. A line drawing dominates the top: a dog wags its tail, tongue dripping eagerly in front of a kabob marked with "ADA" and "shared variable" and "CSP".
I rest my case.


KMT won a distinguished paper award at PLDI!

Summarizing performance numbers

How should we summarize performance numbers? In a recent benchmark run, I had some interesting speedup numbers that I wasn’t certain how to report. While it’s easy to make charts that are illuminating, I’m not certain what I should say in, e.g., an abstract.

Here’s the raw data (also available as a spreadsheet), noting that I’ve made everything as abstract as I can:

In the data, I’ve recorded the runtime of 2 tools (tool1 and tool2) on 40 tests. The tests are lettered by theme, with a number to distinguish tests that are somehow related. Each runtime in the table is in seconds, and is the arithmetic mean of three runs exhibiting nominal variation. I run tool1 in two configurations: tool1 simply solves the problem, while tool1.min tries to solve the problem “minimally” in some sense. I run tool2 in only one configuration. In the spreadsheet, I’ve calculated a few summary statistics for each column. Here are the summary statistics for tool1 vs. tool2:

Arithmetic mean156.84
Geometric mean12.64
Harmonic mean4.49
Summary statistics of tool1’s speedup compared to tool2

Doing some cursory analysis in R, it’s easy to generate charts that give a pretty good feel for the data. (It’s all in mgree/summarizing-perf on GitHub.) Here’s a box plot of times:

boxplots showing runtimes for tool1, tool1.min, and tool2. tool1 is the tightest, lowest box; tool1.min is a little higher but has the same median; tool2 is substantially higher (worse) than the other two.

And here’s a violin plot of times:

a violin plot shows that tool1 and tool1.min are chonkiest around 0.05s, while tool2 has wide variation

I would summarize these charts in text as, “tool1 is an order of magnitude faster than tool2; minimization closes some of the gap, but tool1.min is still substantially faster than tool2”. A bar chart tells the same story:

a bar chart comparing tool1, tool1.min, and tool2 across all tests. tool2 is only rarely competitive with tool1 (i.e., withing half an order of magnitude). tool1.min does worse than tool1, but still typically beats tool2.

With the bar chart, it’s possible to see that sometimes tool2 is in the same league as tool1, but not usually. We have tool2 beating tool1.min only once (test w); it never beats tool1, and typically loses by 1-2 orders of magnitude. Some cases are catastrophically bad.

Plotting speedup lets us justify some other comments. Here’s a scatter plot:

a scatter plot showing speedups for tool1 and tool1.min. a single point for tool1 is on the 1; all others are above. a single point for tool1.min is below the 1; all others are above.

And here’s a boxplot of speedups in aggregate:

a boxplot summarizing speedups of tool1 and tool1.min compared to tool2. the whisker for tool1 stops at 1; the whisker for tool2 goes just below. the medians and quartiles are more or less comparable, with tool1 doing a touch better than tool1.min

Looking at these speedups, I’d feel comfortable saying that “tool1 is typically an order of magnitude faster than tool2, never slower, and sometimes much faster; tool1.min behaves similarly, though it can sometimes be slower”.

This post comes out of a Twitter thread, which is a goldmine of insight and resources. Please check it out, and chime here or in the thread with your thoughts!

Special thanks to Noam Ross for help with some of the fancier log-scale stuff on the plots.

Bridging the gradual typing gap at OOPSLA 2021

I want to believe in a future where the lion will lie down with the lamb; we’ll beat our swords into plowshares; and developers will migrate dynamic prototypes to robust static systems with confidence. But these Aquarian visions are elusive. Having a map of the road to paradise in theory doesn’t mean we know how to get there in practice. Let me tell you about two papers at OOPSLA that shuffle us a few steps forward on this long pilgrim’s trail.

A vintage poster of "Hair", the American Tribal Love Rock Musical, with a trippy inverted head. This poster advertises a performance at the Aquarius Theatre in Los angeles.

Migrating programs

How do you actually get a program from Scheme into ML? Or from JavaScript into TypeScript? The theory of gradual typing goes far beyond these pedestrian questions. In principle, we know how to reconcile dynamism with much more complex systems, like information flow or refinement types or effect systems. But there’s very little tooling to support moving any particular Scheme program into ML. (If your program is a Racket program, then you’re in some luck.)

People have studied program migration before, under a variety of names. Papers go back at least to 2009, arguably even earlier. There are lots of different approaches, and most comprise some form of type inference and custom constraint solving—complex! Worse still, there’s been no consensus on how to evaluate these systems. Luna Phipps-Costin, Carolyn Jane Anderson, me, and Arjun Guha dug into program migration. Our paper, “Solver-based Gradual Type Migration”, tries to build a map of the known territory so far:

  1. There are competing desiderata: maximal type precision, compatibility with code at different types, and preserving the existing semantics of your program, i.e., safety.
  2. We evaluate a variety of past techniques on prior benchmarks, and we devise a novel set of “challenge” problems. Our evaluation framework is robust, and you could plug in other approaches to type migration and evaluate them easily.
  3. We introduce a new, very simple approach to type migration, which we call TypeWhich. TypeWhich uses an off-the-shelf SMT solver. You can choose how compatible/precise you want it to be, but it’ll always be safe.

I’m excited about each of these contributions, each for its own reason.

For (1), I’m excited to formally explain that what you’re actually trying to do with your code matters. “Gradual typing” sensu lato is pretty latus indeed. Are you migrating a closed system, module by module? Or are you coming up with type annotations for a library that might well be called by untyped clients? These are very different scenarios, and you probably want your type migration algorithm to do different things! Bringing in these competing concerns—precision, compatibility, and safety—gives researchers a way to contextualize their approaches to type migration. (All that said, to me, safety is paramount. I’m not at all interested in a type migration that takes a dynamic program that runs correctly on some input and produces a statically typed program that fails on the same input… or won’t even compile! That doesn’t sound very gradual to me.)

For (2), I’m excited to be building a platform for other researchers. To be clear, there’s a long way to go. Our challenge problems are tiny toys. There’s a lot more to do here.

For (3), I’m excited to have an opportunity to simplify things. The TypeWhich constraint generator is simple, classic PL; the constraints it generates for SMT are straightforward; the models that SMT generates are easy to understand. It’s a cool approach!

One tiny final note: Luna has done a tremendous amount of incredibly high quality work on this project, both in code and concept. She’s just now starting her third-year of undergraduate study. So: watch out! You ain’t ready.

Typed functional programming isn’t about functions

If there’s a single defining ‘killer’ feature of typed functional programming, it isn’t first-class functions at all: it’s algebraic datatypes. Algebraic datatypes help make illegal states unrepresentable and ASTs easy to work with. They’re a powerful tool, and their uptake in a variety of new-hotness languages (Kotlin, Rust, Swift) speaks to their broad appeal.

Moving Scheme code to ML is an old goal, and it’s the bread and butter of the introductory sections of gradual typing papers. But are we any closer than we were fifteen years ago? (I’d say “yes”, and point at Typed Racket, or “nobody knows what’s happening anyway” and point at Idris’s Chez Scheme runtime.)

Stefan Malewski, me, and Éric Tanter tried to figure out how algebraic datatypes play with dynamic features. Our paper, “Gradually Structured Data“, uses AGT to ‘compute’ static and dynamic semantics for a language with possibly open algebraic datatypes and the unknown type in a few flavors (?, the unknown type; a new ground type for “datatype”, the same way int and bool and ?->? are ground; and a new type for “any open datatype”). The features gel in a nice way, letting us express some cool behaviors (see Section 2 for how one might evolve a simple JSON API) and sit in a novel space (see Section 5 for a thorough comparison to related features).

I’m particularly pleased that we’ve found a new place in the design spectrum (per our feature chart in Section 5) that seems to support incremental program migration (per our examples in Section 2)—and it’s formally grounded (by using AGT in the middle, formal sections).

This paper came out of conversations with Éric after my screed about gradual typing’s two lineages at SNAPL (see also my followup blogpost, “What to Define When You’re Defining Gradual Type Systems”). There’s plenty more to do: what about separate compilation? What are the right representation choices? How should runtime checks really go, and how can programmers control the costs?

I remember a question I was asked after giving the talk for “Contracts Made Manifest” at POPL 2010 with some panic fondly. That paper compares the latent approach to contracts in Racket-then-Scheme (well structured runtime checks at module boundaries) to the manifest approach (runtime checks are a form of type coercion, occurring anywhere) in the emerging refinement types literature (Sage, Liquid Types, etc.). I had shown that the two aren’t equivalent in the presence of dependency, and I concluded by talking about how the two implementation approaches differed. So: somebody asked, “Which approach should you use?” To be honest, I had hardly even thought about it.

So, suppose you wanted use algebraic datatypes and dynamic features today: which language should you use? I’ve thought about it, and the answer, sadly, is, “It depends”. OCaml’s polymorphic variants get you a long way; Haskell’s Dynamic could work great, but it’s badly in need of usable surface syntax. (I’ve tried to get Richard Eisenberg to help me with the fancy work to make that happen, but he’s justifiably worried that the Haskell community would run him out of town.) Scala, Haskell, and OCaml are your best bets if you want true algebraic datatypes. If you’re more relaxed about things, Typed Racket or TypeScript could work well for you. If what you’re looking for is a type system expressive enough to capture interesting dynamic idioms, then I think there’s a clear choice: CDuce. Ever since un bel recensore anonimo at SNAPL 2019 showed me that CDuce can type flatten, I’ve been impressed. Check this out:

let flatten ( Any -> [ (Any\[Any*])* ] )  (* returns a list of non-lists ???? *)
  | [] -> []                              (* nil *)
  | (h,t) -> (flatten h)@(flatten t)      (* cons *)
  | x -> [x]                              (* anything else *)

Look at that type! In just a few lines of CDuce, we can show that flatten produces not just a list of elements, but a list of things that are not themselves lists. The price here is that CDuce’s types are set-theoretic, which means things are a touch different from what people are used to in OCaml or Haskell. But if you’re okay with that, CDuce is a serious contender!

Coda: see you at OOPSLA?

I’m planning on going to OOPSLA 2021 in Chicago, given the twoopsla and the opportunity to present a paper from OOPSLA 2020, “Formulog: Datalog for SMT-based static analysis”, with Aaron Bembenek and Steve Chong. I’ve already blogged about it, but I’m excited to get to give an in-person version of the talk, too. You can still watch Aaron’s excellent recorded talk on YouTube and enjoy the cabin vibes. There won’t be cabin vibes at my OOPSLA 2020 talk, but there will be terrible jokes. So: think about it. Will I see you at OOPSLA? I hope so!

I’m looking for PhD students!

I’m looking for PhD students in the Fall 2021 application cycle, to start in Fall 2022. Come work with me at Stevens CS in Hoboken, NJ!

I work in Gateway South (the left-hand side of this photo). You could, too! (Photo credit: Stevens Alumni.)

What will we work on?

I’m interested in applying formalism — all those pretty Greek letters in program semantics, type systems, and static analysis — directly to real systems — all that nitty gritty code that makes these beautiful, horrible machines do their thing. I’m working on a few projects that variously emphasize theoretical or practical aspects. My main goal these days is to provide better support for the POSIX shell and its ecosystem, but here’s a sampling from recent papers:

  • Smoosh (POPL 2020): I’m interested in improving and supporting the shell. Smoosh is a formal model of the POSIX shell that can be executed and passes the POSIX test suite. Continuing work on Smoosh means hacking in Lem, OCaml, and Coq (and maybe Rust or C or JS or Elm), and thinking about virtualization, symbolic execution, fuzzing, and how specifications and implementations interact. Or maybe it just means building cool tools for the POSIX world!
  • Formulog (OOPSLA 2020): Datalog, functional programming, and SMT combine to let you write down and run things that look a lot like your formal spec. Continuing work in this line means hacking in Rust (and maybe C++ or Java), and thinking about SMT and how we can be confident that the formalism we write is the code that we run—and that our code is efficient.
  • Gradual types (OOPSLA 2021) and type migration (OOPSLA 2021): People have been trying to combine the benefits of dynamic and static types for years. Work in this line will mean hacking in Rust (and maybe JS or TS or Haskell) and doing classic PL stuff like type soundness, type inference, and proofs of contextual equivalence (by logical relations or bisimulation, on paper or in Coq).
A slide from a Keynote deck. The title is "semantics engineering". The left-hand side illustrates systems challenges:

 - a "C" monster
 - complicated specs
 - a dog in front of a laptop (programming is hard!)

The right-hand side illustrates PL formalism: inference rules, helper functions, grammars, etc.
I’ve been calling it this combination of executable systems and PL formalism “semantics engineering”, with inspiration from the PLT folks (though I don’t really use Redex).

You can check out a list of all my papers. Are any of these papers the sort of thing you’d like to write? Come join me for a PhD!

Who will you work with?

Stevens has about thirty research faculty, and we’re growing fast. We have a great group of people interested in PL, security, and systems: Eduardo Bonelli, Tegan Brennan, Dominic Duggan, Eric Koskinen, Philippe Meunier, David Naumann, Georgios Portokalidis, Susanne Wetzel, and Jun Xu. And there are of course many other fantastic researchers in other topics to learn from in class and collaborate with on research. And beyond all that, I got a lot out of my internships (AT&T Shannon Labs; MSR Cambridge), and I encourage my students to find stimulating opportunities.

Where is Hoboken, again?

Hoboken, NJ is directly across the Hudson River from Manhattan a/k/a New York City. There’s 24-hour train service and frequent ferries to and from New York. Hoboken is a vision zero city, where it’s safe and comfortable to bike and walk. There are other cool cities nearby, like Jersey City.

How do you apply?

You can learn more about the CS PhD program at Stevens and apply online. If you have questions, please don’t hesitate to get in touch.

Heaven, Hell, or Hoboken!

After six years at Pomona College, I’ve moved to Stevens Institute of Technology as an assistant professor in the computer science department. I miss my lovely Pomona colleagues—they’re hiring!—but I’m excited to be on the East Coast and to be doing more research with a new set of lovely colleagues.

A photo of my office nameplate. The Stevens logo in red, with the following text:

Michael Greenberg
Assistant Professor
Department of Computer Science

447 (in Braille)

I’ve got a new webpage, but the old webpage should stay up.

We’ll be spinning up the Stevens PL/systems/security seminar soon, and I’m hopeful we can involve lots of interesting people, as speakers and attendees. If you’re in the New York area, come by and say hi!

Also… I’ll be looking to hire PhD students for the coming year! More info on that soon.

Pomona College is hiring!

Pomona College’s computer science department is hiring Fall of 2021 for Fall 2022. I used to work at Pomona, and there is a lot to recommend it. Pomona College is a small liberal arts college (SLAC) in LA County, 35mi/45-240min outside DTLA. It’s a 2:2 teaching load.

Steps on campus, with a view of the mountains behind.

First and foremost, you’ll have excellent colleagues. They are friendly, collegial, supportive, and hardworking. There’s a sense of shared purpose and responsibility. Disagreements are resolved amicably, because everyone is on the same team. Nobody shirks. They’re great people!

Second, the students are bright. They’re motivated, broad-minded, and often interested in social justice. Pomona’s student body overall is quite diverse, along a variety of axes (income, ethnicity, national origin), and the CS enjoys that diversity. Pomona is a very wealthy institution, and it’s putting its wealth to work helping many students who have very little money.

Third, Pomona offers a great deal of research freedom. I felt zero pressure to get grants when I was there, which allowed me to pursue whatever research interests felt worthwhile.

I’ve written in the past about what I loved (and didn’t) about Pomona College. I’ve left that document up, since it provides more detail on what I think is really good about working at a SLAC in general and Pomona in particular.

Joining Pomona’s CS department will let you join a community of lovely colleagues. You’ll have the opportunity to shape the culture and trajectory of a department, work closely with smart and interesting students, do the research you want… all while enjoying the mountains, high desert, city, and coast. It could be right for you — if you’re not sure, feel free to get in touch and we can chat about it.

A student asked why STLC programs always terminate… so I showed them! Pomona offers the opportunity to teach interesting things to interested students. That student and I later worked through the Homotopy Type Theory book together.

Processing Semi-Structured Data in the Unix Shell

The Unix shell is incredibly powerful. I use it routinely for simple tasks (moving files around), routine work (grading scripts), and in my development process (building, deploying, etc.). When I’m working with text, the shell and its ecosystem is excellent: patching together cat, find, grep, sed, tr, and cut with shell pipelines and redirections is a convenient, expressive, and fast way to inspect and edit files.

But my shell toolchain is much less helpful when working with semi-structured data, like JSON and YAML. Folks have made wonderful contributions to the shell ecosystem to help—tools like jq and gron. These two tools provide new languages for manipulating JSON. It may be embarrassing to admit for a programming languages researcher, but… I’m kind of maxed out on new languages.

So I built a new tool that lets you use your usual shell tools to work with modern file formats: ffs, the file filesystem.

A GIF showing the following shell interaction, editing JSON in place.

~/ffs/demo $ echo '{}' >demo.json
~/ffs/demo $ ffs -i demo.json &
[1] 56827
~/ffs/demo $ cd demo
~/ffs/demo/demo $ echo 47 >favorite_number
~/ffs/demo/demo $ mkdir likes
~/ffs/demo/demo $ echo true >likes/dogs
~/ffs/demo/demo $ echo false >likes/cats
~/ffs/demo/demo $ touch mistakes
~/ffs/demo/demo $ echo Michael Greenberg >name
~/ffs/demo/demo $ echo >website
~/ffs/demo/demo $ cd ..
~/ffs/demo $ umount demo
~/ffs/demo $ 
[1]+  Done                    ffs -i demo.json
~/ffs/demo $ cat demo.json 
{"favorite_number":47,"likes":{"cats":false,"dogs":true},"mistakes":null,"name":"Michael Greenberg","website":""}~/ffs/demo $ 
~/ffs/demo $
Editing JSON in place using ffs.

ffs lets you mount semi-structured data as a filesystem: objects and lists correspond to directories, while other types correspond to regular files. You can mount a file in one format, edit the filesystem, and write it back in another.

All you need to run ffs is FUSE, a kernel module that supports userspace filesystem. You’ll want libfuse on Linux, or macFUSE on macOS. Download a binary and play around!