Formulog: ML + Datalog + SMT

If you read a description of a static analysis in a paper, what might you find? There’ll be some cute model of a language. Maybe some inference rules describing the analysis itself, but those rules probably rely on a variety of helper functions. These days, the analysis likely involves some logical reasoning: about the terms in the language, the branches conditionals might take, and so on.

What makes a language good for implementing such an analysis? You’d want a variety of features:

  • Algebraic data types to model the language AST.
  • Logic programming for cleanly specifying inference rules.
  • Pure functional code for writing the helper functions.
  • An SMT solver for answering logical queries.

Aaron Bembenek, Steve Chong, and I have developed a design that hits the sweet spot of those four points: given Datalog as a core, you add constructors, pure ML, and a type-safe interface to SMT. If you set things up just right, the system is a powerful and ergonomic way to write static analyses.

Formulog is our prototype implementation of our design; our paper on Formulog and its design was just conditionally accepted to OOPSLA 2020. To give a sense of why I’m excited, let me excerpt from our simple liquid type checker. Weighing in under 400 very short lines, it’s a nice showcase of how expressive Formulog is. (Our paper discusses substantially more complex examples.)

type base =
  | base_bool

type typ = 
  | typ_tvar(tvar)
  | typ_fun(var, typ, typ)
  | typ_forall(tvar, typ)
  | typ_ref(var, base, exp)

and exp = 
  | exp_var(var)
  | exp_bool(bool)
  | exp_op(op)
  | exp_lam(var, typ, exp)
  | exp_tlam(tvar, exp)
  | exp_app(exp, exp)
  | exp_tapp(exp, typ)

ADTs let you define your AST in a straightforward way. Here, bool is our only base type, but we could add more. Let’s look at some of the inference rules:

(* subtyping *)
output sub(ctx, typ, typ)

(* bidirectional typing rules *)
output synth(ctx, exp, typ)
output check(ctx, exp, typ)

(* subtyping between refinement types is implication *)
sub(G, typ_ref(X, B, E1), typ_ref(Y, B, E2)) :-
  wf_ctx(G),
  exp_subst(Y, exp_var(X), E2) = E2prime,
  encode_ctx(G, PhiG),
  encode_exp(E1, Phi1),
  encode_exp(E2prime, Phi2),
  is_valid(`PhiG /\ Phi1 ==> Phi2`).

(* lambda and application synth rules *)
synth(G, exp_lam(X, T1, E), T) :-
  wf_typ(G, T1),
  synth(ctx_var(G, X, T1), E, T2),
  typ_fun(X, T1, T2) = T.

synth(G, exp_app(E1, E2), T) :-
  synth(G, E1, typ_fun(X, T1, T2)),
  check(G, E2, T1),
  typ_subst(X, E2, T2) = T.

(* the only checking rule *)
check(G, E, T) :-
  synth(G, E, Tprime),
  sub(G, Tprime, T).

First, we declare our relations—that is, the (typed) inference rules we’ll be using. We show the most interesting case of subtyping: refinement implication. Several helper relations (wf_ctx, encode_*) and helper functions (exp_subst) patch things together. The typing rules below follow a similar pattern, mixing the synth and check bidirectional typing relations with calls to helper functions like typ_subst.

fun exp_subst(X: var, E : exp, Etgt : exp) : exp =
  match Etgt with
  | exp_var(Y) => if X = Y then E else Etgt
  | exp_bool(_) => Etgt
  | exp_op(_) => Etgt
  | exp_lam(Y, Tlam, Elam) =>
    let Yfresh = 
      fresh_for(Y, X::append(typ_freevars(Tlam), exp_freevars(Elam)))
    in
    let Elamfresh = 
      if Y = Yfresh
      then Elam
      else exp_subst(Y, exp_var(Yfresh), Elam)
    in
    exp_lam(Yfresh,
            typ_subst(X, E, Tlam),
            Elamfresh)
  | exp_tlam(A, Etlam) =>
    exp_tlam(A, exp_subst(X, E, Etlam))
  | exp_app(E1, E2) => 
    exp_app(exp_subst(X, E, E1), exp_subst(X, E, E2))
  | exp_tapp(Etapp, T) => 
    exp_tapp(exp_subst(X, E, Etapp), typ_subst(X, E, T))
  end

Expression substitution might be boring, but it shows the ML fragment well enough. It’s more or less the usual ML, though functions need to have pure interfaces, and we have a few restrictions in place to keep typing simple in our prototype.

There’s lots of fun stuff that doesn’t make it into this example: not only can relations call functions, but functions can examine relations (so long as everything is stratified). Hiding inside fresh_for is a clever approach to name generation that guarantees freshness… but is also deterministic and won’t interfere with parallel execution. The draft paper has more substantial examples.

We’re not the first to combine logic programming and SMT. What makes our design a sweet spot is that it doesn’t let SMT get in the way of Datalog’s straightforward and powerful execution model. Datalog execution is readily parallelizable; the magic sets transformation can turn Datalog’s exhaustive, bottom-up search into a goal-directed one. It’s not news that Datalog can turn these tricks—Yiannis Smaragdakis has been saying it for years!—but integrating Datalog cleanly with ML functions and SMT is new. Check out the draft paper for a detailed related work comparison. While our design is, in the end, not so complicated, getting there was hard.

Relatedly, we have also have an extended abstract at ICLP 2020, detailing some experiments in using incremental solving modes from Formulog. You might worry that Datalog’s BFS (or heuristic) strategy wouldn’t work with an SMT solver’s push/pop (i.e., DFS) assertion stack—but a few implementation tricks and check-sat-assuming indeed provide speedups.

Flapjax on PL Perspectives

Shriram Krishnamurthi, Arjun Guha, Leo Meyerovich, and I wrote a post about Flapjax on PL Perspectives, the SIGPLAN blog. (Thanks to Mike Hicks for helping us edit the post!)

Flapjax won the OOPSLA MIP award for 2009 (though the SIGPLAN website isn’t yet up to date). Our blog post is about the slightly unconventional way we worked: most of the Flapjax work happened in 2006 and 2007, but we didn’t even try to write the paper until several years later (Leo and I were in grad school). Rather than recapitulate those ideas, go read the post!

smoosh v0.1

I’ve been building an executable formalization of the POSIX shell semantics, which I’ve been calling smoosh (the Symbolic, Mechanized, Observable, Operational SHell).

I’m pleased to announce an important early milestone: smoosh passes the POSIX test suite (modulo locales, which smoosh doesn’t currently support). I’ve accordingly tagged this ‘morally correct’ verison as smoosh v0.1. You can play around with the latest version via the shtepper, a web-based frontend to a symbolic stepper. I think the shtepper will be particularly useful for those trying to learn the shell or to debug shell scripts.

There’s still plenty to do, of course: there are bugs to fix, quirks to normalize, features to implement, and tests to run. But please: play around with the smoosh shell and the shtepper, file bug reports, and submit patches!

New paper: Word expansion supports POSIX shell interactivity

I’ve been thinking about and working on the POSIX shell for a little bit over a year now. I wrote a paper for OBT 2017, titled Understanding the POSIX Shell as a Programming Language, outlining why I think the shell is worthy of study.

For some time I’ve had the conviction that word expansion—the process that includes globbing with * but also things like command substitution with backticks—is somehow central to the shell’s interactivity. I’m pleased to have finally expressed my conviction in more detail: Word expansion supports POSIX shell interactivity will appear at PX 2018. Here’s the abstract:

The POSIX shell is the standard tool to deploy, control, and maintain systems of all kinds; the shell is used on a sliding scale from one-off commands in an interactive mode all the way to complex scripts managing, e.g., system boot sequences. For all of its utility, the POSIX shell is feared and maligned as a programming language: the shell is feared because of its incredible power, where a single command can destroy not just local but also remote systems; the shell is maligned because its semantics are non-standard, using word expansion where other languages would use evaluation.

I conjecture that word expansion is in fact an essential piece of the POSIX shell’s interactivity; word expansion is well adapted to the shell’s use cases and contributes critically to the shell’s interactive feel.

See you in Nice?

Installing ctypes and ctypes-foreign on OS X with brew and OPAM

I recently had some trouble getting ctypes working, so I thought I’d share my solution.

I found the Real World OCaml chapter on FFIs, and I tried following their advice first. They suggest:

brew install libffi
opam install ctypes

But it doesn’t quite work. For one, you also need to install ctypes-foreign. For two, the brew installation of libffi doesn’t automatically install itself globally, so ctypes can’t find it.

Here’s the right incantation:

brew install libffi
cd /usr/lib/
sudo mv libffi.dylib libffi.dylib.orig
sudo ln -s /usr/local/opt/libffi/lib/libffi.dylib
LIBFFI_CFLAGS=-I/usr/include/ffi CFLAGS=-I/usr/include/ffi opam install ctypes ctypes-foreign

Now, this is merely a good start. Once I had this much working, I still couldn’t get OCaml to call C functions! It came down to a bear of a linking issue… no combination of LD_PATH and LD_LIBRARY_PATH and LD_PRELOAD would help, nor would -lflags -cclib,-L. I just couldn’t get ctypes to see my shared library! I finally found an option to ld that does the trick: -force_load. Here’s my build line:

ocamlbuild -no-hygiene -pkg ctypes.foreign -lflags "-cclib -force_load /path/to/libfoo.a" bar.native

I imagine you can get away without -no-hygiene; I couldn’t get corebuild to work at all.

In any case, special thanks to Arjun Narayan for helping me debug this, and to Spiros Eliopolous for mockery absolutely nothing.

Twitter bots and OAuth

I’m working on a Twitter bot, and I ran up against something very annoying: apps need to be on an account with a mobile phone number. I have just one mobile phone, and it’s already tied to my real Twitter account. Rather than finding a way to get another mobile number, I had the bot authorize my app using Twitter’s OAuth API. Here’s how to do it.

Step one: collect your API keys. There’s a consumer key and a consumer secret, both available from your app’s page at apps.twitter.com. Never commit these to a repository or post them anywhere. I put them in a special file that I tell git to ignore, keys.py.

Step two: collect OAuth tokens. The key trick here is to collect tokens that are permanently good, using the out-of-band (OOB) PIN method. To actually send the requests, I used the requests-oauthlib library for Python. Here’s how to do it:

from requests_oauthlib import OAuth1Session
from keys import *

twitter = OAuth1Session(consumer_key, consumer_secret) # loaded from keys.py!
twitter.params['oauth_callback'] = 'oob' # make sure we're in PIN mode
r = twitter.post('https://api.twitter.com/oauth/request_token')

At this point, r.text will tell you, HTTP request parameter style, your oauth_token and oauth_secret. Add these to your keys.py file and copy the oauth_token to the clipboard.

Step three: authorize the app. Fire up a web browser and log in to the bot’s account. Go to https://api.twitter.com/oauth/authorize?oauth_token=[whatever your oauth_token was]. You’ll get a PIN number. Add it to your keys.py file.

Step four: finalize the authorization.

from requests_oauthlib import OAuth1Session
from keys import *
twitter = OAuth1Session(consumer_key, consumer_secret)
twitter.params['oauth_verifier'] = pin
r = twitter.get('https://api.twitter.com/oauth/access_token?oauth_token=' + oauth_token)

Now r.text will give you a new oauth_token and oauth_token_secret. Save these—they’re OAuth access tokens, which are how you can actually make API calls.

Step five: check that it worked. Log in to the bot account, go to settings, and check the apps subheading—your app should appear. You can tweet programmatically, now:

from requests_oauthlib import OAuth1Session
from keys import *

twitter = OAuth1Session(consumer_key, consumer_secret, oauth_access_token, oauth_token_secret)
twitter.post('https://api.twitter.com/1.1/statuses/update.json?status=Testing.')

I would be lying if I said I was 100% confident that everything from here on out is easy, but it seems like these access tokens are all you need after PIN based authorization.

PHPEnkoder 1.13

I’ve resolved some E_NOTICE-level messages that were showing up when people set WP_DEBUG to true. Thanks to Rootside for pointing out this problem on the WordPress forums. As always, please let me know on the forums or email hidden; JavaScript is required if you run into any problems.

Scheduling the discussion order at PC meetings

I recently wrote a bit of code for scheduling the discussion order of PC meetings so as to minimize traffic in and out of the room due to conflicts of interest. Given some information that HotCRP happily generates, the code generates a schedule, which can be further turned into a handout and slides showing the current paper’s conflicts and the two upcoming papers.

If you’re interested in using or contributing, I’ve put the code up at mgree/conflict on github.

PHPEnkoder 1.12.1: “PHP is laughably bad” edition

I’ve released a bugfix for PHPEnkoder version 1.12. Get this: before version 5.5, PHP didn’t support arbitrary array index expressions. The problem was a line:

$ord = unpack("N",$c)[1];

Which I changed to:

$bs = unpack("N",$c);
$ord = $bs[1];

This is really ridiculous. Like, serious amateur hour ridiculous. Like, if your final project in a compiler’s class had syntactic limitation, you would not get a good grade. PHP is a fractal of bad design. Ugh. Just: ugh.

And while I’m complaining about these sorts of things, where are the structured content management systems? PHPEnkoder’s email and mailto: detection is a giant, horrible kludge of regular expressions. Where’s the CMS where the filters get passed a structured, syntactic representation of the page to be rendered? This is a serious question, and if you know of one, email hidden; JavaScript is required.