Processing Semi-Structured Data in the Unix Shell

The Unix shell is incredibly powerful. I use it routinely for simple tasks (moving files around), routine work (grading scripts), and in my development process (building, deploying, etc.). When I’m working with text, the shell and its ecosystem is excellent: patching together cat, find, grep, sed, tr, and cut with shell pipelines and redirections is a convenient, expressive, and fast way to inspect and edit files.

But my shell toolchain is much less helpful when working with semi-structured data, like JSON and YAML. Folks have made wonderful contributions to the shell ecosystem to help—tools like jq and gron. These two tools provide new languages for manipulating JSON. It may be embarrassing to admit for a programming languages researcher, but… I’m kind of maxed out on new languages.

So I built a new tool that lets you use your usual shell tools to work with modern file formats: ffs, the file filesystem.

A GIF showing the following shell interaction, editing JSON in place.

~/ffs/demo $ echo '{}' >demo.json
~/ffs/demo $ ffs -i demo.json &
[1] 56827
~/ffs/demo $ cd demo
~/ffs/demo/demo $ echo 47 >favorite_number
~/ffs/demo/demo $ mkdir likes
~/ffs/demo/demo $ echo true >likes/dogs
~/ffs/demo/demo $ echo false >likes/cats
~/ffs/demo/demo $ touch mistakes
~/ffs/demo/demo $ echo Michael Greenberg >name
~/ffs/demo/demo $ echo https://mgree.github.io >website
~/ffs/demo/demo $ cd ..
~/ffs/demo $ umount demo
~/ffs/demo $ 
[1]+  Done                    ffs -i demo.json
~/ffs/demo $ cat demo.json 
{"favorite_number":47,"likes":{"cats":false,"dogs":true},"mistakes":null,"name":"Michael Greenberg","website":"https://mgree.github.io"}~/ffs/demo $ 
~/ffs/demo $
Editing JSON in place using ffs.

ffs lets you mount semi-structured data as a filesystem: objects and lists correspond to directories, while other types correspond to regular files. You can mount a file in one format, edit the filesystem, and write it back in another.

All you need to run ffs is FUSE, a kernel module that supports userspace filesystem. You’ll want libfuse on Linux, or macFUSE on macOS. Download a binary and play around!

SIGPLAN Blog: Making PL Ideas Accessible

I have a new post up on the SIGPLAN blog: “Making PL Ideas Accessible: An Open-Source, Open-Access, Interactive Journal. Inspired by Distill, I propose an open-access, open-source, interactive journal for disseminating clear presentations of current ideas and methods in programming languages.

It’s a particularly good moment to consider our research’s reach and impact: CORE has just downgraded many PL conferences in its rankings. Just because you don’t take an interest in rankings doesn’t mean rankings won’t take an interest in you. Let this spur a new wave of beautiful and enlightening explanations of PL ideas that can reach a a broad audience.

POPL Cocktail Hour

POPL 2021 is open for business on Clowdr! The synchronous band is in the afternoon and evening in Central European Time (CET = UTC+1). I live outside LA, which is UTC-8… so the POPL happy hours at 10:30am start a little early even for me.

So far as I know, this is the first POPL with a paper named after a cocktail. Accordingly, I’ve decide to host a POPL Cocktail Hour on Wednesday, January 20th at 5pm Pacific Time (PT = UTC-8). We’ll be meeting in the Clowdr break room. (You need to be registered to attend, but it’s not too late!)

I’ll be making the official POPL Cocktail, “Nordic Summer”. I got the recipe from the Moody Mixologist, but here it is:

  • 2oz aquavit
  • 1oz Aperol
  • 1oz lime juice (fresh, natch)

Add ice to a shaker, shake ingredients until it’s quite cold (i.e., it hurts to hold a metal shake), and then strain into a chilled coupe. SkÃ¥l!

I think it’d be great with a variety of substitutions—Capelletti or Campari or even Punt e Mes would do well instead of Aperol, and you can sub lemon for lime. I bet it’d be good long (i.e., with soda on top).

Here’s another one I came up with, which I’m calling the “Copenhagen Sour”:

  • 3/4oz lemon juice
  • 1 egg white
  • 1 1/4oz aquavit (okay, mine is made in Pasadena)
  • 1/4oz Cherry Heering (made in Copenhagen)
  • 3/4oz simple syrup

Dry shake the juice and egg white (i.e., no ice). Add ice, aquavit, Cherry Heering, and simple syrup and shake hard. Double strain into a frosty coupe.

Please join me for a tipple if you can—and bring your own recipes to share!

Cast notation: a case study

I recently wrote on the SIGPLAN blog about how PL notation is a barrier to entry. While the arguments are mine, I acknowledge the many folks who helped me write it in the post. Ideas from an interesting conversation with Neel Krishnaswami came up again in a comment from Jeremy Gibbons. The universe has spoken: here’s what I think about cast notation.

First, here’s what Neel said:

I think that the argument in this note is a bit under theorized. As I see it, there are three fundamental forces at work:

1. Infix operators are hard to parse without extra side information (eg, precedence).

2. Center-embedding is hard to parse for psycholinguistic reasons.

3. Semantics is about relating things, and it is easier for people to see connections when the notational differences between the two things are small.

So when you compare cast(e, S, T) to e <S => T>, the functional notation wins in point 1 and loses on point 2. This is because cast(cast(e, S, T), T, U) has nested structure in a way that e <S => T> <T => U> does not—the second one can parse as exp cast*.

I don’t know the work on gradual typing well enough to say with any confidence, but I would have expected the second notation to be a bit better for 3. The semantics of the term e is a map Γ -> S, and if the meaning of a cast is an embedding function S -> T, then [[ e <S => T> ]] = [[e]]; [[<S => T>]] — i.e., the parts of the term can be interpreted using composition in a diagrammatic order without changing the relative position of the subterms.

My guess in this last part is also an example of the dynamic that leads to bad notation — we pick our notation at the outset, based on a guess about which semantic properties are important, and we can get stuck with bad notation if the properties that actually are important differ from the ones we guessed when designing the notation. (Well, people can also just have bad taste, I guess.)

—Neel Krishnaswami, personal communication.

Both Neel and Jeremy zeroed in on a feature I really like about the e <S => T> or e :: S => T notations: they’re neatly diagrammatic. Jeremy goes further, noting that these notation suggess that cast might compose, as in e <S=>T> <T=>U> ~= e <S=>U>.

If I were forced to choose a notation, I do think these two are the best… with a slight preference for e :: S => T. (I think Phil Wadler introduced this notation, but I’m too tired to run it down just now. Let me know in the comments? Edit: thanks to James McKinna for pointing out the source—“Blame for all” by Ahmed et al.—in a comment, below!)

So why do I prefer a function? In short: the notations suggest identities which don’t actually hold.

Casts don’t compose

Whether you’re casting between simple gradual types or dependent refinements… casts don’t actually compose! Consider a term f : int -> ?, i.e., a function f that takes an int and returns… something.

We can cast f to be purely dynamic, writing f' = f :: (int -> ?) => (? -> ?). These types are compatible, i.e., they differ only by having or not having a ?. Now eventually f' may flow out of the dynamic part of our code and arrive at some context that thinks f' ought to be a bool -> bool function, casting it. So we get:

f' :: (? -> ?) => (bool -> bool) =
(f :: (int -> ?) => (? -> ?)) :: (? -> ?) => (bool -> bool) =
f :: (int -> ?) => (? -> ?) => (bool -> bool) 

Now, composition would say that we ought to be able to convert the above to f :: (int -> ?) => (bool -> bool), but such a cast is statically forbidden—these types aren’t compatible because their domains are incompatible, i.e., you can never cast from int to bool! (If you’re surprised that compatibility isn’t transitive in this way, a whole literature awaits. I recommend Abstracting Gradual Typing by Garcia, Clark, and Tanter as a starting point: transitivity can fail.)

In the contracts world, e :: S => T => U ~= e :: S => U is just as bad. What if S = U = {x:Int|true}, but T = {x:Int|x>0}. By eliminating the cast in the middle, we’ve forgotten to check that e is positive! Such a forgetful semantics comes up as a possible space-efficient semantics, though you can do better.

Cast congruence

Where Jeremy talked about composition of casts, Neel talked about compositional semantics: that is, the postfix notation directly suggested a diagrammatic denotation, as in [[ e :: S => T ]] = [[ e ]] ; [[ S => T]]. My experience with casts suggests that this intuition isn’t a helpful one, for two reasons.

First: the key ingredient for space-efficient evaluation is not using conventional composition for casts, but rather treating casts in your continuation specially. That’s no ordinary semi-colon! A “cast congruence” lemma lets you recover conventional reasoning, but it takes quite some work to get there.

Second, treating casts as first class (i.e., utterable without being directly applied) forces you to think about very strange terms, like <S1->S2 => T1->T2> <S1 => S2>. (Just… don’t. For why?) Just as for primitive operations, it’s simplest to force casts to be fully applied.

Use coercions

I don’t like these notations for casts because they offer bad suggestions. A textual notation is modestly more cumbersome here, but it’s worth it for clarity to newcomers. It’s particularly worth skipping fancy notation in this setting, because casts are merely a technical device for a core calculus, not a part of the surface language.

But the real truth is: if you’re interested in higher-order runtime checks, you really should be using Henglein’s coercions anyway. (And check out Henglein and Rehof’s Scheme implemented on those principles while you’re at it.) Coercions are much clearer than casts, compose naturally, and are the source of space efficiency and fast implementations. What’s more, blame safety for coercions is straightforward… it’s syntactic!

Postscript: the beam in my eye

You might say, “Well, Michael, it seems like all your papers use the <S => T> e notation. Who are you to judge?”

I used that notation first for POPL 2010, when we (me, Benjamin Pierce, and Stephanie Weirich) tried to figure out whether or not contracts and hybrid types were the same. (They are at simple types. They aren’t if you have dependency—they treat “abusive”, self-contradictory contracts differently.) I just stole Cormac Flanagan’s notation from Hybrid Type Checking, changing an \rhd to a \Rightarrow. (Ugh, why change it at all? My guess: I could draw \Rightarrow better on the board.)

I’ve stuck with that notation… and that’s the heart of the problem. People get used to a board notation, and then they decide that their preferred shorthand is what should go in the paper. What’s good for you when you’re doing the work may not be helpful or even clear to new folks.

I like Neel’s final diagnosis. Don’t invent a notation in the first paper. Use whatever you want on the board, but publish with text first. Like a good stew, your notation will be better on the second day, once the flavors have had time to marry. Later on, if you decide you do need a notation, you’ll know exactly which identities you want your notation to suggest… and which it should not suggest!

Racist Bullshit in Mathematics

Robin Gandy’s “On the Axiom of Extensionality–Part 1”, Journal of Symbolic Logic, Vol. 21, No. 1 (Mar., 1956) quotes Alan Turing using a racist phrase.

A screengrab if the bottommatter of the first page of Gandy's paper.

Received July 24, 1955.
1 Indeed A. M. Turing once told me that he had done this, and that the proof was fairly difficult. I have found among his manuscripts two versions of the proof: one is rather short and contains a fallacy which could not, I think, easily be put right; the other (perhaps a second draft) is unfinished and only a beginning. He may therefore have discovered and surmounted the fallacy. On the other hand, he always spoke of the axiom of extensionality as being 'the nigger in the woodpile', which suggests that he did not think his consistency was transcendental enough to accord with Gödel's theorem; but, by the results of this paper, it would have to be just that.
[Turing] always spoke of the axiom of extensionality as being ‘the nigger in the woodpile’, which suggests that he did not think his consistency proof was transcendental enough to accord Gödel’s theorem.

Yikes. Those unfamiliar with this particular racist phrase will be disappointed to learn that it’s still current enough in the UK to be used “totally unintentional[ly]”… whatever that means.

Gandy’s paper isn’t the first time I’ve been pulled out of my mathematical/logical/philosophical reverie by racist bullshit. When I was reading Ronald Clark’s The Life of Bertrand Russell, I posted a thread on Twitter of Russell’s many racist utterances, with a selection of three racist Bertrand Russell quotes; two anti-racist quotes repudiate his earlier statements, offering some modest redemption.

I found these episodes of casual, by-the-way racism jarring: they pulled me out of my investment with the material and my ability or even desire to identify with the author.

What’s galling is that Turing “always” spoke of the axiom of extensionality this way; Gandy thought the phrase worth repeating verbatim; the reviewers and editors and publishers thought the phrase acceptable; and those who cite the paper don’t seem to find this footnote worth remarking on. Gandy’s paper is important and widely cited—a foundational resource on extensionality in general, and functional extensionality in particular—but if I refer someone to it, I’m going to let them know to expect a disappointingly racist quote from Turing.

What to Define When You’re Defining Gradual Type Systems

So you want to define a gradual type system, like all the cool kids? My SNAPL 2019 paper imagines three possible motivations:

  1. Expressiveness. You have nothing to lose but your static chains!
  2. Interoperation. Gradual typing seamlessly weaves the dynamic and static worlds into a single fabric.
  3. Typing itself. Static typing offers myriad benefits: enjoy them today!

You don’t have to pick just one. Or maybe you have a different motivation—I’d love to hear it. If you’re motivated by one of these goals but aren’t sure what to do, the paper offers a variety of challenge problems in Section 3.

Now, what do you have to do to define your gradual type system? You have to come up with a type system that has a question mark in it, of course. (You can also write any or dyn or Dynamic or *—whatever works for you.) But what else?

  • A surface language. Since Siek and Taha’s seminal 2006 paper, gradual types have commonly been expressed via elaboration: a source language (with nonexistent or optional or partial type annotations) is translated to a core language that makes all dynamism explicit. What is your source language? Even if you don’t define your source language formally, give an intuition about how programmers will experience it. Can programmers control what’s dynamic and what’s static? Do you ever reject source programs? Which? (GTLC rejects true 5—even in dead code—but different source languages could do different things.) Why is your line the right one to draw?
  • Concrete examples. Ideally drawing from real-world examples, what might be good about gradual types in your context? What new programs do you allow? What problems do you avoid? What guarantees do you gain? Make your example programs good! As Alan Perlis said, “A program without a loop and a structured variable isn’t worth writing”. Examples from the SNAPL paper include: the flatten function, JSON processing, or the “attach-to-the-request” idiom in middleware.
  • Operations. What can we do with base types? Having real operations around will force you to answer hard questions about your source and core languages. How does equality work, i.e., what can be compared and what are the answers? Does your dynamic language reject 5 + "hit"? What about 5 + ((λx.x) "hit")? If you truly have a dynamic type, what operations can you do on it? Which can fail? Is there a way to check at runtime whether casting to a static type will succeed before you commit to such reckless behavior?
  • Control. Include conditionals or some other nontrivial notion of control flow. The first published rules for gradual typing that used a notion of ‘meet’ came in 2012! The way you treat join points in control says a lot about the ergonomics of your system. Church encodings do not cut the mustard.
  • Type semantics. Are your types worth the pixels they’re written on? What do they mean? If I have a value of a given type, what guarantees do I have? You don’t need to give a formal type semantics, but it’s important to know what to expect. If I write a function λx:T. e, what can I actually assume about x in e? If T is int, do I know x is an int, or could it blow up? What about ref int… can reading fail? Writing? What about list int? Does pattern matching on it cause conversions, or possible failure? What about…

The SNAPL 2019 paper argues that there are two ‘lineages’ of gradual typing: one which starts from statically typed languages and relaxes or modifies the type system to include dynamic features, and one which starts from dynamic languages and tries to develop a static type system that can accommodate your ‘preexisting conditions’—legacy code. Whichever lineage you’re working in, each item above is worth carefully considering.

I want to conclude by calling out a paper that more people ought to know about; it does a good job on most of these points. It came out the same year as Alanis Morisette’s acclaimed international debut album, Jagged Little Pill.

(The official ACM version is less complete than the technical report—alas.) They are clear about their surface language (Scheme—with argument lists and call/cc, but not arbitrary set!). They have an entire section of concrete examples, with good demonstrations of how conditionals work with their coercion parameters. They even draw on examples from the literature, citing Mike Fagan’s thesis (which is a goldmine of examples). They don’t give a formal type semantics, but they do explain (with positive and negative examples) how type coercion parameters and polymorphism interact to achieve in their elaborated ML the ad hoc polymorphism necessary to implement their source Scheme.

I also want to highlight this paper because it’s one that I’ve never heard folks actually talk about, though it seems to be cited well enough. I urge anyone who is interested in gradual types to read it. Just like Alanis’s cri de coeur against the shallow world of pop, some things from 1995 are worth revisiting.

Ron Garcia gave helpful feedback on a draft of this post. Thanks, Ron!

Formulog: ML + Datalog + SMT

If you read a description of a static analysis in a paper, what might you find? There’ll be some cute model of a language. Maybe some inference rules describing the analysis itself, but those rules probably rely on a variety of helper functions. These days, the analysis likely involves some logical reasoning: about the terms in the language, the branches conditionals might take, and so on.

What makes a language good for implementing such an analysis? You’d want a variety of features:

  • Algebraic data types to model the language AST.
  • Logic programming for cleanly specifying inference rules.
  • Pure functional code for writing the helper functions.
  • An SMT solver for answering logical queries.

Aaron Bembenek, Steve Chong, and I have developed a design that hits the sweet spot of those four points: given Datalog as a core, you add constructors, pure ML, and a type-safe interface to SMT. If you set things up just right, the system is a powerful and ergonomic way to write static analyses.

Formulog is our prototype implementation of our design; our paper on Formulog and its design was just conditionally accepted to OOPSLA 2020. To give a sense of why I’m excited, let me excerpt from our simple liquid type checker. Weighing in under 400 very short lines, it’s a nice showcase of how expressive Formulog is. (Our paper discusses substantially more complex examples.)

type base =
  | base_bool

type typ = 
  | typ_tvar(tvar)
  | typ_fun(var, typ, typ)
  | typ_forall(tvar, typ)
  | typ_ref(var, base, exp)

and exp = 
  | exp_var(var)
  | exp_bool(bool)
  | exp_op(op)
  | exp_lam(var, typ, exp)
  | exp_tlam(tvar, exp)
  | exp_app(exp, exp)
  | exp_tapp(exp, typ)

ADTs let you define your AST in a straightforward way. Here, bool is our only base type, but we could add more. Let’s look at some of the inference rules:

(* subtyping *)
output sub(ctx, typ, typ)

(* bidirectional typing rules *)
output synth(ctx, exp, typ)
output check(ctx, exp, typ)

(* subtyping between refinement types is implication *)
sub(G, typ_ref(X, B, E1), typ_ref(Y, B, E2)) :-
  wf_ctx(G),
  exp_subst(Y, exp_var(X), E2) = E2prime,
  encode_ctx(G, PhiG),
  encode_exp(E1, Phi1),
  encode_exp(E2prime, Phi2),
  is_valid(`PhiG /\ Phi1 ==> Phi2`).

(* lambda and application synth rules *)
synth(G, exp_lam(X, T1, E), T) :-
  wf_typ(G, T1),
  synth(ctx_var(G, X, T1), E, T2),
  typ_fun(X, T1, T2) = T.

synth(G, exp_app(E1, E2), T) :-
  synth(G, E1, typ_fun(X, T1, T2)),
  check(G, E2, T1),
  typ_subst(X, E2, T2) = T.

(* the only checking rule *)
check(G, E, T) :-
  synth(G, E, Tprime),
  sub(G, Tprime, T).

First, we declare our relations—that is, the (typed) inference rules we’ll be using. We show the most interesting case of subtyping: refinement implication. Several helper relations (wf_ctx, encode_*) and helper functions (exp_subst) patch things together. The typing rules below follow a similar pattern, mixing the synth and check bidirectional typing relations with calls to helper functions like typ_subst.

fun exp_subst(X: var, E : exp, Etgt : exp) : exp =
  match Etgt with
  | exp_var(Y) => if X = Y then E else Etgt
  | exp_bool(_) => Etgt
  | exp_op(_) => Etgt
  | exp_lam(Y, Tlam, Elam) =>
    let Yfresh = 
      fresh_for(Y, X::append(typ_freevars(Tlam), exp_freevars(Elam)))
    in
    let Elamfresh = 
      if Y = Yfresh
      then Elam
      else exp_subst(Y, exp_var(Yfresh), Elam)
    in
    exp_lam(Yfresh,
            typ_subst(X, E, Tlam),
            Elamfresh)
  | exp_tlam(A, Etlam) =>
    exp_tlam(A, exp_subst(X, E, Etlam))
  | exp_app(E1, E2) => 
    exp_app(exp_subst(X, E, E1), exp_subst(X, E, E2))
  | exp_tapp(Etapp, T) => 
    exp_tapp(exp_subst(X, E, Etapp), typ_subst(X, E, T))
  end

Expression substitution might be boring, but it shows the ML fragment well enough. It’s more or less the usual ML, though functions need to have pure interfaces, and we have a few restrictions in place to keep typing simple in our prototype.

There’s lots of fun stuff that doesn’t make it into this example: not only can relations call functions, but functions can examine relations (so long as everything is stratified). Hiding inside fresh_for is a clever approach to name generation that guarantees freshness… but is also deterministic and won’t interfere with parallel execution. The draft paper has more substantial examples.

We’re not the first to combine logic programming and SMT. What makes our design a sweet spot is that it doesn’t let SMT get in the way of Datalog’s straightforward and powerful execution model. Datalog execution is readily parallelizable; the magic sets transformation can turn Datalog’s exhaustive, bottom-up search into a goal-directed one. It’s not news that Datalog can turn these tricks—Yiannis Smaragdakis has been saying it for years!—but integrating Datalog cleanly with ML functions and SMT is new. Check out the draft paper for a detailed related work comparison. While our design is, in the end, not so complicated, getting there was hard.

Relatedly, we have also have an extended abstract at ICLP 2020, detailing some experiments in using incremental solving modes from Formulog. You might worry that Datalog’s BFS (or heuristic) strategy wouldn’t work with an SMT solver’s push/pop (i.e., DFS) assertion stack—but a few implementation tricks and check-sat-assuming indeed provide speedups.

The Big Brzozowski

When I mentioned how important smart constructors are for Brzozowski derivatives on Twitter, when Prabhakar Ragde raised an interesting question:


The question is subtle: it’s not about how many possible derivatives there, but rather a question about size in the worst case. If I start parsing a string s with a regular expression e, how big might the nth derivative of e be?

I livecoded an empirical answer. TL;DR: the worst-case derivatives are exponentially large!

I hacked up a driver that would (a) enumerate all regular expressions of a certain size and (b) calculate their 1st through nth derivatives, while (c) logging the largest size seen at each step. (You can get the raw CSV from the repo.)

Data in hand, I did some simple analysis in R. First, I plotted the data on a logarithmic graph with linear fit: et voilà , it’s exponential.

a clear linear fit on a logarithmic graph

With that graph, I granted myself permission to fit an exponential curve:

nice fits on a conventional graph, though the smart constructor algorithm is flatter compared to the sharply spiking plain one; R^2 for the fit is in the high .90s for both

So: as you take Brzozowski derivatives, your regular expression will—in the worst case—grow exponentially. That sucks. Unsurprisingly smart constructors don’t give asymptotic savings, even though they save you a huge amount of overhead.

What do you mean, ‘big’?

In my empirical analysis, I used one measure of size: the number of nodes, counting empty regexes as being of size 0. But there are others! In particular, there’s a height measure on regular expressions, too.

In the worst case, the size of a regular expression is exponential in its height. Can I characterize how the height of a regular expression grows with the Brzozowski derivative?

I built a small model in Coq and proved that:

  • The height of the derivative of a regular expression E is no greater than twice the height of E itself (deriv_height). I doubt that this is a tight upper bound.
  • There is no constant k that bounds the size increase of the derivative (deriv_height_constant_general). The general thrust of the proof is as follows. Consider e = seq (alt epsilon e1) empty, where e1 is a regex with maximal derivative height, i.e., h (d c e1) = k + h e1 for some c. Note that alt epsilon e1 is nullable and also has maximal derivative height. We have that h (d c e) = 2 + h (d c e1') by definition; by assumption, h (d c e) <= k + h e = k + 1 + h e1'. By assumption again, h (d c e1') = k + h e1'. But then how can it be that 2 + k + h e1' <= k + 1 + h e1'?

With these two proofs combined, we can be confident that the exponential growth here is real, and not an artifact of small models. It would be interesting to determine what the multiplier actually is.

In these proofs, I set s empty = h empty = 0. It felt right at the time, but maybe it’s a little weird. So I redid the proofs setting them to 1. The dude abides.

Where next?

I have several remaining questions. Can we characterize which terms have large derivatives? Do they have equivalent forms with smaller derivatives? I suspect right-associating a seq will be better than left-associating it. Are there rewrites that ensure non-nullability (or minimize nullability) of the left-hand sides of seqs?

Smart constructors are smarter than you think

CS 181-N Advanced Functional Programming is a project-based elective I’m teaching this semester. The first half of the course teaches the students basic Haskell programming, from purity to FAM and IO, by way of parsing. The second half of the course has the students build a small project using a pure FP (Haskell and Elm, this go round). The projects were kicking off just as we pivoted to video.

I’ve been streaming my own project as a way of showing students different software engineering practices: how to use CI, how to write good tests and generators, how to run benchmarks, how to file bug reports, and so on. It’s been fun.

My ‘project’ is the Regularity, a library for working with regular languages and automata. We get a lot of A-plot Haskell mixing with B-plot theory of computation.

After implementing a few kinds of NFA (and seeing that they’re much more efficient than naive “find all matches” regex matching), I turned to Brzozowski derivatives. Most undergraduates aren’t exposed to the idea of derivatives outside of a calculus class, so I thought it’d be fun to show. Here’s the implementation, from Regex.hs:

data Regex =
    Empty
  | Epsilon
  | Char Char
  | Seq !Regex !Regex
  | Alt !Regex !Regex
  | Star !Regex
  deriving (Eq, Ord)

dMatches :: Regex -> Text -> Bool
dMatches re t = nullable (T.foldl (flip deriv) re t)

-- re `matches` c:s iff deriv c re `matches` s
deriv :: Char -> Regex -> Regex
deriv _c Empty         = empty
deriv _c Epsilon       = empty
deriv  c (Char c')     = if c == c' then epsilon else empty
deriv  c (Seq re1 re2) = 
  alt (seq (deriv c re1) re2) (if nullable re1 then deriv c re2 else empty)
deriv  c (Alt re1 re2) = alt (deriv c re1) (deriv c re2)
deriv  c (Star re)     = seq (deriv c re) (star re)

-- `nullable re` returns true iff re accepts the empty string
nullable :: Regex -> Bool
nullable Empty         = False
nullable Epsilon       = True
nullable (Char _)      = False
nullable (Seq re1 re2) = nullable re1 && nullable re2
nullable (Alt re1 re2) = nullable re1 || nullable re2
nullable (Star _re)    = True

It only took a few minutes to implement the derivatives, and performance was… fine. Thinking it would be nice to show the concept, I did smart constructors next:

empty :: Regex
empty = Empty

epsilon :: Regex
epsilon = Epsilon

char :: Char -> Regex
char = Char

seq :: Regex -> Regex -> Regex
seq Epsilon re2     = re2
seq re1     Epsilon = re1
seq Empty   _re2    = Empty
seq _re1    Empty   = Empty
seq re1     re2     = Seq re1 re2

alt :: Regex -> Regex -> Regex
alt Empty   re2     = re2
alt re1     Empty   = re1
alt Epsilon re2     = if nullable re2 then re2 else Alt Epsilon re2
alt re1     Epsilon = if nullable re1 then re1 else Alt re1 Epsilon
alt re1     re2     = if re1 == re2 then re1 else Alt re1 re2

star :: Regex -> Regex
star Empty     = Epsilon
star Epsilon   = Epsilon
star (Star re) = Star re
star re        = Star re

But then something surprised me: with smart constructors, the Brzozowski-based matching algorithm was substantially faster than NFAs, with or without epsilon transitions!

How much faster? Here are the times as reported by criterion.

Matching (a|a)* …on a50 …on a100
Naive, all-matches 71.4 µs 255   µs
NFA with epsilon transitions 31.6 µs 55.0 µs
NFA 4.88µs 9.29µs
Brzozowski derivative, dumb constructors 61.8 µs 262   µs
Brzozowski derivative, smart constructors 1.88µs 3.73µs

I knew smart constructors were an optimization, but I was blown away! If you didn’t know about smart constructors, you might shrug off Brzozowski derivatives as a curiosity for math dweebs. But with just a tiny bit of optimization from smart constructors, Brzozowski derivatives offer the best performance and the simplest implementation, at least on these meager tests.

It would be interesting to dig into why the dumb constructors are so slow. It doesn’t seem to be laziness, so is it just allocation churn? (We’re not even using hash-consing, which can provide another performance boost.) Is it large, redundant regexes? Something else?

Regular-expression derivatives re-examined by Owens, Reppy, and Turon is closely related and worth a read! Relatedly, Neel Krishnaswami has some lovely theoretically discussion, too.

Flapjax on PL Perspectives

Shriram Krishnamurthi, Arjun Guha, Leo Meyerovich, and I wrote a post about Flapjax on PL Perspectives, the SIGPLAN blog. (Thanks to Mike Hicks for helping us edit the post!)

Flapjax won the OOPSLA MIP award for 2009 (though the SIGPLAN website isn’t yet up to date). Our blog post is about the slightly unconventional way we worked: most of the Flapjax work happened in 2006 and 2007, but we didn’t even try to write the paper until several years later (Leo and I were in grad school). Rather than recapitulate those ideas, go read the post!