When I mentioned how important smart constructors are for Brzozowski derivatives on Twitter, when Prabhakar Ragde raised an interesting question:
I saw Brzozowski in the hall once and asked him if there was any bound on the growth of regex size. He didn’t know of any work on the subject!
â€” Prabhakar Ragde (@plragde) May 7, 2020
The question is subtle: it’s not about how many possible derivatives there, but rather a question about size in the worst case. If I start parsing a string s with a regular expression e, how big might the nth derivative of e be?
I livecoded an empirical answer. TL;DR: the worst-case derivatives are exponentially large!
I hacked up a driver that would (a) enumerate all regular expressions of a certain size and (b) calculate their 1st through nth derivatives, while (c) logging the largest size seen at each step. (You can get the raw CSV from the repo.)
Data in hand, I did some simple analysis in R. First, I plotted the data on a logarithmic graph with linear fit: et voilà , it’s exponential.
With that graph, I granted myself permission to fit an exponential curve:
So: as you take Brzozowski derivatives, your regular expression will—in the worst case—grow exponentially. That sucks. Unsurprisingly smart constructors don’t give asymptotic savings, even though they save you a huge amount of overhead.
What do you mean, ‘big’?
In my empirical analysis, I used one measure of size: the number of nodes, counting empty regexes as being of size 0. But there are others! In particular, there’s a height measure on regular expressions, too.
In the worst case, the size of a regular expression is exponential in its height. Can I characterize how the height of a regular expression grows with the Brzozowski derivative?
I built a small model in Coq and proved that:
- The height of the derivative of a regular expression E is no greater than twice the height of E itself (deriv_height). I doubt that this is a tight upper bound.
- There is no constant k that bounds the size increase of the derivative (deriv_height_constant_general). The general thrust of the proof is as follows. Consider e = seq (alt epsilon e1) empty, where e1 is a regex with maximal derivative height, i.e., h (d c e1) = k + h e1 for some c. Note that alt epsilon e1 is nullable and also has maximal derivative height. We have that h (d c e) = 2 + h (d c e1') by definition; by assumption, h (d c e) <= k + h e = k + 1 + h e1'. By assumption again, h (d c e1') = k + h e1'. But then how can it be that 2 + k + h e1' <= k + 1 + h e1'?
With these two proofs combined, we can be confident that the exponential growth here is real, and not an artifact of small models. It would be interesting to determine what the multiplier actually is.
In these proofs, I set s empty = h empty = 0. It felt right at the time, but maybe it’s a little weird. So I redid the proofs setting them to 1. The dude abides.
I have several remaining questions. Can we characterize which terms have large derivatives? Do they have equivalent forms with smaller derivatives? I suspect right-associating a seq will be better than left-associating it. Are there rewrites that ensure non-nullability (or minimize nullability) of the left-hand sides of seqs?
A natural idea to try would be to run a minimization algorithm on the original term and after each derivation step, and look at the growth of those minimized terms. It should still be exponential. The problem is that, if I understand correctly, there are no good/fast known minimization algorithms for NFAs (only for DFAs).
Given that we expect the exponential blowup to occur due to duplicated occurrences of parts of the regexp in an alternative (when deriving `(e1|e2)`), another idea would be to be smarter than your smart constructor for alternatives: when producing `(e1|e2)`, instead of only checking whether one side is nullable, check if one side is subsumed by the other: if `e1` accepts all the words of `e2`, then `(e1|e2)` is just `e1`. I suspect this can be implemented reasonably. (A quick search shows algorithms to compute the simulation relation between states of an NFA, which is related but slightly different.)
Thinking about this a bit more, I think this is no surprise that the growth is exponential. (What I actually mean is: “we were surprised, but once we are told the result we can see that it is natural”.) Indeed, the set of iterated derivatives of an expression L corresponds to a set of states for a minimal DFA that recognizes L. We know that in general the NFA->DFA transformation can lead to an exponential blowup (even when using the minimal DFA), and there is an NFA of roughly the same size as the original regexp. By a counting argument, if some NFAs have an exponential number of distinct iterated derivatives, then some of them must have exponential size in the original expression.
Yes, minimization is an interesting place to go! I agree that there should be no free lunch here—we’re simply in exponential territory. Owens, Reppy, and Turon’s paper explores the space of smart constructors a bit—I’m curious about how the different choices here affect proxy measures (regex size and number of states) and direct measures (performance of matching and equivalence checking and explicit enumeration).
The worst blowup doesn’t come from e1|e2—it comes from nullable sequences, where d c (seq e1 e2) = alt (seq (d c e1) e2) (d c e2). Notice that we have both derivatives _and_ the original e2.
I morally agree with you about the post facto inevitability, but I’m not totally following the counting argument. There are exponentially many regular expressions of a given size (e.g., 151 950 syntactically distinct regular expressions of size 6 and 1 468 590 of size 7). It seems to me that, in principle, you could have exponentially many iterated derivatives of a given expression without needing to grow so large. On some level, there shouldn’t be room at sizes n+1 through n+k for exponentially many derivatives from size n and exponentially many new, distinct regular expressions, and their derivatives, and, and… but it’s not obvious to me that it’s impossible to have room.
(You are of course right, the counting argument is flawed.)