Comments on: The Big Brzozowski

By: gasche

gasche — Thu, 28 May 2020 16:07:27 +0000

In reply to Michael Greenberg. (You are of course right, the counting argument is flawed.)

By: Michael Greenberg

Michael Greenberg — Sat, 09 May 2020 14:56:58 +0000

In reply to gasche.

Yes, minimization is an interesting place to go! I agree that there should be no free lunch here—we’re simply in exponential territory. Owens, Reppy, and Turon’s paper explores the space of smart constructors a bit—I’m curious about how the different choices here affect proxy measures (regex size and number of states) and direct measures (performance of matching and equivalence checking and explicit enumeration).

The worst blowup doesn’t come from e1|e2—it comes from nullable sequences, where d c (seq e1 e2) = alt (seq (d c e1) e2) (d c e2). Notice that we have both derivatives _and_ the original e2.

I morally agree with you about the post facto inevitability, but I’m not totally following the counting argument. There are exponentially many regular expressions of a given size (e.g., 151 950 syntactically distinct regular expressions of size 6 and 1 468 590 of size 7). It seems to me that, in principle, you could have exponentially many iterated derivatives of a given expression without needing to grow so large. On some level, there shouldn’t be room at sizes n+1 through n+k for exponentially many derivatives from size n and exponentially many new, distinct regular expressions, and their derivatives, and, and… but it’s not obvious to me that it’s impossible to have room.

By: gasche

gasche — Sat, 09 May 2020 06:56:35 +0000

A natural idea to try would be to run a minimization algorithm on the original term and after each derivation step, and look at the growth of those minimized terms. It should still be exponential. The problem is that, if I understand correctly, there are no good/fast known minimization algorithms for NFAs (only for DFAs).

Given that we expect the exponential blowup to occur due to duplicated occurrences of parts of the regexp in an alternative (when deriving `(e1|e2)`), another idea would be to be smarter than your smart constructor for alternatives: when producing `(e1|e2)`, instead of only checking whether one side is nullable, check if one side is subsumed by the other: if `e1` accepts all the words of `e2`, then `(e1|e2)` is just `e1`. I suspect this can be implemented reasonably. (A quick search shows algorithms to compute the simulation relation between states of an NFA, which is related but slightly different.)

Thinking about this a bit more, I think this is no surprise that the growth is exponential. (What I actually mean is: “we were surprised, but once we are told the result we can see that it is natural”.) Indeed, the set of iterated derivatives of an expression L corresponds to a set of states for a minimal DFA that recognizes L. We know that in general the NFA->DFA transformation can lead to an exponential blowup (even when using the minimal DFA), and there is an NFA of roughly the same size as the original regexp. By a counting argument, if some NFAs have an exponential number of distinct iterated derivatives, then some of them must have exponential size in the original expression.