FSA algorithms, Part 1: Regex parsing

I've been working on a game on and off for a while now, which involves the creation of a DFA or NFA that matches a given regular expression. I worked on something similar as a college project, but the end product was not so great as it was written in OCaml which has middling options for UI (and we spent most of the effort in the project on getting the core functionality working).

In this series of posts I hope to go through the implementation of the relevant algorithms for this game in Python, which will make them a lot easier to understand (and also a lot easier to write up as I go).

Here's the general plan for the series, which will follow the general algorithms needed for the base game functionality:

  • Regular expression parsing. This is used for parsing the inputs to levels, so that they can be specified in a typical regex format rather than via some custom format that may be easier to parse but may be more difficult to use in the long run.
  • Regex to NFA conversion
  • NFA minimization. The algorithm for converting regular expressions to NFAs can be inefficient, so minimizing the resulting NFA can be a good optimization before proceeding to the DFA conversion.
  • Efficient NFA to DFA conversion. It's common knowledge that converting an NFA to a DFA can result in exponentially many states, as the DFA has to contain the logic for the superposition of states that an NFA can handle. But there are ways to do this conversion slightly more efficiently than the naive subset method.
  • DFA minimization. We'll need to minimize DFAs in order to ensure that they are compared accurately. Also, it will help us in scoring the user submission by determining whether fewer states could have been used.
  • DFA comparison. Once we have the minimized DFAs, we need to compare them. But it's not a simple matter of dfa1 == dfa2 - the names of the states may differ. We'll use some form of DFS here.

This will be a long series, and I hope I can finish it in some reasonable timeframe. For now, here's regex parsing.

Regex classes

First, we need to get some types going. In OCaml, this was pretty easy - the language is made for that kind of stuff. In Python, we'll have to be a bit more mundane and use classes.

class Regex:
    def __repr__(self):
        return str(self)

class Union(Regex):
    def __init__(self, left, right):
        self.left = left
        self.right = right
    def __str__(self):
        return 'Union({}, {})'.format(str(self.left), str(self.right))


class Concat(Regex):
    def __init__(self, left, right):
        self.left = left
        self.right = right
    def __str__(self):
        return 'Concat({}, {})'.format(str(self.left), str(self.right))


class Star(Regex):
    def __init__(self, exp):
        self.exp = exp
    def __str__(self):
        return 'Star({})'.format(str(self.exp))

class Letter(Regex):
    def __init__(self, val):
        self.val = val
    def __str__(self):
        return 'Letter({})'.format(self.val)

These classes are basically just dummy AST nodes for the actual regular expression operations, along with some pretty printing functionality so we can check our results nicely.

Parsing regex strings algorithmically

There are a couple options for parsing the regex strings. One might think of using some kind of parser framework, like PLY or Lark or something similar. But for simple languages like that of regular expressions, this won't be worth the hassle.

As you may have noticed by the classes we have, we're not doing anything complicated here as far as regexes go - just the bare minimum (and the usual functionality subset, for courses which cover this materal): Kleene star, concatenation, and union. This subset is small enough that we can just use an algorithm and save ourselves a lot of overhead.

Here's a brief summary of the algorithm we're going to use: Shunting-yard algorithm (Wikipedia)

This algorithm converts an expression in infix notation into one in postfix notation (RPN, which you may be familiar with from old calculators and programs like dc). This stack-based form of notation is much easier for us to parse manually than infix notation.

So, our goal here is as follows:

  • Convert the regular expression string passed into postfix notation
  • Parse through the stack of this postfix notation and build up a Regex class which will serve as our AST

In future sections, we'll parse this AST to get an NFA.

Shunting-yard implementation

First, we need a way to get the precedence of operators. A simple function like this will work:

def compare_precedence(op1, op2):
    if op1 in ['(', ')']:
        return False
    if op2 == '*':
        return False
    if op1 == '*':
        return True
    if op1 == '+':
        return False
    if op1 == '&' and op2 == '&':
        return False
    if op1 == '&' and op2 == '+':
        return True
    raise SyntaxError

Out of the symbols above, the ampersand (&) may be confusing. This is our proxy for concatenation, since by default concatenation doesn't show up in regular expressions. For instance, the regex ab is really ab, but most of the time we omit the • as it's just implied. This becomes annoying when we want to actually parse the expressions, so we add these ampersands in like this:

def add_concat_ampersands(s):
    string_with_concat = '&'.join(list(s))
    def remove_bad_concat(char_list):
        match char_list:
            case []:
                return []
            case ['*', '&', *rest]:
                return ['*'] + remove_bad_concat(rest)
            case ['(', '&', *rest]:
                return ['('] + remove_bad_concat(rest)
            case ['&', ')', *rest]:
                return [')'] + remove_bad_concat(rest)
            case ['+', '&', *rest]:
                return ['+'] + remove_bad_concat(rest)
            case ['&', '+', '&', *rest]:
                return ['+'] + remove_bad_concat(rest)
            case ['&', '+', *rest]:
                return ['+'] + remove_bad_concat(rest)
            case [head, *rest]:
                return [head] + remove_bad_concat(rest)
    
    return ''.join(remove_bad_concat(list(string_with_concat)))

Basically, we just stick the ampersands everywhere, and then remove the ones that don't have any business being there (like the ones between characters and parentheses, or between other operators and their arguments).

You may notice something else a bit strange here, which is that we're removing the ampersands between the stars (or asterisks, or Kleene stars - henceforth referred to as "stars") and their arguments in a way that would imply that the stars act on arguments to the right of them rather than the usual left. This is due to a limitation of the algorithm we're using to convert to RPN - it's much easier to work with unary operators like unary minus, which appear on the left side of their arguments. We'll get back to this later!

For now, let's continue with the RPN-ifying journey.

def regex_of_string_reversed_stars(s):
    string_without_bad_concat = add_concat_ampersands(s)

    output_queue = []
    op_stack = []
    
    def stack_top(stack):
        try:
            return stack[-1]
        except:
            return None
    
    ...

After correctly adding our concat ampersands, we initialize our queue and stack, which will be used in the algorithm. We also define a quick function to pop from our stack (which may give us None if we have issues).

    ...

    def parse_token(token):
        if token == ' ':
            return
        if token == '(':
            return op_stack.append(token)
        if token in ['*', '+', '&']:
            while len(op_stack) > 0 and compare_precedence(stack_top(op_stack), token):
                output_queue.append(op_stack.pop())
            return op_stack.append(token)
        if token == ')':
            while op_stack[-1] != '(':
                output_queue.append(op_stack.pop())
            return op_stack.pop()
        return output_queue.append(token)

    for token in s:
        parse_token(token)

    ...

Here, we define (and call) another helper function, which parses a given token (character in this case) according to some simple rules:

  • We ignore any whitespace
  • We stick any open parens on the stack
  • If the token is an operator, we use our precedence rules and stack to pop off as many tokens as necessary into our queue
  • If the token is a close paren, we pop from our stack onto our queue until we see an open paren
  • If the token is something else (not an operator, paren, or whitespace) we just stick the token directly into the queue

For a better explanation on why/how this works, see the above Wikipedia link on the algorithm.

    ...

    while len(op_stack) > 0:
        output_queue.append(op_stack.pop())

    output_stack = []

    ...

To prepare for the second part of the process - converting the RPN into an AST - we put the remaining items on the stack into the queue, and initialize an output stack. (We could have used the same stack as before but out_stack is a bit more accurate of a name.)

    ...

    def inv_rpn(token):
        if token == '&':
            a, b = output_stack.pop(), output_stack.pop()
            return output_stack.append(Concat(b, a))
        if token == '+':
            a, b = output_stack.pop(), output_stack.pop()
            return output_stack.append(Union(b, a))
        if token == '*':
            return output_stack.append(Star(output_stack.pop()))
        return output_stack.append(Letter(token))

    for token in output_queue:
        inv_rpn(token)

    return output_stack.pop()

Finally, we define a helper function to parse the RPN into an AST using the classes we created earlier.

  • If the token is a dyadic operator (two args), we pop two items from the stack and create the relevant object using those items as arguments. (These operators are Concat and Union.)
  • If the token is a monadic operator (single arg), we pop one item from the stack and create the relevant object using that item as an argument. (Here, the only monadic operator is Star.)
  • If the token is not an operator, we just append it to the output stack as a Letter (which is basically just putting it there as the character itself).

After we define the helper function, we iterate through the RPN'd tokens in the output queue we got from the first part and process each token. When we're done, the only remaining item on the stack should be the final result of the parsing process.

Now we just have one task left, which is to reverse the star locations so that we can input the regex with the stars taking arguments on the left instead of on the right as our function requires.

Reversing the star associativity

It turns out that there's not really a super nice way to do this either. We've gotta use another stack and queue, because we're dealing with parentheses.

For instance, if we supply c(a(b*a)*)** as our regex string, we want to end up with c*(a*(*ba)) - which might be difficult if we don't account for nested parens correctly.

Here's the function, in all its gross glory:

def reverse_star_locations(s):
    stack = []
    queue = []
    for i in range(len(s)):
        if s[i] == '(':
            to_stack = '('
        elif s[i] == ')':
            while stack[-1] != '(':
                popped = stack.pop()
                queue.insert(0, popped)
            stack.pop()
            to_stack = '(' + ''.join([str(q) for q in queue]) + ')'
            queue = []
        elif s[i] == '*':
            popped = stack.pop()
            to_stack = '*' + popped
        else:
            to_stack = s[i]
        stack.append(to_stack)
    return ''.join(stack)

What's going on here? We initialize our stack and queue, and go to process the items similarly. We're not using a helper function this time, and we're using a temp variable (to_stack) to store our action until the end of the loop, so it looks a bit messier.

  • If the token we see is an open paren, we stick it on the stack
  • If the token is a close paren, then we pop until we hit an open paren, sticking each of the tokens in the way into the queue. Then, we combine the entire queue into one string with, shove it into the stack, and empty the queue again
  • If the token is a star, we pop from the stack, put the star in front of what we popped, and shove it back into the stack
  • Otherwise, we just plop the token on the stack and continue

When we're done, we've got a stack with some strings on it. We join them together and return the final result.

Now that we're done with all that, we can finally create a single function to give us the AST from a nice regex string.

def regex_of_string(s):
    stars_reversed = reverse_star_locations(s)
    output = regex_of_string_reversed_stars(stars_reversed)
    return output

Hope you enjoyed! If you liked this check out the other stuff on my website. See you next time!