Parser combinators

Posted on 2019-01-03

In the last post I mentioned how I’ve gotten back into Haskell over the last months. Come December and time for another round of The Advent of Code. I haven’t been participating in this programming challenge series the last few year. However, since this is an excellent chance to apply Haskell to new problems I decided to start follow the challenges. I have thoroughly enjoyed the ones I have attempted thus far, even though I’ve not been able to keep up with the daily challenges. They often end up taking a surprising amount of time, perhaps also due to me using Haskell.

One common denominator with all the challenges in the Advent of Code is the requirement to parse input data. Parsers may be a “solved problem”, but there are many ways to approach parsing when faced with challenges like these, ranging from pragmatic uses of regular expressions and ad-hoc string manipulation, to fully-specified grammars and parser generators.

Coincidentally, parsing is one of the areas in which Haskell truly shines, especially in the domain of parser combinators. There are multiple parser combinator libraries for Haskell, yet I’m not going to focus on any of them here. For the Advent of Code, I didn’t want to stray too far away from the default base package, and to my joy I found that there actually is a quite capable parser combinator module included in base: Text.ParserCombinators.ReadP.

Co-co-co-combinators!

A parser combinator is a higher-order function which accepts parsers as input, returning a new parser as output. A combinator is a higher-order function that uses only function application and earlier defined combinators to define a result from its arguments1.

So, in short, a parser combinator library lets you construct complex parsers by composing smaller, “atomic” parsers. Here’s an example before we get into the details:

This is an example of a ReadP parser which parses and yields a String containing only digits. many1 is an example of a combinator:

λ: :t many1
many1 :: ReadP a -> ReadP [a]
λ: :t satisfy
satisfy :: (Char -> Bool) -> ReadP Char

many1 repeats a given parser one or more times, yielding a list of the result, while satisfy yields a single Char if it passes a predicate function (isDigit in our example).

Demystifying some types

ReadP is our parser type. Its actual internals are out of scope of this post, but it can help our intuition by imagining it encapsulates stateful “stuff” like the input along with the cursor location within the input, backtracking/lookahead information, and so on. We’ll treat it as an opaque data type, but here’s what GHCi has to say about it (with some details removed):

λ: :i ReadP
newtype ReadP a
  = Text.ParserCombinators.ReadP.R (...)
instance Applicative ReadP
instance Functor ReadP
instance Monad ReadP

We don’t get to look at what happens within ReadP, but we need a way to actually run our parser on a piece of input. To do this, we can use the aptly named readP_to_Ssigh

λ: :i readP_to_S
readP_to_S :: ReadP a -> ReadS a

We don’t get much wiser from its type signature, so let’s recurse yet another turtle2:

λ: :i ReadS
type ReadS a = String -> [(a, String)]

Aha! ReadS is just a type alias for a function accepting a String and returning a list of matches along with the remaining, unmatched input. We can expand ReadS in the type signature of readP_to_S to illustrate this a bit better:

The reason the return value is a list, is because an input could result in multiple successful parses, and with varying lengths of matching input. This interface can be unexpected if you’re used to parsers either failing or succeeding with a single result.

Let’s give it a spin with the atomic get parser, consuming a single character from the input:

λ: :t get
get :: ReadP Char
λ: readP_to_S get "foo"
[('f',"oo")]

Since our simple parser only consumes a single character, there’s only one successful parse, and the unprocessed input is returned as the second tuple element. Let’s try out parseDigits:

λ: readP_to_S parseDigits "foo"
[]
λ: readP_to_S parseDigits "123"
[("1","23"),("12","3"),("123","")]
λ: readP_to_S parseDigits "123foo"
[("1","23foo"),("12","3foo"),("123","foo")]

We’re getting multiple successful parses with many1 since we’re not restricting our parser (parseDigits in particular) to parse until the end of input. Skipping ahead, here’s how we can use the Applicative instance of ReadP and the <* operator to combine parseDigits with the eof parser:

λ: readP_to_S (parseDigits <* eof) "123"
[("123","")]

Let’s get Funky!

We’ve already seen glimpses of how we combine predefined parsers from ReadP into more complex parsers. Several of the ReadP parsers accept other parsers as input, hence the combinator-part. parseDigits uses only function application to construct a parser of digits from many1 and satisfy with a predicate. There’s one major flaw with parseDigits though, it doesn’t actually yield us a number!

One way to go at this, is to do our parse, then map a constructor accepting String over the parse result:

λ: map (readInt . fst) $ readP_to_S parseDigits "123"
[1,12,123]

Unfortunately this is both clunky, and won’t scale well when we want to expand our parser to construct more complex data structures. However, recalling the type of readP_to_S we can see that indeed, it accepts parsers which are polymorphic in their return values:

This means we can create parsers which directly yield the data types we desire. In order to do that though, we need to familiarize ourselves with some of the ReadP typeclass instances, mainly: Functor and Applicative. There are also instances for Alternative and Monad.

Since ReadP has a Functor instance, our intuition should tell us that it should be quite possible to fmap read over our parser to convert our parser result type:

and in GHCi:

λ: :t readP_to_S parseInt "123"
readP_to_S parseInt "123" :: [(Int, String)]
λ: readP_to_S parseInt "123"
[(1,"23"),(12,"3"),(123,"")]

If we’re concerned about the (un)safety of read, we can choose to use readMaybe from Text.Read instead:

which gives us:

λ: readP_to_S parseMaybeInt "123"
[(Just 1,"23"),(Just 12,"3"),(Just 123,"")]
λ: readP_to_S (parseMaybeInt <* eof) "123"
[(Just 123,"")]

Applicative parsing

2D coordinates are a frequent source of input in the Advent of Code challenges. The way we like to represent a 2D coordinate in code is using a Tuple of two elements, and the elements being the x and y position of the coordinate, or point. Points are often serialized using surrounding parenthesis, a comma separating the two parts, and optional whitespace:

(1,2)
(-10, 100)
(  5, -42)

We do not yet have the tools we need in order to create parsers for types which are constructed from multiple arguments, like (,) (the Tuple data constructor). Functor only allows us to map over a (single) parse result to yield another type. Granted, we could give up on type safety and parse our input into substrings, which we again validate piece by piece to construct our values. But we don’t accept compromises like these, do we? No we don’t.

Applicative extends our toolbox with the ability to lift n-ary data constructors and functions into the world of ReadP, allowing us to construct more complex types:

λ: readP_to_S ((,) <$> parseInt <*> parseInt) "123"
[((1,2),"3"),((1,23),""),((12,3),"")]

We’re now getting tuples out of our parser, although we’re not parsing actual tuples yet. Also note that our parser is ambiguous. The result of the parser should be a tuple, but it’s equally valid to create a tuple of the first and second digit, as is splitting x and y between the second and third digits. The input format (and thus our parser) must change to specify where one coordinate component ends and the other begins. We use static delimiters “(”, “)”, and “,” for this.

To match static input ReadP provides char and string:

λ: :t char
char :: Char -> ReadP Char
λ: :t string
string :: String -> ReadP String

Both of these parsers accept a character (or string), resulting in a parser which yields the same value if the input matched successfully:

λ: readP_to_S (char 'a') "foo"
[]
λ: readP_to_S (char 'f') "foo"
[('f',"oo")]
λ: readP_to_S (string "bar") "foo"
[]
λ: readP_to_S (string "foo") "foo"
[("foo","")]

We can combine parseInt with char using Applicative to parse tuples (in a strict manner, without any whitespace):

and applied to some input:

λ: readP_to_S parseCoord "(1,2)"
[((1,2),"")]

The <* and *> operators just discard the result of the parsers on the right and left hand side, respectively. The arrows “point” at the part of the sequence whose value will be returned.

Negative space

There are two issues with our coordinate parser: it’s quite strict in the way it supports no whitespace, and it does not support negative values for the x and y component.

To add whitespace support, we can use the operators from Applicative in combination with the provided skipSpaces parser. Let’s update parseInt to consume whitespace surrounding a number:

parseCoord should be updated to use parserIntSpaces:

Not much change required at all, besides renaming a function reference! We’re now able to parse coordinates with whitespace:

λ: readP_to_S parseCoord "(  1  ,  2  )"
[((1,2),"")]

Finally, in order to support negative numbers prefixed with “-” we need to change parseInt yet again. Using the provided option parser we can add support of an optional prefix of “-”:

Note that we have to fmap the list cons operator to prepend the sign to the resulting list of digits. read for Int also supports whitespace around the number, which means option can yield a blank space character if there is no “-” in front of the number.

option takes, along with a parser, a default value which it yields if the parse is not successful:

λ: :t option
option :: a -> ReadP a -> ReadP a

Now we can parse negative numbers too!

λ: readP_to_S (parseSignedInt <* eof) "-123"
[(-123,"")]

Monadic parsing

We’ve already mentioned that ReadP has a Monad instance, giving us access to do notation. This allows a bit more flexibility and write parsers in a somewhat more imperative style. Let’s do a rewrite of our tuple parser:

Note that we can use the Monadic sequencing operator >> and Applicative sequencing operator *> interchangeably. While in a do notation it might be more consistent to stick with the monadic operators.

One of the benefits of using the Monad instance for ReadP is that it simplifies sequencing parsers where later parts of the parser depends on earlier matched input. For instance, we might want to parse a piece of input which starts with a number of elements to parse, followed by the elements themselves:

We use the count parser to repeat a given parser n times:

λ: :t count
count :: Int -> ReadP a -> ReadP [a]

This parser first reads a number n stating how many elements to parse, then proceeds to parse n numbers separated by newline:

λ: readP_to_S parseCountInts "2\n1\n2\n3\n4\n5"
[([1,2],"3\n4\n5")]
λ: readP_to_S parseCountInts "3\n1\n2\n3\n4\n5"
[([1,2,3],"4\n5")]
λ: readP_to_S parseCountInts "4\n1\n2\n3\n4\n5"
[([1,2,3,4],"5")]

The parser does not proceed to process input beyond the number of elements we specify.

Summary

Functional programming in Haskell centers around breaking down problems into smaller, independent pieces. Then using various means of composition to combine these pieces into a working solution. Parser combinators are yet another example of how we can achieve proper reusability in Haskell. Recall how we reused parseInt (and its derivatives) to create more complex parsers, which again could be composed to create even larger parsers.

I really encourage you to have a go at using either ReadP or the many parser combinator libraries available. Some focus on performance and speed, others on diagnostics and error reporting.

Footnotes


  1. https://en.wikipedia.org/wiki/Combinatory_logic

  2. https://en.wikipedia.org/wiki/Turtles_all_the_way_down