Booknotes: Parsing and tokenization

A Saturday in February 2025

Booknotes 4.

I’m writing a guide to building your own database server. Here’s what I’ve been doing:

Progress

This week I’ve continued with my sample implementation which has helped me to go back and improve the first four chapters.

A new chapter appears

In my last booknotes I said that I’ve restarted my sample implementation so I can go through the test-suite as someone following the guide for the first time would. This has been a great way to evaluate the existing chapters to make sure they have a clear focus and the right amount of complexity.

It was very satisfying to find there was a whole whole chapter related to parsing that was previously muddled into three other chapters. This week I’ve written this chapter which I’ve called ‘Resilient parsing’ and goes into detail about mixed capitalization, extra whitespace and identifiers that overlap with keywords. It suggests using tokenization to help manage this complexity.

Simpler tokenization

And focusing on parsing made me realise that I had overcomplicated the regular expressions I was using in my tokenizer. Previously I used regular expressions to break up the input like this:

if keyword = input.match(KEYWORD_PATTERN)
   Token.new(:keyword, keyword.upcase)
elsif identifier = input.match(IDENTIFIER_PATTERN)
   Token.new(:identifier, identifier)
elsif symbol = input.match(SYMBOL_PATTERN)
   Token.new(:symbol, symbol)
elsif integer = input.match(INTEGER_PATTERN)
   Token.new(:integer, integer)
end

The problem was that keywords (such as SELECT, insert, WHERE) and identifiers (such as my_column, TABLE_B or selected_value) look very similar and the only way to distinguish between them is to use a look-ahead in the regular expression to make sure a keyword isn’t followed by another character or underscore. I also didn’t like how changing the order of the conditions would break the parser because the identifier regular expression didn’t exclude keywords. Put another way, it was hard to make all my regular expressions exclusive of each other.

I found I could avoid this just by matching on a pattern that matched keywords AND identifiers and then checking if the match was in a list of keywords:

if word = input.match(WORD_PATTERN)
   if KEYWORDS.include?(word.upcase)
     Token.new(:keyword, word.upcase)
   else
     Token.new(:identifier, word)
  end
elsif symbol = input.match(SYMBOL_PATTERN)
   Token.new(:symbol, symbol)
elsif integer = input.match(INTEGER_PATTERN)
   Token.new(:integer, integer)
end

Enumerators without next

I was looking at using enumerators in my sample implementation and found that certain kinds of enumerators in Ruby don’t have the next and related methods available.

I submitted a bug to Ruby for the problem. There is likely some reason that product and chain enumerators have been implemented differently to regular enumerators, but I can’t work out why. I’m hoping that they could be made more similar or at least the documentation improved to explain the difference.

Talk

I still would like to submit a lightning talk for February’s LRUG but I’m away for some of next week so don’t think I will have the time. Either way I will be there to watch the other talks.

⸺