A regex puzzle to start your weekend


Regular expressions are surely one of the greatest pleasures, puzzles and pains any programmer has to deal with.

So, here’s one for you to figure out: I have already solved it, but will be intrigued if someone comes up with a better version than mine.

BASIC syntax includes the IF ... THEN construct e.g. IF X > 5 THEN GOTO 150 .

Now, for BINSIC, the BASIC-like domain specific language I am building using Groovy, I have to parse these structures, putting brackets round the if clause and so on. So, I have to be able to pull out the conditional. But BASIC might also have code like this IF X > 5 THEN IF Y < 10 THEN GOTO 150 (NB: I know this can be replicated with a single conditional using boolean operators, but that’s not the point: the language allows THEN to be followed by any other valid BASIC statement and that must include another IF ... THEN clause.)

So, what regex would you use to pick out the first conditional statement but not any subsequent statement (these can be passed recursively into the parser)?

You can cheat by looking at the BINSIC code on GitHub, but what would be the point? I’ll post an/my answer sometime later this weekend…

Confused about GNU regex documentation


The documentation for the regex (regular expression) functionality of the GNU C library says this:

When regexec matches parenthetical subexpressions of pattern, it records which parts of string they match. It returns that information by storing the offsets into an array whose elements are structures of type regmatch_t. The first element of the array (index 0) records the part of the string that matched the entire regular expression. Each other element of the array records the beginning and end of the part that matched a single parenthetical subexpression.

It then goes on to say this:

When you call regexec, you specify how long the matchptr array is, with the nmatch argument. This tells regexec how many elements to store. If the actual regular expression has more than nmatch subexpressions, then you won’t get offset information about the rest of them. But this doesn’t alter whether the pattern matches a particular string or not.

 

So does this mean that nmatch should contain the number of subexpressions and the array should be one bigger (as element 0 matches the whole string) or that nmatch should be equal to the size of the array?