Regex answer


The answer to the little puzzle I posted on Friday is now sort of lost – as I updated the regex I was using to handle a full IF ... THEN ... ELSE clause.

But that one is…(at least this is what works for me)

IF\s((.(?!THEN))+)\sTHEN\s((.(?!ELSE))+)?!(\sELSE(.+))IF\s((.(?!THEN))+)\sTHEN\s((.(?!ELSE))+)\sELSE(.+)

And a tougher regular expression problem


So, here’s a tougher regular expressionproblem – one I have not yet worked out myself.

An NFA Graph created to describe a regular exp...
An NFA Graph created to describe a regular expression (Photo credit: Wikipedia)

BASIC‘s essential looping structure is a FOR (STEP) ... NEXT loop. In ZX80/ZX81 BASIC (the dialect I am aiming to emulate with the BINSIC DSL) this is of the form FOR v = num TO num STEP num... NEXT v, where v is a single letter counter variable and num a valid numerical expression.

Groovy does not implement a FOR loop as such but it is not difficult to cover the simple and common case of a STEP 1 loop (though other values might be more difficult), but the tricky bit is that BASIC FOR ... NEXT loops can be nested. So what regular expression will grab the insides of a loop?

Update: The best thing about blogging is that sometimes it helps to clarify the problem. And it has done so here even as I wrote this. I don’t need to worry about the nested loops at all – all I need to do is substitute some code for the FOR statement and replace the NEXT with a }. This has the added advantage of saving poor BASIC coders from disordering their NEXT statements.

A regex puzzle to start your weekend


Regular expressions are surely one of the greatest pleasures, puzzles and pains any programmer has to deal with.

So, here’s one for you to figure out: I have already solved it, but will be intrigued if someone comes up with a better version than mine.

BASIC syntax includes the IF ... THEN construct e.g. IF X > 5 THEN GOTO 150 .

Now, for BINSIC, the BASIC-like domain specific language I am building using Groovy, I have to parse these structures, putting brackets round the if clause and so on. So, I have to be able to pull out the conditional. But BASIC might also have code like this IF X > 5 THEN IF Y < 10 THEN GOTO 150 (NB: I know this can be replicated with a single conditional using boolean operators, but that’s not the point: the language allows THEN to be followed by any other valid BASIC statement and that must include another IF ... THEN clause.)

So, what regex would you use to pick out the first conditional statement but not any subsequent statement (these can be passed recursively into the parser)?

You can cheat by looking at the BINSIC code on GitHub, but what would be the point? I’ll post an/my answer sometime later this weekend…

Some RegEx stuff


Regular Expression NFA
Image by jl_2 via Flickr

So, I was trying to match results from /proc/pid/stat which gives a line that looks like this:

2321 (squid) S 1 2321 2321 0 -1 4202752 4841 0 0 0 530 577 0 0 20 0 1 0 24716 36311040 4417 18446744073709551615 1 1 0 0 0 0 0 4096 85571 18446744073709551615 0 0 17 0 0 0 1945 0 0

And where the 10th entry is the number of minor faults and the 12th entry is the number of major faults (as you can see here, Squid has had 4841 minor faults and 0 major faults since it was restarted when I changed IP address).

So a RegEx seemed to be the way to go and my first attempt looked like this:

(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+\\s(\\S)+

But this did not work … well, it matched the line but it gave me bad results. [Note: \s matches whitespace, \S matches non-whitespace.]

I am sure you are all cleverer than me and saw the flaw straight away – but it took me some time to figure it out: (\\S)+ would treat only the first character of 4841 as a match –  what I needed to use was (\\S+) which matched the group and not just the character.

 

And… further to my querying of the poorly written GNU RegEx documentation, the nmatch parameter should be one bigger than the number of groups expected to be matched – in the above case that means 13.

Confused about GNU regex documentation


The documentation for the regex (regular expression) functionality of the GNU C library says this:

When regexec matches parenthetical subexpressions of pattern, it records which parts of string they match. It returns that information by storing the offsets into an array whose elements are structures of type regmatch_t. The first element of the array (index 0) records the part of the string that matched the entire regular expression. Each other element of the array records the beginning and end of the part that matched a single parenthetical subexpression.

It then goes on to say this:

When you call regexec, you specify how long the matchptr array is, with the nmatch argument. This tells regexec how many elements to store. If the actual regular expression has more than nmatch subexpressions, then you won’t get offset information about the rest of them. But this doesn’t alter whether the pattern matches a particular string or not.

 

So does this mean that nmatch should contain the number of subexpressions and the array should be one bigger (as element 0 matches the whole string) or that nmatch should be equal to the size of the array?

The joy of regex


Deterministic Finite Automaton recognizing the...
Image via Wikipedia

A regex is, as any fule kno, a regular expression: a pretty fundamental concept in computing (the picture here shows an evaluation mechanism for one, albeit for a limited language) – computers, being deterministic, rely on “regular languages” for programs and so regexes are generally speaking widely used in parsing the languages.

But what has motivated me to write this blog (apart from the need to demote a story with the word “sexy” in the headline) is that they are the sort of thing that ought to be taught more widely than the low-quality stuff that seems to pass for ICT education in schools.

Because they link a computing fundamental to something that would be useful in everyday life. Think of it this way – you have a document which logs 20,000 library loans  with a record tag and a 24 hr clock. You know about 500 of them are of interest to you because they record when, say, the system recorded “DVD borrowed” or “CD borrowed” (ie something like “DVD borrowed:    17:03:19”) – how can you extract just that?

A regex could do that in seconds – and while they can look complex, they are not that difficult to learn – at the risk of getting it all wrong (as I haven’t tested it) then the one for this could be (depending on the precise format):

(DVD|CD) borrowed:\t([0-1]?[0-9]|2[0-3])(:([0-5][0-9])){2}/)

Of course you have to have something with a regex engine available to crack that – though it’s actually quite a widely supported option in much free and proprietary software, it’s just nobody teaches you this hugely useful skill.