I now have some code that is meant to parse an XML file of approximately 5 billion lines.
Unfortunately it fails, every time (it seems), on line 4,295,025,275.
This is something of a nightmare to debug – but it looks like an overflow bug (in the xerces-c parser) of some sort.
Do I try to find it by inspection or by repeating the runs (it takes about 4 – 5 hours to get to the bug point)?
One is probably quite difficult but simple to organise. The other is (relatively) easier – just step through the code – but is perhaps impossible to organise – how many weeks of wall clock time in a debugger before we get to that instance?
Regular expressions are surely one of the greatest pleasures, puzzles and pains any programmer has to deal with.
So, here’s one for you to figure out: I have already solved it, but will be intrigued if someone comes up with a better version than mine.
BASIC syntax includes the IF ... THEN construct e.g. IF X > 5 THEN GOTO 150 .
Now, for BINSIC, the BASIC-like domain specific language I am building using Groovy, I have to parse these structures, putting brackets round the if clause and so on. So, I have to be able to pull out the conditional. But BASIC might also have code like this IF X > 5 THEN IF Y < 10 THEN GOTO 150 (NB: I know this can be replicated with a single conditional using boolean operators, but that’s not the point: the language allows THEN to be followed by any other valid BASIC statement and that must include another IF ... THEN clause.)
So, what regex would you use to pick out the first conditional statement but not any subsequent statement (these can be passed recursively into the parser)?
You can cheat by looking at the BINSIC code on GitHub, but what would be the point? I’ll post an/my answer sometime later this weekend…