I now have some code that is meant to parse an XML file of approximately 5 billion lines.
Unfortunately it fails, every time (it seems), on line 4,295,025,275.
This is something of a nightmare to debug – but it looks like an overflow bug (in the xerces-c parser) of some sort.
Do I try to find it by inspection or by repeating the runs (it takes about 4 – 5 hours to get to the bug point)?
One is probably quite difficult but simple to organise. The other is (relatively) easier – just step through the code – but is perhaps impossible to organise – how many weeks of wall clock time in a debugger before we get to that instance?
- DOMParser (forums.thedailywtf.com)
- Minutes from Friday (Nov 8 2013) meeting (thesourcecraft.wordpress.com)
- How to parse xml in android-java? (stackoverflow.com)
3 responses to “Going on a bug(?) hunt”
Can you “quickly” write a script to delete a designated number of lines above the failure point from the XML file (subject to the restriction that syntactically complete units are deleted, leaving a valid but smaller file behind)? That would enable a bisection search to find an indigestible chunk, should that be a problem.
Is xerces-c open source? If so, you could add code to monitor memory use.
I am certain there is nothing wrong with the XML it is parsing … though I suppose it would not hurt to do as you suggest – a few lines of sed should do it. Xerces-c is open source (apache licence) so I can instrument it in some way, but it might take weeks for the instrumented code to get to line 4.2 billion!
If it is an overflow bug, hacking out some of the earlier lines should cause the same fault to occur deeper into the XML file. That would at least confirm your suspicion.
As a faster (?) alternative to instrumenting, there’s the old school approach — hack xerces-c to print progress reports (“entered routine X”, “exited routine X”, “pausing to gasp for wind”, …) every so often. You can at least run that at standard speed and look at the tail of the resulting log file.
I feel for you on this. Ran into a problem once that I think turned out to be the symbol table of a commercial FORTRAN compiler overflowing. Royal PITA to diagnose, and that was on a small program.