The joy of .CXX

I have a problem.

The rise of the C guru. (LOL =])
The rise of the C guru. (LOL =]) (Photo credit: Dzhus)

I need to parse 220GB of XML to find an OPT reference string (i.e. to profile the memory references of a running application to create a page replacement algorithm that will know which is the most efficient page replacement decision.)

I had one algorithm in mind for this and I wrote the Groovy code but it kept exhausting the memory pool of the VM (as it generated a lot of anonymous memory). Even whacking the VM up in size quite considerably didn’t seem to fix the problem.

So I used another algorithm – and it works. After a fashion. Because even with 18 threads running in parallel on the University of York’s big iron compute server I think it might take around two years to complete. And I don’t have two years.

So I need to find something better – something where I have much more control over the memory allocation code and where I can also be sure of performance.

C is the obvious answer. Or, in this case, C with an interface to some C++ code I wrote a few years ago to build a red-black tree.

The strange thing is that, despite Groovy’s claim that it removes the really tedious bits of code writing by doing away with boiler plate, this weekend I have found writing all that getter/setter code quite therapeutic: perhaps this is a programmer’s comfort blanket?

Enhanced by Zemanta

3 responses to “The joy of .CXX”

  1. Choice of language is rarely the bottleneck for a problem like this…. Hard drive access is. Random access of any type on a file of this size is a performance killer. *Avoid random access* of the data like the plague…. It could add days or weeks to the time! (Just compare streaming speeds of continuous data from a file to the time it takes to move the physical head of the hard drive…. Although a few head movements are impossible for a human to notice, calculate how long a 100 million moves would take).

    Many unix tools were made for this type of problem (perl, grep, awk, etc). Not knowing more of what you are looking for, I can’t be exact in my answer, but what I can say definitely for sure- stream the data through and do your calculations line by line. Often, I add a perl line to split elements of interest one per line (ie- perl -pe “s/<anElement/\n /dev/null

    (or if you need to sort, add it)

    For a file that size, expect sizes up to about half an hour…. If you write your code properly, it won’t add much to this (perhaps as much as a doubling, perhaps it will be negligable).

    If you add a sort, it could take up to an hour or two…. Still not so bad.

    (The unix command “sort” avoids random access quite well itself by streaming the data and sorting in memory sized chunks, then doing a merge sort on the intermediate sorts).

  2. Thanks for the comment. But the issue isn’t essentially one about I/O (though the fact that the alternative is I/O bound is what makes it so slow) but that I have so little control over memory in a VM based language like Groovy (or Java).
    I have a perfectly good algorithm but it fails for memory reasons.
    A sort would be of minimal benefit.

%d bloggers like this: