So, dear reader, I wrote a Gauss-Jordan solver

Number of steps in the Euclidean algorithm for...
Number of steps in the Euclidean algorithm for GCD(x,y). Red points indicate relatively few steps (quick), whereas yellow, green and blue points indicate successively more steps (slow). The largest blue area follows the line y = Φx, where Φ represents the Golden ratio. (Photo credit: Wikipedia)

Been a while…

…Anyway: I am now designing (PhD related) a simulation of a NoC system and I need to have it run some code – so I thought I’d look at code to solve a system of linear equations and sought to build the same using Gauss-Jordan elimination.

It proved to be a lot more difficult than I had hoped: not that the maths was too complex, but that producing something I could later simplify from C++ into something that looked like synthetic assembly has and is proving very difficult indeed.

My system needs to execute a simple synthetic assembly because what I am modelling is the way the memory interactions work – so I have to get close to the metal and not hide behind a high-level language (no, not even C will do). Obviously – or at least it feels like that to me – I don’t want to have to mess about with floating point implementations, so I wanted integer arithmetic.

And that is where the first problem comes – to implement a Gauss-Jordan solver (or even to solve linear equations in the way you used to at school: the Gauss-Jordan algorithm is merely a scaling up of that), you have to divide numbers and sadly (for this purpose) not all numbers are multiples of one another – so the only way to preserve integer maths is to use rationals.

That, of course, is not such a problem. The C++ pair utility even makes it easy to model a rational out of two integers (writing a separate rational class would be overkill for something as small as this.)

But then, quickly, another problem becomes apparent. Even when using long long types to represent the numerator and denominator of our rational fraction repeated multiplications to enable fractional subtraction breaks the integer bounds after little more than a handful of row operations (and I want to model a large system – with literally thousands of equations in each system).

Even when adding a repeatedly applied gcd function to apply Euclid’s algorithm to the rational fractions, the numbers go out of range quickly.

So I have been forced to rely on an arbitrary precision library – GNU’s GMP. The code now works well (you can see at its Github repro and the solver code is below:

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <cstdlib>
#include <utility>
#include <gmpxx.h>

//Gauss-Jordan elimination

using namespace std;

void gcd(pair<mpz_class, mpz_class>& input)
	mpz_class first = abs(input.first);
	mpz_class second = abs(input.second);
	if (first > second) {
		first = second;
		second = abs(input.first);
	while (second != 0) {
		mpz_class temp = second;
		second = first%second;
		first = temp;
	input.first /= first;
	input.second /= first;
	if (input.second < 0) {
		input.first *= -1;
		input.second *= -1;

int main()
	string path("./variables.csv");
	ifstream inputFile(path);

	vector<mpz_class> answers;
	//answer line
	string rawAnswer;
	getline(inputFile, rawAnswer);
	istringstream stringy(rawAnswer);
	string number;
	while(getline(stringy, number, ',')) {

	//now read in the system
	vector<vector<pair<mpz_class, mpz_class> > > lines;
	while(getline(inputFile, rawAnswer)) {
		istringstream stringy(rawAnswer);
		vector<pair<mpz_class, mpz_class> > innerLine;
		while (getline(stringy, number, ',')) {
			pair<mpz_class, mpz_class>
				addPair(atol(number.c_str()), 1);

	for (int i = 0; i < lines.size(); i++) {
		pair<mpz_class, mpz_class> pivot(lines[i][i].second,
		if (lines[i][i].first != 0) {
			lines[i][i].first = 1;
			lines[i][i].second = 1;
		} else {
		for (int j = i + 1; j <= lines.size(); j++) {
			lines[i][j].first *= pivot.first;
			lines[i][j].second *= pivot.second;
		for (int j = i + 1; j < lines.size(); j++) {
			pair<mpz_class, mpz_class> multiple(lines[j][i].first,
			lines[j][i] = pair<mpz_class, mpz_class>(0, 1);
			for (int k = i + 1; k <= lines.size(); k++) {
				pair<mpz_class, mpz_class>
					multiple.first * lines[i][k].first,
					multiple.second * lines[i][k].second);
				lines[j][k] = pair<mpz_class, mpz_class>
					(lines[j][k].first * factor.second -
					factor.first * lines[j][k].second,
					lines[j][k].second * factor.second);
	// now eliminate upper half
	for (int i = lines.size() - 1; i > 0; i--) {
		if (lines[i][i].first == 0) {
		pair<mpz_class, mpz_class> answer = lines[i][lines.size()];
		for (int j = i - 1; j >= 0; j--) {
			pair<mpz_class, mpz_class> multiple = lines[j][i];
			if (multiple.first == 0) {
			lines[j][i] = pair<mpz_class, mpz_class>(0, 1);
			lines[j][lines.size()].first =
				lines[j][lines.size()].first * multiple.second
				- answer.first * multiple.first * 
			lines[j][lines.size()].second =
				lines[j][lines.size()].second *
				answer.second * multiple.second;
	cout << "DIAGONAL FORM" << endl;
	for (int i = 0; i < lines.size(); i++) {
		for (int j = 0; j < lines.size(); j++) {
			if (lines[i][j].first == 0) {
				cout << "0 , ";
			} else {
				cout << lines[i][j].first << "/" << lines[i][j].second << " , ";
		cout << " == " << lines[i][lines.size()].first << " / " << lines[i][lines.size()].second << endl;


(NB: I know that GMP supports rationals “out of the box” but I am trying to model something that I can translate to my synthetic assembly and so hiding the details of the implementation is not what I want to do.)

How on Earth, though, am I going to model arbitrary precision code in my “assembly” code?


Die of an Intel 80486DX2 microprocessor (actua...
Die of an Intel 80486DX2 microprocessor (actual size: 12×6.75 mm) in its packaging. (Photo credit: Wikipedia)

Been a while since I’ve written here – been avoiding writing about politics, which has obviously not been so great for me in the last couple of weeks… but now I have something else to ruminate on.

I have reached a milestone, or perhaps basecamp, in my PhD research: having a model for memory management that needs further exploration. (Hopefully there may even be a paper on this soon.)

Some of that exploration will have to be in hardware, and that’s really beyond me but I can and should build a software model to test how a many core system built using this new model might operate.

So far I have been testing or proving concepts with OVPSim but it does not allow me to build a true asynchronous multi-core model, so I need to do that in software myself.

But where to begin – I have a list of classes that you might want to have in C++:

  • Computer – which would aggregate…
    • DRAM
    • Storage
    • NoC – which would aggregate…
      • Mesh
      • Tiles – which would aggregate…
        • CPU
        • Cache
        • Ports (to the Mesh)

I hope you can see how quickly this becomes complex – and all we are talking about here is a simple software framework to allow me to do the research (ie., delivering the software, complex as it is, is only the very start.)

I am struggling to know where to begin – C++ seems like the logical choice for this, but it’s not proving to be much fun. Particularly because my CPU class has to be able to “execute” some code – I thought about using a DSL but may stick to the XML output I got from my hacked Valgrind Lackey – as at least I can then use existing XML engines.

Should I build from the XML up – eg., get a CPU class that can hack the XML and pass the requests up the chain (eg via the cache up to the Mesh and up to the DRAM etc), or what?

Pointers versus references

English: Image for teaching pointers in comput...
English: Image for teaching pointers in computer programming Česky: Obrázek pro výuku ukazatelů v programování (Photo credit: Wikipedia)

Some people don’t like pointers – and for that reason, I think, we have references in C++. But as a confirmed pointer person, I find references very hard going.

I had a piece of C++ code that did this:

PartialPage& DoubleTree::oldestPage()
	PartialPage& pageToKill = pageTree.begin()->second);
	long timeToKill = pageTree.begin()->second.getTime();
	map<long, PartialPage&>::iterator itOld;
	for (itOld = pageTree.begin(); itOld != pageTree.end(); itOld++) {
		if (itOld->second.getTime() < timeToKill) {
			timeToKill = itOld->second.getTime();
			pageToKill = itOld->second;
	return pageToKill;

This produced rubbish results – because re-assigning the reference didn’t make it refer to a new element of the map. Essentially you cannot mutate a reference in C++ at all.

Switching to pointers fixed the problem though.

PartialPage* DoubleTree::oldestPage()
	PartialPage* pageToKill = &(pageTree.begin()->second);
	long timeToKill = pageTree.begin()->second.getTime();
	map<long, PartialPage>::iterator itOld;
	for (itOld = pageTree.begin(); itOld != pageTree.end(); itOld++) {
		if (itOld->second.getTime() < timeToKill) {
			timeToKill = itOld->second.getTime();
			pageToKill = &(itOld->second);
	return pageToKill;


Racey code can damage your (mental) health

Running the Hackney half marathonI’ve had a tough week. Ran a half marathon (quite badly, but I finished – see the picture) on Sunday, hobbled around York University (blister on my foot) on Monday before returning to London and coping with the added stress of a new job.

All the while the failure of my latest piece of code to work was annoying me too – eating away at me and making me wonder if the whole PhD thing was a bit of Quixotic nonsense or even Panzan stupidity.

Anyone who has chased down a bug for days will know the feeling – the root cause must be out there somewhere, even if you have been through your code dozens of times and can see nothing wrong.

Sitting on a long commute to my place of (temporary) work this morning I realised that my problem could only be down to one thing – a race condition.

I am emulating an LRU/2 type page replacement policy – with two page queues – a “high” and a “low”: pages are initially assigned to low, but can be pushed into high on a second reference. Pages only leave via low (they can be pushed out of high into low if we run out of space in high and so on).

With one thread running there was no problem, but when a second thread came on board the application fell over – and I realised it was because the manipulation of the LRU queues was not atomic.

And, a few dozen code changes later (done on the reverse journey back to London) – and some accounting for C++ template strangeness (it seems that an iterator across a map, means the map is not const – that took me a while to work out) and the problem has gone away and suddenly academic endeavour feels worth it again.

C, C, glorious C

gdb icon, created for the Open Icon Library
gdb icon, created for the Open Icon Library (Photo credit: Wikipedia)

This blog – The Unreasonable Effectiveness of C – is really very good, and makes a lot of points that are very difficult to argue with.

Though, lately, I have (re-)discovered the joys of C++ – even if I do write code in that language like a C programmer.

In the last six months I have written a lot of code designed to process a lot of data. I started off with Groovy – my data was XML and Groovy comes with tools that are specifically designed to make a programmer’s life a lot easier when handling XML.

But, in the end, Groovy just was not up to the task – because for my needs: processing 200 GB of XML, Groovy (or, rather, the Java VM on which it runs) is just not fast or flexible enough.

My response to that was to go “route 1” and reach for C. I know and knew I could build what I wanted in C and that it would be as fast as I could reasonably hope to get it (going down to Assembly was just not a serious option).

However, the code I am actually using is in C++. I found that the abstractions offered by the STL were worth the price in terms of speed of execution and code that is somewhat opaque to the GDB debugger.

It’s a compromise, of course, but I suspect if I had tried to write the equivalent of the STL map and set classes in C, I’d still be debugging my implementation many weeks later – after all I found that my C++ red-black tree code was actually broken despite using it (successfully) for a number of years.

Real programmers have to make these compromises – and C++ struck the right note between language expressiveness and language power for me.

Enhanced by Zemanta

Passing by reference in C++ – a downside

Code (Photo credit: Riebart)

Once you realise what is happening this is obvious, but it took me a while…

I wanted to write some longs out as character strings in C++, so wrote some code like this:

void writeLongToFile(ofstream& xmlFile, long& value)
stringstream stringy;
stringy << value;
xmlFile << stringy.rdbuf();

But compile time error after compile timer error followed as I was told that I could not use this code with unsigned long or with const long and so on … but surely these are simple conversions I thought…

…Just me thinking like a C programmer again. Passing by reference is just that – there is no copying or conversion available and so long and unsigned long are fundamentally incompatible.

The solution? Just pass by value – afterall the saving in passing a more or less primative type like long by reference must be pretty minimal – passing the stream by reference is the real saving here (actually, it’s also practical – as we want the output to go to the real stream and not a copy: in C we’d use a pointer here).

Enhanced by Zemanta

Which bit of a 4k page does your program use?

This follows on from the previous post – here are the plots.

These are based on a run of the PARSEC benchmark suite x264 program – encoding 32 frames of video using 18 threads – 16 worker threads: the plots show how often each 16 byte “line” is used – whether as an instruction is read or the memory is used for read-write storage. Sixteen bytes is both the size of a typical cache line and a read from a DDR memory.

all code memonly


The code plot might suggest there is some pattern – between about segments 100 (offset 0x640 inside the page) and 200 (offset 0xC80) there is an increased hit rate) but my guess is that is an artefact of the particular code being used here, rather a general issue (possible explanations would be a particular library function being very heavily used): though conceivably it could be an issue with the GCC compiler or linker.

That might be worth further exploration, but not for me. From visual inspection I am concluding that the distribution of accesses inside a 4k page doesn’t justify trying to pre-cache particular 16 byte “lines”.

Enhanced by Zemanta