Traffic generation options

English: Read Only Memory
English: Read Only Memory (Photo credit: Wikipedia)

This is a blog post where I am going to write about things as a way of clarifying, in my own mind, what the best way of tackling a problem is.

So far, in research for my PhD, I have concentrated on establishing some base points for potential performance of Network-on-Chip systems running multithreaded code.

Nearly nine months ago I hacked at Valgrind‘s Lackey tool to ensure it produced XML output recording every memory reference made by a piece of code running under it. This was really basic stuff – Lackey recognises four primatives – code for code-in-execution, and load, store and modify (a combined load and store) for read-write memory. So typically you get blocks of code followed by some read-write records and then some more code. I don’t know what the operands are, just the primative type, the address and the size of the piece of memory being used (whether for code or read-write operations).

I then used that to record the output of one of the Parsec parallel benchmarks – a 16 thread (actually it executes 18 threads) piece of video transcoding. In the real world this ran in seconds, under Lackey it took a few days and output about 200GB of XML.

That XML can then be broken down into thread-specific strands of execution – 18 of these in all, of various sizes, but all of the order of several GB at least.

These are then plugged in to some sort of simulator. The basic hardware model being simulated has remained the same throughout (mostly, I did fiddle with it a bit a while back but decided that wasn’t worth it). So we have 16 cores sharing a flat 512kB memory space (this is very loosely based on the Parallella system, but it is not meant to be any sort of simulation of it). There is no caching and no sense that any part of the memory is further from one core than another.

What does alter is the page replacement algorithm used. I first tried FIFO and the code ran for many weeks and completed in about 60 billion simulated ticks – if a memory reference is to a page in the 512kB then it is deemed to take 1 tick to complete, if the reference is to a page (a 4k page size has been used thus far), it takes 100 ticks per 16 byte line to load (25600 ticks for a whole 4k page) – and plainly we have to decide what page gets evicted if our 512k store is already full.

Messing about with various LRU models showed that a two queue LRU did give a little better performance than a simple single LRU queue, and that completed in around 50 billion ticks (and two weeks or so of running).

I then built – more or less starting from scratch – a version of the simulator that modelled Belady’s OPT. That required some very large binary trees to be used – along with 250GB of RAM – and completed the code in about 22 billion ticks (and about three weeks in wall clock time).

All these models showed one piece of common behaviour – thrashing, as demonstrated by the fact that adding additional cores to the execution did not increase the amount of code being executed: instead each individual core had to handle more page faults as the cores competed for the small pool of memory.

I now have two pieces of code running which aim to measure the (in)efficiency of these “traditional” paging approaches – come back in a few weeks to see what they show.

So, while they run I need to get on to the next stage, which is testing some alternative approaches. But I have  a problem – I cannot wait three weeks for each experiment to run. There simply is not any time for that.

The alternatives boil down to chopping up sections of my current XML from the benchmark, or writing a traffic generator.

The traffic generator idea has a lot to be said for it – my supervisor certainly is in favour – but it is not without weaknesses: the degree of locality between the different threads executing is really quite important – a lot of locality and the fault count falls and code gets executed fast – poor locality and the fault count rockets.

But how do I model that: I am not sure there is any literature out there that discusses this problem in much detail – multithreading is hard and for that reason rational people avoid it!

But using the chopped up XML is not without issues either – it’s inflexible, it elevates one instance of executing code to be a model for all executing code and so is just as vulnerable to the question of locality.

How slow is a fast computer?

Valgrind (Photo credit: Wikipedia)

I am working on a simulation environment for a NoC. The idea is that we can take a memory reference string – generated by Valgrind – and then test how long it will take to execute with different memory models for the NoC. It’s at an early stage and still quite crude.

The data set it is working with is of the order of 200GB, though that covers 18 threads of execution and so, very roughly speaking, it is 18 data sets of just over 10GB each. I have written some parallel Groovy/Java code to handle it and the code seems to work, though there is a lot of work to be done.

I am running it on the University of York’s compute server – a beast with 32 cores and a lot of memory. But it is slow, slow, slow. My current estimate is that it would take about 10 weeks to crunch a whole dataset. The code is slow because we have to synchronise the threads to model the inherent parallelism of the NoC. The whole thing is a demonstration – with a vengeance – of Amdahl’s Law.

Even in as long a project as a PhD I don’t have 10 weeks per dataset going free, so this is a problem!

And in XML…

<!DOCTYPE lackeyml [
<!ELEMENT lackeyml (application?, (thread)*)>;
<!ATTLIST lackeyml version CDATA #FIXED "0.2">
<!ATTLIST lackeyml xmlns CDATA #FIXED "\">
<!ELEMENT application EMPTY>
<!ATTLIST application command CDATA #REQUIRED>
<!ELEMENT thread (instruction|store|load|modify)* >
<!ELEMENT instruction EMPTY>
<!ATTLIST instruction address CDATA #REQUIRED>
<!ATTLIST instruction size CDATA #REQUIRED>
<!ATTLIST modify address CDATA #REQUIRED>
<!ATTLIST store address CDATA #REQUIRED>

Incidentally, if you don’t know what any of this means, you do not need to worry. But if you are really interested then it is a DTD (Document Type Definition) for the XMLised output of Valgrind’s Lackey tool, as patched by me to also output thread (Posix pthread) data. The original – version 0.1 – DTD (which does not account for thread data) can be found in the lackey_xml git repository.

New lackeyml DTD

I cannot imagine that this matters to anyone other than me right now, but for the record (this is taken from a hacked up version of Valgrind‘s Lackey tool):

        VG_(printf)("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n");
        VG_(printf)("<!DOCTYPE lackeyml [\n");
        VG_(printf)("<!ELEMENT lackeyml (application?, (thread)*)>\n");
        VG_(printf)("<!ATTLIST lackeyml version CDATA #FIXED \"0.2\">\n");
        VG_(printf)("<!ATTLIST lackeyml xmlns CDATA #FIXED");
        VG_(printf)(" \"\">\n");
        VG_(printf)("<!ELEMENT application EMPTY>\n");
        VG_(printf)("<!ATTLIST application command CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT thread (instruction|store|load|modify)* >\n");
        VG_(printf)("<!ATTLIST thread tid CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT instruction EMPTY>\n");
        VG_(printf)("<!ATTLIST instruction address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST instruction size CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT modify EMPTY>\n");
        VG_(printf)("<!ATTLIST modify address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST modify size CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT store EMPTY>\n");
        VG_(printf)("<!ATTLIST store address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST store size CDATA #REQUIRED>\n");
        VG_(printf)("<!ELEMENT load EMPTY>\n");
        VG_(printf)("<!ATTLIST load address CDATA #REQUIRED>\n");
        VG_(printf)("<!ATTLIST load size CDATA #REQUIRED>\n");
        VG_(printf)(" xmlns=\"\">\n");

Update: As originally published this code had an error in the name of the attribute for thread – I have fixed that now.

Using XSLT to manipulate an SVG file

I am generating a lot of the graphics for my project using Scalable Vector Graphics (SVG) – an XML format.

The advantages of SVG are obvious – it is human readable, it preserves some of the data in the output (eg in the relative placing of the dots on a graph), Groovy has good support for it and, in theory at least XML Stylesheet Transformations (XSLT) means that it can be manipulated outside of a formal programming environment using another bit of XML (an XSL stylesheet) and freely available XSLT tools – eg xsltproc on Linux.

But XSLT is one of those Cinderellas of the computing world – widely relied on but not generally understood or widely discussed. As a result I have struggled over the last 24 hours to work out how to do what ought to be a simple thing: removing a set of dots of one colour from an SVG. I still have not got to the bottom of all the issues – specifically xltproc seems to have an issue with XML namespace declarations and/or DTD declarations – but I have fixed it for all practical purposes, so thought I should document it for future users…

Before I describe the problem more fully as well as the solution, I have to recommend this book – XSLT: Mastering XML Transformations. For Beginners and Advanced Users – which gave me the pointers to a solution I just could not find online.

So here is the graph (in PNG format because WordPress does not support SVG) I want to manipulate.

Firefox memory mapThis is memory map – via Valgrind – of Firefox (7) opening and closing (I created a page that, once Firefox was configured, would close the browser.)

The different types of memory accesses – for instructions (red), loads (blue), stores (yellow) and modifies (green) are all on superimposed and which colour you see depends on the order they are written if they access the same page.

So the answer, with an SVG, is obvious, just look at the colour you want to see.

Ought to be easy, right? Well, I suppose when you know how to do it, it is easy. But it’s not completely simple!

Each dot on the graph is drawn with an XML element like this:

<circle cx='102' cy='699' r='1' fill='none' stroke='red' stroke-width='1' />

The trick is to create a template for all the elements but inside that template only copy out the elements that match the correct stroke attribute. (Attributes are not children of node elements either, so you have to copy them out explicitly.)

Here’s a stylesheet that does it:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" indent="yes"
<xsl:param name="colour">yellow</xsl:param>
<xsl:template match="/">
	<xsl:apply-templates select="svg"/>
<xsl:template match="svg">
		<xsl:for-each select="@*">
		<xsl:apply-templates select="rect"/>
		<xsl:apply-templates select="line"/>
		<xsl:apply-templates select="text"/>
		<xsl:apply-templates select="circle"/>
<xsl:template match="line">
		<xsl:for-each select="@*">
<xsl:template match="rect">
		<xsl:for-each select="@*">
<xsl:template match="text">
		<xsl:for-each select="@*|node()">
<xsl:template match="circle">
	<xsl:if test="@stroke=$colour">
			<xsl:for-each select="@*">

(This code be shortened to just copy-of the line, rect and text elements.)

And here’s an example:

xsltproc  -o inst.svg --stringparam colour "red" pickcolour.xsl master.svg

Instructions executed by Firefox 7

Writing more code to avoid writing any of the report?

The C Programming Language
Image by mrbill via Flickr

I have managed to churn out 160 lines of working C today – which I think is quite good going, though, according to this, maybe I could have churned out 400 of C++ or even 960 of Perl (I love Perl but the mind boggles).

My program will tell you how pages pages are present or how many have been swapped (it has the ability to do a bit more too, but I have not exploited that even though it is essentially there) – you can fetch it from github here: valext (the name reflects the idea that this was going to be an extension to Valgrind but then I discovered/realised that was never going to work).

Anyway, what I now have to face up to is whether I am churning out code to avoid actually writing any of my MSc project report – I have a feeling that some of that is going on and think I need to start rereading Writing for Computer Science – which helped me greatly with the proposal.

Patterns of locality in xterm

So, I fixed the problem I was having with ranges by reducing the number of times the range was being accessed by something like 1000 orders of magnitude by pushing the calculation to much later on in the program (ie instead of checking the range on every calculation, check it only at the end when some other filtering has already happened).

And so I can now show the world a graph that illustrates how xterm‘s initialisation does indeed show different phases of locality – ie that in a small enough time frame spacial locality is shown in memory accesses, but over a bigger time space that locality will shift – this is important because it indicates that a good virtual memory manager would be able to flush the pages from the old locality out of the system:

locality in xterm memory accessesPlainly I need to do a bit more work on the graphing software – but the yellow lines are the axis and the green dots indicate where the memory is being accessed – time (or rather its analogue in terms of instructions executed) runs from left to right and (process) memory address gets greater as you move from the bottom up.

In fact this graph is only looking at the bottom 10% of the range of memory accessed by the application (and we are ignoring the instructions in favour of explicit heap loads, stores and modifications) – but the problem is that for most of that range nothing is accessed – again I need to adjust the Groovy script to make it better.

None of this is new – this experiment was first conducted in the early 1970s – though I am not aware of anyone having done it for Linux – but I suspect they have.