Now I work as a software engineer and because of the covid-19 lockdown I am spending even more time in front of computers at home and I thought that – finally – this was the moment to make proper use of the NSLU2 NAS decvice I bought more than a decade ago.
But it turns out that this supports a now dead version of Samba/CIFS so isn’t much use. So, I thought, flash the thing and use one of the various FOSS options, but these seem to be either dead or close-to-dead these days.
So is there any way I can get decent use out of this device?
This is a post about my PhD research: in fact it is a sort of public rumination, an attempt to clarify my thoughts in writing before I take the next step.
It’s also possibly an exercise in procrastination: a decision to write about what I might do next, rather than to get on with doing it, but I am going to suppress that thought for now.
I am looking for ways to make “network on chip” systems more viable as general use (or any use, I suppose) computing platforms. These systems are a hardware response to the hardware problem that is causing such difficulties for big software and hardware manufacturers alike: namely that we cannot seem to make faster computers any more.
The problem we have is that while we can still get more transistors on a chip (i.e., that “Moore’s Law” still applies), we cannot keep operating them at faster speed (i.e., “Dennard Scaling” has broken down) as they get too hot.
In response we can either build better small devices (mobile phones, tablets) or try to build faster parallel computing devices (so instead of one very fast chip we have several moderately fast chips and try to have better software that makes good use of their ability to compute things in parallel).
Network-on-chip (NoC) processors are a major step along the road of having parallel processors – we put more processing units on a single piece of silicon rather than have them co-operate via external hardware. But the software has not caught up and we just cannot keep these chips busy enough to get the benefit their parallelism might offer.
That is where I hope to make a difference, even if just at the margins. Can I find a way to make the NoC chips busier, specifically by keeping them fed with data and code from the computer memory fast enough?
I have tried the obvious and simple methods: essentially adaptations of methods that have been used for most of the last 50 years in conventional serial computer devices and the answer is ‘no’ if that is all that is on offer.
Messing about with the standard algorithms used to feed code and data to the processors hits a solid brick wall: the chips have a limited amount of ‘fast’ local memory and the time it takes to keep that refreshed with up-to-date code and data places a fundamental limit on performance.
So, while many computer science students might be familiar with “Amdahl’s Law” which stipulates that, for parallel code, the elements that have to be run in serial (even if just setting up the parallel section) place a fundamental limit on how much extra performance we can squeeze out by throwing more and more parallel processors at the problem – we have a different, if related, problem here. Because we can apply more and more parallel processors to the problem but the performance remains constant, because even though we are running parallel code, we are limited by memory performance.
This limit – which implies that as we use more processors they become individually less efficient – even hits the so-called “clairvoyant” or optimal (OPT) memory management/page replacement algorithm: OPT knows which memory page it is most efficient to replace but is still limited by the fundamental barrier of limited on-chip memory.
The limit is manifest in the straight lines we can see in the plot here – the steeper slope of OPT means it runs faster but after the first few processors are brought to bear on the problem (the number of processors being used climbs for the first few billion instructions) the rate of instructions executed per ‘tick’ (an analogue of time) is constant.
Getting NoCs to run faster and so releasing the benefits from the potentially massive parallelism they could bring, depends on beating this memory barrier (and lots of other things too, but one at a time!). So, what are the options?
Well, one thing I can rule out is trying to cache a particular piece of a memory page (in traditional operating systems memory is shifted about the system in blocks called pages – typically 4096 bytes long). Caches typically store memory in 16 byte “lines” – hardware reads from the backing memory store in 16 byte blocks in most cases – and so I tested to see if there was a pattern in which 16 byte line was most likely to be used (see previous blog post). My instinct from looking at the plot is that will not work.
Similarly, a look at which pages were being used doesn’t reveal any immediately obvious pattern – some pages are used heavily by code, some are not – nothing surprising there.
So, the easy things do not work. Now I need to look at the hard things.
I think I need to escape from the page paradigm – one thing to look at is the size of the memory objects that are accessed. 4k pages are simple to handle – load a block in, push it out: but they could be (probably are) very inefficient. Might it be better to base our memory caching system on object sizes? That’s what I plan to check.
Just as they faced up to the inevitability that they could no longer keep producing ever faster processors, Intel released the “Pentium D” – a dual Pentium CPU.
The two Pentiums were not on the same die though and the whole thing gobbled energy, heated quickly and generally performed pretty poorly in comparison to the following on generation of two-CPUs-on-the-same-piece-of-silicon.
In 2005, though, having found a new job after a brief post-election period of enforced idleness, I bought a Pentium D based motherboard from the now late and lamented Morgan Computers on New Oxford Street. Morgans specialised in selling cheap end of line machines and I was both happy with the performance (earlier in the year I was building machines out of Pentium 90s!) and the price.
But the heat was a perennial problem. As the box clogged up with dust, as it inevitably did, or the heatsink‘s efficiency declined, the machine would first slow (as the cores were throttled) or just peg out at a random time.
Latterly it wasn’t even lasting 10 minutes or less with X Windows running – even though I’d invested in a much bigger heatsink.
So, tonight it was dismembered – the disk being fed into a standalone mounter so I could extract what I needed, the NVidia card removed (and the RAM) and the heavy frame chucked out. I would guess it will be gone by the morning, so efficient are the metal gatherers round here.
My three additional Raspberry Pis are here and one thing is immediately obvious – compute power is now much cheaper than disk storage. The three devices together give me about 2100 MIPS (based on the BogoMIPS calibration on my existing system) and cost around £90 for the lot.
I am not sure what the storage is going to cost me, but with three SD cards and three USB disks I will be lucky to get the lot for £150.
Here is a complete graph of the memory use and fault count for a run of ls -lia.
As you can see there are only soft/minor faults here – as one would expect (this was a machine with a lot of memory), as the C library provides the ls function and it will be loaded in memory (presumably the filesystem information was also in memory).
But there are a lot of soft faults – and these too have a cost, even if nothing like the cost of a hard fault. For a start each soft page fault almost certainly indicates a miss in the processor cache.
The paper linked here also gives a lot of information about Linux’s generation of soft/minor faults and their performance impact – it seems that the kernel is designed to deliver system wide flexibility at the expense of performance.
I suppose I should begin by saying I don’t “hate Apple” – in fact I think their products are quite nice if hideously over-priced. But I do have a bit of contempt, I must admit, for those people who tell me they like Apple because “it’s so easy” – somewhat like the fashion, sadly continuing, of boasting by arts and humanities graduates of how little they know of maths or science (scientists do not do the same back).
Apple is “easy” because you are restricted to buying their over-priced hardware, it is as simple as that. If they control the hardware they control the drivers and so you have to do what they say.
OK, you may feel that the time it would take you to learn how to install your own hardware is worth every penny. But your lack of knowledge is hardly something to boast of as though you were superior to those of us who do know how to do it, is it?
(Though the problems people sometimes have with hardware on their Windows machine is because of the broken business model that too many hardware manufacturers operate under. The real choices are – grip the operating system and the hardware so the two move in absolute tandem, as with Apple – or free the hardware specification so that FOSS drivers appear, for free. Being a hardware manufacturer but being forced to chase after Microsoft’s Windows ABI is surely the worst of both worlds. On this point I do think Eric S. Raymond is right.)
This week a ten or so year old NVidia FX5500 died on me (looking at it it looks like things I take to be capacitors (they look like water towers) on the board blew up – the metal casings on all of them are cracked and one seems to have leaked something.
Second NVidia card to die on me this year, but the first was after some dodgy software settings were applied by yours truly and, well, ten years is not a bad innings.
Trying desperately to replace it I found an even older PCI card to use:
00:0a.0 VGA compatible controller: 3Dfx Interactive, Inc. Voodoo Banshee (rev 03) (prog-if 00 [VGA controller])
Subsystem: Creative Labs 3D Blaster Banshee PCI (CT6760)
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 5
Region 0: Memory at e0000000 (32-bit, non-prefetchable) [size=32M]
Region 1: Memory at e2000000 (32-bit, prefetchable) [size=32M]
Region 2: I/O ports at 9000 [size=256]
[virtual] Expansion ROM at 60000000 [disabled] [size=64K]
Capabilities:  Power Management version 1
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Kernel modules: tdfxfb
But as this is a server and I don’t need sophisticated graphics this card, which is 1998’s technology, is likely to do just fine.
Finally got to the bottom of my issue with power saving and scheduling on my Pentium D machine (essentially a dual core Pentium 4).
It seems apparently lowering heat output (the Pentium D is a notoriously hot running processor), the “ondemand” frequency scheduler is not likely to save power in the real world and has been deliberately broken by the kernel maintainers.
p4-clockmod is NOT true CPU frequency scaling, it just forces the CPU to idle on a periodic duty cycle and has no effect on CPU frequency. The clock modulation feature is basically just engaging the same mechanism the CPU uses to reduce heat output when it gets too hot, and which is not meant as a power saving mechanism. When engaged, it does reduce heat output and power usage, but not as much as it reduces system performance, and means the system will simply take longer to return to idle. In short, using p4-clockmod can only increase power usage in any real workload.