If writing the MSc project proposal is this hard…


…what is the project itself going to be like?

I started work last night on writing up a very early draft of my project proposal. For several reasons it was a lot more difficult than I had expected.

Firstly, and typically, I let the technology of the writing tool get in the way of the actual writing. I spent much more time fiddling with LyX and various templates than writing anything. Should I just write plain text in a word processor and then copy that into the LaTeX tool or soldier on with LyX (after all even the project proposal will have to include mathematical notation)?

Secondly, while I hoped to use Christmas, and the time off work, to find the time to work on this, doing without distraction means staying up to 3am as the house is only quiet after 1am or later at Christmas. Not really sustainable.

Thirdly, and this is the most difficult one – all my mental images of what I was going to write – here’s the hypothesis (currently framed as “a more thorough-going application of the working set concept in the Linux kernel will improve performance”) and everything else flows from that, just melted away.

Indeed I am sort of working on the idea that I really need to explain some of the core ideas and then present a hypothesis to be tested.

Advertisements

What if P = NP?


Update (5 March): read a better version here.

I admit I now going slightly out of my depth, but I will try to explain what this is about and why it is interesting.

It is known that computers can solve some problems in what is called “polynomial time“: that is to say a finite time that is proportional to a polynomial of complexity of the input. The key thing is that these problems are computable using mechanical steps (ie algorithms) in a way that is (sometimes euphemistically) described as “quickly”.

These can be simple problems – like what is the sum of the first ten integers – or more complex ones, such as creating self-balancing trees, sorting database records alphabetically and so on.

These are the so-called “P” (for polynomial time class) problems. Here’s a definition:

A problem is assigned to the P (polynomial time) class if there exists at least one algorithm to solve that problem, such that the number of steps of the algorithm is bounded by a polynomial in n, where n is the length of the input.

Then there are another class of problems which seem fiendishly difficult to solve but which it is relatively simple to prove the correctness of any solution offered. These problems can also be solved (computed) in polynomial time – ie a finite time – and they can also be computed by a Turing machine (a simple model of a computer) and so an algorithmic solution exists. It is just that one cannot tell what that algorithm is. These are said to be solvable in unbounded polynomial time – and in the worst case the only way – it is thought – that a solution can be found is through an exhaustive search of all algorithms – in other words a form of “brute force“. These are the NP (Not in class P) problems.

Now most mathematicians think that NP does not equal P and that may or may not be a good thing as much of our internet commerce relies on encryption which is thought to be an NP problem.

(In Afghanistan in 2001 US cryptanalysts seemingly brute forced a Taliban Windows NT box but it used much weaker encryption than most electronic commerce.)

But what if it were the case that all seemingly NP problems were actually P problems? There are a lot of people studying this issue – but according to the New Scientist (their Christmas edition, the first in my subscription and delivered this morning, inspired me to write this) we should expect to wait at least until 2024 for an answer (by which point the problem – first formulated in 1971 – will have reached the age at which 50% of major mathematical problems will have been solved).

Some problems thought to be NP have already been shown to be P and there was a big fuss earlier in 2010 when a draft proof of P = NP (edit: it was actually P != NP) was published (the proof was flawed). And unlike, say, Fermat’s last theorem, proving P = NP is likely to have more or less immediate effects on the lives of many of us (how would you feel if you knew that it was possible, even if not likely, that someone had developed a way to quickly crack open all your internet credit card transactions?)

Of course, proving P = NP for all problems does not necessarily mean we will have determined polynomial time based solutions for all the current NP problems out there. But I would expect it would quickly lead to the solution of a multitude of them.

And, actually, I think the benefits to humanity would be enormous. The most likely immediate effects would be in improvements in computing/operating system efficiency. fast computers for less money and/or less energy consumption would be a huge benefit. From that alone many other benefits will flow in terms of information availability and universality.

530 lines of Javascript


I have just written that amount of code in what I persist in thinking of as a toy language (it was actually somewhat longer until I refactored the code to group some common functions together),

I had to do this for a coursework exercise – a lot of effort to be honest for what is at stake – processing a rather large XML file with some XSL. The Javascript essentially manages the parameters.

At the end my conclusion is that I don’t really see why anybody would want to write that much client code if they could possibly help it. Of course it transfers the computational burden to the client – but at the cost of hundreds of lines of interpreted code which is essentially under the control of the people who write the engines in the Firefox and IE browsers. In the real world that points towards a support nightmare.

Having written a fair bit of Perl (and AJAX) stuff in the past the whole thing felt unnatural – dozens of lines, much of it designed to handle the differences between the browser engines, that could have been handled simply on the server side.

One thing that I was convinced of was the potential power of XSLT: though I was not quite prepared for the revelation that it is Turing complete (ie it would be possible to write some XSL that would process any algorithm/task solvable through a finite number of mechanical steps). Though I shudder to think of how big a stylesheet would be required to handle all but the smallest of task.

But the potential power of XSLT is not the same of thinking of many practical uses for it!

Algorithms matter … a lot


One of the things that most annoyed me about the article on graphics was the author’s subsequent attempts (once he came under criticism) to justify bad technological advice through the excuse that falling hardware prices meant that his advice to use inappropriately optimised formats did not matter.

For the truth is that cheaper hardware expands the use of technology at both the “fat” end (ie your 4GB quad core desktop) and the “thin” end – your mobile phone and lower still. As computing becomes ubiquitous then appropriate optimisation will be as important as it ever was.

And appropriate optimisation means efficient and effective algorithms.

In fact, suggests a report quoted here, improvement in algorithm efficiency has contributed more – by an order of magnitude – to advances in some branches of computing than falling hardware prices (I am using price here as an analogue for the general Moore’s Law advance).

The report only appears to cite two specific examples of how algorithmic improvement has far outstripped the silicon, so I would be wary of claiming in general that algorithms have contributed a 30 – 43 times greater improvement to computing efficiency in the 15 years between 1988 and 2003 (something not bothering the slashdot summariser, so this is now likely to become the myth), it is a remarkable finding nonetheless.

The book I want to write


The Design of the Unix Operating System is one of the best computing books I have read in a long time.

It is clear and easy to understand. (It’s also very cheap if you buy a second hand edition). But it is more than that too: as soon as you begin reading it you realise that it has been fundamental in fixing the culture of the Unix/Linux kernel development world. Indeed I wish I read it long before now as it would have helped me get a better fix on both Linux’s antecedents and on the mental world of the average kernel hacker.

That said it is old and somebody ought to rewrite it for Linux – maybe I will be able to do that some day.

Some advice you should ignore


I have an interest in graphic formats – I wrote the perl package Image::Pngslimmer a while back when I was hacking some perl database code that delivered (via Ajax) graphs and photographs.

I built the whole website for fun and learning purposes and so therefore used PNG graphics when JPEGs would have (for the photographs at least) been more appropriate.

So, therefore an article that begins:

Virtually every developer will use bitmaps at times in their programming. Or if not in their programming, then in a website, blog, or family photos. Yet many of us don’t know the trade-offs between a GIF, JPEG, or PNG file – and there are some major differences there. This is a short post on the basics which will be sufficient for most, and a good start for the rest. Most of this I learned as a game developer (inc. Enemy Nations) where you do need a deep understanding of graphics.

has my attention.

But what a load of rubbish it turns out to be. The author plainly doesn’t know what the “trade-offs” are either as he states:

PNG – 32-bit (or less), lossless compression, small file sizes – what’s not to like. Older versions of some browsers (like Internet Explorer) would display the transparent pixels with an off-white color but the newer versions handle it properly. Use this (in 32-bit mode using 8 bits for transparency) for everything.

…and…

JPEG – 24-bit only (i.e. no transparency), lossy (can be lossless but compressions drops a lot), small file sizes. There is no reason to use this format unless you have significant numbers of people using old browsers. It’s not a bad format, but it is inferior to PNG with no advantages.

This is seriously bad advice and it is also technically wrong.

First, PNG is not a lossless format. Many PNGs are indeed lossless, but they do not have to be.  A substantial part of the  code in the Image::Pngslimmer package is about using the median cut algorithm to produce a lossy PNG out of a lossless one by producing a best match set of colours for the image and then converting pixels to the colour in the set which is closest to them in RGB space (the median cut algorithm ensures that there are more members of the set in the bits of RGB space where there are more colours in the original image).

Second, JPEG remains by far the best way of representing photographs in the real world internet, where the trade-off between size and quality is important.  Look closely at the colour boundaries in a JPEG, especially a highly compressed one, and you can see how the edges are blurred and colours cross the boundaries – that is an artefact of the compression and indeed represents a loss (once compressed in this way the quality of the original image cannot be recovered). But it is also a highly efficient algorithm and can produce much smaller image files than the lossless PNG equivalent without any noticeable dimunition in quality for the average viewer.

As a parallel think of MP3s and FLACs for audio. Like FLACs, PNGs (can be) lossless and compressed – so delivering a much smaller file than the source WAV or similar file. But MP3s can be much smaller again without any noticeable dimunition in quality for most people on most equipment.

Yes, more and more people are connecting at 25Mb/s, but then more and more people are also connecting using GPRS, EDGE and low bandwidth 3G connections. Size still matters.

PNG is absolutely superior for representing line drawings (including artefacts such as graphs) as no one wants to see the lines start to fuzz. But that is a pretty limited subset of images on the internet.

So here is some proper advice: use JPEGs (at about 85% compression) for almost all photographs you want to put on the internet. If you have a line drawing or similar, prefer PNGs. If you have a truly lossless image (remember someone may have compressed it before it gets to you) and you need to keep it that way, then use the PNG format: but expect your file size to be very large.

Dealing with 0x80600001 errors


Maybe you have just seen a message like this:

Error: uncaught exception: [Exception… “Component returned failure code: 0x80600001 [nsIXSLTProcessor.importStylesheet]”  nsresult: “0x80600001 (<unknown>)”  location: “JS frame :: file:///home/adrian/webtech/cia.html :: fulltable :: line 47”  data: no]

If you have then chances are you are working on some XSL/XSLT (the above comes from a piece of coursework I am working on which manipulates an XML representation of data from the CIA World Fact Book).

The error indicates that your XSL is broken and non-compliant and the problem is that Firefox/Mozilla is much stricter about what is broken than it is likely your command line XSLT processor is: the piece of XSL which generated the above message seemed to fly through xsltproc on my Linux box.

The best way to fix this is to take out the lines, one by one, from your XSL and look for the one that breaks the transformation. To avoid being inadvertently tied up in some issue of plagiarism later on I cannot post the XSL I was working on when this came up, but I had a line like this:

<xsl:apply-templates match="//item"/>

That is bad XSL – the match should be a select but as xsltproc happily covered up for my mistake and generated the XHTML I was looking for I could not understand why Firefox was flagging the error on this line of Javascript xsltproc.importStylesheet(xmlxsl).

It also took me a while to find an online explanation, so I wrote this.

XML: any use?


I stumbled across the site XMLSucks.com just now when reading a comment on slashdot about the idea that there was an FBI mandated “backdoor” in OpenBSD.

Right now I am working on some coursework with XML and so the site has my sympathy. For sure, XML has its uses – SVG seems like a pretty good idea to me and I have used it recently to generate graphics to represent the processes running on a Linux box.

But freely mixing it with HTML on the web? I am inclined to (mostly) agree with the statement on the site:

XML is bloated. XMLis fugly. XML is only “human-readable” if you’re willing to stretch the definition of “human-readable.” The same goes for the proposed bloatware of HTML5. Anyone looking at the spec must be shaking their heads. Sure, it’s better than the now-abandoned xhtml 2.0, but that’s not saying much. I

Rediscovering enthusiasm


This is the first “normal” – not abroad or just back, not jet lagged and so on – weekend I’ve been able to have at home in a month and it has also been the first time in that period where I have been able to expend some time to looking further at my proposed MSc project – on extending working set heuristics in the Linux kernel.

The good news is that I am once more convinced of the utility of, and enthusiastic about the implementation of, the idea. At the risk of looking very naive in six months (or six weeks) time even in my own eyes – here is the core idea:

Peter Denning’s 1968 and 1970 papers on the working set and virtual memory made some bold claims – calling global page replacement algorithms “in general sub-optimal” and asserting that the working set method is the best practical guarantee against thrashing.

Windows NT and its derivatives (XP, Vista, 7 etc) reflect their heritage from VMS in using a working set based replacement policy.

In contrast Linux (and the Unix family generally) use global replacement policies: indeed a fairly simple clock algorithm stands at the centre of Linux’s page replacement policies. Kernel developers say the policy works well in practice and that, in effect, the active “least recently used” list of cached pages – against which the clock algorithm runs, is a list of pages in the working sets of running processes.

My essential idea is to seek to trim the active list on a process-by-process basis when the system is under high load (the long delay in execution caused by a page fault hopefully making it efficient to execute the extra code in the hope of reducing the number of page faults.) Pages from the active list that are owned by the processes with the biggest memory footprint will be dropped into the inactive list, so making it more likely they will be eventually swapped out.

The second aspect of the application of a working set heuristic will be to alter the scheduling priorities of processes depending on their memory footprint. There are a few options here and I have not looked at this closely enough yet, but things to test could include:

  • Increasing the priority of the smallest processes – on the basis these might reach the end of execution more quickly and so release memory back to the pool
  • Radically lowering the priorities of the processes whose pages are being swapped out – on the basis that they do not have a working set of resources available and so, as Denning argued forty years ago, should not be able to run

In practical terms I am still some way off writing any kernel code. I have, though, written some user tools (still need polishing) to display the memory footprint of Linux processes in a red-black tree (the representation used internally by the kernel). Following Eric S Raymond (on Unix programming not politics!), the tools are partitioned into single applications that do different things – but run together they can generate graphics such as the one below:

Processes on a Linux box

So, on we go…

Making the easy look difficult or is it the other way round


The one thing that my MSc course has taught me is that things I thought were simple – such as SQL – are in fact fiendishly difficult in the hands of an examiner and with a clock ticking against you.

Simple queries like SELECT Name FROM ListOfNames just are not where it is at.

Here is one exercise I found on the web that I was not able to solve (I gave up after half an hour).

DESCRIPTION

The database of naval ships that took part in World War II is under consideration. The database has the following relations:
Classes(class, type, country, numGuns, bore, displacement)
Ships(name, class, launched)
Battles(name, date)
Outcomes(ship, battle, result)
Ships in classes are arranged to a single project. A class is normally assigned the name of the first ship in the class under consideration (head ship); otherwise, the class name does not coincide with any ship name in the database.
The Classes relation includes the class name, type (bb for a battle ship, or bc for a battle cruiser), country where the ship was built, number of main guns, gun caliber (diameter of the gun barrel, in inches), and displacement (weight in tons). The Ships relation includes the ship name, its class name, and launch year. The Battles relation covers the name and date of a battle the ships participated; while the result of their participation in the battle (sunk, damaged, or unharmed – OK) is in the Outcomes relation. Note: the Outcomes relation may include the ships not included in the Ships relation.

PROBLEM

Point out the battles in which at least three ships from the same country took part.