A puzzle from Donald Knuth

Recently I had to write some code to generate a pseudorandom number in a system with very limited sources of entropy. So, of course I turned to Donald Knuth and, in particular, Volume 2 – Seminumerical Algorithms – in the magisterial The Art of Computer Programming.

Reading through the questions/exercises I then came across this one:

Prove that the middle-square method using 2n-digit numbers to the base b has the following disadvantage: if the sequence includes any number whose most significant n digits are zero, the succeeding numbers will get smaller and smaller until zero occurs repeatedly.

(Knuth rates this question as ’14’ and, using his scale of difficulty which places a ’10’ as a minute to solve and a 20 as twenty minutes, probably means this should take about 7 or 8 minutes but I’ve spent much, much longer on it than that!)

A quick explanation: the middle-square method is a naive (though actually first suggested by none other than John von Neumann) random number generation method where we take a number n-digits long, square it and take the middle n-digits as our next seed or random number.

At first I thought I’d found an example where Knuth’s proposition appears to be false.

Let b=10 and the seed number be N_0 =60 then every subsequent number in the sequence is also 60 (obviously that’s as useless as repeated zeros for a random number generator.) But the problem with that is, that although it demonstrates a weakness in the middle square method, it doesn’t fit Knuth’s definition of the problem. What is n here? If n = 2, 2n=4, 4n=8 (n=2 is the minimum for middle values), then $N_o = 0060 $ and N_0^2 = 00003600 and so N_1 = 0036, N_2=0012, N_3=0001, N_4=0 (thanks to Hagen von Eitzen for clarifying this for me.)

So let’s look at the general case. (I also this explanation to Hagen von Eitzen, as compared to my very long-winded first attempt – though any errors that follow are mine not his.)

So we have a number x which we think of as a 2n digit number – though the first n digits are 0. Then x^2 < b^{2n} (as the largest x can be is b^n -1.)

Thus as the largest x can be is b^n - 1 then:

\frac{x^2}{b^n} \leqslant \frac{x(b^n - 1)}{b^n}

And \frac{x(b^n - 1)}{b^n} = x-\frac{x}{b^n}

If we have a 2n digit number x then the biggest number of digits we can get from the out put is 4n, but in our case we only have n digits to worry about so the biggest size x^2 can be is 2n digits, as again the leading 2n digits will be 0.

So, to apply the middle square method we need to lose the lower n digits – i.e., take \lfloor\frac{x^2}{b^n}\rfloor.

From the above:

\lfloor\frac{x^2}{b^n}\rfloor \leqslant \lfloor x-\frac{x}{b^n} \rfloor

If x > 0 then \lfloor x-\frac{x}{b^n}\rfloor will always < x, so the sequence decreases until x = 0.

The Reingold-Tilford algorithm revisited

hw5_ivmooc_rt_tree (Photo credit: marianocecowski)

A while ago, as part of early research into what became my MSc project, I wrote code to create and then draw red-black trees using C++.

To draw the trees I used the venerable Reingold-Tilford algorithm, which is more or less the standard approach. I wrote some blogs about it and pages here seem to come pretty high up in Google searches for the algorithm, so I get passing traffic regularly as a result.

But idly chasing these links has led me to a chapter from the forthcoming Handbook of Graph Drawing and Visualization edited by Roberto Tamassia which has a chapter on tree drawing by Adrian Rusu, which contains bad news for us Reingold-Tilford fan boys, as this summary from the book of an experiment comparing algorithmic performance shows (emphasis added):

• The performance of a drawing algorithm on a tree-type is not a good predictor of the performance of the same algorithm on other tree-types: some of the algorithms perform best on a tree-type, and worst on other tree-types.
Reingold-Tilford algorithm [RT81] scores worse in comparison to the other chosen algorithms for almost all ten aesthetics considered.
• The intuition that low average edge length and area go together is contradicted in only one case.
• The intuitions that average edge length and maximum edge length, uniform edge length and total edge length, and short maximum edge length and close farthest leaf go together are contradicted for unbalanced binary trees.
• With regards to area, of the four algorithms studied, three perform best on different types of trees.
• With regards to aspect ratio, of the four algorithms studied, three perform well on trees of different types and sizes.
• Not all algorithms studied perform best on complete binary trees even though they have one of the simplest tree structures.
The level-based algorithm of Reingold-Tilford [RT81] produces much worse aspect ratios than algorithms designed using other approaches.
• The path-based algorithm of Chan et al. [CGKT02] tends to construct drawings with better area at the expense of worse aspect ratio.

Even if P=NP we might see no benefit

A system of linear inequalities defines a poly...
A system of linear inequalities defines a polytope as a feasible region. The simplex algorithm begins at a starting vertex and moves along the edges of the polytope until it reaches the vertex of the optimum solution. (Photo credit: Wikipedia)

Inspired by an article in the New Scientist I am returning to a favourite subject – whether P = NP and what the implications would be in the (unlikely) case that this were so.

Here’s a crude but quick explanation of P and NP: P problems are those that can be solve in a known time based on a polynomial (hence P) of the problem’s complexity – ie., we know in advance how to solve the problem. NP (N standing for non-deterministic) problems are those for which we can quickly (ie in P) verify that a solution is correct but for which we don’t have an algorithmic solution to hand – in other words we have to try all the possible algorithmic solutions in the hope of hitting the right one. Reversing one-way functions (used to encrypt internet commerce) is an NP problem – hence, it is thought/hoped that internet commerce is secure. On the other hand drawing up a school timetable is also an NP problem so solving that would be a bonus. There are a set of problems, known as NP-complete, which if any one was shown to be, in reality a P problem would mean that P = NP – in other words there would be no NP problems as such (we are ignoring NP-hard problems).

If it was shown we lived in a world where P=NP then we would inhabit ‘algorithmica’ – a land where computers could solve complex problems with, it is said, relative ease.

But what if, actually, we have polynomial solutions to P class problems but there were too complex to be of much use? The New Scientist article – which examines the theoretical problems faced by users of the ‘simplex algorithm’ points to just such a case.

The simplex algorithm aims to optimise a multiple variable problem using linear programming – as in an example they suggest, how do you get bananas from 5 distribution centres with varying numbers of supplies to 200 shops with varying levels of demand – a 1000 dimensional problem.

The simplex algorithm involves seeking the optimal vertex in the geometrical representation of this problem. This was thought to be rendered as a problem in P via the ‘Hirsch conjecture‘ – that the maximum number of edges we must traverse to get between any two corners on a polyhedron is never greater than the number of faces of the polyhedron minus the number of dimensions in the problem.

While this is true in the three dimensional world a paper presented in 2010 and published last month in the Annals of MathematicsA counterexample to the Hirsch Conjecture by Francisco Santos has knocked down its universal applicability. Santos found a 43 dimensional shape with 86 faces. If the Hirsch conjecture was valid then the maximum distance between two corners would be 43 steps, but he found a pair at least 44 steps apart.

That leaves another limit – devised by Gil Kalai of the Hebrew University of Jerusalem and Daniel Kleitman of MIT, but this, says the New Scientist is “too big, in fact, to guarantee a reasonable running time for the simplex method“. Their two page paper can be read here. They suggest the diameter (maximal number of steps) is n^{log(d+2)} where n is the number of faces and d the dimensions. (The Hirsch conjecture is instead n-d .)

So for Santos’s shape we would have a maximal diameter of \approx 10488 (this is the upper limit, rather than the actual diameter). A much bigger figure even for a small dimensional problem, the paper also refers to a linear programming method that would require, in this case, a maximum of n^{4\sqrt d}\approx 10^{50} steps. Not a practical proposition if the dimension count starts to rise. (NB I am not suggesting these are the real limits for Santos’s shape, I am merely using the figures as an illustration of the many orders of magnitude difference they suggest might apply).

I think these figures suggest that proving P = NP might not be enough even if it were possible. We might have algorithms in P, but the time required would be such that quicker, if somewhat less accurate, approximations (as often used today) would still be preferred.

Caveat: Some/much of the above is outside my maths comfort zone, so if you spot an error shout it out.

Convolutional coding

Convolutional coding is form of error-correction widely used in mobile communications and is another area of life where discrete mathematics is hard at work for us, even though we are unlikely ever to think about it.

The basic idea in convolutional coding are that a stream of bits y of length L_y is converted to a different string x of length L_x based on the last L_z input symbols (L_z is the constraint length of the code and \frac{L_y}{L_x} is the code rate.

A widely used form of convolution coding is the Viterbi algorithm.

The simple idea behind the Viterbi algorithm is that given a string of observed bits, it returns the most likely input stream of bits (these bits having been transformed before transmission).

I am going to base my explanation of the Viterbi algorithm on the example used on wikipedia (though the example used there is wrong at time of writing – 14 July 2012 – as it seems to have the order of the states back to front, ie the wording does not agree with the diagram).

In this example a doctor must determine if a patient is ill with fever based on the patient’s report that she either feels normal, cold or dizzy.

From wikipedia:

states = ('Healthy', 'Fever')

observations = ('normal', 'cold', 'dizzy')

start_probability = {'Healthy': 0.6, 'Fever': 0.4}

transition_probability = {
   'Healthy' : {'Healthy': 0.7, 'Fever': 0.3},
   'Fever' : {'Healthy': 0.4, 'Fever': 0.6},

emission_probability = {
   'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
   'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6},

Ie, 70% of healthy patients remain healthy between observations, and 50% of healthy patients report they feel ‘normal’ and so on…

So, a patient presents on three successive days and reports they feel normal, then cold, then dizzy. What should the doctor regard as the most likely course of underlying (hidden – this is a ‘Hidden Markov Model’) events?

The trellis diagram below shows the transitions and their probabilities…

Trellis diagram

To find the most likely sequence is to compute the path with the highest probability (generally referred to as the shortest path). In this case this can be seen to be healthy, healthy, fever.

The Veterbi algorithm works on this same principle – data is received and as the convolution algorithm is known then the most likely input pattern can be determined.

Have we reached “peak silicon” and what can we do about it?

Moore’s Law states that the number of transistors that can be squeezed into a given slice of silicon doubles every two years (or 18 months) – something I wrote about recently and where I declared “More transistors means greater speed, more and cheaper memory and so on … ”

Except, maybe not. As the graph below, shamelessly grabbed from Herb Stutter’s “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software“, shows, while Moore’s Law (the green graph) holds true, the other associated improvements that we have come to expect to parallel it, such as a similar increase in efficiency per watt (royal blue graph) and clock speed (navy blue) have not. In short, we can build cheaper chips but they are not necessarily much faster.

Herb Sutter's graph of CPU performanceAnd, as this article recounts, we are now talking about “dark silcon” – bits of chips that have to remain unpowered while other parts are in use so as to ensure the whole chip does not fry or fail due to too high power consumption.

So, if we have reached the point of “peak silicon” what can we do about it?

The chip manufacturers have responded by packing more and more cores into their devices and that works up to a point – we do not even need to have very parallel coding outside the operating system to take some advantage of that on even a typical multitasking desktop machine. But none of us are doubling the number of video renderers, MP3 decoders, database queries and spreadsheet calculations we run in parallel every 18 months, so the “Moore paradigm” of computing power doubling in that time will be lost.

A more fundamental alternative is to rewrite our software so that it becomes inherently designed to take advantage of multicore machines. Writing effective parallel software is not easy, but it can be done for lots of tasks. But even then there are limits – “Amdahl’s law” reminds us that parallelisation will only speed the parts of a program that can be run in parallel: if say we had a piece of code that must be run in serial and takes 5 seconds, and some code that currently takes 55 seconds but could be made perfectly parallel, then if we had 2 processors it takes 5 seconds (serial time), plus 27.5 seconds for the parallel code, doubling the processors but not quite halving the time, with a 46% saving. Doubling the number of processors again (to 4) cuts total computing time to 18.75 seconds but the proportional saving has dropped to 42%. In other words, the “Moore paradigm” also disappears.

The third thing we can do is look for better algorithms: the recent announcement of a vastly improved fast fourier transform (FFT) algorithm shows what can be done here – algorithmic improvement can vastly outstrip hardware speedup. But currently for many problems (those in NP space) there is no prior known algorithm available and computing power can be simply dedicated to going through all the possible algorithms looking for the one that works (we do not know what algorithms solves an NP problem but once a solution is found we can verify it ‘easily’). Assuming, as most mathematicians are said to do, that P does not equal NP (ie there is no yet to be discovered algorithm that cracks NP problems) this at least means that “peak silicon” will keep internet commerce safe for the foreseeable future but it is bad news in lots of other ways.

There is a fourth option, of course, which is to get a better physics – either for silcon fabrication, quantum computing or some other physics based innovation. Right now, though, these are probably still the least likely options but as the links below show, lots of people are working .

Better algorithms, not better hardware are what make your computer faster, faster

Moore's Law, The Fifth Paradigm.
Image via Wikipedia

Many people have heard of Moore’s Law – which states that the number of transistors that can be packed into a piece of silicon doubles every two years (or 18 months in another formulation). More transistors means greater speed, more and cheaper memory and so on … so in the thirty two years since I first saw a real “micro-computer” raw processing power has increased by over 30000 times on even the conservative formulation of the law.

The contrast with other engineering improvements is enormous: typical lightbulbs have become six or seven times more energy efficient in that period – something that LED technologies might double again in the next few years. Great, but nothing like computing.

But this contribution from the electronics engineers, while of course impressive, is minuscule in comparison to some of the improvements won by the mathematicians and the computer scientists who have perfected better, faster, methods to solve the common problems. These improvements in algorithms are not universal, unlike Moore’s Law, but can be much, more bigger – often in the factor of millions of times faster.

We are on the verge of having another such improvement revealed – to the “fast fourier transform” (FFT) algorithm which allows signals to be broken down into their components for compression and other purposes.

The FFT improvement is said to be (the paper is yet to be published and only the abstract is in the public domain) around a factor of 10 – which could have big implications for any application where fat data has to be pushed over a thin line and some loss of quality is permissible (mobile telephony is one obvious example but home sensors could be another).

The FFT image compression algorithm was patented some years ago, but as a pure mathematical procedure such a patient has no application in UK (or EU) law (and surely that is right – how can it be possible to patent something which already exists?) – it is not clear what the legal position of this innovation will be.

Once again, though, all this shows the fundamental importance of maths and science education to us all – and the importance of pure mathematical and scientific research. A little more emphasis on this and a little less on the “needs of industry” (very few employers are interested in research as opposed to getting their products out of the door a bit faster or cheaper) would be a welcome step forward for policy makers.

“Let’s enhance that” might work after all

This video has amused and fascinated many because it displays a fundamental ignorance of a basic rule of the – universe – a consequence of the second law of thermodynamics – that one cannot extract more information out of a picture than went into it at the time of creation:

But, according to the “Communications of the ACM” (subscription required), researchers at Carnegie Mellon University may have developed better ways to extract the information that is in an image.

That seems to be one of the consequences of new algorithms they have developed to deliver faster solutions to certain classes of linear systems (linear equations with many – perhaps billions – of variables).

To quote a small bit of the article (my emphasis added):

The researchers say the algorithm, which applies to a class of problems known as symmetric and diagonally dominant (SDD) systems, not only has practical potential, but also is so fast it might soon be possible for a desktop PC to solve systems with one billion variables in just seconds. “The main point of the new algorithm is that it is guaranteed to work and to work quickly,” says Gary L. Miller, a professor of computer science at CMU and a member of the three-person team that developed the
new algorithm.
SDD systems, characterized by system matrices in which each diagonal element is larger than the sum of the absolute values of all the other elements in the corresponding row, are used for a wide range of purposes, from online recommendation systems to industrial simulations, materials modeling, and image processing. Algorithms used for this type of linear system fall into two broad classes: direct solvers, such as Gaussian elimination, and iterative solvers. In contrast to direct solvers, which compute exact solutions, iterative solvers produce a series of approximate solutions. Direct methods are usually memory-hungry, a limitation that makes iterative solvers, such as the kind developed by the CMU team, more effective for the large data sets generated by today’s applications.

The article also has some real world examples of how the algorithm is being used to improve medical software.

Cannot immediately point you to a good book on this, though Amazon have an expensive DVD.

Progress on the MSc project

The various process states, displayed in a sta...
Image via Wikipedia

Had a letter from Birkbeck today telling me I had passed all my exams – so the issue now is finishing the MSc project and so getting the degree (I suppose, in theory at least as I have yet to be formally awarded it, I could now claim to have reached post graduate diploma level).

Right now, on the project, I am testing small kernel patches to see if a localised page replacement policy makes any noticeable difference for systems that are, or are in danger of, thrashing (ie spending nearly all their time waiting for paging to and from disk as opposed to doing any real computing).

The first patch I tried, forcing the largest and a large (not necessarily second-largest) process to write any dirty pages to disk had no noticeable effect – having thought about it a bit more, I now realise is unlikely that file backed pages are going to be dirty in this way for many processes, so actually all this code is likely to have done little but degrade performance.

What I really need is an aggressive rifle through the biggest processes page stack, essentially accelerating the hands of the CLOCK (of the CLOCK page replacement algorithm)  for this process (or, diminishing the handspread, to use a term often seen in this context) compared to those in general. So that’s next.

A further thought on MD5

Shows a typical cryptographic hash function (S...
Image via Wikipedia

The main use of MD5 – at least if my computer is any guide – is to check that a file you have downloaded from the internet or elsewhere is what it says it is.

In fact in this general use MD5 is not being used to encrypt anything – instead it produces a “message digest” – a 128 bit number that is a hash function of the supplied file. The problem with collisions in this case is that it means two different files could give the same hashed value (ie MD5 digest) and you could be left thinking you had the genuine file when you did not.

But that 128 bit hashed value plainly is not going to give you back the file – unlike CSI:Miami and everywhere where you see a “let’s enhance that” computer graphics gimmick, in the real world you cannot get more information out than you put in: so a 128 bit number will not magically transform into a 5 MB file even if you can reverse the hashing.

But that was not the issue with the Sun – they appeared to be using MD5 to hash a short password and in that case, at least in theory, being able to crack MD5 could give the original information back.

More proof that algorithms matter

Diagram of deleting a node from a singly linke...
Image via Wikipedia

In my piece on Programming Pearls a few days ago, I said I had been inspired to try a different, and hopefully faster, algorithm to manage a linked list.

Well, I did, and then I improved on it a second time by not deleting the array between scans of the memory map of a program being tested – simply reusing the allocated space and using a guard entry to mark the current ‘high tide’ of the chain in the array.

The resulting code is still very slow on the Pentium III box I am testing it on: but there is slow and there is slow: this particular slow is about 100 times faster than the simple/naive algorithm I started off with.