The Crimea, Urdu and MD5 hashes: a partial reply to Professor Rubin


The charge of the gallant three hundred, the Heavy Brigade!
Down the hill, down the hill, thousands of Russians,
Thousands of horsemen, drew to the valley–and stay’d;
For Scarlett and Scarlett’s three hundred were riding by
When the points of the Russian lances arose in the sky;
And he call’d, ‘Left wheel into line!’ and they wheel’d and obey’d.
Then he look’d at the host that had halted he knew not why,
And he turn’d half round, and he bade his trumpeter sound
To the charge, and he rode on ahead, as he waved his blade
To the gallant three hundred whose glory will never die–
‘Follow,’ and up the hill, up the hill, up the hill,
Follow’d the Heavy Brigade.

is the first stanza of Tennyson’s Charge of the Heavy Brigade.


بہادر تین سو کے انچارج، ہیوی بریگیڈ!
پہاڑ کے نیچے، پہاڑی کے نیچے، ہزاروں روسی
سوار ہزاروں کی تعداد میں متوجہ وادی اور رہے؛
کے لئے سرخ اور سرخ کے تین سو کی طرف سے سوار تھے
روسی بالا کے پوائنٹس جب آسمان میں پیدا؛
اور اس نے فون کیا، “لائن میں بائیں پہیا! ‘اور وہ پہیوں اور اطاعت ہے.
پھر وہ میزبان ہے کہ روک دیا تھا اور وہ جانتا تھا کہ کیوں نہیں دیکھا،
اور وہ نصف دور کر دیا، اور اس نے اپنی بگل کھلاڑی آواز بڑے
چارج کرنے کے لئے، اور وہ پر آگے سوار، کے طور پر اس نے اس کے بلیڈ لہرایا
بہادر تین سو جن کے جلال کبھی نہیں گا مرنا
‘، عمل کریں اور پہاڑی کے اوپر اوپر پہاڑی، پہاڑ اپ،
بھاری بریگیڈ کے بعد.

is an undoubtedly very poor Urdu translation (via Google translate) of the same.

While this:


is the MD5 hash of the English verse.

So which – the Urdu or the MD5 hash – is most similar to the English verse? Possibly you might answer neither – particularly if you can read any Urdu. But I think most would agree that the Urdu lines are more similar than the hexadecimal string.

Therefore, asserts Professor Paul Rubin, in response to my account yesterday of Professor Ulrike Hahne’s seminar on difference and similarity, the argument that difference is related to the Kolmogorov complexity of converting one object/stimulus to another falls flat on its face. As computing an MD5 hash is less computationally difficult than translating a longish piece of verse. It’s a persuasive argument and I am not going to challenge it directly.

Yet there is another aspect to this. For the MD5 hash is undoubtedly both more different from the English verse than the Urdu, and it is also more difficult to convert the hash to the verse. Indeed, there is no real way to computationally convert such hashes back to the original. Of course arguably the verse is one solution to the reversal, but there are (presumably, I am not really familiar with MD5 hashing beyond the most basic idea) an infinite number of such solutions. To ‘computationally’ go from the hash to the verse we would have to write a programme that wrote out the characters of the poem and discard the hash altogether.

This relies on an important aspect of difference – it is not generally commutative, ie., if we took \Delta to be the ‘difference’ operator then A \Delta B \neq B \Delta A for all cases. Or if you want a word experiment that makes the same point “a line at 85 degrees to the horizontal is almost vertical, but a vertical line is not almost 85 degrees to the horizontal” in the minds of most people.

So, the Kolmogorov complexity theory still has some legs to stand on.

Similarity, difference and compression

Perl (Photo credit: Wikipedia)

I am in York this week, being a student and preparing for the literature review seminar I am due to give on Friday – the first staging post on the PhD route, at which I have to persuade the department I have been serious about reading around my subject.

Today I went to a departmental seminar, presented by Professor Ulrike Hahne of Birkbeck College (and latterly of Cardiff University). She spoke on the nature of “similarity” – as is the nature of these things it was a quick rattle through a complex subject and if the summary that follows is inaccurate, then I am to blame and not Professor Hahne.

Professor Hahne is a psychologist but she has worked with computer scientists and so her seminar did cut into computer science issues. She began by stating that it was fair to say that all things are equally the same (or different) – in the sense that one can find an infinite number of things by which two things can be categorised in the same way (object A is weighs less that 1kg, object B weighs less than 1kg, they both weigh less than 2kgs and so on). I am not sure I accept this argument in its entirity – in what way is an object different from itself? But that’s a side issue, because her real point was that similarity and difference is a product of human cognition, which I can broadly accept.

So how do we measure similarity and difference? Well the “simplest” way is to measure the “distance” between two stimuli in the standard geometric way – this is how we measure the difference between colours in a colour space (about which more later) ie., the root of the sum of the squares of the distances. This concept has even been developed into the “universal law of generalisation”. This idea has achieved much but has major deficiencies.

Professor Hahne outlined some of the alternatives before describing her interest (and defence of) the idea that the key to difference was the number of mental transformations required to change one thing from another – for instance, how different is a square from a triangle? Two transformations are required, first to think of the triangle and then to replace the square with the triangle and so on.

In a more sophisticated way, the issue is the Kolmogorov complexity of the transformation. The shorter the program we can write to make the transformation, the more similar the objects are.

This, it strikes me, has an important application in computer science, or it least it could have. To go back to the colour space issue again – when I wrote the Perl module Image::Pngslimmer I had to write a lot of code that computed geometrical distances between colour points – a task that Perl is very poor at, maths is slow there. This was to implement the so-called “median cut” algorithm (pleased to say that the Wikipedia article on the median cut algorithm cites my code as an example, and it wasn’t even me who edited it to that, at least as far as I can remember!) where colours are quantised to those at the centre of “median cut boxes” in the colour space. Perhaps there is a much simpler way to make this transformation and so more quickly compress the PNG?

I asked Professor Hahne about this and she confirmed that her collaborator Professor Nick Chater of Warwick University is interested in this very question. When I have got this week out the way I may have a look at his published papers and see if there is anything interesting there.