I am in York this week, being a student and preparing for the literature review seminar I am due to give on Friday – the first staging post on the PhD route, at which I have to persuade the department I have been serious about reading around my subject.

Today I went to a departmental seminar, presented by Professor Ulrike Hahne of Birkbeck College (and latterly of Cardiff University). She spoke on the nature of “similarity” – as is the nature of these things it was a quick rattle through a complex subject and if the summary that follows is inaccurate, then I am to blame and not Professor Hahne.

Professor Hahne is a psychologist but she has worked with computer scientists and so her seminar did cut into computer science issues. She began by stating that it was fair to say that all things are equally the same (or different) – in the sense that one can find an infinite number of things by which two things can be categorised in the same way (object A is weighs less that 1kg, object B weighs less than 1kg, they both weigh less than 2kgs and so on). I am not sure I accept this argument in its entirity – in what way is an object different from itself? But that’s a side issue, because her real point was that similarity and difference is a product of human cognition, which I can broadly accept.

So how do we measure similarity and difference? Well the “simplest” way is to measure the “distance” between two stimuli in the standard geometric way – this is how we measure the difference between colours in a colour space (about which more later) ie., the root of the sum of the squares of the distances. This concept has even been developed into the “universal law of generalisation”. This idea has achieved much but has major deficiencies.

Professor Hahne outlined some of the alternatives before describing her interest (and defence of) the idea that the key to difference was the number of mental transformations required to change one thing from another – for instance, how different is a square from a triangle? Two transformations are required, first to think of the triangle and then to replace the square with the triangle and so on.

In a more sophisticated way, the issue is the Kolmogorov complexity of the transformation. The shorter the program we can write to make the transformation, the more similar the objects are.

This, it strikes me, has an important application in computer science, or it least it could have. To go back to the colour space issue again – when I wrote the Perl module Image::Pngslimmer I had to write a lot of code that computed geometrical distances between colour points – a task that Perl is very poor at, maths is slow there. This was to implement the so-called “median cut” algorithm (pleased to say that the Wikipedia article on the median cut algorithm cites my code as an example, and it wasn’t even me who edited it to that, at least as far as I can remember!) where colours are quantised to those at the centre of “median cut boxes” in the colour space. Perhaps there is a much simpler way to make this transformation and so more quickly compress the PNG?

I asked Professor Hahne about this and she confirmed that her collaborator Professor Nick Chater of Warwick University is interested in this very question. When I have got this week out the way I may have a look at his published papers and see if there is anything interesting there.

###### Related articles

- Speeding algorithms by shrinking data (phys.org)
- The second coming of Java (jelastic.com)
- perl hacks: Why Corporates Hate Perl (perlhacks.com)
- Adaptive Run-Length Delta Encoding (moinakg.wordpress.com)

The distance-measuring tradition is a fascinating one, but its arithmetic simplicity means it can behave very strangely with some forms of data. If you’re interested, you might also want to look at some of the theoretical literature from systematics on all this – both the phenetic, numerical taxonomy tradition represented by Sokal & Sneath & their successors (arguably including Felsenstein) and the cladistic tradition represented by Hennig, and two varying schools of successors – on the one hand the “pattern cladism” of people like Nelson & Platnick, Olivier Rieppel (who wrote a great paper on the concept of similarity) and more recently Ebach & Williams, and on the other the more mainstream “phylogenetic cladist” interpretation of people like JS Farris. Systematists are very prone to factionalism, but it’s worth looking at the work of the many schools – pattern cladism, parsimony, likelihood, the now-dominant Bayesian approach, etc., and looking too at both the criteria these different approaches use for choice between hypotheses of similarity, and the problems they are known to which they are known to be prone (e.g. parsimony and long branch attraction).

Obviously systematics is a bit different as you’d expect the classification of living organisms to have a hierarchic structure, but some of the interesting epistemic debates from that period are (i) the extent to which that assumption is necessary, appropriate or helpful, (ii) how you can test for the presence of hierarchic structure in a dataset (e.g. permutation tail probability test), and (iii) the polythetic groups problem, first spotted by the French systematist Adanson in his 1763 book on the familial classification of the angiosperms (which includes the fantastic line ‘we must search for the system in nature herself, if indeed there is one’), but known to a wider intellectual audience through Wittgenstein’s reformulation of it as a “family” issue (he used ‘games’ as an example) as published in Philosophical Investigations.

I spent most of a happy summer at university reading back issues of Systematic Zoology from the late 1970s, when the debates were at their fiercest – if you have JSTOR access and are interested hunt for articles in Syst Zool by Michener about the classification of bees, by Nelson about and Platnick on ‘Philosophy and the transformation of cladistics’. There’s also a paper by Rod Page on component analysis and the information content of cladograms, from c.1986, which is well worth reading (unfortunately I have entirely forgotten the reference): together they are as good a place to start as any.

Anyway, not sure if that’s wholly relevant, but might well be interesting!

Wow, thanks Tim. Lots to look at there.

Relating “similarity” to Kolmogorov complexity is interesting, but I’m a bit skeptical. Which pair do think is more similar – a real vector and its norm, or a prime number and the next higher prime? The program for the first seems much easier (and shorter) to me, but I would be inclined to say that consecutive primes are more similar than a vector and its (scalar) norm.

Paul, interesting points. But I wonder if, as a mathematician you are a representative sample of humanity🙂 I wonder if most people would agree with you? I am not sure I do, possibly because the primes seem somewhat mystical and vectors and their norms do not.

Granted, I am better looking than most.🙂 How about this example? Object A is a text document (Tennyson’s much overlooked “Charge of the Heavy Brigade”). Object B is the MD5 hash of object A. Object C is a translation of object A into grammatically correct (if not necessarily fluent or rhythmic) Urdu. I’d be inclined to say that C is more similar to A than B is to A, but the A to B conversion program should be shorter than the A to C conversion program.

OK, I’ll give you that one, as I am sure no one would suggest the hash was more similar than the translation even with the Urdu script.

Pingback: The Crimea, Urdu and MD5 hashes: a partial reply to Professor Rubin « cartesian product