Give yourself a Christmas present: learn sed


English: A Shebang, also Hashbang or Sharp ban...
A Shebang, also Hashbang or Sharp bang. (Photo credit: Wikipedia)

Text is at the core of The Unix Way – and all True Unix Hackers work from the command line. This much you know.

(If you don’t get a copy of The Art of Unix Programming – there is an awful lot of rubbish in that book but it does do one thing well: explain the deep connection between text and Unix.)

In a practical sense this means to get the best from your Unix system (and this includes you if you are a Mac OSX user) you need to boost your command line skills. The first thing to do is, of course, become familiar with a text editor – either vi or emacs (I am a vi user, but refuse to engage in a religious war on this matter.)

Then, perhaps not the next thing, but one of the next things you should do is learn sed – the streaming editor – one of the many gifts to the world (including Unix, of course) from Bell Labs (I recently read The Idea Factory: Bell Labs and the Great Age of American Innovation and I suppose I really ought to get around to writing a review of that).

Sed comes from the 1970s, but as so often in computing, it feels to me that its time has come again – in the era of big data a program that allows you to edit a file one line at a time – as opposed to trying to read as much of a file as possible into your computer’s memory – has come round again.

If you are sufficiently long in the tooth to have messed about with Microsoft’s edlin or any other line editor you might be forgiven for giving a hollow laugh at this point – but sed is a tool that genuinely repays the effort you have to make to learn it.

In the last few weeks I have been messing about with 220GB XML files and even the University of York’s big iron compute server cannot handle a buffered edit of a file that size – sed is the only realistic alternative (actually I thought about using my own hex editor – hexxed – which is also essentially a line editor, but a hex editor is really for messing about with binary files and I wouldn’t recommend it.

Sed has allowed me to fix errors deep inside very large files with just a few commands – eg:

LANG=C sed ‘51815253s@^.*$@<instruction address=\’004cf024\’ size=’03’ />@’ infile.xml >outfile.xml

Fixes line 51,815,253 in my file (the line identified by an XML fatal error). Earlier I had executed another line of sed to see what was wrong with that line:

LANG=C sed -n ‘51815253p’ infile.xml

(The LANG=C prefix is because the breakage involved an alien locale seemingly being injected into my file.)

Sed allows you to do much more – for instance anything you can identify through a pattern can be altered. Let’s say you have (text) documents with your old email address – me@oldaddress.com – and you want to change that to your new address – me@newaddress.com …

sed ‘s/me@oldaddress\.com/me@newaddress\.com/g’ mytext.txt > newtext.txt

Then check newtext.txt for correctness before using mv to replace the original.

But there is much, much more you can do with it.

Plus you get real cred as a Unix hacker if you know it.

Now, too many programs these days – especially anything from Redmond – go out of their way to suppress text formats. Text, after all, is resistant to the “embrace and extend” methodology – text wants to be free. But there is plenty of it out there still.

Books that teach you about sed are not so plentiful – I have been reading an old edition of sed & awk – which seems to be out of print – though you can buy a second hand copy for less than a quid excluding postage costs. Well worth the investment, I’d say.

Big data to save the planet? A Question To Which The Answer Might Not Be No


British readers of a certain age may remember a groundbreaking TV series from the autumn of 1979 – The Mighty Micro – in which Christopher Evans discussed the impact of the coming microchip revolution. (The series was broadcast after Evans had died, aged just 48).

In many ways the programmes – from what I can remember (and I was an avid viewer) – rather underestimated the impact of what was to follow. But the last programme did – and still does – stick in the memory because of what seemed, and seems, like a hyper-optimistic prediction: that microchips could save us from war.

Essentially Evans’s view was that by hugely increasing computational power, micro-powered computing would allow us to accurately predict the outcome of military conflict and so prevent it (why start a war when you know you are bound to lose or if you do win you, and your domestic critics, know it will devastate your society).

There are a lot of flaws in this argument. One only has to think of the jihadist claim to “love death” to recognise that the certainty of defeat might not be deterent enough and the 2008 financial crisis also demonstrates that increased computing power might just create new ways to mess things up, not to solve them.

But, but, but… maybe there is something to it after all. This week’s New Scientist reports on the release of the “Global Data on Events, Location and Tone” (GDELT) data set and the way it has been successfully used by Jay Yonamine, then a PhD student at Penn State, to model the spread of conflict in Afghanistan.

Yonamine was able to successfully model how the conflict would spread through Afgahnistan using GDELT, which geolocates major news stories and uses natural language processing to store a very short summary of them.

Modelling how the conflict spread is not the same as predicting where the next jihadist inspired conflict will take place though, of course, but it may be the first step on being able to draw out undercurrents of news stories and issue early warnings. The key question is whether it can be an effective leading indicator.

Maybe the idea has promise. At the very beginning of my memories of the world are the events of August 1969 – when the British Army was drafted on to Northern Ireland’s streets to avoid a bloodbath. Just six months before no one would have predicted that would have happened – even if the tempo of civil disputation had been increasing and certainly no one expected them to stay on the streets, as they did, for the next 30 years. And more importantly, perhaps, nobody – beyond some zealots on either side – would have wanted either outcome.

Again, think of the 2007 – 2008 financial crisis. Could it have been foreseen as early as 2004? Certainly some politicians claim that it could – but how could you tell whether they were any good at prediction: Mitt Romney,a  pretty serious person after all, really believed that would be president even on the night of election day – does that mean everything he says is nonsense or just some things?

Uisng Google Trends to investBig data might help sort some of that out too. Another piece of research highlighted by New Scientist, and undertaken by Tobias Preis at the University of Warwick, Helen Susannah Moat and  H. Eugene Stanley of Boston University suggests that an investment strategy based on analysis of Google Trends could have made substantial sums over the 2004 – 2011 period (see graph).

Their abstract states:

Crises in financial markets affect humans worldwide. Detailed market data on trading decisions reflect some of the complex human behavior that has led to these crises. We suggest that massive new data sources resulting from human interaction with the Internet may offer a new perspective on the behavior of market participants in periods of large market movements. By analyzing changes in Google query volumes for search terms related to finance, we find patterns that may be interpreted as “early warning signs” of stock market moves. Our results illustrate the potential that combining extensive behavioral data sets offers for a better understanding of collective human behavior.

So, risking the wrath of John Rentoul, this could be A Question To Which The Answer Might Not Be No.

“Big data” suggests online poker “relatively benign”


PokerTracker 3 includes a HUD overlay as seen ...
Image via Wikipedia

This week’s New Scientist reports (currently only available to subscribers) that the “Big Data” revolution has now encompassed online poker – with data collected on four million online players between September 2009 and March 2010.

I have never played online poker – I do like the game but I am hopeless at it and losing money brings me no thrill! But all the same the results are fascinating to me and I think they are also important in public policy terms.

The UK relatively recently liberalised its laws on gambling – but in the face of much controversy and moral panic the legislation was never implemented in full. Horror stories of gambling addiction abounded. But is addiction a big problem?

Not for most players it seems. Kahlil Philander (what a great name!) from the University of Las Vegas (not the person who collected the data – that was Ingo Fielder at the University of Hamburg) says “online poker is a relatively benign activity for 95 to 99 per cent of users”. The only 1 – 5% are a mixture of “pathological gamblers” and professionals. Are there other policy areas where we would let the issues facing perhaps less than 1% of the population block what is benign for the others? I am not convinced.

The US, despite draconian laws on online gambling and what looks like attempts to enforce the law extra-territorially, provided 23.7% of all players and next came Germany (where it also supposedly illegal but there is no real enforcement) with 9.6%, followed by (fully legal) gamblers in France (7.4%), Russia (6.7%), Canada (5.7%) and the UK (4.5%).

Half of online players played for less than a hour a month, while 6% played for more than 100 hours. And about 94% of players pay poker sites less than $500 in a six month period (in fact about a quarter of players pay the sites less than a dollar in six months and more than half pay in around $2.40 a month or less).

So, the games are not a threat to most people – but are they a realistic way of making money? The answer very clearly is no, and what money professionals make is down to very hard work, according to a sidebar on the main article.

Essentially it is much easier to lose money in poker than it is to win it, and it is also the case that for most of us luck (or rather random processes) and not skill will dominate our rate of loss or return. Typical (losing) players have to play 1560 hands (I am sure that is more than I will manage in a lifetime) before skill predominates over luck and for professionals the number rockets to 35,450.