A conundrum for a sed wizard


I am hoping to get maximum exposure for this problem, so I get a solution.

I have a large XML file with one corrupt line – line 35,185,222

A very simple sed script prints out the broken line – as a single line (this is important!):

sed -n '35185222p' infile.xml

Gives this as output:

<load address=’11c1�����ze=’08’ />

But if I change my sed script (sticking it in a file to avoid having to escape the quotes) like so:

35185222s@^.*$@<load address=’11c1385b’ size=’08’ />@p

sed -n -f seddy.sed infile.xml

The script fails to match – because sed sees line 35185222 ending with the first corrupt character.

I know this because a script like:

35185222s@^.*@XXX@p

will output

XXX�����ze=’08’ />

So how do I fix this?

 

Update: I have been reading sed & awk but have only just got the the c command – which I could have used. But I was also interested in a real fix – and thanks to Hugh and others in the comments I now know that comes from specifying the locale with LANG=c.