I am hoping to get maximum exposure for this problem, so I get a solution.
I have a large XML file with one corrupt line – line 35,185,222
A very simple sed script prints out the broken line – as a single line (this is important!):
sed -n '35185222p' infile.xml
Gives this as output:
<load address=’11c1�����ze=’08’ />
But if I change my sed script (sticking it in a file to avoid having to escape the quotes) like so:
35185222s@^.*$@<load address=’11c1385b’ size=’08’ />@p
sed -n -f seddy.sed infile.xml
The script fails to match – because sed sees line 35185222 ending with the first corrupt character.
I know this because a script like:
35185222s@^.*@XXX@p
will output
XXX�����ze=’08’ />
So how do I fix this?
Update: I have been reading sed & awk but have only just got the the
c
command – which I could have used. But I was also interested in a real fix – and thanks to Hugh and others in the comments I now know that comes from specifying the locale with LANG=c
.
12 responses to “A conundrum for a sed wizard”
Your pattern matches a line with exactly one character. The target line has more than one character.
You probably want the pattern ^.*$
That matches any whole line — a bit of a blunt instrument.
Sorry, I should have put that in in the first place – that pattern ^.*$ is the one that fails to match. I will edit the post to make this clear.
Sorry, I misread the original message (in my text-only mail reader).
Use the sed “c” command?
Why: because sed’s line recognizer seems to be different from the regular expression’s line recognizer.
Things might be clearer if you turn UNICODE handling off. export LANG=c as a shell command before running sed or LANG=c as a prefix to the command invoking sed
LANG=c sed -n ‘35185222p’ infile.xml
You might have to pipe the result to od -c to see the odd characters.
I agree with what Hugh suggested:
LANG=C sed -e “35185222c” infile.xml >outfile.xml
Please make sure that “wordpress” hasn’t messed up my quoting here. This worked sanely for me when I tried it on my own command-line. I even tested with a few oddball characters on the line-to-be-fixed.
Using the LANG=c prefix does indeed seem to fix the problem. Many thanks!
I would just use head to print the first N lines, echo to print the replacement line, and then tail to print the rest
In lieu of a proper answer to why sed isn’t working:-
head -35185221 infile.xml > tmp.xml
echo “” >> tmp.xml
tail -n +35185223 infile.xml >> tmp.xml
The echo command in my previous answer is obviously echoing the line you want line 35185222 to be, but it has been stripped out (probably to avoid XSS problems).
[…] via Hacker News https://cartesianproduct.wordpress.com/2013/12/09/a-conundrum-for-a-sed-wizard/ […]
Create a script that replaces line 35185222 with the correct content, instead of trying to do regex matching. Here is an example with a much shorter file where I’m replacing line 3:
kgrossjo@marcie:~$ cat x
line one
line two
line three
line four
line five
kgrossjo@marcie:~$ cat x.sed
3c\
new line
kgrossjo@marcie:~$ sed -f x.sed x
line one
line two
new line
line four
line five
kgrossjo@marcie:~$
It’s almost certainly a locale issue. Run sed as `LC_ALL=C sed`
If that doesn’t work: Post the output of `locale`, and post an octal dump/hex dump of the line, so we can see what it is (I prefer `hexdump -c`)
Why don’t you delete the bad line and insert a good line rather than trying to fix the broken line?