A conundrum for a sed wizard

I am hoping to get maximum exposure for this problem, so I get a solution.

I have a large XML file with one corrupt line – line 35,185,222

A very simple sed script prints out the broken line – as a single line (this is important!):

sed -n '35185222p' infile.xml

Gives this as output:

<load address=’11c1�����ze=’08’ />

But if I change my sed script (sticking it in a file to avoid having to escape the quotes) like so:

35185222s@^.*$@<load address=’11c1385b’ size=’08’ />@p

sed -n -f seddy.sed infile.xml

The script fails to match – because sed sees line 35185222 ending with the first corrupt character.

I know this because a script like:

35185222s@^.*@XXX@p

will output

XXX�����ze=’08’ />

So how do I fix this?

 

Update: I have been reading sed & awk but have only just got the the c command – which I could have used. But I was also interested in a real fix – and thanks to Hugh and others in the comments I now know that comes from specifying the locale with LANG=c.

Advertisement

12 responses to “A conundrum for a sed wizard”

  1. Your pattern matches a line with exactly one character. The target line has more than one character.
    You probably want the pattern ^.*$
    That matches any whole line — a bit of a blunt instrument.

    1. Sorry, I should have put that in in the first place – that pattern ^.*$ is the one that fails to match. I will edit the post to make this clear.

  2. Sorry, I misread the original message (in my text-only mail reader).

    Use the sed “c” command?
    Why: because sed’s line recognizer seems to be different from the regular expression’s line recognizer.

    Things might be clearer if you turn UNICODE handling off. export LANG=c as a shell command before running sed or LANG=c as a prefix to the command invoking sed
    LANG=c sed -n ‘35185222p’ infile.xml

    You might have to pipe the result to od -c to see the odd characters.

    1. I agree with what Hugh suggested:

      LANG=C sed -e “35185222c” infile.xml >outfile.xml

      Please make sure that “wordpress” hasn’t messed up my quoting here. This worked sanely for me when I tried it on my own command-line. I even tested with a few oddball characters on the line-to-be-fixed.

    2. Using the LANG=c prefix does indeed seem to fix the problem. Many thanks!

  3. I would just use head to print the first N lines, echo to print the replacement line, and then tail to print the rest

  4. In lieu of a proper answer to why sed isn’t working:-

    head -35185221 infile.xml > tmp.xml
    echo “” >> tmp.xml
    tail -n +35185223 infile.xml >> tmp.xml

    1. The echo command in my previous answer is obviously echoing the line you want line 35185222 to be, but it has been stripped out (probably to avoid XSS problems).

  5. Create a script that replaces line 35185222 with the correct content, instead of trying to do regex matching. Here is an example with a much shorter file where I’m replacing line 3:

    kgrossjo@marcie:~$ cat x
    line one
    line two
    line three
    line four
    line five
    kgrossjo@marcie:~$ cat x.sed
    3c\
    new line
    kgrossjo@marcie:~$ sed -f x.sed x
    line one
    line two
    new line
    line four
    line five
    kgrossjo@marcie:~$

  6. It’s almost certainly a locale issue. Run sed as `LC_ALL=C sed`

    If that doesn’t work: Post the output of `locale`, and post an octal dump/hex dump of the line, so we can see what it is (I prefer `hexdump -c`)

  7. Why don’t you delete the bad line and insert a good line rather than trying to fix the broken line?

%d bloggers like this: