How to use sed to fix an xml issue

How to use sed to fix an xml issue - regex

I have an xml with the following (invalid) structure
<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1>
I want to use sed to change it into
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>
i.e. I want to remove </tag1>...<tag1> (and move everything in between under the enclosing tag1), if I encounter an invalid xml substring as <tag1></*
I have tried using sed without success (one such attempt is below)
sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'
It does work with the example above, but if I have two occurrence of the same condition it just removes the first </tag1> and the last <tag1> instead of performing the replacement twice
echo '<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3><tag1></tag4>text8</tag1>' | sed -e 's/<\/tag1>\(.*\)<tag1><\//\1<\//g'
outputs
<tag1>text1<tag2>text2<tag3>text3</tag3><tag1></tag2>text4</tag1><tag1>text5<tag4>text6</tag1><tag3>text7</tag3></tag4>text8</tag1>
I think sed just expands the RE to cover the largest selection, but what should I do if I do not want it to do such thing ?

You want non-greedy matching, but to the best of my knowledge, sed doesn't support it. Can you use perl or do you have to use sed?
Try: perl -p -e 's/<\/tag1>(.*?)<tag1>(\<\/.+?<\/tag1>)/\1\2/g'
I think the issue is that the regex has to match through to the end of the actual closing or else that closing tag becomes the beginning of the next match.

sed 's|</tag1><tag3>|<tag3>|;s|</tag3><tag1>|</tag3>|' file.xml
Output:
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>

This might work for you (GNU sed):
sed -r 's/<tag1>/\n/g;s/<\/tag1>(<tag3>[^\n]*)\n/\1/g;s/\n/<tag1>/g' file
Reduce <tag1> to a unique character i.e \n then use the negated character class [^\n] to obtain non-greedy matching. Following the changes reverse the initial substitution.

GNU sed
sed '\,<tag1></,{ s,</tag1>,,; s,<tag1>,,2; }' <<END
<tag1>text1<tag2>text2</tag1><tag3>text3</tag3><tag1></tag2>text4</tag1> <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1> <!-- should not change -->
END
<tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1> <!-- error case -->
<tag1><tag2 /></tag1><tag1><tag3 /></tag1> <!-- should not change -->
If the string <tag1></ is seen, then remove the first </tag1> and the second <tag1>

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7

With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.

This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.

$ cut -d'"' -f2 file
TEXT I WANT TO KEEP

You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"

The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.

A simpler one for sed:
sed 's/^[^"]*//' myfile.txt

If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.

Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

How to replace text with comma [Linux on Windows]

We get these automated emails from our client that have this rough format:
VP##0-X1-#####-#[Revision #:Document title]
VP##0-X2-#####-#[Revision #:Document title]
VP##0-X3-#####-#[Revision #:Document title]
What I want to do:
replace [Revision with a comma
replace : with a comma
delete ]
So that I can convert this into a CSV and then use some excel magic to fill in our tracking sheet.
I've tried to use sed with this general format:
sed -i 's,[Revision ,\,,g' <FILE>
but I don't know how to get a comma in for this case.
This is what I want to get in the end:
VP##0-X1-#####-#,#,Document title
VP##0-X2-#####-#,#,Document title
VP##0-X3-#####-#,#,ocument title
Any and all insight is appreciated.
I'm using Ubuntu on Windows.

sed 's/\[Revision /,/;s/:/,/;s/]//' inputfile
VP##0-X1-#####-#, #,Document title
VP##0-X2-#####-#, #,Document title
VP##0-X3-#####-#, #,Document title
No need to use heavy lifting by using back-referencing or using multiple sed commands. You can issue multiple replacement commands from within single sed command:
Syntax:
sed 's/a/A/' file
sed 's/b/B/' file
sed 's/c/C/' file
Can be combined into one command:
sed 's/a/A/;s/b/B/;s/c/C/' file #note the semicolon separating multiple replace operations.

You can use:
sed -Ei 's/(.*)(\[Revision)(.*)(:)(.*)(])/\1,\3,\5/' <FILE>
Testing it with one line and an echo:
$ echo "[VP##0-X1-#####-#[Revision #:Document title]" | sed -E 's/(.*)(\[Revision)(.*)(:)(.*)(])/\1,\3,\5/'
[VP##0-X1-#####-#, #,Document title
Explanation:
'(.*)(\[Revision)(.*)(:)(.*)(])
The regular expression in the first half of the sed command is divided into 6 groups defined by ().
Group 2 (\[Revision) will match "[Revision" and group 4 (:) will match ":", the parts of the string you want to replace.
/\1,\3,\5/'
In the second part of the command, the same groups can be used as the replacement text, so I used group 1 (\1) to preserve everything before "[Revision", then use a comma ',', then use group 3 (\3) (everything between "[Revision" and ":"), a comma ",", and finally group 5 (\5). Group 6 will match the final ']', so it is not used as you wanted to remove it.

The [ must be escaped since it is a special character for regular expressions. Also, it may be better to use another character than , as separator in the sed command. This should do the trick:
sed -i 's/\[Revision /,/g' <FILE>

With sed, / is a pretty common separator. Also, square brackets are special characters and need to be escaped.
replace [Revision with a comma
sed -i 's/\[Revision /,/g' <FILE>
replace : with a comma
sed -i 's/:/,/g' <FILE>
delete ]
sed -i 's/\]//g' <FILE>

Sed or Awk or Perl substitution in a sentence

I need to make a substitution using Sed or other program. I have these patterns <ehh> <mmm> <mhh> repeated at the beginning of a sentences and I need to substitute for nothing.
I am trying this:
echo "$line" | sed 's/<[a-zA-z]+>//g'
But I get the same result, nothing changes. Anyone can help?
Thank you!

For me, for the test file
<ahh> test
<mmm>test 1
the following
sed 's/^<[a-zA-Z]\+>//g' testfile
produces
test
test 1
which seems to be what you want. Note that for basic regular expressions, you use \+ whereas for extended regular expressions, you use + (and need to use the -r switch for sed).
NB: I added a ^to the check since you said: at the beginning of the line.

echo '<ehh> <mmm> <mhh>blabla bla' | \
sed '^Js/^\([[:space:]]*\<[a-zA-Z]\{3\}\>\)\{1,\}//'
remove all starting occurence of your pattern (including heading space)
I escape & to be sure due to sed meaning of this character in pattern (work without on my AIX)
I don't use g because it remove several occurence of full pattern and there is only 1 begin (^) and use a multi occurence counter with group instead \(\)\{1,\}

If the goal is to get the last parameter from lines like this:
<ahh> test
<mmm>test 1
You can do:
awk -F\; '/^<[[:alpha:]]+&gt/ {print $NF}' <<< "$line"
test
test 1
It will search for pattern <[[:alpha:]]+&gt and print last field on line, separated by ;

regex: not match a group rather than single characters

echo test.a.wav|sed 's/[^(.wav)]*//g'
.a.wav
What I want is to remove every character until it reaches the whole group .wav(that is, I want the result to be .wav), but it seems that sed would remove every character until it reaches any of the four characters. How to do the trick?

Groups do not work inside [], so the dot is part of the class as is the parens.
How about:
echo test.a.wav|sed 's/.*\(\.wav\)/\1/g'
Note, there may be other valid solutions, but you provide no context on what you are trying to do to determine what may be the best solution.

The feature you're requesting wouldn't be supported by sed (negative lookahead) but Perl does the trick.
$ echo 'test.a.wav' | perl -pe 's/^(?:(?!\.wav).)*//g'
.wav

Instead of regex, you can use awk like this:
echo test.a.wav.more | awk -F".wav" '{print FS$2}'
.wav.more
It splits the data with your pattern, then print pattern and the rest of the data.

This might work for you (GNU sed):
sed ':a;/^\.wav/!s/.//;ta;/./!d' file
or:
sed 's/\.wav/\n&/;s/^[^\n]*\n//;/./!d' file
N.B. This deletes the line if it is empty. If this is not wanted just remove /./!d from the above commands.

How to use sed and regex?

I need to use sed to look for all lines in a file with pattern "[whatever]|[whatever]" so I'm using the following regex:
sed '/\"[a-zA-Z0-9]+\|[a-zA-Z0-9]+\"/p' test2.txt
But it's not working because in this file is returning something when it shouldn't
RTV0031605951US|3160595|20/03/2013|0|"Laurie Graham"|"401"
Does anybody know with regex should I use? Thanks in advance

I see three problems with your regular expression:
+ is not a metacharacter, so you need to escape it to get its special meaning.
Similar issue happens with the pipe. Neither it is a metacharacter, so don't escape it to match it literally.
Sed by default prints each line that matches, so add -n that avoids that, if you already use /p that prints it. Otherwise you will have those lines twice in the output.

sed will output anything that is a partial match.
To match only whole lines that match your regex, add ^ and $ to the start/end:
sed '/^\"[a-zA-Z0-9]+\|[a-zA-Z0-9]+\"$/p' test2.txt

sed '/\B\"[ [:alnum:]]\+\"|\"[ [:alnum:]]\+\"\B/!d' file
If you use this in a sed script, do not escape double quotes.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to use sed to fix an xml issue - regex

sed 's|</tag1><tag3>|<tag3>|;s|</tag3><tag1>|</tag3>|' file.xml Output: <tag1>text1<tag2>text2<tag3>text3</tag3></tag2>text4</tag1>

This might work for you (GNU sed): sed -r 's/<tag1>/\n/g;s/<\/tag1>(<tag3>[^\n]*)\n/\1/g;s/\n/<tag1>/g' file Reduce <tag1> to a unique character i.e \n then use the negated character class [^\n] to obtain non-greedy matching. Following the changes reverse the initial substitution.

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

How to replace text with comma [Linux on Windows]

Sed or Awk or Perl substitution in a sentence

regex: not match a group rather than single characters

How to use sed and regex?

Categories

Resources