sed replace between two strings wildcard - regex

I am trying to flag everything inside a color tag and replace it with something else, such as:
I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.
to
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.
Here is what I've tried:
sample='I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.'
replace='foobar'
sample=$(echo $sample| sed "s/\[color=blue\].*\[\/color\]/\[color=blue\]\[b\]$replace\[\/b\]\[\/color\]/g")
Which gets me:
I have a [color=blue][b]foobar[/b][/color] in my house.
Any idea on how to make sed nongreedy in this case?

Just replace your .* with [^[]* (any character other than left bracket). That is:
"s/\[color=blue\][^[]*\[\/color\]/\[color=blue\]\[b\]$replace\[\/b\]\[\/color\]/g"

sed is always greedy. You can work around it by selecting the regex carefully. The example below is identical to yours except that .* has been replaced with [^[]* (which means everything except [):
$ echo $sample| sed "s/\[color=blue\][^[]*\[\/color\]/\[color=blue\]\[b\]$replace\[\/b\]\[\/color\]/g"
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.
For truly non-greedy regular expressions, try perl or python.

sed 's#\(\[color=[[:alpha:]]*\]\)[[:alnum:]]*\(\[/color\)#\1[b]foobar[/b]\2#g'
example
echo 'I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.'|sed 's#\(\[color=[[:alpha:]]*\]\)[[:alnum:]]*\(\[/color\)#\1[b]\2#g'
output
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.

sed -r 's/(\[color=[a-z]*\])[a-z]*(\[\/color\])/\1[b]foobar[\/b]\2/g' File
or
sed 's/\(\[color=[a-z]*\]\)[a-z]*\(\[\/color\]\)/\1[b]foobar[\/b]\2/g' File
Explanation:
Here, we look for the patterns 1. [color=any small letter sequence] followed by 2. any small letter sequence followed by 3. [/color] and group patterns 1 and 3 using ( and ). Then we do the substitutions. We keep the 1st and 2nd groups (using \1 and \2), but replace the contents between the first and second group with [b]foobar[/b].

sed will always be greedy. You can use perl if you strictly want non-greedy variant:
$ echo $test
I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.
$ perl -pne 's/(\[color=[a-zA-Z]*\])(.*?)(\[\/color\])/$1\[b\]foobar\[\/b\]$3/g' <<< "$test"
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.
I guess, you can interpret most of the regex here, except for the tiny syntax change:
(.*?) in place of (.*) dictates that the match is supposed to be non-greedy.
If you skip ? after .*, here is the output you must be getting currently:
$ perl -pne 's/(\[color=[a-zA-Z]*\])(.*)(\[\/color\])/$1\[b\]foobar\[\/b\]$3/g' <<< "$test"
I have a [color=blue][b]foobar[/b][/color] in my house.

As other have stated you need to use non greedy by reading non matching characters.
Using a carat inside brackets [^ABC] effectively means not whatever follows.
So using this with the asterix * will match only up to the next one of that character.
For example
[^[]*
Will match everything up to the next [ bracket
Also everyone is backslash escaping the replacement which is not needed as it cannot print regex.
Anyway here is a command that should work.
sed 's/\(\[color[^]]*\]\)[^[]*\(\[\/color\]\)/\1[b]foobar\[b]\2/g'

Related

sed and Perl regexp replaces once, with multiple replacements flag

I have the string:
lopy,lopy1,sym,lopy,lopy1,sym"
I want the line to be:
lopy,lopy1,sym,lady,lady1,sym
Which means that all "lad" after the string sym should be replaced. So I ran:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | sed -r 's/(.*sym.*?)lopy/\1lad/g'
I get:
lopy,lopy1,sym,lopy,lad1,sym
Using Perl is not really better:
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(.*sym.+?)lopy/${1}lad/g'
yields
lopy,lopy1,sym,lad,lopy1,sym
Not all "lopy" are replaced. What am I doing wrong?
The (.*sym.*?)lopy / (.*sym.+?)lopy patterns are almost the same, .+? matches one or more chars other than line break chars, but as few as possible, and .*? matches zero or more such chars. Mind that sed does not support lazy quantifiers, *? is the same as * in sed. However, the main problem with the regexps you used is that they match sym, then any text after it and then lopy, so when you added g, it just means you want to find more cases of lopy after sym....lopy. And there is only one such occurrence in your string.
You want to replace all lopy after sym, so you can use
perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
See the regex demo. Details:
(?:\G(?!^)|sym) - sym or end of the previous match (\G(?!^))
.*? - any zero or more chars other than line break chars, as few as possible
\K - match reset operator that discards all text matched so far
lopy - a lopy string.
See the online demo:
#!/bin/bash
echo "lopy,lopy1,sym,lopy,lopy1,sym" | perl -pe 's/(?:\G(?!^)|sym).*?\Klopy/lad/g'
# => lopy,lopy1,sym,lad,lad1,sym
If the values are always comma separated, you may replace .*? with ,: (?:\G(?!^)|sym),\Klopy (see this regex demo).
Since OP has mentioned sed so I am adding awk program here. Which could be better choice in comparison to sed. With shown samples, please try following awk program.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
awk -F',sym,' '
{
first=$1
$1=""
sub(/^[[:space:]]+/,"")
gsub(/lop/,"lad")
$0=first FS $0
}
1
'
Explanation: Adding detailed explanation for above.
echo "lopy,lopy1,sym,lopy,lopy1,sym" | ##Printing values and sending as standard output to awk program as an input.
awk -F',sym,' ' ##Making ,sym, as a field separator here.
{
first=$1 ##Creating first which has $1 of current line in it.
$1="" ##Nullifying $1 here.
sub(/^[[:space:]]+/,"") ##Substituting initial space in current line here.
gsub(/lop/,"lad") ##Globally substituting lop with lad in rest of line.
$0=first FS $0 ##Adding first FS to rest of edited line here.
}
1 ##Printing edited/non-edited line value here.
'
The problem is that the lopy(s) to replace are after sym, with a pattern like sym.*?lopy, so a global replacement looks for yet more of the whole sym+lopy-after-sym (not just for all lopys after that one sym).†
To replace all lopys (after the first sym, followed by another sym) we can capture the substring between syms and in the replacement side run code, in which a regex replaces all lopys
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe's{ sym,\K (.+?) (?=sym) }{ $1 =~ s/lop/lad/gr }ex'
To isolate the substring between syms I use \K after the first sym, which drops matches prior to it, and a positive lookahead for the sym after the substring, which doesn't consume anything. The /e modifier makes the replacement side be evaluated as code. In the replacement side's regex we need /r since $1 can't change, and we want the regex to return anyway. See perlretut.
† To match all of abbbb we can't say /ab/g, nor /(a)b/g nor /a(b)/g, because that would look for all repetitions of the whole ab in the string (and find only ab in the beginning).
sed does not support non-greedy wildcards at all. But your Perl script also fails for other reasons; you are saying "match all occurrences of this" but then you specify a regex which can only match once.
A common simple solution is to split the string, and then replace only after the match:
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 'if (#x = /^(.*?sym,)(.*)/) { $x[1] =~ s/lop/lad/g; s/.*/$x[0]$x[1]/ }'
If you want to be fancy, you can use a lookbehind to only replace the lop occurrences after the first sym.
echo "lopy,lopy1,sym,lopy,lopy1,sym" |
perl -pe 's/(?<=sym.{0,200})lop/lad/'
The variable-length lookbehind generates a warning and is only supported in Perl 5.30+ (you can turn it off with no warnings qw(experimental::vlb));.)
Since you have shown an attempted sed command and used sed tag, here is a sed loop based solution:
sed -E -e ':a' -e 's~(sym,.*)lopy~\1lady~g; ta' file
lopy,lopy1,sym,lady,lady1,sym"
Explanation:
:a sets a label a before matching sym,.* pattern
ta jumps pattern matching back to label a after making a substitution
This looping stop when s command has nothing to match i.e. no lopy substring after sym,

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7
With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.
This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.
$ cut -d'"' -f2 file
TEXT I WANT TO KEEP
You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"
The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.
A simpler one for sed:
sed 's/^[^"]*//' myfile.txt
If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.
Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

sed replace exact string that include brackets

i'm trying to replace an exact string that includes bracket on it. let's say:
a[aa] to bbb, just for giving an example.
I had used the following regex:
sed 's|\<a\[aa]\>|bbb|g' testfile
but it doesn't seem to work. this could be something really basic but I have not been able to make it work so I would appreciate any help on this.
You need to remove the trailing word boundary that requires a letter, digit or _ to immediately follow the ] char.
sed 's|\<a\[aa]|bbb|g' file
See the online sed demo:
s="say: a[aa] to bbb, not ba[aa]"
sed 's|\<a\[aa]|bbb|g' <<< "$s"
# => say: bbb to bbb, not ba[aa]
You may also require a non-word char with a capturing group and replace with a backreference:
sed -E 's~([^_[:alnum:]]|^)a\[aa]([^_[:alnum:]]|$)~\1bbb\2~g' file
Here, ([^_[:alnum:]]|^) captures any non-word char or start of string into Group 1 and ([^_[:alnum:]]|$) matches and caprures into Group 2 any char other than _, digit or letter, and the \1 and \2 placeholders restore these values in the result. This, however, does not allow consecutive matches, so you may still use \< before a to play it safe: sed -E 's~\<a\[aa]([^_[:alnum:]]|$)~bbb\1~g'. file`.
See this online demo.
To enforce whitespace boundaries you may use
sed -E 's~([[:space:]]|^)a\[aa]([[:space:]]|$)~\1bbb\2~g' file
Or, in your case, just a trailing whitespace boundary seems to be enough:
sed -E 's~\<a\[aa]([[:space:]]|$)~bbb\1~g' file

sed does not match the regex

I've wrote this regex:
/_([^_+\n][\w]+)_/g
and I wanted to test it out on my terminal with
echo "HELLO ___ _HELO_WORLD_" | sed "/_([^_+\n][\w]+)_/g"
However, it outputs
HELLO ___ _HELO_WORLD_
which means sed does not match anything.
The result needs to be :
_HELLO_WORLD_
I am using OS X, and I tried both -E and -e as suggested by other posts, but that didn't change anything. What am I doing wrong here?
sed is not particularily well suited for this task, as it really is good at applying patterns to lines, less so to words, making the regexes overly complicated.
word-oriented solution
anyhow, here's an attempt, using two replacement patterns:
sed -e 's|\<[^_][^\> ]*[^_]\> *||g' -e 's|\<_*\> *||g'
the first expression replaces any word that is neither starting nor ending with underscores (and any trailing whitespace) by nought. \< indicates the beginning of a word, and \> the ending; so \<\([^_][^\>]*[^_]\)\> translates to "at the beginning \< there is no underscore [^_], followed by any number of characters not ending the word [^\>]. followed by a character that is not an underscore [^_] right before the word ends \>
the second expression is simpler and replaces any word solely consisting of underscores with nought.
line oriented processing
if you can arrange for your data to be one expression per line you can use something like the following
$ cat data.txt
HELLO
___
_HELO_WORLD_
$ cat data.txt | sed -n -e '/_[^_+\s]\w*_/p'
_HELO_WORLD_
$
The sed-term is almost the one you gave (though for some reasons sed doesn't like the +, so I use a workaround with * instead.
The basic trick is to use the -n flag to disable the default printing of lines and to use the p command to explicitely print matching lines.
I am still not sure what you are asking, so I answer what I guess you are asking. My guess is, that you want to find strings surrounded by underscores with Sed. The short answer is: no. The longer is: you can not find overlapping string parts with Sed, because it does not support lookahead.
If you take this string _HELLO_WORLD_ and the following pattern _[^_]*_, the pattern will match _HELLO_ and the remaining string is WORLD_, which will not match, because the leading underscore has already been consumed.
Sed is the wrong tool for this. Use Perl instead. This prints all strings surrounded by underscores:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/_([A-Z]+)(?=_)/print $1/ge'
HELOWORLD
Update reflecting your last comment:
If you want to find strings starting and ending with an underscore at word boundaries, use this one:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/\b_([A-Z]+[_A-Z]*[A-Z]*)_\b/print $1/ge'
HELO_WORLD
There are multiple problem :
your sed command is a condition. It should be an action, as s/pattern/replacement/flags or the condition could be followed by an action, i.e. /_([^_+\n][\w]+)_/p to print the line.
with sed, you either need to escape your parentheses and + or to use the -rregex-extended flag
[\w] : \w is already a character class by itself, no need to encase it in a class
Finally, a shot at what I think you want with GNU grep :
grep -P -o "_[^_+\n\s]\w+_"
$ echo "HELLO ___ _HELO_WORLD_" | grep -P -o "_[^_+\n\s]\w+_"
_HELO_WORLD_
Using grep is enough and easier if you only need to match.
-o will able you to retrieve only the matched part rather than the whole line
-P uses perl regexes so that you can use shorthand classes as \n and \s
I added \s to the negated class, because previously it could match the space before what you want to match, since \w can match the underscore.
If you can't use GNU grep, then it's back to sed, which is already answered by ceving.
As many answers and the downvotes suggest, sed doesn't look like the right tool to use for this question, so I ended up using Python, which worked out really well, so I will just post it here for anyone in the future who might have same problem.
import re
p = re.compile('_([^_+\n][\w ]+)_')
result = p.findall(text)

Printing a matched regexp with sed

So I'm trying to match a regexp with any string in the middle of it and then print out just that string. The syntax is sort of like this...
sed -n 's/<title>.*</title>/"what do I put here"/p' input.file
and I just want to print out whatever .* is where I typed "what do I put here". I'm not very comfortable with sed at this point so this is likely a very simple answer and I'm having trouble finding one in any of the other questions. Thanks in advance!
Capture the pattern you want to extract within \(...\), and then you can refer to it as \1 in the replacement string:
sed -n 's/<title>\(.*\)</title>/\1/p' input.file
You can have multiple \(...\) expressions, and refer to them with \1, \2, \3, and so on.
If you have the GNU version of sed, or gsed, then you could simplify a bit:
sed -rn 's/<title>(.*)</title>/\1/p' input.file
With the -r flag, sed can use "extended regular expressions", which practically let's you write (...) instead of \(...\), + instead of \+, and other goodies.