sed does not match the regex

sed does not match the regex - regex

I've wrote this regex:
/_([^_+\n][\w]+)_/g
and I wanted to test it out on my terminal with
echo "HELLO ___ _HELO_WORLD_" | sed "/_([^_+\n][\w]+)_/g"
However, it outputs
HELLO ___ _HELO_WORLD_
which means sed does not match anything.
The result needs to be :
_HELLO_WORLD_
I am using OS X, and I tried both -E and -e as suggested by other posts, but that didn't change anything. What am I doing wrong here?

sed is not particularily well suited for this task, as it really is good at applying patterns to lines, less so to words, making the regexes overly complicated.
word-oriented solution
anyhow, here's an attempt, using two replacement patterns:
sed -e 's|\<[^_][^\> ]*[^_]\> *||g' -e 's|\<_*\> *||g'
the first expression replaces any word that is neither starting nor ending with underscores (and any trailing whitespace) by nought. \< indicates the beginning of a word, and \> the ending; so \<\([^_][^\>]*[^_]\)\> translates to "at the beginning \< there is no underscore [^_], followed by any number of characters not ending the word [^\>]. followed by a character that is not an underscore [^_] right before the word ends \>
the second expression is simpler and replaces any word solely consisting of underscores with nought.
line oriented processing
if you can arrange for your data to be one expression per line you can use something like the following
$ cat data.txt
HELLO
___
_HELO_WORLD_
$ cat data.txt | sed -n -e '/_[^_+\s]\w*_/p'
_HELO_WORLD_
$
The sed-term is almost the one you gave (though for some reasons sed doesn't like the +, so I use a workaround with * instead.
The basic trick is to use the -n flag to disable the default printing of lines and to use the p command to explicitely print matching lines.

I am still not sure what you are asking, so I answer what I guess you are asking. My guess is, that you want to find strings surrounded by underscores with Sed. The short answer is: no. The longer is: you can not find overlapping string parts with Sed, because it does not support lookahead.
If you take this string _HELLO_WORLD_ and the following pattern _[^_]*_, the pattern will match _HELLO_ and the remaining string is WORLD_, which will not match, because the leading underscore has already been consumed.
Sed is the wrong tool for this. Use Perl instead. This prints all strings surrounded by underscores:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/_([A-Z]+)(?=_)/print $1/ge'
HELOWORLD
Update reflecting your last comment:
If you want to find strings starting and ending with an underscore at word boundaries, use this one:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/\b_([A-Z]+[_A-Z]*[A-Z]*)_\b/print $1/ge'
HELO_WORLD

There are multiple problem :
your sed command is a condition. It should be an action, as s/pattern/replacement/flags or the condition could be followed by an action, i.e. /_([^_+\n][\w]+)_/p to print the line.
with sed, you either need to escape your parentheses and + or to use the -rregex-extended flag
[\w] : \w is already a character class by itself, no need to encase it in a class
Finally, a shot at what I think you want with GNU grep :
grep -P -o "_[^_+\n\s]\w+_"
$ echo "HELLO ___ _HELO_WORLD_" | grep -P -o "_[^_+\n\s]\w+_"
_HELO_WORLD_
Using grep is enough and easier if you only need to match.
-o will able you to retrieve only the matched part rather than the whole line
-P uses perl regexes so that you can use shorthand classes as \n and \s
I added \s to the negated class, because previously it could match the space before what you want to match, since \w can match the underscore.
If you can't use GNU grep, then it's back to sed, which is already answered by ceving.

As many answers and the downvotes suggest, sed doesn't look like the right tool to use for this question, so I ended up using Python, which worked out really well, so I will just post it here for anyone in the future who might have same problem.
import re
p = re.compile('_([^_+\n][\w ]+)_')
result = p.findall(text)

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7

With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.

This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.

$ cut -d'"' -f2 file
TEXT I WANT TO KEEP

You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"

The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.

A simpler one for sed:
sed 's/^[^"]*//' myfile.txt

If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.

Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

How to replace with one sed command first n letter to uppercase

I would like to replace with one sed command first n letter to uppercase.
Example 'madrid' to 'MADrid'. (n=3)
I know how to change first letter to uppercase with this command:
sed -e "s/\b\(.\)/\U\1/g"
but I dont know how to change this command for my problem.
I tried to change
sed -e "s/\b\(.\)/\U\1/g"
to
sed -e "s/\b\(.\)/\U\3/g"
but this didnt work. Also, I googled and searched on this site but exact answer with my problem I couldnt find.
Thank you.

I infer from your use of \U that you're using GNU sed:
n=3
echo 'madrid' | sed -r 's/\<(.{'"$n"'})/\U\1/g' # -> 'MADrid'
I've omitted the unnecessary -e option
I have added -r to enable support for extended regular expressions, which have more familiar syntax and also offer more features.
I'm using a single-quoted sed script with a shell-variable value spliced in so as to avoid confusion between what the shell expands up front and what is interpreted by sed itself.
\< is used instead of \b, because unlike the latter it only matches at the start of a word.Thanks, Casimir et Hippolyte
The above replaces any 3 characters at the start of a word, however.
To limit it to at most $n letters:
sed -r 's/\<([[:alpha:]]{1,'"$n"'})/\U\1/g'
As for what you've tried:
The \3 in your attempt sed -e "s/\b\(.\)/\U\3/g" refers to the 3rd capture group (parenthesized subexpression, (...)) in the regex (which doesn't exist), it does not refer to 3 repetitions.
Instead, you have to make sure that your one and only capture group (which you can reference as \1 in the substitution) itself captures as many characters as desired - which is what the {<n>} quantifier is for; the related {<m>,<n>} construct matches a range of repetitions.

This might work for you (GNU sed):
sed -r 's/[a-z]/&\n/'"$n"';s/^([^\n]*)\n/\U\1/' file
Where $n is the first n letters. Putting the question of word boundaries aside this converts n letters of a-z consecutive or non-consecutive to upper case i.e. A-Z
N.B. this is two sed commands not one!

Grep Regex Exclusion Special Character

I am having a difficult time trying to search for a phrase but exclude the phrase if it is directly followed by a colon-space.
I am looking for Delet! (i.e. "Delet.*" in regex syntax) but I do not want anything returned that is "Deleted: " (includes a space after the colon). However, I would like anything returned that is "Deleted" followed by anything other than a colon-space.
I have tried the following expressions
grep -ri 'delet.*[^:]'
grep -ri 'delet[a-zA-Z0-9\;\".....]{0,10}'
(including all special characters in the range preceded by escapes)

Using a lookahead expression:
grep -Pi 'Delet(?!ed: )'
Note the modification of the parameters of grep: -P enables the use of lookahead expressions.

Try this. The ? after the * instructs it to select as few non-space characters as possible, followed by any one character that is not a colon, followed by a space.
grep -ri 'delet[^ ]*?[^:] '

If I got you correctly you want anything starting with delet, and not starting with deleted::
grep -Ei '^delet((([^e]|e$)|e([^d]|d$)|ed([^:]|:$)|ed:[^ ]).*)?$'
This basically says:
Match [start]deletX[anything][end] or [start]delete[end] where X is not e
Match [start]deleteX[anything][end] or [start]deleted[end] where X is not d
Match [start]deletedX[anything][end] or [start]deleted:[end] where X is not :
Match [start]deleted:X[anything][end] where X is not space.
It would have been far easier to use pipe and second negative grep if that is applicable:
grep -i ^delet | grep -vi '^deleted: '

It sounds like all you need is:
awk -v IGNORECASE=1 '/delet/ && !/deleted: /' file
The above uses GNU awk for IGNORECASE, other awks would use tolower().
The benefit of awk over grep is that awk tests for conditions, not just regexps, so you can create compound conditions using && and || out of tests for regexps which makes it MUCH simpler and clearer to just code the condition you want to test - that the line contains delet and (&&) not (!) deleted:.

sed replace between two strings wildcard

I am trying to flag everything inside a color tag and replace it with something else, such as:
I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.
to
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.
Here is what I've tried:
sample='I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.'
replace='foobar'
sample=$(echo $sample| sed "s/\[color=blue\].*\[\/color\]/\[color=blue\]\[b\]$replace\[\/b\]\[\/color\]/g")
Which gets me:
I have a [color=blue][b]foobar[/b][/color] in my house.
Any idea on how to make sed nongreedy in this case?

Just replace your .* with [^[]* (any character other than left bracket). That is:
"s/\[color=blue\][^[]*\[\/color\]/\[color=blue\]\[b\]$replace\[\/b\]\[\/color\]/g"

sed is always greedy. You can work around it by selecting the regex carefully. The example below is identical to yours except that .* has been replaced with [^[]* (which means everything except [):
$ echo $sample| sed "s/\[color=blue\][^[]*\[\/color\]/\[color=blue\]\[b\]$replace\[\/b\]\[\/color\]/g"
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.
For truly non-greedy regular expressions, try perl or python.

sed 's#\(\[color=[[:alpha:]]*\]\)[[:alnum:]]*\(\[/color\)#\1[b]foobar[/b]\2#g'
example
echo 'I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.'|sed 's#\(\[color=[[:alpha:]]*\]\)[[:alnum:]]*\(\[/color\)#\1[b]\2#g'
output
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.

sed -r 's/(\[color=[a-z]*\])[a-z]*(\[\/color\])/\1[b]foobar[\/b]\2/g' File
or
sed 's/\(\[color=[a-z]*\]\)[a-z]*\(\[\/color\]\)/\1[b]foobar[\/b]\2/g' File
Explanation:
Here, we look for the patterns 1. [color=any small letter sequence] followed by 2. any small letter sequence followed by 3. [/color] and group patterns 1 and 3 using ( and ). Then we do the substitutions. We keep the 1st and 2nd groups (using \1 and \2), but replace the contents between the first and second group with [b]foobar[/b].

sed will always be greedy. You can use perl if you strictly want non-greedy variant:
$ echo $test
I have a [color=blue]dog[/color] and a [color=blue]cat[/color] in my house.
$ perl -pne 's/(\[color=[a-zA-Z]*\])(.*?)(\[\/color\])/$1\[b\]foobar\[\/b\]$3/g' <<< "$test"
I have a [color=blue][b]foobar[/b][/color] and a [color=blue][b]foobar[/b][/color] in my house.
I guess, you can interpret most of the regex here, except for the tiny syntax change:
(.*?) in place of (.*) dictates that the match is supposed to be non-greedy.
If you skip ? after .*, here is the output you must be getting currently:
$ perl -pne 's/(\[color=[a-zA-Z]*\])(.*)(\[\/color\])/$1\[b\]foobar\[\/b\]$3/g' <<< "$test"
I have a [color=blue][b]foobar[/b][/color] in my house.

As other have stated you need to use non greedy by reading non matching characters.
Using a carat inside brackets [^ABC] effectively means not whatever follows.
So using this with the asterix * will match only up to the next one of that character.
For example
[^[]*
Will match everything up to the next [ bracket
Also everyone is backslash escaping the replacement which is not needed as it cannot print regex.
Anyway here is a command that should work.
sed 's/\(\[color[^]]*\]\)[^[]*\(\[\/color\]\)/\1[b]foobar\[b]\2/g'

replace number in a string

I am trying to match this string
'12.34.5.6',#### OR
'12.34.5.6', #### (Note the space after the comma)
in a series of files and replace #### with 2222.
I started small and this command successfully changed 1234 to 2222
sed -i 's/'12.34.5.6\''\,1234/'12.34.5.6\''\, 2222/g' file.txt
so I moved on to work on replacing 1234 with regex, below are some of the commands i've tried but do not work.
sed -i 's/'12.34.5.6\''\,\(\s?[0-9]{4,5}\)/'12.34.5.6\''\, 2222/g' file.txt
sed -i 's/'12.34.5.6\''\,[0-9][0-9][0-9][0-9][0-9]?/'12.34.5.6\''\, 2222/g' file.txt
Can someone help me out with this or give some pointers?

sed -r "s/('12[.]34[.]5[.]6',[ ]?)[0-9]{4}/\\12222/g"

This might do the trick:
sed -E "s/('12.34.5.6',\s?)[0-9]{4,5}/\12222/g"
Examples:
$ echo "'12.34.5.6', 2134" | sed -E "s/('12.34.5.6',\s?)[0-9]{4,5}/\12222/g"
'12.34.5.6', 2222
$ echo "'12.34.5.6',9230" | sed -E "s/('12.34.5.6',\s?)[0-9]{4,5}/\12222/g"
'12.34.5.6',2222
Explications:
With -E we ask sed to use extended regex (but this is mainly a matter of taste), the beginning of the regex is fairly simple: '12.34.5.6', just match this same string. We then add a space, followed by a ? to indicate it is optionnal. This first part is enclosed in braces to be able to use this in the replacement pattern.
Then, we add the #'s to the pattern. I assumed you used #'s in place of numbers based on your attempts with [0-9]{4,5} and [0-9][0-9][0-9][0-9][0-9].
Finally, in the replacement pattern we use the previously matched first pair of braces with \1, and add our 2's: \12222 (which will replace the numbers (#'s), discarded in the process because not enclosed in the braces).
PS. Next time please format your question for better readability.
PPS. I think the real issue here is not the regex but the quote escaping in your regex. Maybe take look at [this question].

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

sed does not match the regex - regex

Related

How to use grep/sed/awk, to remove a pattern from beginning of a text file

How to replace with one sed command first n letter to uppercase

Grep Regex Exclusion Special Character

sed replace between two strings wildcard

replace number in a string

Categories

Resources