Hi I'm looking for a regular expression for: line of text that does not end with a certain word, let's say it's "abcd"
At first I tried with
.*[^abcd]$
That one doesn't work of course. It matches a line that doesn't end with any of the letters a,b,c or d.
So, in Advanced Grep Topics, I found this expression, but couldn't get it to work:
^(?>.*)(?<=abcd)
->
grep -e "^(?>.*)(?<=abcd)$"
Any idea for the expression I need?
Have a look at grep's -v option
grep -v 'abcd$'
If you really meant word rather that just "sequence of characters" then use
grep -v '\babcd$'
\b meaning "word-boundary"
Give this a shot:
grep -v "\<abcd\>$"
Proof of Concept
$ printf "%s\n" "foo abcd bar baz" "foo bar baz abcd" "foo bar bazabcd" | grep -v "\<abcd\>$"
foo abcd bar baz
foo bar bazabcd
Note: This will match whole words as noted by the fact that the 3rd line was returned even though it contained abcd as the last 4 letters
grep supports PCRE regular expressions when using -P flag.
One of the reason grep -e "^(?>.*)(?<=abcd)$" does not work is because the lookaround you are using is positive, which means totally opposite of what is required. (?<= is the syntax for positive lookbehind, which tells regex engine to search for lines that ends with abcd.
To search for lines that does not end with certain string, you need to use negative lookbehind. The syntax for negative lookbehind is (?<!. And because negative lookbehind includes exclamation mark which bash will try to interpret as an event, one can not use double quotes to supply regex to grep.
I used following regex to search for the lines that do not end with log.
grep -P '(?<!log)$' < <inputfile>
Similarly you can use above command and replace log with whatever pattern you want to match.
This regex can be used with other programs where inverse matching is not supported, such as -v option of grep
Related
I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7
With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.
This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.
$ cut -d'"' -f2 file
TEXT I WANT TO KEEP
You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"
The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.
A simpler one for sed:
sed 's/^[^"]*//' myfile.txt
If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.
Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro
Assume the following is in file.txt:
---------
foo bar
more foo bar
---------
when I execute grep -P '(?<=-$)(?s:.)*(?=^-)' file.txt, I expect only the middle two lines to be matched, but this expression matches nothing. What's wrong?
I also tried grep -P '(?s)(?<=-$).*(?=^-)' file.txt but same result.
Your pattern dos not work because
The P option alone only makes grep match using the PCRE regex engine
Since you have no other options, grep outputs whole matched lines, you need to add o option to output the matched text(s) and z to slurp the file into a single text
Your regex has ^ and $ anchors that match start/end of the string, not lines, by default. You need a m flag together with s flag (it makes . match any char including line break chars).
So, you may use your regex with m and -oz:
grep -Poz '(?ms)(?<=-$).*(?=^-)' file.txt
Or,
grep -Poz '(?s)-\R\K.*(?=\R-)' file.txt
where \R matces any line break sequence and \K omits the text matched so far from the overall memory buffer.
See the regex demo.
In my project, I'm trying to match files that contain "/baz". So, I do the following:
grep -rnw . -e "\/baz"
The output correctly matches the following instance:
import { baz } from './baz';
But it does not seem to match on this line:
import { baz } from './foo/bar/baz';
It does match if I grep on "bar\/baz", however. What's going on?
Ditch the -w argument from your grep call. From grep help:
-w, --word-regexp force PATTERN to match only whole words
It will, therefore, match your pattern only if it's a whole word (surrounded by word boundaries such as a dot, whitespace, begin/end of the line...)
You need
grep -rn . -e "\/baz"
without the word flag. Otherwise the expression looks for word boundaries which cannot be found in the second example.
I've wrote this regex:
/_([^_+\n][\w]+)_/g
and I wanted to test it out on my terminal with
echo "HELLO ___ _HELO_WORLD_" | sed "/_([^_+\n][\w]+)_/g"
However, it outputs
HELLO ___ _HELO_WORLD_
which means sed does not match anything.
The result needs to be :
_HELLO_WORLD_
I am using OS X, and I tried both -E and -e as suggested by other posts, but that didn't change anything. What am I doing wrong here?
sed is not particularily well suited for this task, as it really is good at applying patterns to lines, less so to words, making the regexes overly complicated.
word-oriented solution
anyhow, here's an attempt, using two replacement patterns:
sed -e 's|\<[^_][^\> ]*[^_]\> *||g' -e 's|\<_*\> *||g'
the first expression replaces any word that is neither starting nor ending with underscores (and any trailing whitespace) by nought. \< indicates the beginning of a word, and \> the ending; so \<\([^_][^\>]*[^_]\)\> translates to "at the beginning \< there is no underscore [^_], followed by any number of characters not ending the word [^\>]. followed by a character that is not an underscore [^_] right before the word ends \>
the second expression is simpler and replaces any word solely consisting of underscores with nought.
line oriented processing
if you can arrange for your data to be one expression per line you can use something like the following
$ cat data.txt
HELLO
___
_HELO_WORLD_
$ cat data.txt | sed -n -e '/_[^_+\s]\w*_/p'
_HELO_WORLD_
$
The sed-term is almost the one you gave (though for some reasons sed doesn't like the +, so I use a workaround with * instead.
The basic trick is to use the -n flag to disable the default printing of lines and to use the p command to explicitely print matching lines.
I am still not sure what you are asking, so I answer what I guess you are asking. My guess is, that you want to find strings surrounded by underscores with Sed. The short answer is: no. The longer is: you can not find overlapping string parts with Sed, because it does not support lookahead.
If you take this string _HELLO_WORLD_ and the following pattern _[^_]*_, the pattern will match _HELLO_ and the remaining string is WORLD_, which will not match, because the leading underscore has already been consumed.
Sed is the wrong tool for this. Use Perl instead. This prints all strings surrounded by underscores:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/_([A-Z]+)(?=_)/print $1/ge'
HELOWORLD
Update reflecting your last comment:
If you want to find strings starting and ending with an underscore at word boundaries, use this one:
$ echo "HELLO ___ _HELO_WORLD_" | perl -ne 's/\b_([A-Z]+[_A-Z]*[A-Z]*)_\b/print $1/ge'
HELO_WORLD
There are multiple problem :
your sed command is a condition. It should be an action, as s/pattern/replacement/flags or the condition could be followed by an action, i.e. /_([^_+\n][\w]+)_/p to print the line.
with sed, you either need to escape your parentheses and + or to use the -rregex-extended flag
[\w] : \w is already a character class by itself, no need to encase it in a class
Finally, a shot at what I think you want with GNU grep :
grep -P -o "_[^_+\n\s]\w+_"
$ echo "HELLO ___ _HELO_WORLD_" | grep -P -o "_[^_+\n\s]\w+_"
_HELO_WORLD_
Using grep is enough and easier if you only need to match.
-o will able you to retrieve only the matched part rather than the whole line
-P uses perl regexes so that you can use shorthand classes as \n and \s
I added \s to the negated class, because previously it could match the space before what you want to match, since \w can match the underscore.
If you can't use GNU grep, then it's back to sed, which is already answered by ceving.
As many answers and the downvotes suggest, sed doesn't look like the right tool to use for this question, so I ended up using Python, which worked out really well, so I will just post it here for anyone in the future who might have same problem.
import re
p = re.compile('_([^_+\n][\w ]+)_')
result = p.findall(text)
I am having a difficult time trying to search for a phrase but exclude the phrase if it is directly followed by a colon-space.
I am looking for Delet! (i.e. "Delet.*" in regex syntax) but I do not want anything returned that is "Deleted: " (includes a space after the colon). However, I would like anything returned that is "Deleted" followed by anything other than a colon-space.
I have tried the following expressions
grep -ri 'delet.*[^:]'
grep -ri 'delet[a-zA-Z0-9\;\".....]{0,10}'
(including all special characters in the range preceded by escapes)
Using a lookahead expression:
grep -Pi 'Delet(?!ed: )'
Note the modification of the parameters of grep: -P enables the use of lookahead expressions.
Try this. The ? after the * instructs it to select as few non-space characters as possible, followed by any one character that is not a colon, followed by a space.
grep -ri 'delet[^ ]*?[^:] '
If I got you correctly you want anything starting with delet, and not starting with deleted::
grep -Ei '^delet((([^e]|e$)|e([^d]|d$)|ed([^:]|:$)|ed:[^ ]).*)?$'
This basically says:
Match [start]deletX[anything][end] or [start]delete[end] where X is not e
Match [start]deleteX[anything][end] or [start]deleted[end] where X is not d
Match [start]deletedX[anything][end] or [start]deleted:[end] where X is not :
Match [start]deleted:X[anything][end] where X is not space.
It would have been far easier to use pipe and second negative grep if that is applicable:
grep -i ^delet | grep -vi '^deleted: '
It sounds like all you need is:
awk -v IGNORECASE=1 '/delet/ && !/deleted: /' file
The above uses GNU awk for IGNORECASE, other awks would use tolower().
The benefit of awk over grep is that awk tests for conditions, not just regexps, so you can create compound conditions using && and || out of tests for regexps which makes it MUCH simpler and clearer to just code the condition you want to test - that the line contains delet and (&&) not (!) deleted:.