Why is my regex not matching a C and the next letter?

Why is my regex not matching a C and the next letter? - regex

My testfile is:
PolicyChain:ComplementaryUser Caught
PolicyChain:SourceIP Caught
My regex is:
cat testfile | grep -E -o '[^PolicyChain:].+?'
It matches:
mplementaryUser Caught
SourceIP Caught
I'm ultimately just trying to match the string after the colon but before the space. Please help??

[^PolicyChain:] is a character class that matches one character that is NOT (as indicated by the ^) among P,o,l,i,c,y,C,h,a,i,n or :.
Then you match one character or more characters, lazily .+?.
Since the regex has to start by matching a non-c (the first token), it cannot start matching at the C of ComplementaryUser.
I suggest that your decision to use a character class is an error, and you want a positive lookbehind instead, such as (?<=^PolicyChain:): http://www.regular-expressions.info/refadv.html
A positive lookbehind means, 'look behind my current position and attempt to match this lookbehind regex. If it does match, we can continue with the rest of the main regex. If it does not match, we cannot continue.'
However note that lookaheads and lookbehinds are not POSIX-compliant, and you must use a perl-themed regex (PCRE) to have them. (Or .NET, Python, Java, Ruby...)

Try this instead.
cat testfile | sed -e "s/.*:\([^ ][^ ]*\).*/\1/"

You can simply use cut:
echo "PolicyChain:ComplementaryUser Caught" | cut -d: -f 2

Related

How to grep an exact string with slash in it?

I'm running macOS.
There are the following strings:
/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2
I want to grep only the bolded words.
I figured doing grep -wr 'superman/|/superman' would yield all of them, but it only yields /superman.
Any idea how to go about this?

You may use
grep -E '(^|/)superman($|/)' file
See the online demo:
s="/superman
/superman1
/superman/batman
/superman2/batman
/superman/wonderwoman
/superman3/wonderwoman
/batman/superman
/batman/superman1
/wonderwoman/superman
/wonderwoman/superman2"
grep -E '(^|/)superman($|/)' <<< "$s"
Output:
/superman
/superman/batman
/superman/wonderwoman
/batman/superman
/wonderwoman/superman
The pattern matches
(^|/) - start of string or a slash
superman - a word
($|/) - end of string or a slash.

grep '/superman\>'
\> is the "end of word marker", and for "superman3", the end of word is not following "man"
The problems with your -w solution:
| is not special in a basic regex. You either need to escape it or use grep -E
read the man page about how -w works:
The test is that the
matching substring must either be at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end of the line or followed by a
non-word constituent character
In the case where the line is /batman/superman,
the pattern superman/ does not appear
the pattern /superman is:
at the end of the line, which is OK, but
is prededed by the character "n" which is a word constituent character.
grep -w superman will give you better results, or if you need to have superman preceded by a slash, then my original answer works.

Using EGREP to find a substring repeated 3 or moretimes in a string

I'm trying to find any string that repeats any 4 word substring 3 times or more, with no overlapping (the substrings cant overlap each other)
Something like this:
grep -E '([A-Za-z]{4})\1\1' test.txt
I know that this is wrong but I'm not really sure what I'm doing wrong or how to use the string repeating feature.
I'm specifically interested in doing this using EGREP, not other ways.
Some examples:
fourfourfour would be okay
fourfourfourfour would not be okay
none of the substrings can overlap, so if I was searching for "hehe" in hehehehe it would return false as there is only two non overlapping matches.

If it's a four chracter string then you could try the below grep command.
grep -oP '^(?:(?!\1).)*\K(.{4})(?=(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$)' file

Try this:
grep -P '^(.*?(....))(?=(.*?\\2){2}(?!(.*\\2){3}).*'
The key here is using reluctant quantifiers to consume as little as possible before each abc then a negative look ahead to disallow more than 3.

Since you are requesting specifically a grep -E solution, and all of the earlier answers seem hellbent on using grep -P, here's one more.
grep -E '(....)\1\1' file
This looks for a group four arbitrary characters (including spaces) repeated three times, adjacent to each other.
If you want to restrict to non-whitespace characters, try this instead.
grep -E '([^[:space:]]{4})\1\1' file
This looks more complicated, but really isn't: we use [^[:space:]] instead of . and specify four repetitions with {4} just because it would really suck to have to write [^[:space:]][^[:space:]][^[:space:]][^[:space:]].
If you want to relax the adjacency requirement and look for a four-charcter string occuring three times on the same input line with some other characters in between, try this instead.
grep -E '(....).*\1.*\1' file
The parentheses perform grouping, but also capturing; whatever the first set of parentheses matched will be available as \1. You cannot just say (....){3} because that simply says four characters, followed by any other four, followed by any other four.

Here is a non-regex solution using awk:
awk '{s=substr($0, 1, 4); print ($0 == s s s)?"match":"no match"}'
Testing:
echo "fourfourfour" | awk '{s=substr($0, 1, 4); print ($0 == s s s)?"match":"no match"}'
match
echo "hehehehe" | awk '{s=substr($0, 1, 4); print ($0 == s s s)?"match":"no match"}'
no match

I have to take back an earlier statement. Getting EXACTLY x-number of matches does
appear to work in Perl (and possibly PCRE types too).
It does this because in Perl, variables can exist as multiple types, and as such each
has a control state. One of the states is defined or not.
So capture buffers can be referenced before they are actually defined.
This might not apply to command line grep (even in Perl mode), but it might be worth a try.
Adding to #AvinashRaj's regex, it can be done like this. I tested it in Perl, works there:
# ^(?:(?!\1).)*(.{4})(?:(?!\1).)*\1(?:(?!\1).)*\1(?:(?!\1).)*$
^
(?:
(?! \1 )
.
)*
( .{4} ) # (1)
(?:
(?! \1 )
.
)*
\1
(?:
(?! \1 )
.
)*
\1
(?:
(?! \1 )
.
)*
$

How to grep for this pattern in Unix

I want to grep for this particular pattern. The pattern is as follows
**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887
inside the file test.txt which has the following data
NNN**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887_20140628.csv
I tried using grep "**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887" test.txt but it's not returning anything. Please advice
EDIT:
Hi, basically i'm inside a loop and only sometimes i get files with this pattern. So currently im putting like grep "$i" test.txt which works in all the cases except when I have to encounter such patterns.
And I'm actually grepping for the exact file_number, file sequence.So if it says 123_29887 it will be 123_29887. Thanks.

You could use:
grep -P "(?i)\*\*[a-z\d]+\*\*[a-z]+_\d+_\d+" somepath
(?i) turns on case-insensitive mode
\*\* matches the two opening stars
[a-z\d]+ matches letters and digits
\*\* matches two more stars
[a-z]+ matches letters
_\d+_\d+ matches underscore, digits, underscore, digits
If you need to be more specific (for instance, you know that a group of digits always has three digits), you can replace parts of the expression: for instance, \d+ becomes \d{3}
Matching a Literal but Yet Unknown Pattern: \Q and \E
If you receive literal patterns that you need to match, such as **xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887, the issue is that special regex characters such as * need to be escaped. If the whole string is a literal, we do this by escaping the whole string between \Q and \E:
grep -P "\Q**xMT123xMT123x**ABCxxxxxxxxxxxxxxxxxx_123_29887\E" somepath
And in a loop, of course, you can build that regex programmatically by concatenating \Q and \E on both sides.

Why sed doesn't print an optional group?

I have two strings, say foo_bar and foo_abc_bar. I would like to match both of them, and if the first one is matched I would like to emphasize it with = sign. So, my guess was:
echo 'foo_abc_bar' | sed -r 's/(foo).*(abc)?.*(bar)/\1=\2=\3/g'
> foo==bar
or
echo 'foo_abc_bar' | sed -r 's/(foo).*((abc)?).*(bar)/\1=\2=\3/g'
> foo==
But as output above shows none of them work.
How can I specify an optional group that will match if the string contains it or just skip if not?

The solution:
echo 'foo_abc_bar' | sed -r 's/(foo)_((abc)_)?(bar)/\1=\3=\4/g'
Why your previous attempts didn't work:
.* is greedy, so for the regex (foo).*(abc)?.*(bar) attempting to match 'foo_abc_bar' the (foo) will match 'foo', and then the .* will initially match the rest of the string ('_abc_bar'). The regex will continue until it reaches the required (bar) group and this will fail, at which point the regex will backtrack by giving up characters that had been matched by the .*. This will happen until the first .* is only matching '_abc_', at which point the final group can match 'bar'. So instead of the 'abc' in your string being matched in the capture group it is matched in the non-capturing .*.
Explanation of my solution:
The first and most important thing is to replace the .* with _, there is no need to match any arbitrary string if you know what the separator will be. The next thing we need to do is figure out exactly which portion of the string is optional. If the strings 'foo_abc_bar' and 'foo_bar' are both valid, then the 'abc_' in the middle is optional. We can put this in an optional group using (abc_)?. The last step is to make sure that we still have the string 'abc' in a capturing group, which we can do by wrapping that portion in an additional group, so we end up with ((abc)_)?. We then need to adjust the replacement because there is an extra group, so instead of \1=\2=\3 we use \1=\3=\4, \2 would be the string 'abc_' (if it matched). Note that in most regex implementations you could also have used a non-capturing group and continued to use \1=\2=\3, but sed does not support non-capturing groups.
An alternative:
I think the regex above is your best bet because it is most explicit (it will only match the exact strings you are interested in). However you could also avoid the issue described above by using lazy repetition (matches as few characters as possible) instead of greedy repetition (matches as many characters as possible). You can do this by changing the .* to .*?, so your expression would look something like this:
echo 'foo_abc_bar' | sed -r 's/(foo).*?(abc).*?(bar)/\1=\2=\3/g'

Maybe you could simply use:
echo 'foo_abc_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
echo 'foo_bar' | sed -r 's/(foo|bar|abc)_?/\1=/g'
> foo=abc=bar=
> foo=bar=
This avoids the foo==bar you get with foo_bar and I found it a bit weird to show emphasis by putting = sometimes before the match, sometimes after the match.

Unable to figure out regex bash or sed or awk

I wanted to split the following jdk-1.6.0_30-fcs.x86_64 to just jdk-1.6.0_30. I tried the following sed 's/\([a-z][^fcs]*\).*/\1/'but I end up with jdk-1.6.0_30-. I think am approaching it the wrong way, is there a way to start from the end of the word and traverse backwards till I encounter -?

Not exactly, but you can anchor the pattern to the end of the string with $. Then you just need to make sure that the characters you repeat may not include hyphens:
echo jdk-1.6.0_30-fcs.x86_64 | sed 's/-[^-]*$//'
This will match from a - to the end of the string, but all characters in between must be different from - (so that it does not match for the first hyphen already).
A slightly more detailed explanation. The engine tries to match the literal - first. That will first work at the first - in the string (obviously). Then [^-]* matches as many non-- characters as possible, so it will consume 1.6.0_30 (because the next character is in fact a hyphen). Now the engine will try to match $, but that does not work because we are not at the end of the string. Some backtracking occurs, but we can ignore that here. In the end the engine will abandon matching the first - and continue through the string. Then the engine will match the literal - with the second -. Now [^-]* will consume fcs.x86_64. Now we are actually at the end of the string and $ will match, so the full match (which will be removed is) -fcs.x86_64.

Use cut >>
echo 'jdk-1.6.0_30-fcs.x86_64' | cut -d- -f-2

Try doing this :
echo 'jdk-1.6.0_30-fcs.x86_64' | sed 's/-fcs.*//'
If using bash, sh or ash, you can do :
var=jdk-1.6.0_30-fcs.x86_64
echo ${var%%-fcs*}
jdk-1.6.0_30
Later solution use parameter expansion, tested on Linux and Minix3

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is my regex not matching a C and the next letter? - regex

My testfile is: PolicyChain:ComplementaryUser Caught PolicyChain:SourceIP Caught My regex is: cat testfile | grep -E -o '[^PolicyChain:].+?' It matches: mplementaryUser Caught SourceIP Caught I'm ultimately just trying to match the string after the colon but before the space. Please help??

Try this instead. cat testfile | sed -e "s/.:\([^ ][^ ]\).*/\1/"

You can simply use cut: echo "PolicyChain:ComplementaryUser Caught" | cut -d: -f 2

Related

How to grep an exact string with slash in it?

Using EGREP to find a substring repeated 3 or moretimes in a string

How to grep for this pattern in Unix

Why sed doesn't print an optional group?

Unable to figure out regex bash or sed or awk

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why is my regex not matching a C and the next letter? - regex

My testfile is: PolicyChain:ComplementaryUser Caught PolicyChain:SourceIP Caught My regex is: cat testfile | grep -E -o '[^PolicyChain:].+?' It matches: mplementaryUser Caught SourceIP Caught I'm ultimately just trying to match the string after the colon but before the space. Please help??

Try this instead. cat testfile | sed -e "s/.*:\([^ ][^ ]*\).*/\1/"

You can simply use cut: echo "PolicyChain:ComplementaryUser Caught" | cut -d: -f 2

Related

How to grep an exact string with slash in it?

Using EGREP to find a substring repeated 3 or moretimes in a string

How to grep for this pattern in Unix

Why sed doesn't print an optional group?

Unable to figure out regex bash or sed or awk

Categories

Resources

Try this instead. cat testfile | sed -e "s/.:\([^ ][^ ]\).*/\1/"