I try to make a regex (important that ist a regex because i need it for fail2ban) to match when
the receiver and the sender are the same person:
echo "from=<test#test.ch> to=<test#test.ch>" | grep -E -o '([^=]*\s)[ ]*\1'
What am I doing wrong ?
You might use a pattern to match the format of the string between the brackets with a backreference to that capture.
from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1
Explanation
from Match literally
( Capture group 1
=< Match literally
[^\s#<>]+ Match 1+ times any char except a whitespace char or # < >
# Match literally
[^\s#<>]+ Again match 1+ times any char except a whitespace char or # < >
> Match literally
) Close group 1
\s*to\1 Match 0+ whitespace chars, to and the backreference to group 1
Regex demo | Bash demo
Use grep -P instead of -E for Perl compatible regular expressions.
For example
echo "from=<test#test.ch> to=<test#test.ch>" | grep -oP 'from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1'
A bit broader match could be capturing what is between the brackets
[^=\s]+(=<[^<>]+>)\s*[^=\s]+\1
Regex demo
Related
In a Bash script I'm writing, I need to capture the /path/to/my/file.c and 93 in this line:
0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).
0xffffffc0006e0584 is in another_function(char *arg1, int arg2) (/path/to/my/other_file.c:94).
With the help of regex101.com, I've managed to create this Perl regex:
^(?:\S+\s){1,5}\((\S+):(\d+)\)
but I hear that Bash doesn't understand \d or ?:, so I came up with this:
^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)
But when I try it out:
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[0]}
I don't get any match. What am I doing wrong? How can I write a Bash-compatible regex to do this?
You are right, Bash uses POSIX ERE and does not support \d shorthand character class, nor does it support non-capturing groups. See more regex features unsupported in POSIX ERE/BRE in this post.
Use
.*\((.+):([0-9]+)\)
Or even (if you need to grab the first (...) substring in a string):
\(([^()]+):([0-9]+)\)
Details
.* - any 0+ chars, as many as possible (may be omitted, only necessary if there are other (...) substrings and you only need to grab the last one)
\( - a ( char
(.+) - Group 1 (${BASH_REMATCH[1]}): any 1+ chars as many as possible
: - a colon
([0-9]+) - Group 2 (${BASH_REMATCH[2]}): 1+ digits
\) - a ) char.
See the Bash demo (or this one):
test='0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).'
reg='.*\((.+):([0-9]+)\)'
# reg='\(([^()]+):([0-9]+)\)' # This also works for the current scenario
if [[ $test =~ $reg ]]; then
echo ${BASH_REMATCH[1]};
echo ${BASH_REMATCH[2]};
fi
Output:
/path/to/my/file.c
93
In the first pattern you use \S+ which matches a non whitespace char. That is a broad match and will also match for example / which is not taken into account in the second pattern.
The pattern starts with [:alpha:] but the first char is a 0. You could use [:alnum:] instead. Since the repetition should also match _ that could be added as well.
Note that when using a quantifier for a capturing group, the group captures the last value of the iteration. So when using {1,5} you use that quantifier only for the repetition. Its value would be some_function
You might use:
^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Regex demo | Bash demo
Your code could look like
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[4]}
Result
/path/to/my/file.c
93
Or a bit shorter version using \S and the values are in group 2 and 3
^([[:alnum:]_]+[[:space:]]){1,5}\((\S+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Explanation
^ Start of string
([[:alnum:]_]+[[:space:]]){1,5} Repeat 1-5 times what is captured in group 1
\( match (
(\S+\.[[:alpha:]]) Capture group 2 Match 1+ non whitespace chars, . and an alphabetic character
: Match :
([[:digit:]]+) Capture group 3 Match 1+ digits
\)\. Match ).
$ End of string
See this page about bracket expressions
Regex demo
I am trying to extract some custom properties from a PDF with regex (I will use grep).
PDF custom properties are a key-value stored in this format:
<</key1(value1)/key2(value2)/key3(value3)>>
Parenthesis inside values are escaped:
/key4(outside \(inside\) outside)
I did the following regex to extract the value of a key:
grep -Po '(?<=key4\().*?(?=\))' "sample.txt"
However when applying it to the key4 (with parenthesis) it yields:
outside \(inside\
Because it stops in the first ) (the one that is escaped) and not in the unescaped one.
How can I ignore in my regex the escaped parenthesis?
Thank you in advance.
PD: I am open to suggestions in sed or awk.
You can do it like this
(?<=key4\()[^\\()]*(?:\\[\S\s][^\\()]*)*(?=\))
https://regex101.com/r/B4qKdh/1
Expanded:
(?<= key4\( )
[^\\()]*
(?: \\ [\S\s] [^\\()]* )*
(?= \) )
You may use a sed solution like
sed 's/.*key4(\([^\()]*\(\\.[^\()]*\)*\)).*/\1/'
sed -E 's/.*key4\(([^\()]*(\\.[^\()]*)*)\).*/\1/'
See the online sed demo.
POSIX ERE pattern details
.* - any 0+ chars
key4\( - key( literal string
\( - a(` char
([^\()]*(\\.[^\()]*)*) - Group 1:
[^\()]* - 0 or more chars other than \, ( and )
(\\.[^\()]*)* - 0 or more repetitions of
\\. - a \ followed with any 1 char
[^\()]* - 0 or more chars other than \, ( and )
\) - a ) char
.* - any 0+ chars
Note that POSIX BRE pattern just has literal and capturing parentheses escaping swapped (( in POSIX BRE matches a literal ( char, it is not start of a capturing group).
The \1 in the replacement part is the Group 1 placeholder and replaces the whole match with that group value.
With any awk in any shell on any UNIX box:
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(][^)]+/) {
$0 = substr($0,RSTART+6,RLENGTH-6)
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
With GNU awk for the 3rd arg to match():
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(]([^)]+)/,a) {
$0 = a[1]
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
The above just replace \( and \) with strings that contain newlines (which cannot exist with newline separated records) \n1 and \n2, then finds the match for key4, then puts the replacement strings back to their original values before printing.
Requirements: only grep/cut/join/regex.
I have data like this:
798 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
15386 /usr/bin/nautilus --gapplication-service
16051 /usr/bin/zeitgeist-daemon
I want to extract rows data from the number to second ending space, like
798 /usr/bin/dbus-daemon
using only grep/cut/join with or without regex.
I have tried
grep -oe "[^ ][^ ]* *[a-zA-Z\]*$"
but the result isn't as expected.
You may use
# With GNU grep:
grep -oP '^\s*\K\S+\s+\S+' <<< "$s"
# With a POSIX ERE pattern:
grep -oE '[0-9][^ ]* +[^ ]+' <<< "$s"
See the online demo
o - match output mode (not line)
P - PCRE regex engine is used to parse the pattern
The PCRE pattern details:
^ - start of line
\s* - 0+ whitespaces
\K - match reset operator discarding the whole text matched so far
\S+ - 1+ non-whitespace chars
\s+\S+ - 1+ whitespaces and 1+ non-whitespace chars.
The POSIX ERE pattern matches
[0-9] - a digit
[^ ]* - 0+ chars other than space
+ - 1 or more spaces
[^ ]+ - 1+ chars other than a space.
I'm trying to extract any numerical values between a pattern match in a text file.
Parsed Log File Text
> GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2
I want to pull the 25 from f25 in nmmb_2p5km.f25.conus.grib2
Attempted Code
sed -e 's/nmmb_2p5km\(.*\)grib2/\1/'
You may use
log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
sed 's/.*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.*/\1/' <<< "$log"
The .*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.* pattern matches
.* - any 0+ chars
nmmb_2p5km - a literal substring
[^0-9]* - 0+ non-digit chars
\([0-9]*\) - Capturing group 1 (later referred to with \1 from the replacement pattern): 0+ digits
[^0-9]* - 0+ non-digit chars
grib2.* - grib2 and any 0+ chars.
Alternatively, you may use grep with a PCRE pattern like
grep -Po 'nmmb_2p5km\D*\K\d+' <<< "$log"
Details
nmmb_2p5km - a literal substring
\D* - 0+ non-digit chars
\K - match reset oeprator discarding all text matched so far
\d+ - 1+ digits.
See the online sed and grep demo.
Using perl one-liner
> export log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
> perl -ne ' BEGIN { $x=$ENV{log};$x=~s/(.+?)(\d+)\.conus\.(.+)/\2/g; print "$x\n"; exit } '
25
>
Why this doesn't match the capturing group?
grep -rPo 'ServerMethod\(me\.[a-zA-Z]*\.([a-zA-Z]*)\)'
it returns :
test.js:ServerMethod(me.obProcedures.SaveProcess)
test.js:ServerMethod(me.obProcedures.Commit)
but I need just:
SaveProcess
Commit
cygwin version:
2.5.2(0.297/5/3)
It happens so because grep does not return capture group contents, only the whole matches.
You may use \K match reset operator and and a positive lookahead instead:
grep -Po 'ServerMethod\(me\.[a-zA-Z]*\.\K[a-zA-Z]+(?=\))'
See the online demo
Details:
ServerMethod\(me\. - matches a literal string ServerMethod(me.
[a-zA-Z]* - 0 or more ASCII letters
\. - a literal dot
\K - omits the text matched so far from the match
[a-zA-Z]+ - 1 or more ASCII letters
(?=\)) - a positive lookahead that requires a ) immediately to the right of the current location, but does not add it to the match (as it is a non-consuming pattern).
Alternatively, as a PCRE grep option is not always available, use sed with grep:
grep 'ServerMethod(me\.' | sed 's/.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).*/\1/'
See another demo.
Here, the patterns are POSIX BRE compliant:
ServerMethod(me\. - matches a literal ServerMethod(me. text, grep gets the lines with this text
.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).* - matches a line that has
.* - any 0+ chars as many as possible
ServerMethod(me\. - a literal ServerMethod(me. text
[a-zA-Z]* - 0+ ASCII letters
\. - a literal dot
\([a-zA-Z]*\) - Capturing group 1 (referred to via \1): 0+ ASCII letters
) - a literal )
.* - any 0+ chars as many as possible