Extract Values Between Pattern Match - regex

I'm trying to extract any numerical values between a pattern match in a text file.
Parsed Log File Text
> GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2
I want to pull the 25 from f25 in nmmb_2p5km.f25.conus.grib2
Attempted Code
sed -e 's/nmmb_2p5km\(.*\)grib2/\1/'

You may use
log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
sed 's/.*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.*/\1/' <<< "$log"
The .*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.* pattern matches
.* - any 0+ chars
nmmb_2p5km - a literal substring
[^0-9]* - 0+ non-digit chars
\([0-9]*\) - Capturing group 1 (later referred to with \1 from the replacement pattern): 0+ digits
[^0-9]* - 0+ non-digit chars
grib2.* - grib2 and any 0+ chars.
Alternatively, you may use grep with a PCRE pattern like
grep -Po 'nmmb_2p5km\D*\K\d+' <<< "$log"
Details
nmmb_2p5km - a literal substring
\D* - 0+ non-digit chars
\K - match reset oeprator discarding all text matched so far
\d+ - 1+ digits.
See the online sed and grep demo.

Using perl one-liner
> export log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
> perl -ne ' BEGIN { $x=$ENV{log};$x=~s/(.+?)(\d+)\.conus\.(.+)/\2/g; print "$x\n"; exit } '
25
>

Related

Bash regex for same sender and receiver with backreference

I try to make a regex (important that ist a regex because i need it for fail2ban) to match when
the receiver and the sender are the same person:
echo "from=<test#test.ch> to=<test#test.ch>" | grep -E -o '([^=]*\s)[ ]*\1'
What am I doing wrong ?
You might use a pattern to match the format of the string between the brackets with a backreference to that capture.
from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1
Explanation
from Match literally
( Capture group 1
=< Match literally
[^\s#<>]+ Match 1+ times any char except a whitespace char or # < >
# Match literally
[^\s#<>]+ Again match 1+ times any char except a whitespace char or # < >
> Match literally
) Close group 1
\s*to\1 Match 0+ whitespace chars, to and the backreference to group 1
Regex demo | Bash demo
Use grep -P instead of -E for Perl compatible regular expressions.
For example
echo "from=<test#test.ch> to=<test#test.ch>" | grep -oP 'from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1'
A bit broader match could be capturing what is between the brackets
[^=\s]+(=<[^<>]+>)\s*[^=\s]+\1
Regex demo

Removing leading and trailing white spaces as well as leading zeroes

I am trying to remove the leading zeroes 0011223344 and also leading and trailing white spaces
/^0+(?=[0-9]/
s/^\s+|\s+$//
How can I combine the two to get the same output.
11223344
You may use this regex in perl:
s/^\h*0*(?=\d)|\h+$
RegEx Details:
\h matches a horizontal whitespace
^\h*0*(?=\d): At the start it will match 0 or more leading whitespaces followed by 0 or more leading zeroes as long as there is at least one digit ahead
| OR
\h+$: At the end it will match 1+ horizontal whitespaces
Examples:
perl -pe 's/^\h*0+(?=\d)|\h+$//g' <<< ' 001 '
1
perl -pe 's/^\h*0+(?=\d)|\h+$//g' <<< ' 000 '
0
perl -pe 's/^\h*0+(?=\d)|\h+$//g' <<< ' 0000000123 '
123
perl -pe 's/^\h*0*(?=\d)|\h+$//g' <<< ' 123 '
123
You may use
s/^[\s0]+|\s+$//g
Or, for a corner case like ' 0000 ' where you would still like to keep one zero:
s/^(?:\s*(0)+\s*$|[\s0]+)|\s+$/$1/g
See the regex demo #1 and regex demo #2.
Details
^[\s0]+ - matches one or more zeros or whitespace at the start of the string
^(?:\s*(0)+\s*$|[\s0]+) - matches
^ - start of string
(?:\s*(0)+\s*$|[\s0]+) - either of
\s*(0)+\s*$ - 0+ whitespaces, 1 or more zeros each time captured into Group 1, and then 0+ whitespaces till end of string
|- or
[\s0]+ - 1 or more whitespaces or zeros
| - or
\s+$ - matches one or more whitespace chars at the end of string.

How to ignore escaped parenthesis in a regex

I am trying to extract some custom properties from a PDF with regex (I will use grep).
PDF custom properties are a key-value stored in this format:
<</key1(value1)/key2(value2)/key3(value3)>>
Parenthesis inside values are escaped:
/key4(outside \(inside\) outside)
I did the following regex to extract the value of a key:
grep -Po '(?<=key4\().*?(?=\))' "sample.txt"
However when applying it to the key4 (with parenthesis) it yields:
outside \(inside\
Because it stops in the first ) (the one that is escaped) and not in the unescaped one.
How can I ignore in my regex the escaped parenthesis?
Thank you in advance.
PD: I am open to suggestions in sed or awk.
You can do it like this
(?<=key4\()[^\\()]*(?:\\[\S\s][^\\()]*)*(?=\))
https://regex101.com/r/B4qKdh/1
Expanded:
(?<= key4\( )
[^\\()]*
(?: \\ [\S\s] [^\\()]* )*
(?= \) )
You may use a sed solution like
sed 's/.*key4(\([^\()]*\(\\.[^\()]*\)*\)).*/\1/'
sed -E 's/.*key4\(([^\()]*(\\.[^\()]*)*)\).*/\1/'
See the online sed demo.
POSIX ERE pattern details
.* - any 0+ chars
key4\( - key( literal string
\( - a(` char
([^\()]*(\\.[^\()]*)*) - Group 1:
[^\()]* - 0 or more chars other than \, ( and )
(\\.[^\()]*)* - 0 or more repetitions of
\\. - a \ followed with any 1 char
[^\()]* - 0 or more chars other than \, ( and )
\) - a ) char
.* - any 0+ chars
Note that POSIX BRE pattern just has literal and capturing parentheses escaping swapped (( in POSIX BRE matches a literal ( char, it is not start of a capturing group).
The \1 in the replacement part is the Group 1 placeholder and replaces the whole match with that group value.
With any awk in any shell on any UNIX box:
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(][^)]+/) {
$0 = substr($0,RSTART+6,RLENGTH-6)
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
With GNU awk for the 3rd arg to match():
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(]([^)]+)/,a) {
$0 = a[1]
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
The above just replace \( and \) with strings that contain newlines (which cannot exist with newline separated records) \n1 and \n2, then finds the match for key4, then puts the replacement strings back to their original values before printing.

How to write a regex for this?

Requirements: only grep/cut/join/regex.
I have data like this:
798 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
15386 /usr/bin/nautilus --gapplication-service
16051 /usr/bin/zeitgeist-daemon
I want to extract rows data from the number to second ending space, like
798 /usr/bin/dbus-daemon
using only grep/cut/join with or without regex.
I have tried
grep -oe "[^ ][^ ]* *[a-zA-Z\]*$"
but the result isn't as expected.
You may use
# With GNU grep:
grep -oP '^\s*\K\S+\s+\S+' <<< "$s"
# With a POSIX ERE pattern:
grep -oE '[0-9][^ ]* +[^ ]+' <<< "$s"
See the online demo
o - match output mode (not line)
P - PCRE regex engine is used to parse the pattern
The PCRE pattern details:
^ - start of line
\s* - 0+ whitespaces
\K - match reset operator discarding the whole text matched so far
\S+ - 1+ non-whitespace chars
\s+\S+ - 1+ whitespaces and 1+ non-whitespace chars.
The POSIX ERE pattern matches
[0-9] - a digit
[^ ]* - 0+ chars other than space
+ - 1 or more spaces
[^ ]+ - 1+ chars other than a space.

Grep regex capturing issue

Why this doesn't match the capturing group?
grep -rPo 'ServerMethod\(me\.[a-zA-Z]*\.([a-zA-Z]*)\)'
it returns :
test.js:ServerMethod(me.obProcedures.SaveProcess)
test.js:ServerMethod(me.obProcedures.Commit)
but I need just:
SaveProcess
Commit
cygwin version:
2.5.2(0.297/5/3)
It happens so because grep does not return capture group contents, only the whole matches.
You may use \K match reset operator and and a positive lookahead instead:
grep -Po 'ServerMethod\(me\.[a-zA-Z]*\.\K[a-zA-Z]+(?=\))'
See the online demo
Details:
ServerMethod\(me\. - matches a literal string ServerMethod(me.
[a-zA-Z]* - 0 or more ASCII letters
\. - a literal dot
\K - omits the text matched so far from the match
[a-zA-Z]+ - 1 or more ASCII letters
(?=\)) - a positive lookahead that requires a ) immediately to the right of the current location, but does not add it to the match (as it is a non-consuming pattern).
Alternatively, as a PCRE grep option is not always available, use sed with grep:
grep 'ServerMethod(me\.' | sed 's/.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).*/\1/'
See another demo.
Here, the patterns are POSIX BRE compliant:
ServerMethod(me\. - matches a literal ServerMethod(me. text, grep gets the lines with this text
.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).* - matches a line that has
.* - any 0+ chars as many as possible
ServerMethod(me\. - a literal ServerMethod(me. text
[a-zA-Z]* - 0+ ASCII letters
\. - a literal dot
\([a-zA-Z]*\) - Capturing group 1 (referred to via \1): 0+ ASCII letters
) - a literal )
.* - any 0+ chars as many as possible