How to write a regex for this? - regex

Requirements: only grep/cut/join/regex.
I have data like this:
798 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
15386 /usr/bin/nautilus --gapplication-service
16051 /usr/bin/zeitgeist-daemon
I want to extract rows data from the number to second ending space, like
798 /usr/bin/dbus-daemon
using only grep/cut/join with or without regex.
I have tried
grep -oe "[^ ][^ ]* *[a-zA-Z\]*$"
but the result isn't as expected.

You may use
# With GNU grep:
grep -oP '^\s*\K\S+\s+\S+' <<< "$s"
# With a POSIX ERE pattern:
grep -oE '[0-9][^ ]* +[^ ]+' <<< "$s"
See the online demo
o - match output mode (not line)
P - PCRE regex engine is used to parse the pattern
The PCRE pattern details:
^ - start of line
\s* - 0+ whitespaces
\K - match reset operator discarding the whole text matched so far
\S+ - 1+ non-whitespace chars
\s+\S+ - 1+ whitespaces and 1+ non-whitespace chars.
The POSIX ERE pattern matches
[0-9] - a digit
[^ ]* - 0+ chars other than space
+ - 1 or more spaces
[^ ]+ - 1+ chars other than a space.

Related

How to choose the right words in regular expressions?

how to use grep to get numbers that will not contain 3 and 7, not strings!
I try that
grep -o '[[:digit:]^37]*' test
but its not work
If you have a GNU grep, you can use
grep -oP '\b[^\D37]+\b' file
The grep -oP '\b[^[:^digit:]37]+\b' is a synonymic command.
Details:
\b - a word boundary (may be replaced with (?<!\d) if you simply want to make sure there are no other digits immediately on the left)
[^ - start of a negated bracket expression that matches chars other than:
\D - any non-digit char
37 - 3 and 7
]+ - end of the bracket expression, repeat one or more times
\b - a word boundary (may be replaced with (?!\d) if you simply want to make sure there are no other digits immediately on the right).
See the online demo:
s='123 456 857 112 i21.'
grep -oP '\b[^\D37]+\b' <<< "$s"
Output:
456
112
To use the same approach for letters, relace \D with \P{L} or [:^digit:] with [:^alpha:].

Bash regex for same sender and receiver with backreference

I try to make a regex (important that ist a regex because i need it for fail2ban) to match when
the receiver and the sender are the same person:
echo "from=<test#test.ch> to=<test#test.ch>" | grep -E -o '([^=]*\s)[ ]*\1'
What am I doing wrong ?
You might use a pattern to match the format of the string between the brackets with a backreference to that capture.
from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1
Explanation
from Match literally
( Capture group 1
=< Match literally
[^\s#<>]+ Match 1+ times any char except a whitespace char or # < >
# Match literally
[^\s#<>]+ Again match 1+ times any char except a whitespace char or # < >
> Match literally
) Close group 1
\s*to\1 Match 0+ whitespace chars, to and the backreference to group 1
Regex demo | Bash demo
Use grep -P instead of -E for Perl compatible regular expressions.
For example
echo "from=<test#test.ch> to=<test#test.ch>" | grep -oP 'from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1'
A bit broader match could be capturing what is between the brackets
[^=\s]+(=<[^<>]+>)\s*[^=\s]+\1
Regex demo

How to ignore escaped parenthesis in a regex

I am trying to extract some custom properties from a PDF with regex (I will use grep).
PDF custom properties are a key-value stored in this format:
<</key1(value1)/key2(value2)/key3(value3)>>
Parenthesis inside values are escaped:
/key4(outside \(inside\) outside)
I did the following regex to extract the value of a key:
grep -Po '(?<=key4\().*?(?=\))' "sample.txt"
However when applying it to the key4 (with parenthesis) it yields:
outside \(inside\
Because it stops in the first ) (the one that is escaped) and not in the unescaped one.
How can I ignore in my regex the escaped parenthesis?
Thank you in advance.
PD: I am open to suggestions in sed or awk.
You can do it like this
(?<=key4\()[^\\()]*(?:\\[\S\s][^\\()]*)*(?=\))
https://regex101.com/r/B4qKdh/1
Expanded:
(?<= key4\( )
[^\\()]*
(?: \\ [\S\s] [^\\()]* )*
(?= \) )
You may use a sed solution like
sed 's/.*key4(\([^\()]*\(\\.[^\()]*\)*\)).*/\1/'
sed -E 's/.*key4\(([^\()]*(\\.[^\()]*)*)\).*/\1/'
See the online sed demo.
POSIX ERE pattern details
.* - any 0+ chars
key4\( - key( literal string
\( - a(` char
([^\()]*(\\.[^\()]*)*) - Group 1:
[^\()]* - 0 or more chars other than \, ( and )
(\\.[^\()]*)* - 0 or more repetitions of
\\. - a \ followed with any 1 char
[^\()]* - 0 or more chars other than \, ( and )
\) - a ) char
.* - any 0+ chars
Note that POSIX BRE pattern just has literal and capturing parentheses escaping swapped (( in POSIX BRE matches a literal ( char, it is not start of a capturing group).
The \1 in the replacement part is the Group 1 placeholder and replaces the whole match with that group value.
With any awk in any shell on any UNIX box:
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(][^)]+/) {
$0 = substr($0,RSTART+6,RLENGTH-6)
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
With GNU awk for the 3rd arg to match():
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(]([^)]+)/,a) {
$0 = a[1]
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
The above just replace \( and \) with strings that contain newlines (which cannot exist with newline separated records) \n1 and \n2, then finds the match for key4, then puts the replacement strings back to their original values before printing.

Extract Values Between Pattern Match

I'm trying to extract any numerical values between a pattern match in a text file.
Parsed Log File Text
> GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2
I want to pull the 25 from f25 in nmmb_2p5km.f25.conus.grib2
Attempted Code
sed -e 's/nmmb_2p5km\(.*\)grib2/\1/'
You may use
log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
sed 's/.*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.*/\1/' <<< "$log"
The .*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.* pattern matches
.* - any 0+ chars
nmmb_2p5km - a literal substring
[^0-9]* - 0+ non-digit chars
\([0-9]*\) - Capturing group 1 (later referred to with \1 from the replacement pattern): 0+ digits
[^0-9]* - 0+ non-digit chars
grib2.* - grib2 and any 0+ chars.
Alternatively, you may use grep with a PCRE pattern like
grep -Po 'nmmb_2p5km\D*\K\d+' <<< "$log"
Details
nmmb_2p5km - a literal substring
\D* - 0+ non-digit chars
\K - match reset oeprator discarding all text matched so far
\d+ - 1+ digits.
See the online sed and grep demo.
Using perl one-liner
> export log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
> perl -ne ' BEGIN { $x=$ENV{log};$x=~s/(.+?)(\d+)\.conus\.(.+)/\2/g; print "$x\n"; exit } '
25
>

Grep regex capturing issue

Why this doesn't match the capturing group?
grep -rPo 'ServerMethod\(me\.[a-zA-Z]*\.([a-zA-Z]*)\)'
it returns :
test.js:ServerMethod(me.obProcedures.SaveProcess)
test.js:ServerMethod(me.obProcedures.Commit)
but I need just:
SaveProcess
Commit
cygwin version:
2.5.2(0.297/5/3)
It happens so because grep does not return capture group contents, only the whole matches.
You may use \K match reset operator and and a positive lookahead instead:
grep -Po 'ServerMethod\(me\.[a-zA-Z]*\.\K[a-zA-Z]+(?=\))'
See the online demo
Details:
ServerMethod\(me\. - matches a literal string ServerMethod(me.
[a-zA-Z]* - 0 or more ASCII letters
\. - a literal dot
\K - omits the text matched so far from the match
[a-zA-Z]+ - 1 or more ASCII letters
(?=\)) - a positive lookahead that requires a ) immediately to the right of the current location, but does not add it to the match (as it is a non-consuming pattern).
Alternatively, as a PCRE grep option is not always available, use sed with grep:
grep 'ServerMethod(me\.' | sed 's/.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).*/\1/'
See another demo.
Here, the patterns are POSIX BRE compliant:
ServerMethod(me\. - matches a literal ServerMethod(me. text, grep gets the lines with this text
.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).* - matches a line that has
.* - any 0+ chars as many as possible
ServerMethod(me\. - a literal ServerMethod(me. text
[a-zA-Z]* - 0+ ASCII letters
\. - a literal dot
\([a-zA-Z]*\) - Capturing group 1 (referred to via \1): 0+ ASCII letters
) - a literal )
.* - any 0+ chars as many as possible