How to ignore escaped parenthesis in a regex - regex

I am trying to extract some custom properties from a PDF with regex (I will use grep).
PDF custom properties are a key-value stored in this format:
<</key1(value1)/key2(value2)/key3(value3)>>
Parenthesis inside values are escaped:
/key4(outside \(inside\) outside)
I did the following regex to extract the value of a key:
grep -Po '(?<=key4\().*?(?=\))' "sample.txt"
However when applying it to the key4 (with parenthesis) it yields:
outside \(inside\
Because it stops in the first ) (the one that is escaped) and not in the unescaped one.
How can I ignore in my regex the escaped parenthesis?
Thank you in advance.
PD: I am open to suggestions in sed or awk.

You can do it like this
(?<=key4\()[^\\()]*(?:\\[\S\s][^\\()]*)*(?=\))
https://regex101.com/r/B4qKdh/1
Expanded:
(?<= key4\( )
[^\\()]*
(?: \\ [\S\s] [^\\()]* )*
(?= \) )

You may use a sed solution like
sed 's/.*key4(\([^\()]*\(\\.[^\()]*\)*\)).*/\1/'
sed -E 's/.*key4\(([^\()]*(\\.[^\()]*)*)\).*/\1/'
See the online sed demo.
POSIX ERE pattern details
.* - any 0+ chars
key4\( - key( literal string
\( - a(` char
([^\()]*(\\.[^\()]*)*) - Group 1:
[^\()]* - 0 or more chars other than \, ( and )
(\\.[^\()]*)* - 0 or more repetitions of
\\. - a \ followed with any 1 char
[^\()]* - 0 or more chars other than \, ( and )
\) - a ) char
.* - any 0+ chars
Note that POSIX BRE pattern just has literal and capturing parentheses escaping swapped (( in POSIX BRE matches a literal ( char, it is not start of a capturing group).
The \1 in the replacement part is the Group 1 placeholder and replaces the whole match with that group value.

With any awk in any shell on any UNIX box:
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(][^)]+/) {
$0 = substr($0,RSTART+6,RLENGTH-6)
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
With GNU awk for the 3rd arg to match():
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(]([^)]+)/,a) {
$0 = a[1]
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
The above just replace \( and \) with strings that contain newlines (which cannot exist with newline separated records) \n1 and \n2, then finds the match for key4, then puts the replacement strings back to their original values before printing.

Related

Bash regex for same sender and receiver with backreference

I try to make a regex (important that ist a regex because i need it for fail2ban) to match when
the receiver and the sender are the same person:
echo "from=<test#test.ch> to=<test#test.ch>" | grep -E -o '([^=]*\s)[ ]*\1'
What am I doing wrong ?
You might use a pattern to match the format of the string between the brackets with a backreference to that capture.
from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1
Explanation
from Match literally
( Capture group 1
=< Match literally
[^\s#<>]+ Match 1+ times any char except a whitespace char or # < >
# Match literally
[^\s#<>]+ Again match 1+ times any char except a whitespace char or # < >
> Match literally
) Close group 1
\s*to\1 Match 0+ whitespace chars, to and the backreference to group 1
Regex demo | Bash demo
Use grep -P instead of -E for Perl compatible regular expressions.
For example
echo "from=<test#test.ch> to=<test#test.ch>" | grep -oP 'from(=<[^\s#<>]+#[^\s#<>]+>)\s*to\1'
A bit broader match could be capturing what is between the brackets
[^=\s]+(=<[^<>]+>)\s*[^=\s]+\1
Regex demo

Bash regex matching "0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."

In a Bash script I'm writing, I need to capture the /path/to/my/file.c and 93 in this line:
0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).
0xffffffc0006e0584 is in another_function(char *arg1, int arg2) (/path/to/my/other_file.c:94).
With the help of regex101.com, I've managed to create this Perl regex:
^(?:\S+\s){1,5}\((\S+):(\d+)\)
but I hear that Bash doesn't understand \d or ?:, so I came up with this:
^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)
But when I try it out:
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[0]}
I don't get any match. What am I doing wrong? How can I write a Bash-compatible regex to do this?
You are right, Bash uses POSIX ERE and does not support \d shorthand character class, nor does it support non-capturing groups. See more regex features unsupported in POSIX ERE/BRE in this post.
Use
.*\((.+):([0-9]+)\)
Or even (if you need to grab the first (...) substring in a string):
\(([^()]+):([0-9]+)\)
Details
.* - any 0+ chars, as many as possible (may be omitted, only necessary if there are other (...) substrings and you only need to grab the last one)
\( - a ( char
(.+) - Group 1 (${BASH_REMATCH[1]}): any 1+ chars as many as possible
: - a colon
([0-9]+) - Group 2 (${BASH_REMATCH[2]}): 1+ digits
\) - a ) char.
See the Bash demo (or this one):
test='0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).'
reg='.*\((.+):([0-9]+)\)'
# reg='\(([^()]+):([0-9]+)\)' # This also works for the current scenario
if [[ $test =~ $reg ]]; then
echo ${BASH_REMATCH[1]};
echo ${BASH_REMATCH[2]};
fi
Output:
/path/to/my/file.c
93
In the first pattern you use \S+ which matches a non whitespace char. That is a broad match and will also match for example / which is not taken into account in the second pattern.
The pattern starts with [:alpha:] but the first char is a 0. You could use [:alnum:] instead. Since the repetition should also match _ that could be added as well.
Note that when using a quantifier for a capturing group, the group captures the last value of the iteration. So when using {1,5} you use that quantifier only for the repetition. Its value would be some_function
You might use:
^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Regex demo | Bash demo
Your code could look like
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[4]}
Result
/path/to/my/file.c
93
Or a bit shorter version using \S and the values are in group 2 and 3
^([[:alnum:]_]+[[:space:]]){1,5}\((\S+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Explanation
^ Start of string
([[:alnum:]_]+[[:space:]]){1,5} Repeat 1-5 times what is captured in group 1
\( match (
(\S+\.[[:alpha:]]) Capture group 2 Match 1+ non whitespace chars, . and an alphabetic character
: Match :
([[:digit:]]+) Capture group 3 Match 1+ digits
\)\. Match ).
$ End of string
See this page about bracket expressions
Regex demo

Match and Replace ![foo](/bar/) with Regex in SED

I'm trying to write a RegEx for SED to make it match and replace the following MarkDown text:
![something](/uploads/somethingelse)
with:
![something](uploads/somethingelse)
Now, in PCRE the matching pattern would be:
([\!]|^)(\[.*\])(\(\/bar[\/])
as tested on Regex101:
but on SED it's invalid.
I've tried a lot of combinations before asking, but I'm going crazy since I'm not a RegEx expert.
Which is the right SED regex to match and split that string in order to make the replacement with sed as described here?
The sed command you need should be run with the -E option as your regex is POSIX ERE compliant. That is, the capturing parentheses should be unescaped, and literal parentheses must be escaped (as in PCRE).
You may use
sed -E 's;(!\[.*])(\(/uploads/);\1(uploads/;g'
Details
(!\[.*]) - Capturing group 1:
! - a ! char (if you use "...", you need to escape it)
\[.*] - a [, then any 0+ chars and then ]
(\(/uploads/) - Capturing group 2:
\( - a ( char
/uploads/ - an /uploads/ substring.
The POSIX BRE compliant pattern (the actual "quick fix" of your current pattern) will look like
sed 's;\(!\|^\)\(\[.*](\)/\(uploads/\);\1\2\3;g'
Note that the \(...\) define capturing groups, ( matches a literal (, and \| defines an alternation operator.
Details
\(!\|^\) - Capturing group 1: ! or start of string
\(\[.*](\) - Capturing group 2: a [, then 0+ chars, and then (
/ - a / char
\(uploads/\) - Capturing group 3: uploads/ substring
See the online sed demo
The ; regex delimiter helps eliminate escaping \ chars before / and make the pattern more readable.

Extract Values Between Pattern Match

I'm trying to extract any numerical values between a pattern match in a text file.
Parsed Log File Text
> GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2
I want to pull the 25 from f25 in nmmb_2p5km.f25.conus.grib2
Attempted Code
sed -e 's/nmmb_2p5km\(.*\)grib2/\1/'
You may use
log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
sed 's/.*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.*/\1/' <<< "$log"
The .*nmmb_2p5km[^0-9]*\([0-9]*\)[^0-9]*grib2.* pattern matches
.* - any 0+ chars
nmmb_2p5km - a literal substring
[^0-9]* - 0+ non-digit chars
\([0-9]*\) - Capturing group 1 (later referred to with \1 from the replacement pattern): 0+ digits
[^0-9]* - 0+ non-digit chars
grib2.* - grib2 and any 0+ chars.
Alternatively, you may use grep with a PCRE pattern like
grep -Po 'nmmb_2p5km\D*\K\d+' <<< "$log"
Details
nmmb_2p5km - a literal substring
\D* - 0+ non-digit chars
\K - match reset oeprator discarding all text matched so far
\d+ - 1+ digits.
See the online sed and grep demo.
Using perl one-liner
> export log="GET /pub/data/nccf/com/hiresw/prod/hiresw.20180921/hiresw.t00z.nmmb_2p5km.f25.conus.grib2"
> perl -ne ' BEGIN { $x=$ENV{log};$x=~s/(.+?)(\d+)\.conus\.(.+)/\2/g; print "$x\n"; exit } '
25
>

Grep regex capturing issue

Why this doesn't match the capturing group?
grep -rPo 'ServerMethod\(me\.[a-zA-Z]*\.([a-zA-Z]*)\)'
it returns :
test.js:ServerMethod(me.obProcedures.SaveProcess)
test.js:ServerMethod(me.obProcedures.Commit)
but I need just:
SaveProcess
Commit
cygwin version:
2.5.2(0.297/5/3)
It happens so because grep does not return capture group contents, only the whole matches.
You may use \K match reset operator and and a positive lookahead instead:
grep -Po 'ServerMethod\(me\.[a-zA-Z]*\.\K[a-zA-Z]+(?=\))'
See the online demo
Details:
ServerMethod\(me\. - matches a literal string ServerMethod(me.
[a-zA-Z]* - 0 or more ASCII letters
\. - a literal dot
\K - omits the text matched so far from the match
[a-zA-Z]+ - 1 or more ASCII letters
(?=\)) - a positive lookahead that requires a ) immediately to the right of the current location, but does not add it to the match (as it is a non-consuming pattern).
Alternatively, as a PCRE grep option is not always available, use sed with grep:
grep 'ServerMethod(me\.' | sed 's/.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).*/\1/'
See another demo.
Here, the patterns are POSIX BRE compliant:
ServerMethod(me\. - matches a literal ServerMethod(me. text, grep gets the lines with this text
.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).* - matches a line that has
.* - any 0+ chars as many as possible
ServerMethod(me\. - a literal ServerMethod(me. text
[a-zA-Z]* - 0+ ASCII letters
\. - a literal dot
\([a-zA-Z]*\) - Capturing group 1 (referred to via \1): 0+ ASCII letters
) - a literal )
.* - any 0+ chars as many as possible