I'm trying to write a RegEx for SED to make it match and replace the following MarkDown text:
![something](/uploads/somethingelse)
with:
![something](uploads/somethingelse)
Now, in PCRE the matching pattern would be:
([\!]|^)(\[.*\])(\(\/bar[\/])
as tested on Regex101:
but on SED it's invalid.
I've tried a lot of combinations before asking, but I'm going crazy since I'm not a RegEx expert.
Which is the right SED regex to match and split that string in order to make the replacement with sed as described here?
The sed command you need should be run with the -E option as your regex is POSIX ERE compliant. That is, the capturing parentheses should be unescaped, and literal parentheses must be escaped (as in PCRE).
You may use
sed -E 's;(!\[.*])(\(/uploads/);\1(uploads/;g'
Details
(!\[.*]) - Capturing group 1:
! - a ! char (if you use "...", you need to escape it)
\[.*] - a [, then any 0+ chars and then ]
(\(/uploads/) - Capturing group 2:
\( - a ( char
/uploads/ - an /uploads/ substring.
The POSIX BRE compliant pattern (the actual "quick fix" of your current pattern) will look like
sed 's;\(!\|^\)\(\[.*](\)/\(uploads/\);\1\2\3;g'
Note that the \(...\) define capturing groups, ( matches a literal (, and \| defines an alternation operator.
Details
\(!\|^\) - Capturing group 1: ! or start of string
\(\[.*](\) - Capturing group 2: a [, then 0+ chars, and then (
/ - a / char
\(uploads/\) - Capturing group 3: uploads/ substring
See the online sed demo
The ; regex delimiter helps eliminate escaping \ chars before / and make the pattern more readable.
Related
In a Bash script I'm writing, I need to capture the /path/to/my/file.c and 93 in this line:
0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).
0xffffffc0006e0584 is in another_function(char *arg1, int arg2) (/path/to/my/other_file.c:94).
With the help of regex101.com, I've managed to create this Perl regex:
^(?:\S+\s){1,5}\((\S+):(\d+)\)
but I hear that Bash doesn't understand \d or ?:, so I came up with this:
^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)
But when I try it out:
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[0]}
I don't get any match. What am I doing wrong? How can I write a Bash-compatible regex to do this?
You are right, Bash uses POSIX ERE and does not support \d shorthand character class, nor does it support non-capturing groups. See more regex features unsupported in POSIX ERE/BRE in this post.
Use
.*\((.+):([0-9]+)\)
Or even (if you need to grab the first (...) substring in a string):
\(([^()]+):([0-9]+)\)
Details
.* - any 0+ chars, as many as possible (may be omitted, only necessary if there are other (...) substrings and you only need to grab the last one)
\( - a ( char
(.+) - Group 1 (${BASH_REMATCH[1]}): any 1+ chars as many as possible
: - a colon
([0-9]+) - Group 2 (${BASH_REMATCH[2]}): 1+ digits
\) - a ) char.
See the Bash demo (or this one):
test='0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).'
reg='.*\((.+):([0-9]+)\)'
# reg='\(([^()]+):([0-9]+)\)' # This also works for the current scenario
if [[ $test =~ $reg ]]; then
echo ${BASH_REMATCH[1]};
echo ${BASH_REMATCH[2]};
fi
Output:
/path/to/my/file.c
93
In the first pattern you use \S+ which matches a non whitespace char. That is a broad match and will also match for example / which is not taken into account in the second pattern.
The pattern starts with [:alpha:] but the first char is a 0. You could use [:alnum:] instead. Since the repetition should also match _ that could be added as well.
Note that when using a quantifier for a capturing group, the group captures the last value of the iteration. So when using {1,5} you use that quantifier only for the repetition. Its value would be some_function
You might use:
^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Regex demo | Bash demo
Your code could look like
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[4]}
Result
/path/to/my/file.c
93
Or a bit shorter version using \S and the values are in group 2 and 3
^([[:alnum:]_]+[[:space:]]){1,5}\((\S+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Explanation
^ Start of string
([[:alnum:]_]+[[:space:]]){1,5} Repeat 1-5 times what is captured in group 1
\( match (
(\S+\.[[:alpha:]]) Capture group 2 Match 1+ non whitespace chars, . and an alphabetic character
: Match :
([[:digit:]]+) Capture group 3 Match 1+ digits
\)\. Match ).
$ End of string
See this page about bracket expressions
Regex demo
I am trying to extract some custom properties from a PDF with regex (I will use grep).
PDF custom properties are a key-value stored in this format:
<</key1(value1)/key2(value2)/key3(value3)>>
Parenthesis inside values are escaped:
/key4(outside \(inside\) outside)
I did the following regex to extract the value of a key:
grep -Po '(?<=key4\().*?(?=\))' "sample.txt"
However when applying it to the key4 (with parenthesis) it yields:
outside \(inside\
Because it stops in the first ) (the one that is escaped) and not in the unescaped one.
How can I ignore in my regex the escaped parenthesis?
Thank you in advance.
PD: I am open to suggestions in sed or awk.
You can do it like this
(?<=key4\()[^\\()]*(?:\\[\S\s][^\\()]*)*(?=\))
https://regex101.com/r/B4qKdh/1
Expanded:
(?<= key4\( )
[^\\()]*
(?: \\ [\S\s] [^\\()]* )*
(?= \) )
You may use a sed solution like
sed 's/.*key4(\([^\()]*\(\\.[^\()]*\)*\)).*/\1/'
sed -E 's/.*key4\(([^\()]*(\\.[^\()]*)*)\).*/\1/'
See the online sed demo.
POSIX ERE pattern details
.* - any 0+ chars
key4\( - key( literal string
\( - a(` char
([^\()]*(\\.[^\()]*)*) - Group 1:
[^\()]* - 0 or more chars other than \, ( and )
(\\.[^\()]*)* - 0 or more repetitions of
\\. - a \ followed with any 1 char
[^\()]* - 0 or more chars other than \, ( and )
\) - a ) char
.* - any 0+ chars
Note that POSIX BRE pattern just has literal and capturing parentheses escaping swapped (( in POSIX BRE matches a literal ( char, it is not start of a capturing group).
The \1 in the replacement part is the Group 1 placeholder and replaces the whole match with that group value.
With any awk in any shell on any UNIX box:
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(][^)]+/) {
$0 = substr($0,RSTART+6,RLENGTH-6)
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
With GNU awk for the 3rd arg to match():
$ awk '
{ gsub(/\\[(]/,"\n1"); gsub(/\\)/,"\n2") }
match($0,/[/]key4[(]([^)]+)/,a) {
$0 = a[1]
gsub(/\n1/,"\\("); gsub(/\n2/,"\\)")
print
}
' file
outside \(inside\) outside
The above just replace \( and \) with strings that contain newlines (which cannot exist with newline separated records) \n1 and \n2, then finds the match for key4, then puts the replacement strings back to their original values before printing.
I have a string and I want to achieve to remove all zeros between the characters -s and the first number.
1v-s001v => 1v-s1v
2v-s030r => 2v-s30r
3v-s021v => 3v-s21v
I'm trying with:
\w+-s0*(\d)
but it does not match the subject string.
You may use
(-s)0+(\d)
and replace with $1$2. You may replace \d with [0-9] in case the \d is not supported by your regex flavor.
See the regex demo
Details
(-s) - Capturing group 1 (later referred to with $1 placeholder/replacement backreference from the replacement pattern): a -s substring
0+ - one or more 0 chars
(\d) - Capturing group 2 (later referred to with $2 placeholder/replacement backreference from the replacement pattern): any one digit
Why this doesn't match the capturing group?
grep -rPo 'ServerMethod\(me\.[a-zA-Z]*\.([a-zA-Z]*)\)'
it returns :
test.js:ServerMethod(me.obProcedures.SaveProcess)
test.js:ServerMethod(me.obProcedures.Commit)
but I need just:
SaveProcess
Commit
cygwin version:
2.5.2(0.297/5/3)
It happens so because grep does not return capture group contents, only the whole matches.
You may use \K match reset operator and and a positive lookahead instead:
grep -Po 'ServerMethod\(me\.[a-zA-Z]*\.\K[a-zA-Z]+(?=\))'
See the online demo
Details:
ServerMethod\(me\. - matches a literal string ServerMethod(me.
[a-zA-Z]* - 0 or more ASCII letters
\. - a literal dot
\K - omits the text matched so far from the match
[a-zA-Z]+ - 1 or more ASCII letters
(?=\)) - a positive lookahead that requires a ) immediately to the right of the current location, but does not add it to the match (as it is a non-consuming pattern).
Alternatively, as a PCRE grep option is not always available, use sed with grep:
grep 'ServerMethod(me\.' | sed 's/.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).*/\1/'
See another demo.
Here, the patterns are POSIX BRE compliant:
ServerMethod(me\. - matches a literal ServerMethod(me. text, grep gets the lines with this text
.*ServerMethod(me\.[a-zA-Z]*\.\([a-zA-Z]*\)).* - matches a line that has
.* - any 0+ chars as many as possible
ServerMethod(me\. - a literal ServerMethod(me. text
[a-zA-Z]* - 0+ ASCII letters
\. - a literal dot
\([a-zA-Z]*\) - Capturing group 1 (referred to via \1): 0+ ASCII letters
) - a literal )
.* - any 0+ chars as many as possible
I am trying to get hold of regular expressions in Perl. Can anyone please provide any examples of what matches and what doesn't for the below regular expression?
$sentence =~m/.+\/(.+)/s
=~ is the binding operator; it makes the regex match be performed on $sentence instead of the default $_. m is the match operator; it is optional (e.g. $foo =~ /bar/) when the regex is delimited by / characters but required if you want to use a different delimiter.
s is a regex flag that makes . in the regex match any characters; by default . does not match newlines.
The actual regex is .+\/(.+); this will match one or more characters, then a literal / character, then one or more other characters. Because the initial .+ consumes as much as possible while still allowing the regex to succeed, it will match up to the last / in the string that has at least one character after it; then the (.+) will capture the characters that follow that / and make them available as $1.
So it is essentially capturing the final component of a filepath. Of foo/bar it will capture the bar, of foo/bar/ it will capture the bar/. Strings with only one component, like /foo or bar/ or baz will not match.
Any string, including multi-line strings, that contain a slash character somewhere in the middle of the string.
Matches:
foo/bar
asdf\nwrqwer/wrqwerqw # /s modifier allows '.' to match newlines
Doesn't match:
asdfasfdasf # no slash character
/asdfasdf # no characters before the slash
asdfasf/ # no characters after the slash
In addition, the entire substring that follows the last slash in the string will be captured and assigned to the variable $1.
Breakdown:
$sentence =~ — match $sentence with
m/ — the pattern consisting of
. — any character
+ — one or more times
\/ — then a forward-slash
( — and, saving in the $1 capture group,
.+ — any character one or more times
)
/s — allowing . to match newlines
See perldoc perlop for information about operators such as =~ and quote-like operators such as m//, and perldoc perlre about regular expressions and their options such as /s.