I'm trying to write a regex that matches strings as the following:
translate("some text here")
and
translate('some text here')
I've done that:
preg_match ('/translate\("(.*?)"\)*/', $line, $m)
but how to add if there are single quotes, not double. It should match as single, as double quotes.
You could go for:
translate\( # translate( literally
(['"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
See a demo for this approach on regex101.com.
In PHP this would be:
<?php
$regex = '~
translate\( # translate( literally
([\'"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
~x';
if (preg_match($regex, $string)) {
// do sth. here
}
?>
Note that you do not need to escape both of the quotes in square brackets ([]), I have only done it for the Stackoverflow prettifier.
Bear in mind though, that this is rather error-prone (what about whitespaces, escaped quotes ?).
In the comments the discussion came up that you cannot say anything BUT the first captured group. Well, yes, you can (thanks to Obama here), the technique is called a tempered greedy token which can be achieved via lookarounds. Consider the following code:
translate\(
(['"])
(?:(?!\1).)*
\1
\)
It opens a non-capturing group with a negative lookahead that makes sure not to match the formerly captured group (a quote in this example).
This eradicates matches like translate("a"b"c"d") (see a demo here).
The final expression to match all given examples is:
translate\(
(['"])
(?:
.*?(?=\1\))
)
\1
\)
#translate\(
([\'"]) # capture quote char
((?:
(?!\1). # not a quote
| # or
\\\1 # escaped one
)* #
[^\\\\]?)\1 # match unescaped last quote char
\)#gx
Fiddle:
ok: translate("some text here")
ok: translate('some text here')
ok: translate('"some text here..."')
ok: translate("a\"b\"c\"d")
ok: translate("")
no: translate("a\"b"c\"d")
You can alternate expression components using the pipe (|) like this:
preg_match ('/translate(\("(.*?)"\)|\(\'(.*?)\'\))/', $line, $m)
Edit: previous also matched translate("some text here'). This should work but you will have to escape the quotes in some languages.
Related
I'm trying to write a regex that matches strings as the following:
translate("some text here")
and
translate('some text here')
I've done that:
preg_match ('/translate\("(.*?)"\)*/', $line, $m)
but how to add if there are single quotes, not double. It should match as single, as double quotes.
You could go for:
translate\( # translate( literally
(['"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
See a demo for this approach on regex101.com.
In PHP this would be:
<?php
$regex = '~
translate\( # translate( literally
([\'"]) # capture a single/double quote to group 1
.+? # match anything except a newline lazily
\1 # up to the formerly captured quote
\) # and a closing parenthesis
~x';
if (preg_match($regex, $string)) {
// do sth. here
}
?>
Note that you do not need to escape both of the quotes in square brackets ([]), I have only done it for the Stackoverflow prettifier.
Bear in mind though, that this is rather error-prone (what about whitespaces, escaped quotes ?).
In the comments the discussion came up that you cannot say anything BUT the first captured group. Well, yes, you can (thanks to Obama here), the technique is called a tempered greedy token which can be achieved via lookarounds. Consider the following code:
translate\(
(['"])
(?:(?!\1).)*
\1
\)
It opens a non-capturing group with a negative lookahead that makes sure not to match the formerly captured group (a quote in this example).
This eradicates matches like translate("a"b"c"d") (see a demo here).
The final expression to match all given examples is:
translate\(
(['"])
(?:
.*?(?=\1\))
)
\1
\)
#translate\(
([\'"]) # capture quote char
((?:
(?!\1). # not a quote
| # or
\\\1 # escaped one
)* #
[^\\\\]?)\1 # match unescaped last quote char
\)#gx
Fiddle:
ok: translate("some text here")
ok: translate('some text here')
ok: translate('"some text here..."')
ok: translate("a\"b\"c\"d")
ok: translate("")
no: translate("a\"b"c\"d")
You can alternate expression components using the pipe (|) like this:
preg_match ('/translate(\("(.*?)"\)|\(\'(.*?)\'\))/', $line, $m)
Edit: previous also matched translate("some text here'). This should work but you will have to escape the quotes in some languages.
I'm using regular expressions in a custom text editor to in effect whitelist certain modules (assert and crypto). I'm close to what I need but not quite there. Here it is:
/require\s*\(\s*'(?!(\bassert\b|\bcrypto\b)).*'\s*\)/
I want the regular expression to match any line with require('foo'); where foo is anything except for 'assert' or 'crypto'. The case I'm failing is require('assert '); which is not being matched with my regex however require(' assert'); is correctly being matched.
https://regexr.com/4i6ot
If you don't want to match assert or crypto between ', you could change the lookahead to assert exactly that. You can omit the word boundaries matching the words right after the '.
If what follows should match until the first occurrence of ', you could use a negated character class [^'\r\n]* to match any char except ' or a newline.
require\s*\(\s*'(?!(assert|crypto)')[^'\r\n]*'\s*\)
^
Regex demo
You can use: require\s*\(\s*'(?!(\bassert'|\bcrypto')).*'\s*\)
Online demo
The difference is that I replaced word boundary \b with ' at the end of the module names. With \b a module name of 'assert ' was matched by negative lookahead, because t was matched by \b. In the new version, we require ' at the end of the name of the module.
EDIT
As Cary Swoveland advised, leading \b are not required:
require\s*\(\s*'(?!(assert'|crypto')).*'\s*\)
Demo
I assume from the flawed regex that if there is a match the string between "('" and "')" is to be captured. One way to do that follows.
r = /
require # match word
\ * # match zero or more spaces (note escaped space)
\( # match a left paren
(?! # begin a negative lookahead
' # match a single quote
(?:assert|crypto) # match either word
' # match a single quote
(?=\)) # match a right paren in a forward lookahead
) # end negative lookahead
' # match a single quote
(.*?) # match any number of characters lazily in a capture group 1
' # match a single quote
\) # match a right paren
/x # free-spacing regex definition mode
As the capture group is followed by a single quote, matching characters in the capture group lazily ensures that a single quote is not matched in the capture group. I could have instead written ([^']*). In conventional form this regex is written as follows:
r = /require *\((?!'(?:assert|crypto)'(?=\)))'(.*?)'\)/
Note that in free-spacing regex definition mode spaces will be removed unless they are escaped, put in a character class ([ ]), replaced with \p{Space} and so on.
"require ('victory')" =~ r #=> 0
$1 #=> "victory"
"require (' assert')" =~ r #=> 0
$1 #=> " assert"
"require ('assert ')" =~ r #=> 0
$1 #=> "assert "
"require ('crypto')" =~ r #=> nil
"require ('assert')" =~ r #=> nil
"require\n('victory')" =~ r #=> nil
Notice that had I replace the space character in the regex with "\s" in the last example I would have obtained:
"require\n('victory')" =~ r #=> 0
$1 #=> "victory"
I don't think you need anything remotely that complicated, this simple pattern will work just fine:
require\((?!'crypto'|'assert')'.*'\);
regex101 demo
Giving the following code:
use strict;
use warnings;
my $text = "asdf(blablabla)";
$text =~ s/(.*?)\((.*)\)/$2/;
print "\nfirst match: $1";
print "\nsecond match: $2";
I expected that $2 would catch my last bracket, yet my output is:
If .* by default it's greedy why it stopped at the bracket?
The .* is a greedy subpattern, but it does not account for grouping. Grouping is defined with a pair of unescaped parentheses (see Use Parentheses for Grouping and Capturing).
See where your group boundaries are:
s/(.*?)\((.*)\)/$2/
| G1| |G2|
So, the \( and \) matching ( and ) are outside the groups, and will not be part of neither $1 nor $2.
If you need the ) be part of $2, use
s/(.*?)\((.*\))/$2/
^
A regex engine is processing both the string and the pattern from left to right. The first (.*?) is handled first, and it matches up to the first literal ( symbol as it is lazy (matches as few chars as possible before it can return a valid match), and the whole part before the ( is placed into Group 1 stack. Then, the ( is matched, but not captured, then (.*) matches any 0+ characters other than a newline up to the last ) symbol, and places the capture into Group 2. Then, the ) is just matched. The point is that .* grabs the whole string up to the end, but then backtracking happens since the engine tries to accommodate for the final ) in the pattern. The ) must be matched, but not captured in your pattern, thus, it is not part of Group 2 due to the group boundary placement. You can see the regex debugger at this regex demo page to see how the pattern matches your string.
Please help create a regular expression that would be allocated "|" character everywhere except parentheses.
example|example (example(example))|example|example|example(example|example|example(example|example))|example
After making the selection should have 5 characters "|" are out of the equation. I want to note that the contents within the brackets should remain unchanged including the "|" character within them.
Considering you want to match pipes that are outside any set of parentheses, with nested sets, here's the pattern to achieve what you want:
Regex:
(?x) # Allow comments in regex (ignore whitespace)
(?: # Repeat *
[^(|)]*+ # Match every char except ( ) or |
( # 1. Group 1
\( # Opening paren
(?: # chars inside:
[^()]++ # a. everything inside parens except nested parens
| # or
(?1) # b. nested parens (recurse group 1)
) #
\) # Until closing paren.
)?+ # (end of group 1)
)*+ #
\K # Keep text out of match
\| # Match a pipe
regex101 Demo
One-liner:
(?:[^(|)]*+(\((?:[^()]++|(?1))\))?+)*+\K\|
regex101 Demo
This pattern uses some advanced features:
Possessive quantifiers
Recursion
Resetting the match start
I would like to remove some strings from filename.
I want to remove every string in bracket but not if there is a string "remix" or "Remix" or "REMIX"
Now I have got
sed "s/\s*\(\s?[A-z0-9. ]*\)//g"
but how to exclude cases when there is remix in string?
You can use a capture group:
sed 's/\(\s*([^)]*remix[^)]*)\)\|\s*(\s\?[a-z0-9. ]*)/\1/gi'
When the "remix branch" doesn't match, the capture group is not defined and the matched part is replaced with an empty string.
When the "remix branch" succeeds, the matched part is replaced by the content of the capture group, so by itself.
Note: if that helps to avoid false positive, you can add word-boundaries around "remix": \bremix\b
pattern details:
\( # open the capture group 1
\s* # zero or more white-spaces
( # a literal parenthesis
[^)]* # zero or more characters that are not a closing parenthesis
remix
[^)]*
)
\) # close the capture group 1
\| # OR
# something else between parenthesis
\s* # note that it is essential that the two branches are able to
# start at the same position. If you remove \s* in the first
# branch, the second branch will always win when there's a space
# before the opening parenthesis.
(\s\?[a-z0-9. ]*)
\1 is the reference to the capture group 1
i makes the pattern case-insensitive
[EDIT]
If you want to do it in a POSIX compliant way, you must use a different approach because several Gnu features are not available, in particular the alternation \| (but also the i modifier, the \s character class, the optional quantifier \?).
This other approach consists to find all eventual characters that are not an opening parenthesis and all eventual substrings enclosed between parenthesis with "remix" inside, followed by eventual white-spaces and an eventual substring enclosed between parenthesis.
As you can see all is optional and the pattern can match an empty string, but it isn't a problem.
All before the parenthesis part to remove is captured in group 1.
sed 's/\(\([^(]*([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)[^ \t(]*\([ \t]\{1,\}[^ \t(]\{1,\}\)*\)*\)\([ \t]*([^)]*)\)\{0,1\}/\1/g;'
pattern details:
\( # open the capture group 1
\(
[^(]* # all that is not an opening parenthesis
# substring enclosed between parenthesis without "remix"
( [^)]* [Rr][Ee][Mm][Ii][Xx] [^)]* )
# Let's reach the next parenthesis without to match the white-spaces
# before it (otherwise the leading white-spaces are not removed)
[^ \t(]* # all that is not a white-space or an opening parenthesis
# eventual groups of white-spaces followed by characters that are
# not white-spaces nor opening parenthesis
\( [ \t]\{1,\} [^ \t(]\{1,\} \)*
\)*
\) # close the capture group 1
\(
[ \t]* # leading white-spaces
([^)]*) # parenthesis
\)\{0,1\} # makes this part optional (this avoid to remove a "remix" part
# alone at the end of the string)
The word boundaries in this mode aren't available too. So the only way to emulate them is to list the four possibilities:
([Rr][Ee][Mm][Ii][Xx]) # poss1
([Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss2
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx]) # poss3
([^)]*[^a-zA-Z][Rr][Ee][Mm][Ii][Xx][^a-zA-Z][^)]*) # poss4
and to replace ([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*) with:
\(poss1\)\{0,\}\(poss2\)\{0,\}\(poss3\)\{0,\}\(poss4\)\{0,\}
Just skip the lines matching "remix":
sed '/([^)]*[Rr][Ee][Mm][Ii][Xx][^)]*)/! s/([^)]*)//g'
where bracket are (US) :[]
sed '/remix\|REMIX\|Remix/ !s/\[[^]]*]//g'
where bracet (ROW): ()
sed '/remix\|REMIX\|Remix/ !s/([^)]*)//g'
assuming:
- there is no internal bracket
- Other form of remix are excluced (ReMix, ...), so line is deleted
- Remix could be any place in title (i love remix) [if needed specify which to take and remove]