Extracting inner groups with regex

Extracting inner groups with regex - regex

I have the following string
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6')) AND (((SUM_RevisionAnomalia_UltRevision_1M = 1) AND (CANT_ConsumoFact_UltRevision_1M > 1)) OR ((SUM_RevisionNoAnomalia_UltRevision_1M + 1) AND (CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2))) OR (SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
and I am trying to extract all inner groups, so my answer should contain
([Valor][Corr][Fat]: 6M UC x Viz. Lógicos IN('3','6'))
(SUM_RevisionAnomalia_UltRevision_1M = 1)
(CANT_ConsumoFact_UltRevision_1M > 1)
(SUM_RevisionNoAnomalia_UltRevision_1M + 1)
(CANT_ConsumoFact_UltRevision_1M BETWEEN 1 - 2)
(SUM_RevisionNoAnomalia_UltRevision_1M <= 1)
It is quite easy to extract this when there is only 1 set of those strings inside parentheses, but when given the example above my regex captures the whole string.
The regex i am using is
/(\([a-zA-Z0-9\[\]:_+=-\s\.\(\),'óáéíúüçãôàäê><]+\))/g

It seems you just want to match what is in-between ( and ) that is not ( and ) unless these are (...) that are preceded with a word character.
You can use
\((?:[^()]|\b\([^()]*\))*\)
See the regex demo
The regex breakdown:
\( - matching a literal (
(?:[^()]|\b\([^()]*\))* - zero or more sequences of:
[^()] - any character other than ( and )
| - or...
\b\([^()]*\) - a word boundary (i.e. before that position, there must be a word character) followed with ( followed with zero or more characters other than ( and )
\) - a closing )
An alternative pattern can be an unrolled one (more efficient with longer inputs):
\([^()]*(?:\b\([^()]*\)[^()]*)*\)
See another demo

Related

How to extract the operands on both sides of "==" using regex?

Language and package
python3.8, regex
Description
The inputs and wanted outputs are listed as following:
if (programWorkflowState.getTerminal(1, 2) == Boolean.TRUE) {
Want: programWorkflowState.getTerminal(1, 2) and Boolean.TRUE
boolean ignore = !_isInStatic.isEmpty() && (_isInStatic.peek() == 3) && isAnonymous;
Want: _isInStatic.peek() and 3
boolean b = (num1 * ( 2 + num2)) == value;
Want: (num1 * ( 2 + num2)) and value
My current regex
((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)\s*==\s*((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)
This pattern want to match \((?:[^\(\)]|(?R))*\) or [\w\.] on both side of "=="
Result on regex101.com
Problem: It failed to match the recursive part (num1 * ( 2 + num2)).
The explanation of the recursive pattern \((?:m|(?R))*\) is here
But if I only use the recursive pattern, it succeeded to match (num1 * ( 2 + num2)) as the image shows.
What's the right regex to achieve my purpose?

The \((?:m|(?R))*\) pattern contains a (?R) construct (equal to (?0) subroutine) that recurses the entire pattern.
You need to wrap the pattern you need to recurse with a group and use a subroutine instead of (?R) recursion construct, e.g. (?P<aux>\((?:m|(?&aux))*\)) to recurse a pattern inside a longer one.
You can use
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++)\s*[!=]=\s*((?:(?&aux1)|[\w.])+)
See this regex demo (it takes just 6875 steps to match the string provided, yours takes 13680)
Details
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++) - Group 1, matches one or more occurrences (possessively, due to ++, not allowing backtracking into the pattern so that the regex engine could not re-try matching a string in another way if the subsequent patterns fail to match)
(?P<aux1>\((?:[^()]++|(?&aux1))*\)) - an auxiliary group "aux1" that matches (, then zero or more occurrences of either 1+ chars other than ( and ) or the whole Group "aux1" pattern, and then a )
| - or
[\w.] - a letter, digit, underscore or .
\s*[!=]=\s* - != or == with zero or more whitespace on both ends
((?:(?&aux1)|[\w.])+) - Group 2: one or more occurences of Group "aux" pattern or a letter, digit, underscore or ..

Stripping comments from Forth source code using regular expressions

I am trying to match all content between parentheses, including parentheses in a non-greedy way. There should be a space before and after the opening parentheses (or the start of a line before the opening parentheses) and a space before and after the closing parentheses. Take the following text:
( )
( This is a comment )
1 2 +
\ a
: square dup * ;
( foo bar
baz )
(quux)
( ( )
(
( )
The first line should be matched, the second line including its content should be matched, the second last line should not be matched (or raise an error) and the last line should be matched. The two lines foo bar baz should be matched, but (quux) should not as it doesn't contain a space before and after the parentheses. The line with the extra opening parentheses inside should be matched.
I tried a few conventional regexes for matching content between parentheses but without much success. The regex engine is that of Go's.

re := regexp.MustCompile(`(?s)\(( | .*? )\)`)
s = re.ReplaceAllString(s, "")
Playground: https://play.golang.org/p/t93tc_hWAG

Regular expressions "can't count" (that's over-simplified, but bear with me), so you can't match on an unbounded amount of parenthesis nesting. I guess you're mostly concerned about matching only a single level in this case, so you would need to use something like:
foo := regexp.MustCompile(`^ *\( ([^ ]| [^)]*? \)$`)
This does require the comment to be the very last thing on a line, so it may be better to add "match zero or more spaces" there. This does NOT match the string "( ( ) )" or try to cater for arbitrary nesting, as that's well outside the counting that regular expressions can do.
What they can do in terms of counting is "count a specific number of times", they can't "count how many blah, then make sure there's the same number of floobs" (that requires going from a regular expression to a context-free grammar).
Playground

Here is a way to match all the 3 lines in question:
(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()\n]*[\pZs}\t])?\)[\pZs}\t]*$
See the Go regex demo at the new regex101.com
Details:
(?m) - multiline mode on
^ - due to the above, the start of a line
[\t\p{Zs}]* - 0+ horizontal whitespaces
\( - a (
[\pZs}\t] - exactly 1 horizontal whitespace
(?:[^()\n]*[\pZs}\t])? - an optional sequence matching:
[^()\n]* - a negated character class matching 0+ characters other than (, ) and a newline
[\pZs}\t] - horizontal whitespace
\) - a literal )
[\pZs}\t]* - 0+ horizontal whitespaces
$ - due to (?m), the end of a line.
Go playground demo:
package main
import (
"regexp"
"fmt"
)
func main() {
var re = regexp.MustCompile(`(?m)^[\t\p{Zs}]*\([\pZs}\t](?:[^()\n]*[\pZs}\t])?\)[\pZs}\t]*$`)
var str = ` ( )
( This is a comment )
1 2 +
\ a
: square dup * ;
( foo bar
baz )
(quux)
( ( )
(
( )`
for i, match := range re.FindAllString(str, -1) {
fmt.Println("'", match, "' (found at index", i, ")")
}
}

R regex: how to remove "*" only in between a group of variables

I have a group of variable var:
> var
[1] "a1" "a2" "a3" "a4"
here is what I want to achieve: using regex and change strings such as this:
3*a1 + a1*a2 + 4*a3*a4 + a1*a3
to
3a1 + a1*a2 + 4a3*a4 + a1*a3
Basically, I want to trim "*" that is not in between any values in var. Thank you in advance

Can do find (?<![\da-z])(\d+)\* replace $1
(?<! [\da-z] )
( \d+ ) # (1)
\*
Or, ((?:[^\da-z]|^)\d+)\* for the assertion impaired engines
( # (1 start)
(?: [^\da-z] | ^ )
\d+
) # (1 end)
\*
Leading assertions are bad anyways.
Benchmark
Regex1: (?<![\da-z])(\d+)\*
Options: < none >
Completed iterations: 100 / 100 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 1.09 s, 1087.84 ms, 1087844 µs
Regex2: ((?:[^\da-z]|^)\d+)\*
Options: < none >
Completed iterations: 100 / 100 ( x 1000 )
Matches found per iteration: 2
Elapsed Time: 0.77 s, 767.04 ms, 767042 µs

You can create a dynamic regex out of the var to match and capture *s that are inside your variables, and reinsert them back with a backreference in gsub, and remove all other asterisks:
var <- c("a1","a2","a3","a4")
s = "3*a1 + a1*a2 + 4*a3*a4 + a1*a3"
block = paste(var, collapse="|")
pat = paste0("\\b((?:", block, ")\\*)(?=\\b(?:", block, ")\\b)|\\*")
gsub(pat, "\\1", s, perl=T)
## "3a1 + a1*a2 + 4a3*a4 + a1*a3"
See the IDEONE demo
Here is the regex:
\b((?:a1|a2|a3|a4)\*)(?=\b(?:a1|a2|a3|a4)\b)|\*
Details:
\b - leading word boundary
((?:a1|a2|a3|a4)\*) - Group 1 matching
(?:a1|a2|a3|a4) - either one of your variables
\* - asterisk
(?=\b(?:a1|a2|a3|a4)\b) - a lookahead check that there must be one of your variables (otherwise, no match is returned, the * is matched with the second branch of the alternation)
| - or
\* - a "wild" literal asterisk to be removed.

Taking the equation as a string, one option is
gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', '3*a1 + a1*a2 + 4*a3*a4 + a1*a3')
# [1] "3a1 + a1*a2 + 4a3*a4 + a1*a3"
which looks for
a captured group of characters, ( ... )
containing a non-capturing group, (?: ... )
containing the beginning of the line ^
or, |
a space (or \\s)
followed by a digit 0-9, \\d.
The capturing group is followed by an asterisk, \\*,
followed by another capturing group ( ... )
containing an alphanumeric character \\w.
It replaces the above with
the first captured group, \\1,
followed by the second captured group, \\2.
Adjust as necessary.

Thank #alistaire for offering a solution with non-capturing group. However, the solution replies on that there exists an space between the coefficient and "+" in front of it. Here's my modified solution based on his suggestion:
> ss <- "3*a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4*a2*a3"
# my modified version
> gsub('((?:^|\\s|\\+|\\-)\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4a3*a4 +2a1*a3+ 4a2*a3"
# alistire's
> gsub('((?:^| )\\d)\\*(\\w)', '\\1\\2', ss)
[1] "3a1 + a1*a2+4*a3*a4 +2*a1*a3+ 4a2*a3"

PCRE regex for multiple decimal coordinates using [lon,lat] format

I am trying to create a regex for [lon,lat] coordinates.
The code first checks if the input starts with '['.
If it does we check the validity of the coordinates via a regex
/([\[][-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)[\]][\;]?)+/gm
The regex tests for [lon,lat] with 15 decimals [+- 180degrees, +-90degrees]
it should match :
single coordinates :
[120,80];
[120,80]
multiple coordinates
[180,90];[180,67];
[180,90];[180,67]
with newlines
[123,34];[-32,21];
[12,-67]
it should not match:
semicolon separator missing - single
[25,67][76,23];
semicolon separator missing - multiple
[25,67]
[76,23][12,90];
I currently have problems with the ; between coordinates (see 4 & 5)
jsfiddle equivalent here : http://regex101.com/r/vQ4fE0/4

You can try with this (human readable) pattern:
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<lon> [+-]?
(?:
180 (?:\.0{1,15})?
|
(?: 1(?:[0-7][0-9]?)? | [2-9][0-9]? | 0 )
(?:\.[0-9]{1,15})?
)
)
(?<lat> [+-]?
(?:
90 (?:\.0{1,15})?
|
(?: [1-8][0-9]? | 9)
(?:\.[0-9]{1,15})?
)
)
)
\A
\[ \g<lon> , \g<lat> ] (?: ; \n? \[ \g<lon> , \g<lat> ] )* ;?
\z
~x
EOD;
explanations:
When you have to deal with a long pattern inside which you have to repeat several time the same subpatterns, you can use several features to make it more readable.
The most well know is to use the free-spacing mode (the x modifier) that allows to indent has you want the pattern (all spaces are ignored) and eventually to add comments.
The second consists to define subpatterns in a definition section (?(DEFINE)...) in which you can define named subpatterns to be used later in the main pattern.
Since I don't want to repeat the large subpatterns that describes the longitude number and the latitude number, I have created in the definition section two named pattern "lon" and "lat". To use them in the main pattern, I only need to write \g<lon> and \g<lat>.
javascript version:
var lon_sp = '(?:[+-]?(?:180(?:\\.0{1,15})?|(?:1(?:[0-7][0-9]?)?|[2-9][0-9]?|0)(?:\\.[0-9]{1,15})?))';
var lat_sp = '(?:[+-]?(?:90(?:\\.0{1,15})?|(?:[1-8][0-9]?|9)(?:\\.[0-9]{1,15})?))';
var coo_sp = '\\[' + lon_sp + ',' + lat_sp + '\\]';
var regex = new RegExp('^' + coo_sp + '(?:;\\n?' + coo_sp + ')*;?$');
var coordinates = new Array('[120,80];',
'[120,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]');
for (var i = 0; i<coordinates.length; i++) {
console.log("\ntest "+(i+1)+": " + regex.test(coordinates[i]));
}
fiddle

Try this out:
^(\[([+-]?(?!(180\.|18[1-9]|19\d{1}))\d{1,3}(\.\d{1,15})?,[+-]?(?!(90\.|9[1-9]))\d{1,2}(\.\d{1,15})?(\];$|\]$|\];\[)){1,})
Demo: http://regex101.com/r/vQ4fE0/7
Explanation
^(\[
Must start with a bracket
[+-]?
May or may not contain +- in front of the number
(?!(180\.|18[1-9]|19\d{1}))
Should not contain 180., 181-189 nor 19x
\d{1,3}(\.\d{1,15})?
Otherwise, any number containing 1 or 3 digits, with or without decimals (up to 15) are allowed
(?!(90\.|9[1-9]))
The 90 check is similar put here we are not allowing 90. nor 91-99
\d{1,2}(\.\d{1,15})?
Otherwise, any number containing 1 or 2 digits, with or without decimals (up to 15) are allowed
(\];$|\]$|\];\[)
The ending of a bracket body must have a ; separating two bracket bodies, otherwise it must be the end of the line.
{1,}
The brackets can exist 1 or multiple times
Hope this was helpful.

This might work. Note that you have a lot of capture groups, none of which
will give you good information because of recursive quantifiers.
# /^(\[[-+]?(180(\.0{1,15})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,15})?),[-+]?([1-8]?\d(\.\d{1,15})?|90(\.0{1,15})?)\](?:;\n?|$))+$/
^
( # (1 start)
\[
[-+]?
( # (2 start)
180
( \. 0{1,15} )? # (3)
|
( # (4 start)
( 1 [0-7] \d ) # (5)
|
( [1-9]? \d ) # (6)
) # (4 end)
( \. \d{1,15} )? # (7)
) # (2 end)
,
[-+]?
( # (8 start)
[1-8]? \d
( \. \d{1,15} )? # (9)
|
90
( \. 0{1,15} )? # (10)
) # (8 end)
\]
(?: ; \n? | $ )
)+ # (1 end)
$

Try a function approach, where the function can do some of the splitting for you, as well as delegating the number comparisons away from the regex. I tested it here: http://repl.it/YyG/3
//represents regex necessary to capture one coordinate, which
// looks like 123 or 123.13532
// the decimal part is a non-capture group ?:
var oneCoord = '(-?\\d+(?:\\.\\d+)?)';
//console.log("oneCoord is: "+oneCoord+"\n");
//one coordinate pair is represented by [x,x]
// check start/end with ^, $
var coordPair = '^\\['+oneCoord+','+oneCoord+'\\]$';
//console.log("coordPair is: "+coordPair+"\n");
//the full regex string consists of one or more coordinate pairs,
// but we'll do the splitting in the function
var myRegex = new RegExp(coordPair);
//console.log("my regex is: "+myRegex+"\n");
function isPlusMinus180(x)
{
return -180.0<=x && x<=180.0;
}
function isPlusMinus90(y)
{
return -90.0<=y && y<=90.0;
}
function isValid(s)
{
//if there's a trailing semicolon, remove it
if(s.slice(-1)==';')
{
s = s.slice(0,-1);
}
//remove all newlines and split by semicolon
var all = s.replace(/\n/g,'').split(';');
//console.log(all);
for(var k=0; k<all.length; ++k)
{
var match = myRegex.exec(all[k]);
if(match===null)
return false;
console.log(" match[1]: "+match[1]);
console.log(" match[2]: "+match[2]);
//break out if one pair is bad
if(! (isPlusMinus180(match[1]) && isPlusMinus90(match[2])) )
{
console.log(" one of matches out of bounds");
return false;
}
}
return true;
}
var coords = new Array('[120,80];',
'[120.33,80]',
'[180,90];[180,67];',
'[123,34];[-32,21];\n[12,-67]',
'[25,67][76,23];',
'[25,67]\n[76,23]',
'[190,33.33]',
'[180.33,33]',
'[179.87,90]',
'[179.87,91]');
var s;
for (var i = 0; i<coords.length; i++) {
s = coords[i];
console.log((i+1)+". ==== testing "+s+" ====");
console.log(" isValid? => " + isValid(s));
}

Replacement of strings within 2 strings in regex

I have a string:
dkj a * & &*(&(*(
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and I need to have 2 groups such as each group must start with //HELLO then there should be a next line and any type of text can follow (.*) but it will end with a //BYE preceded by a line:
1)
//#HELLO
^%#&UJNWDUK()C*(v 8*J DK*9
//#HE#$^&&(akls#$98akdjl ak##sjdkja
//
%^&*(//#HELLO//#BYE<><>
//#BYE
2)
//#HELLO
&*^J$XUK 8j8 j jk kk8(&*(
//#BYE
and replaces the original string to this: (basically adding // to each line of each group)
dkj a * & &*(&(*(
////#HELLO
//^%#&UJNWDUK()C*(v 8*J DK*9
////#HE#$^&&(akls#$98akdjl ak##sjdkja
////
//%^&*(//#HELLO//#BYE<><>
////#BYE
^%#&UJNWDUK()C*(v 8*J DK*90K )
////#HELLO
//&*^J$XUK 8j8 j jk kk8(&*(
////#BYE
Here is my current progress:
I have
\/\/#HELLO\n.*?\/\/#BYE[\n$]
However im not sure how to go about the replacement, I'm thinking separating each line per group using \G after the //#HELLO and ending with //#BYE

It's a bit complex, but this will do it:
Search: (?m)(//#HELLO[\r\n]+|\G(?://#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))
Replace: //$1
In Groovy:
String resultString = subjectString.replaceAll(/(?m)(\/\/#HELLO[\r\n]+|\G(?:\/\/#BYE|(?=(?:[^#]|#(?!HELLO[\r\n]+))*#BYE)[^\r\n]*[\r\n]*))/, '//$1');

For grouping into separate lines use the following regex:
//#HELLO\r(.*[\n\r]+)*//#BYE\r?
\r - Newline character
[\n\r] - Enter characters
*? - Non-greedy match
?- Match 1 or 0 times
You can take out the ? at the end if it always ends with a newline.
You can then use the group (The value inside the brackets) to search and replace.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Extracting inner groups with regex - regex

Related

How to extract the operands on both sides of "==" using regex?

Stripping comments from Forth source code using regular expressions

R regex: how to remove "*" only in between a group of variables

PCRE regex for multiple decimal coordinates using [lon,lat] format

Replacement of strings within 2 strings in regex

Categories

Resources