Regular expression for contents of parenthesis in Racket - regex

How can I get contents of parenthesis in Racket? Contents may have more parenthesis. I tried:
(regexp-match #rx"((.*))" "(check)")
But the output has "(check)" three times rather than one:
'("(check)" "(check)" "(check)")
And I want only "check" and not "(check)".
Edit: for nested parenthesis, the inner block should be returned. Hence (a (1 2) c) should return "a (1 2) c".

Parentheses are capturing and not matching.. so #rx"((.*))" makes two captures of everything. Thus:
(regexp-match #rx"((.*))" "any text")
; ==> ("any text" "any text" "any text")
The resulting list has the first as the whole match, then the first set of acpturnig paren and then the ones inside those as second.. If you want to match parentheses you need to escape them:
(regexp-match #rx"\\((.*)\\)" "any text")
; ==> #f
(regexp-match #rx"\\((.*)\\)" "(a (1 2) c)")
; ==> ("(a (1 2) c)" "a (1 2) c")
Now you see that the first element is the whole match, since the match might start at any location in the search string and end where the match is largest. The second element is the only one capture.
This will fail if the string has additional sets of parentheses. eg.
(regexp-match #rx"\\((.*)\\)" "(1 2 3) (a (1 2) c)")
; ==> ("(1 2 3) (a (1 2) c)" "1 2 3) (a (1 2) c")
It's because the expression isn't nesting aware. To be aware of it you need recursive reguler expression like those in Perl with (?R) syntax and friends, but racket doesn't have this (yet???)

Related

Removing everything between nested parentheses

For removing everything between parentheses, currently i use:
SELECT
REGEXP_REPLACE('(aaa) bbb (ccc (ddd) / eee)', "\\([^()]*\\)", "");
Which is incorrect, because it gives bbb (ccc / eee), as that removes inner parentheses only.
How to remove everynting between nested parentheses? so expected result from this example is bbb
In case of Google BigQuery, this is only possible if you know your maximum number of nestings. Because it uses re2 library that doesn't support regex recursions.
let r = /\((?:(?:\((?:[^()])*\))|(?:[^()]))*\)/g
let s = "(aaa) bbb (ccc (ddd) / eee)"
console.log(s.replace(r, ""))
If you can iterate on the regular expression operation until you reach a fixed point you can do it like this:
repeat {
old_string = string
string := remove_non_nested_parens_using_regex(string)
} until (string == old_string)
For instance if we have
((a(b)) (c))x)
on the first iteration we remove (b) and (c): sequences which begin with (, end with ) and do not contain parentheses, matched by \([^()]*\). We end up with:
((a) )x)
Then on the next iteration, (a) is gone:
( )x)
and after one more iteration, ( ) is gone:
x)
when we try removing more parentheses, there is no more change, and so the algorithm terminates with x).

How to extract the operands on both sides of "==" using regex?

Language and package
python3.8, regex
Description
The inputs and wanted outputs are listed as following:
if (programWorkflowState.getTerminal(1, 2) == Boolean.TRUE) {
Want: programWorkflowState.getTerminal(1, 2) and Boolean.TRUE
boolean ignore = !_isInStatic.isEmpty() && (_isInStatic.peek() == 3) && isAnonymous;
Want: _isInStatic.peek() and 3
boolean b = (num1 * ( 2 + num2)) == value;
Want: (num1 * ( 2 + num2)) and value
My current regex
((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)\s*==\s*((?:\((?:[^\(\)]|(?R))*\)|[\w\.])+)
This pattern want to match \((?:[^\(\)]|(?R))*\) or [\w\.] on both side of "=="
Result on regex101.com
Problem: It failed to match the recursive part (num1 * ( 2 + num2)).
The explanation of the recursive pattern \((?:m|(?R))*\) is here
But if I only use the recursive pattern, it succeeded to match (num1 * ( 2 + num2)) as the image shows.
What's the right regex to achieve my purpose?
The \((?:m|(?R))*\) pattern contains a (?R) construct (equal to (?0) subroutine) that recurses the entire pattern.
You need to wrap the pattern you need to recurse with a group and use a subroutine instead of (?R) recursion construct, e.g. (?P<aux>\((?:m|(?&aux))*\)) to recurse a pattern inside a longer one.
You can use
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++)\s*[!=]=\s*((?:(?&aux1)|[\w.])+)
See this regex demo (it takes just 6875 steps to match the string provided, yours takes 13680)
Details
((?:(?P<aux1>\((?:[^()]++|(?&aux1))*\))|[\w.])++) - Group 1, matches one or more occurrences (possessively, due to ++, not allowing backtracking into the pattern so that the regex engine could not re-try matching a string in another way if the subsequent patterns fail to match)
(?P<aux1>\((?:[^()]++|(?&aux1))*\)) - an auxiliary group "aux1" that matches (, then zero or more occurrences of either 1+ chars other than ( and ) or the whole Group "aux1" pattern, and then a )
| - or
[\w.] - a letter, digit, underscore or .
\s*[!=]=\s* - != or == with zero or more whitespace on both ends
((?:(?&aux1)|[\w.])+) - Group 2: one or more occurences of Group "aux" pattern or a letter, digit, underscore or ..

How can I split by commas while ignoring any comma that's inside quotes? [duplicate]

This question already has answers here:
Split a string by commas but ignore commas within double-quotes using Javascript
(17 answers)
Java: splitting a comma-separated string but ignoring commas in quotes
(12 answers)
Javascript: Splitting a string by comma but ignoring commas in quotes
(3 answers)
Closed 3 years ago.
I have a Typescript file that takes a csv file and splits it using the following code:
var cells = rows[i].split(",");
I now need to fix this so that any comma that's inside quotes does not result in a split. For example, The,"quick, brown fox", jumped should split into The, quick, brown fox, and jumped instead of also splitting quick and brown fox. What is the proper way to do this?
Update:
I think the final version in a line should be:
var cells = (rows[i] + ',').split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/).slice(1).reduce((a, b) => (a.length > 0 && a[a.length - 1].length < 4) ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]] : [...a, [b]], []).map(e => e.reduce((a, b) => a !== undefined ? a : b, undefined))
or put it more beautifully:
var cells = (rows[i] + ',')
.split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/)
.slice(1)
.reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)
.map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)
;
This is rather long, but still looks purely functional. Let me explain it:
First, the regular expression part. Basically, a segment you want may fall into 3 possibilities:
*?([^",]+?) *?,, which is a string without " or , surrounded with spaces, followed by a ,.
" *?(.+?)" *?,, which is a string, surrounded with a pair of quotes and an indefinite number of spaces beyond the quotes, followed by a ,.
( *?),, which is an indefinite number of spaces, followed by a ','.
So splitting by a non-capturing group of a union of these three will basically get us to the answer.
Recall that when splitting with a regular expression, the resulting array consists of:
Strings separated by the separator (the regular expression)
All the capturing groups in the separator
In our case, the separators fill the whole string, so the strings separated are all empty strings, except that last desired part, which is left out because there is no , following it. Thus the resulting array should be like:
An empty string
Three strings, representing the three capturing groups of the first separator matched
An empty string
Three strings, representing the three capturing groups of the second separator matched
...
An empty string
The last desired part, left alone
So why simply adding a , at the end so that we can get a perfect pattern? This is how (rows[i] + ',') comes about.
In this case the resulting array becomes capturing groups separated by empty strings. Removing the first empty string, they will appear in a group of 4 as [ 1st capturing group, 2nd capturing group, 3rd capturing group, empty string ].
What the reduce block does is exactly grouping them into groups of 4:
.reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)
And finally, find the first non-undefined elements (an unmatched capturing group will appear as undefined. Our three patterns are exclusive in that any 2 of them cannot be matched simultaneously. So there is exactly 1 such element in each group) in each group which are precisely the desired parts:
.map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)
This completes the solution.
I think the following should suffice:
var cells = rows[i].split(/([^",]+?|".+?") *, */).filter(e => e)
or if you don't want the quotes:
var cells = rows[i].split(/(?:([^",]+?)|"(.+?)") *, */).filter(e => e)

How to use Groovy's replaceFirst with closure?

I'm a newbie on Groovy and have a question about replaceFirst with closure.
The groovy-jdk API doc gives me examples of...
assert "hellO world" == "hello world".replaceFirst("(o)") { it[0].toUpperCase() } // first match
assert "hellO wOrld" == "hello world".replaceAll("(o)") { it[0].toUpperCase() } // all matches
assert '1-FISH, two fish' == "one fish, two fish".replaceFirst(/([a-z]{3})\s([a-z]{4})/) { [one:1, two:2][it[1]] + '-' + it[2].toUpperCase() }
assert '1-FISH, 2-FISH' == "one fish, two fish".replaceAll(/([a-z]{3})\s([a-z]{4})/) { [one:1, two:2][it[1]] + '-' + it[2].toUpperCase() }
The first two examples are quite straightforward, but I can't understand the remaining ones.
First, what does [one:1, two:2] mean?
I even don't know the name of it to search.
Second, why is there a list of "it"?
The doc says replaceFirst()
Replaces the first occurrence of a captured group by the result of a closure call on that text.
Doesn't "it" refer to "the first occurrence of a captured group"?
I would appreciate any tips and comments!
First, [one:1, two:2] is a Map:
assert [one:1, two:2] instanceof java.util.Map
assert 1 == [one:1, two:2]['one']
assert 2 == [one:1, two:2]['two']
assert 1 == [one:1, two:2].get('one')
assert 2 == [one:1, two:2].get('two')
So, basically, the code inside the closure uses that map as a lookup table to replace one with 1 and two with 2.
Second, let's see how the regex matcher works:
To find out how many groups are present in the expression, call the
groupCount method on a matcher object. The groupCount method returns
an int showing the number of capturing groups present in the matcher's
pattern. In this example, groupCount would return the number 4,
showing that the pattern contains 4 capturing groups.
There is also a special group, group 0, which always represents the entire expression. This group is not included in the total reported by
groupCount. Groups beginning with (? are pure, non-capturing groups
that do not capture text and do not count towards the group total.
Drilling down to the regex stuff:
def m = 'one fish, two fish' =~ /([a-z]{3})\s([a-z]{4})/
assert m instanceof java.util.regex.Matcher
m.each { group ->
println group
}
This yields:
[one fish, one, fish] // only this first match for "replaceFirst"
[two fish, two, fish]
So we can rewrite the code in a clearer way renaming it by group (it is simply the default name of the argument in a single argument closure):
assert '1-FISH, two fish' == "one fish, two fish".replaceFirst(/([a-z]{3})\s([a-z]{4})/) { group ->
[one:1, two:2][group[1]] + '-' + group[2].toUpperCase()
}
assert '1-FISH, 2-FISH' == "one fish, two fish".replaceAll(/([a-z]{3})\s([a-z]{4})/) { group ->
[one:1, two:2][group[1]] + '-' + group[2].toUpperCase()
}

Regular Expression: how to match least characters (non-greedy) but no less than N

I want to have such a method leastMatch(String input, String regExp, int n)
input is the input text, regExp is the regular expression and n is the minimal amount of characters the result has.
The return value is a head of the input text which matches the regular expression and contain
at least n characters.
e.g.
leastMatch("abababab", "(ab)+", 3) //returns "abab"
leastMatch("abababab", "(ab)+", 0) //returns ""
leastMatch("abababab", "(ab)+", 4) //returns "abab"
use this syntaxe:
(ab){3,} --> 3 or more
(ab){0,2} --> less than 3
(ab){2,5} --> between 2 and 5
(ab){3} --> 3
you can easily determine the number of ab you allow with n and ab length (2 in this example)