Regex - locate first ')' after matching string

Regex - locate first ')' after matching string - regex

I am doing a mass method replace in my C# codebase. I have lines of code that look like the following:
Assert.That(Edit.FundsTable.GetCellByIndexes(0, 2).Text.Contains("Employer Request IPM A"));
The problem is that initially when the GetCellByIndexes call was made, we had another method that basically did the same thing, leaving us doing the exact same task 2 ways. The more standard way that we are changing it to is the following:
Assert.That(Edit.FundsTable.Cells[0, 2].Text.Contains("Employer Request IPM A"));
I am trying to do a VS replace all replacement to move GetCellByIndexes calls to Cells calls. The issue is with the right paran. I can do a replace all from
GetCellByIndexes(
to
Cells[
very easily. The problem is changing the right paran of the method call to a square bracket. Does anyone know how to identify the first right paran after the "GetCellByIndexes" string utilizing Regex?

Use
GetCellByIndexes\(([^()]+)\)
Replace with Cells[$1]. See proof.
Code
Explanation
GetCellByIndexes
'GetCellByIndexes'
\(
'('
(
group and capture to $1:
  [^()]+
  any character except: '(', ')' (1 or more times (matching the most amount possible))
)
end of $1
\)
')'

In general search for:
GetCellByIndexes\(\s*(\d+)\s*,\s*(\d+)\s*\)
replace with
Cells[$1, $2]
Both if you are using the search-and-replace of Visual Studio or if you are programming in C#.
Note that this will only work if the indexes are numbers... If they are something more complex (variables, or functions) then it becomes more interesting (and complex).

Related

Regex that does not match ( and ), but matches '(' and ')'

I am trying to fix a regular expression used in tokenization so as to match everything (including '(' and ')', but not match ( and ) without being surrounded by apostrophes).
Use case examples which should be matched:
'('AN')'
'('AN
AN')'
...and every other possibility involving '(' or ')' combined or not with any string
Currently, it looks like this:
[^\)\(]+
The most successful result I have obtained so far is:
[^\)\(]+|\'.*?\'
This manages to correctly match expressions like: '('AN')' , '(' , ')' , AN , '('')' , '()'.
But it fails for: AN'(' , AN')' , '('AN , ')'AN.
NOTE: I have done some research, and found that the regex engine involved is quite old (around 1980s) and is called PCLNT (I am not 100% sure about its name). I mention this because in some other situations when I dealt with regular expressions, the regex engines available online showed the correct result, but in my application it did not even compile.
Any help would be great, also if anyone knows anything about this possible engine and its documentation please guide me.

This regex will match a sequence of any combination of characters other than parentheses or anything between apostrophes. It then optionally matches a single apostrophe followed by any sequence of unspecial characters, in order to catch unpaired apostrophes:
([^()']*|'[^']*')*('[^'()]*)?
I know nothing about the regex library you are using, but I don't think there's anything out of the ordinary in that regex.

I think what you're looking for is any string containing '(' OR any string containing ')'
.*(\'\(\').*|.*(\'\)\').*
Example here

It seems you want to match anything except a bracket without an adjacent apostrophe:
^('[()]|[()]'|[^()])+$
See live demo.
Note that you don’t have to escape brackets in a character class.

Capturing what's inside a nested structure in a regex or grammar token

I'd like to capture the interior of a nested structure.
my $str = "(a)";
say $str ~~ /"(" ~ ")" (\w) /;
say $str ~~ /"(" ~ ")" <(\w)> /;
say $str ~~ /"(" <(~)> ")" \w /;
say $str ~~ /"(" <(~ ")" \w /;
The first one works; the last one works but also captures the closing parenthesis. The other two fail, so it's not possible to use capture markers in this case. But the problem is more complicated in the context of a grammar, since capturing groups do not seem to work either, like here:
# Please paste this together with the code above so that it compiles.
grammar G {
token TOP {
'(' ~ ')' $<content> = .+?
}
}
grammar H {
token TOP {
'(' ~ ')' (.+?)
}
}
grammar I {
token TOP {
'(' ~ ')' <( .+? )>
}
}
$str = "(one of us)";
for G,H,I -> $grammar {
say $grammar.parse( $str );
}
Since neither capturing grouping or capture markers seem to work, except if it's assigned, on the fly, to a variable. This, however, creates an additional token I'd really like to avoid.
So there are two questions
What is the right way to make capture markers work in nested structures?
Is there a way to use either capturing groups or capturing markers in tokens to get the interior of a nested structure?

One solution to two issues
Per ugexe's comment, the [...] grouping construct works for all your use cases.
The <( and )> capture markers are not grouping constructs so they don't work with the regex ~ operation unless they're grouped.
The (...) capture/grouping construct clamps frugal matching to its minimum match when ratchet is in effect. A pattern like :r (.+?) never matches more than one character.
The behaviors described in the last two bullet points above aren't obvious, aren't in the docs, may not be per the design docs, may be holes in roast, may be figments of my imagination, etc. The rest of this answer explains what I've found out about the above three cases, and discusses some things that could be done.
Glib explanation, as if it's all perfectly cromulent
<( and )> are capture markers.
They behave as zero width assertions. Each asserts "this marks where I want capturing to start/end for the regex that contains this marker".
Per the doc for the regex ~ operator:
it mostly ignores the left argument, and operates on the next two [arguments]
(The doc says "atoms" where I've written "arguments". In reality it operates on the next two atoms or groups.)
In the regex pattern "(" ~ ")" <(\w)>:
")" is the first atom/group after ~.
<( is the second atom/group after ~.
~ ignores \w)>.
The solution is to use [...]:
say '(a)' ~~ / '(' ~ ')' [ <( \w )> ] /; # ｢a｣
Similarly, in a grammar:
token TOP { '(' ~ ')' [ <( .+? )> ] }
(...) grouping isn't what you want for two reasons:
It couldn't be what you want. It would create an additional token capture. And you wrote you'd like to avoid that.
Even if you wanted the additional capture, using (...) when ratchet is in effect clamps frugal matching within the parens.
What could be done about capture markers "not working"?
I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.
Is it known to be intended behavior or a bug?
Searches of GH repos for "capture markers":
raku/old-design-docs
raku/roast
raku/old-issue-tracker and rakudo/rakudo
raku/docs
The term "capture markers" comes from the doc, not the old design docs which just say:
A <( token indicates the start of the match's overall capture, while the corresponding )> token indicates its endpoint. When matched, these behave as assertions that are always true, but have the side effect of setting the .from and .to attributes of the match object.
(Maybe you can figure out from that what strings to search for among issues etc...)
At the time of writing, all GH searches for <( or )> draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos, eg this.
I was curious and tried this:
my $str = "aaa";
say $str ~~ / <(...)>* /;
It infinitely loops. The * is acting on just the )>. This corroborates the sense that capture markers are treated as atoms.
The regex ~ operator works for [...] and some other grouped atom constructions. Parsing any of them has a start and end within a regex pattern.
The capture markers are different in that they aren't necessarily paired -- the start or end can be implicit.
Perhaps this makes treating them as we might wish unreasonably difficult for Raku given that start (/ or{) and end ( / or }) occur at a slang boundary and Raku is a single-pass parsing braid?
I think that a doc fix is probably the appropriate response to this capture marker aspect of your SO.
If regex ~ were the only regex construct that cared that left and right capture markers are each an individual atom then perhaps the best place to mention this wrinkle would be in the regex ~ section.
But given that multiple regex constructs care (quantifiers do per the above infinite loop example), then perhaps the best place is the capture markers section.
Or perhaps it would be best if it's mentioned in both. (Though that's a slippery slope...)
What could be done about :r (.*?) "not working"?
I think a doc update is the likely best thing to do. But imo whoever thinks of filing an issue about one, or preparing a PR, would be well advised to make use of the following.
Is it known to be intended behavior or a bug?
Searches of GH repos for ratchet frugal:
raku/old-design-docs
raku/roast
raku/old-issue-tracker and rakudo/rakudo
raku/docs
The terms "ratchet" and "frugal" both come from the old design docs and are still used in the latest doc and don't seem to have aliases. So searches for them should hopefully match all relevant mentions.
The above searches are for both words. Searching for one at a time may reveal important relevant mentions that happen to not mention the other.
At the time of writing, all GH searches for .*? or similar draw blanks but that's due to a weakness of the current built in GH search, not because there aren't any in those repos.
Perhaps the issue here is broader than the combination of ratchet, frugal, and capture?
Perhaps file an issue using the words "ratchet", "frugal" and "capture"?

RegEx - is recursive substitution possible using only a RegEx engine? Conditional search replace

I'm editing some data, and my end goal is to conditionally substitute , (comma) chars with .(dot). I have a crude solution working now, so this question is strictly for suggestions on better methods in practice, and determining what is possible with a regex engine outside of an enhanced programming environment.
I gave it a good college try, but 6 hours is enough mental grind for a Saturday, and I'm throwing in the towel. :)
I've been through about 40 SO posts on regex recursion, substitution, etc, the wiki.org on the definitions and history of regex and regular language, and a few other tutorial sites. The majority is centered around Python and PHP.
The working, crude regex (facilitating loops / search and replace by hand):
(^.*)(?<=\()(.*?)(,)(.*)(?=\))(.*$)
A snip of the input:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
room_ass=01:macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*,4,6,8,),
room_ass=01:macro_id=03: name=All, pgm_audio=1, list=(1,2*,3,4,5,6,7,8,),
And the desired output:
room_ass=01: macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*.3.5.7.),
room_ass=01: macro_id=02: name=Right, pgm_audio=1, usb=1, list=(2*.4.6.8.),
room_ass=01: macro_id=03: name=All, pgm_audio=1, list=(1.2*.3.4.5.6.7.8.),
That's all. Just replace the , with ., but only inside ( ).
This is one conceptual (not working) method I'd like to see, where the middle group<3> would loop recursively:
(^.*)(?<=\()([^,]*)([,|\d|\*]\3.*)(?=\))(.*$)
( ^ )
..where each recursive iteration would shift across the data, either 1 char or 1 comma at a time:
room_ass=01:macro_id=01: name=Left, pgm_audio=0, usb=0, list=(1*,3,5,7,),
iter 1-| ^ |
2-| ^ |
3-| ^ |
4-| ^|
or
A much simpler approach would be to just tell it to mask/select all , between the (), but I struck out on figuring that one out.
I use text editors a lot for little data editing tasks like this, so I'd like to verify that SublimeText can't do it before I dig into Python.
All suggestions and criticisms welcome. Be gentle. <--#n00b
Thanks in advance!
-B

Not much magic needed. Just check, if there's a closing ) ahead, without any ( in between.
,(?=[^)(]*\))
See this demo at regex101
However it does not check for an opening (. It's a common approach and probably a dulicate.

This is a complete guess because I don't use SublimeText, the assumption here is that SublimeText uses PCRE regular expressions.
Note that you mention "recursive", I don't believe you mean Regular Expression Recursion that doesn't fit the problem here.
Something like this might work...
You'll need to test to make sure this isn't matching other things in your document and to see if SublimeText even supports this...
This is based on using the /K operator to "keep" what comes before it - you can find other uses of it as an PCRE alternative (workaround) to variable look-behinds not being supported by PCRE.
Regular Expression
\((?:(?:[^,\)]+),)*?(?:[^,\)]+)\K,
Visualisation
Regex Description
Match the opening parenthesis character \(
Match the regular expression below (?:(?:[^,\)]+),)*?
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) *?
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Match the character “,” literally ,
Match the regular expression below (?:[^,\)]+)
Match any single character NOT present in the list below [^,\)]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
The literal character “,” ,
The closing parenthesis character \)
Keep the text matched so far out of the overall regex match \K
Match the character “,” literally ,

Regular Expressions: Non-Greedy with Stack?

I have to do a lot regex within LaTeX and HTML files.. and often I find my self in the following situation:
I want something like \mbox{\sqrt{2}} + \sqrt{4} to be stripped to \sqrt{2} + \sqrt{4}.
In words: "replace every occurrence of \mbox{...} by its content.
So, how do I do that?
The greedy version \mbox{(.*)} gets me \sqrt{2}} + \sqrt{4 in $1 and the
non-greedy version \mbox{(.*?)} gets me \sqrt{2 in $1.
Both is not what I want.
What I need is, that the RegEx engine keeps somehow a
Stack of characters that at the position before and behind (.*), namely { and }. So, when a new { is encountered in .*, it should be placed on stack. when a } is encountered, the last { should be removed from stack. When the stack is empty, .* is done.
Similar cases occur with nested HTML Tags.
So, since most regex engines create an FSA for each regex, a stack should be feasible, or do I miss something? Some rare modifier that I'm not aware of? I am wondering, why there is no solution for this.
Of course I could code something for my self with java/python/perl whatsoever.. but I'd like to have it integrated in RegEx :)
Regards, Gilbert
(ps: I omitted to project + \sqrt{4} to keep the example small, \ should be escaped too)

It depends on your regex engine but it is possible with the .Net regex engine as follows...
\\mbox{(
(?>
[^{}]+
| { (?<number>)
| } (?<-number>)
)*
(?(number)(?!))
)
}
Assuming you are using IgnorePatternWhiteSpace
you can then do regex.Replace(sourceText,"$1") to perform the conversion you wished

Here's another regex that works in perl http://codepad.org/fcVz9Bky :
s/
\\mbox{
(
(?:
[^{}]+ #either match any number of non-braces
| #or
\{[^{}]+} #braces surrounding non-braces
)*
)
}
/$1/x;
Note: It only works for one level of nesting

Another trick you may be able to use is a recursive regex (which should be supported by PCRE and a few other flavors):
\\mbox(\{([^{}]|(?1)+)*+\})
Not too much to explain, if you're in the right state of mind.
Here's a similar one, but a little more flexible (for example, easier to add [] and (), or other balanced constructs):
\\mbox\{([^{}]|\{(?1)*\})*\}

How to edit "Full Windows Folder Path Regular Expression"

Hay this regualr expression working fine for Full Windows Folder Path
^([A-Za-z]:|\\{2}([-\w]+|((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\\(([^"*/:?|<>\\,;[\]+=.\x00-\x20]|\.[.\x20]*[^"*/:?|<>\\,;[\]+=.\x00-\x20])([^"*/:?|<>\\,;[\]+=\x00-\x1F]*[^"*/:?|<>\\,;[\]+=\x00-\x20])?))\\([^"*/:?|<>\\.\x00-\x20]([^"*/:?|<>\\\x00-\x1F]*[^"*/:?|<>\\.\x00-\x20])?\\)*$
Matches
d:\, \\Dpk\T c\, E:\reference\h101\, \\be\projects$\Wield\Rff\, \\70.60.44.88\T d\SPC2\
Non-Matches
j:ohn\, \\Dpk\, G:\GD, \\cae\.. ..\, \\70.60.44\T d\SPC2\
PROBLEM:
THIS EXPRESSION REQUIRED "\" END OF PATH.
HOW CAN I EDIT THIS EXPRESSION SO USER CAN ENTER PATH LIKE
C:\Folder1, C:\Folder 1\Sub Folder

There are two ways to approach this problem:
Understand the regex (way harder than necessary) and fix it to your specification (may be buggy)
Who cares how the regex does its thing (it seems to do what you need) and modify your input to conform to what you think the regex does
The second approach means that you just check if the input string ends with \. If it doesn't then just add it on, then let the regex does it magic.
I normally wouldn't recommend this ignorant alternative, but this may be an exception.
Blackboxing
Here's how I'm "solving" this problem:
There's a magic box, who knows how it works but it does 99% of the time
We want it to work 100% of the time
It's simpler to fix the 1% so it works with the magic box rather than fixing the magic box itself (because this would require understanding of how the magic box works)
Then just fix the 1% manually and leave the magic box alone
Deciphering the black magic
That said, we can certainly try to take a look at the regex. Here's the same pattern but reformatted in free-spacing/comments mode, i.e. (?x) in e.g. Java.
^
( [A-Za-z]:
| \\{2} ( [-\w]+
| (
(25[0-5]
|2[0-4][0-9]
|[01]?[0-9][0-9]?
)\.
){3}
(25[0-5]
|2[0-4][0-9]
|[01]?[0-9][0-9]?
)
)
\\ (
( [^"*/:?|<>\\,;[\]+=.\x00-\x20]
| \.[.\x20]* [^"*/:?|<>\\,;[\]+=.\x00-\x20]
)
( [^"*/:?|<>\\,;[\]+=\x00-\x1F]*
[^"*/:?|<>\\,;[\]+=\x00-\x20]
)?
)
)
\\ (
[^"*/:?|<>\\.\x00-\x20]
(
[^"*/:?|<>\\\x00-\x1F]*
[^"*/:?|<>\\.\x00-\x20]
)?
\\
)*
$
The main skeleton of the pattern is as follows:
^
(head)
\\ (
bodypart
\\
)*
$
Based from this higher-level view, it looks like an optional trailing \ can be supported by adding ? on the two \\ following the (head) part:
^
(head)
\\?(
bodypart
\\?
)*
$
References
regular-expressions.info/Question Mark for Optional
Note on catastrophic backtracking
You should generally be very wary of nesting repetition modifiers (a ? inside a * in this case), but for this specific pattern it's "okay", because the bodypart doesn't match \.
References
regular-expressions.info/Catastrophic Backtracking

I don't understand your regular expression at all. But I bet all you need to do is find the bit or bits that match the trailing "\", and add a single question mark after that bit or those bits.

The regex you provided seems to mismatch "C:\?tmp" which is an invalid windows path.
I have figured out one solution but works in windows only. You may have a try with this one:
"^[A-Za-z]:(?:\\\\(?![\"*/:?|<>\\\\,;[\\]+=.\\x00-\\x20])[^\"*/:?|<>\\\\[\\]]+){0,}(?:\\\\)?$"
This regex ignores the last "\" which hinders you.
I've tested with pcre.lib(5.5) in VS2005.
Hope it helps!

I know this question is roughly 4 years old, but the following may be sufficient:
string validWindowsOrUncPath = #"^(?:(?:[a-z]:)|(?:\\\\[^\\*\?\:;\0]*))(?:\\[^\\*\?\:;\0]*)+$";
(to be used with IgnoreCase option).
Edit:
I even came to this one, which can extract the root and each part in named groups:
string validWindowsOrUncPath = #"^(?<Root>(?:[a-z]:)|(?:\\\\[^\\*\?\:;\0]*))(?:\\(?<Part>[^\\*\?\:;\0]*))+$";

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex - locate first ')' after matching string - regex

Use GetCellByIndexes\(([^()]+)\) Replace with Cells[$1]. See proof. Code Explanation GetCellByIndexes 'GetCellByIndexes' \( '(' ( group and capture to $1: [^()]+ any character except: '(', ')' (1 or more times (matching the most amount possible)) ) end of $1 \) ')'

Related

Regex that does not match ( and ), but matches '(' and ')'

Capturing what's inside a nested structure in a regex or grammar token

RegEx - is recursive substitution possible using only a RegEx engine? Conditional search replace

Regular Expressions: Non-Greedy with Stack?

How to edit "Full Windows Folder Path Regular Expression"

Categories

Resources