RegEx repeating subcapture returns only last match - regex

I have a regexp that tries to find multiple sub captures within matches:
https://regex101.com/r/nk1Q5J/2/
=\s?.*(?:((?i)Fields\("(\w+)"\)\.Value\)*?))
I've had simpler versions with equivalent results but this is the last iteration.
the Trick here is that the first group looks for a sequence that begins with '=' (to identify database field reads in VB.Net)
The single sub capture cases work:
Match 1. [comparison + single parameter read call]
= False And IsNull(Fields("lmpRoundedEndTime").Value)
=> G2: lmpRoundedEndTime
Match 3. [read into string] oRs.Open "select lmpEmployeeID,lmpShiftID,lmpTimecardDate,lmpProjectID from Timecards where lmpTimecardID = " + App.Convert.NumberToSql(Fields("lmlTimecardID").Value),Connection,adOpenStatic,adLockReadOnly,adCmdText
=> G2: lmlTimecardID
Match 4. [assignment] Fields("lmlEmployeeID").Value = oRs.Fields("lmpEmployeeID").Value
Where I am failing is a match with multiple sub-captures. My regexp returns the last (intended) sub capture :
Match 2. [read multiple input parameters] Fields("lmpPayrollHours").Value = App.Ax("TimecardFunctions").CalculateHours(Fields("lmpShiftID").Value,Fields("lmpRoundedStartTime").Value,Fields("lmpRoundedEndTime").Value)
=> G2: lmpRoundedEndTime
'''''''''''' ^ must capture: lmpShiftID , lmpRoundedStartTime , lmpRoundedEndTime
I've read up on lazy quantifiers etc, but can't wrap my head around where this goes wrong.
References:
https://www.regular-expressions.info/refrepeat.html
http://www.rexegg.com/regex-quantifiers.html
Related:
Regular expression: repeating groups only getting last group
What's the difference between "groups" and "captures" in .NET regular expressions?
BTW, I could quantify the sub capture as {1,5} safely for efficiency, but that's not the focus.
EDIT:
By using negative lookahead to exclude the left side of comparisons, this got me much closer (match 2 above now works):
(?:Fields\("(\w+)"\)\.Value)(?!\)?\s{0,2}[=|<])
but in the following block of code, only the first two are captured:
If oRs.EOF = False Then
If CInt(Fields("lmlTimecardType").Value) = 1 Then
If Trim(oRs.Fields("lmeDirectExpenseID").Value) <> "" Then
Fields("lmlExpenseID").Value = oRs.Fields("lmeDirectExpenseID").Value
End If
Else
If Trim(oRs.Fields("lmeIndirectExpenseID").Value) <> "" Then
Fields("lmlExpenseID").Value = oRs.Fields("lmeIndirectExpenseID").Value
End If
End If
If CInt(Fields("lmlTimecardType").Value) = 2 Then
If Trim(oRs.FIelds("lmeDefaultWorkCenterID").Value) <> "" Then
Fields("lmlWorkCenterID").Value = oRs.FIelds("lmeDefaultWorkCenterID").Value
End If
End If
End If
Capture1:
Fields("lmlExpenseID").Value = oRs.Fields("lmeDirectExpenseID").Value
Capture2:
Fields("lmlExpenseID").Value = oRs.Fields("lmeIndirectExpenseID").Value
Capture3 (failed):
Fields("lmlWorkCenterID").Value = oRs.FIelds("lmeDefaultWorkCenterID").Value

Actually, (?:Fields\("(\w+)"\)\.Value)(?!\)?\s{0,2}[=|<]) does work in my Excel sheet, just not in the regex101 test. Probably a slightly different standard used there.
https://regex101.com/r/p6zFqy/1/

Related

Regular expression to match page number groups

I need a regular expression to match page numbers as found in common programs.
These usually take the form 1-5,3,5,1-9 for example.
I have a regular expression (\d+-\d+)?,(\d+-\d+?)* which I need help to refine.
As can be seen here regex101 I am matching commas and missing numbers entirely.
What I need is to match 1-5 as group 1, 3 as group 2, 5 as group 3 and 1-9 as group 4 without matching any commas.
Any help is appreciated. I will be using this in VBA.
This worked for me - am I missing something?
Sub Pages()
Dim re As Object, allMatches, m, rv, sep, c As Range, i As Long
Set re = CreateObject("VBScript.RegExp")
re.Pattern = "(\d+(-\d+)?)"
re.ignorecase = True
re.MultiLine = True
re.Global = True
For Each c In Range("B5:B20").Cells 'for example
c.Offset(0, 1).Resize(1, 10).ClearContents 'clear output cells
i = 0
If re.test(c.Value) Then
Set allMatches = re.Execute(c.Value)
For Each m In allMatches
i = i + 1
c.Offset(0, i).Value = m
Next m
End If
Next c
End Sub
If I recall correctly, capturing a dynamic number of groups will not work. You can pre-specify the format / number of groups to be matched, or you can catch the repeated groups as one and split them afterwards.
If you know the format, just do
(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)
which of course is not very neat.
If you want the flexible structure, match the first group and all the rest as a second and then split the latter by the delimiter ',' in whichever language.
(\d+(?:-\d+)?)((?:(?:,)(\d+(?:-\d+)?))*)
You need to make the -\d+ part optional, since you don't always have ranges. And the comma between each range should be part of the second group with the * quantifier, so you can match a single range with no comma after it.
\d+(-\d+)?(,\d+(-\d+)?)*
This will match the string that contains all the ranges. To get an array of individual ranges without the commas, do a second match in this string:
\d+(-\d+)?
Use the VBA function for getting an array of all matches of a regexp (sorry, I don't know VBA, so can't provide the specific syntax).

Regex - capture multiple groups and combine them multiple times in one string

I need to combine some text using regex, but I'm having a bit of trouble when trying to capture and substitute my string. For example - I need to capture digits from the start, and add them in a substitution to every section closed between ||
I have:
||10||a||ab||abc||
I want:
||10||a10||ab10||abc10||
So I need '10' in capture group 1 and 'a|ab|abc' in capture group 2
I've tried something like this, but it doesn't work for me (captures only one [a-z] group)
(?=.*\|\|(\d+)\|\|)(?=.*\b([a-z]+\b))
I would achieve this without a complex regular expression. For example, you could do this:
input = "||10||a||ab||abc||"
parts = input.scan(/\w+/) # => ["10", "a", "ab", "abc"]
parts[1..-1].each { |part| part << parts[0] } # => ["a10", "ab10", "abc10"]
"||#{parts.join('||')}||"
str = "||10||a||ab||abc||"
first = nil
str.gsub(/(?<=\|\|)[^\|]+/) { |s| first.nil? ? (first = s) : s + first }
#=> "||10||a10||ab10||abc10||"
The regular expression reads, "match one or more characters in a pipe immediately following two pipes" ((?<=\|\|) being a positive lookbehind).

Overlapping matches in Regex - Scala

I'm trying to extract all posible combinations of 3 letters from a String following the pattern XYX.
val text = "abaca dedfd ghgig"
val p = """([a-z])(?!\1)[a-z]\1""".r
p.findAllIn(text).toArray
When I run the script I get:
aba, ded, ghg
And it should be:
aba, aca, ded, dfd, ghg, gig
It does not detect overlapped combinations.
The way consists to enclose the whole pattern in a lookahead to consume only the start position:
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
p.findAllIn(text).matchData foreach {
m => println(m.group(1))
}
The lookahead is only an assertion (a test) for the current position and the pattern inside doesn't consume characters. The result you are looking for is in the first capture group (that is needed to get the result since the whole match is empty).
You need to capture the whole pattern and put it inside a positive lookahead. The code in Scala will be the following:
object Main extends App {
val text = "abaca dedfd ghgig"
val p = """(?=(([a-z])(?!\2)[a-z]\2))""".r
val allMatches = p.findAllMatchIn(text).map(_.group(1))
println(allMatches.mkString(", "))
// => aba, aca, ded, dfd, ghg, gig
}
See the online Scala demo
Note that the backreference will turn to \2 as the group to check will have ID = 2 and Group 1 will contain the value you need to collect.

Optional replacement in a Regular Expression

I am creating a regular expression, in VBA that uses the JS flavor of RegEx. Here is the issue I have ran into:
Current RegEx:
(^6)(?:a|ab)?
I have a 6 followed by either nothing, an 'a' or 'ab'.
In the case of a 6 followed by nothing I want to return just the 6 using $1
In the case of a 6 followed by an 'a' or 'ab' I want to return 6B
So I need that 'B' to be optional, contingent on there being an 'a' or 'ab'.
Something to the effect of : $1B?
That of course does not work. I only want the B if the 'a' or 'ab' is present, otherwise just the $1.
Is this possible to do in a single regex pattern? I could just have 2 separate patterns, one looking for only a 6 and the other for 6'a'or'ab'... but my actual regex patterns are much more complicated and I might need several patterns to cover some of them...
Thanks for looking.
I don't think your question is clearly defined--for example, I don't know why you need a replace--but from what I can infer, something like the following may work for you:
target = "6ab"
result = ""
With New RegExp
.Pattern = "^(6)(?:a(b?))?"
Set matches = .Execute(target)
If Not matches Is Nothing Then
Set mat = matches(0)
result = mat.SubMatches(0)
If mat.SubMatches.Count > 1 Then
result = result & UCase(mat.SubMatches(1))
End If
End If
Debug.Print result
End With
You basically inspect the capture groups to determine whether or not there was a hit on the b capture. Whereas you used a|ab, I think and optional b (b?) is more to the point. It's probably stylistic more than anything.
As I mentioned in my comment, there is no way to tell a regex engine to choose between literal alternatives in the replacement string. Thus, all you can do is to access Submatches to check for values that you get there, and return appropriate values.
Note that a regex you have should have 2 capturing groups, or at least a capturing group where you do not know the exact text (the (ab?)).
Here is my idea in code:
Function RxCondReplace(ByVal str As String) As String
RxCondReplace = ""
Set objRegExp = CreateObject("VBScript.RegExp")
objRegExp.Pattern = "^6(ab?)?"
Set objMatches = objRegExp.Execute(str)
Set objMatch = objMatches.Item(0) ' Only 1 match as .Global=False
If objMatch.SubMatches.Item(0) = "a" Or _ ' check if 1st group equals "a"
objMatch.SubMatches.Item(0) = "ab" Then ' check if 1st group equals "ab"
RxCondReplace = "6B"
ElseIf objMatch.SubMatches.Item(1) = "" Then ' check if 2nd group is empty
RxCondReplace = "6"
End If
End Function
' Calling the function above
Sub CallConditionalReplace()
Debug.Print RxCondReplace("6") ' => 6
Debug.Print RxCondReplace("6a") ' => 6B
Debug.Print RxCondReplace("6ab") ' => 6B
End Sub

Parsing Excel reference with regular expression?

Excel returns a reference of the form
=Sheet1!R14C1R22C71junk
("junk" won't normally be there, but I want to be sure that there's no extraneous text.)
I would like to 'split' this into a VB array, where
a(0)="Sheet1"
a(1)="14"
a(2)="1"
a(3)="22"
a(4)="71"
a(5)="junk"
I'm sure it can be done easily with a regular expression, but I just can't get the hang of it.
Is there a kind soul who could help me?
Thanks
=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)
should work.
[^!]+ matches a sequence of non-exclamation-point characters.
\d+ matches a sequence of digits.
.* matches anything.
So, in VB.NET:
Dim a As Match
a = Regex.Match(SubjectString, "=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)")
If a.Success Then
' matched text: a.Value
' backreference n text: a.Groups(n).Value
Else
' Match attempt failed
End If
A straightforward String.Split would work, provided the "junk" text wasn't there:
Dim input As String = "=Sheet1!R14C1R22C71"
Dim result = input.Split(New Char() { "="c, "!"c, "R"c, "C"c }, StringSplitOptions.RemoveEmptyEntries)
For Each item As String In result
Console.WriteLine(item)
Next
The regex gets a little tricky since you will need to go through the Groups and Captures of the nested portions to get the proper order.
EDIT: here's my regex solution. It accepts multiple occurrences of R's and C's.
Dim input As String = "=Sheet1!R14C1R22C71junk"
Dim pattern As String = "=(?<Sheet>Sheet\d+)!(?:R(?<R>\d+)C(?<C>\d+))+"
Dim m As Match = Regex.Match(input, pattern)
If m.Success Then
Console.WriteLine(m.Groups("Sheet").Value)
For i = 0 To m.Groups("R").Captures.Count - 1
Console.WriteLine(m.Groups("R").Captures(i).Value)
Console.WriteLine(m.Groups("C").Captures(i).Value)
Next
End If
Pattern explanation:
"=(?Sheet\d+)" : matches an = sign followed by "Sheet" and digits. Uses named group of "Sheet"
"!(?:R(?\d+)C(?\d+))+" : matches the exclamation mark followed by at least one occurrence of the *R*xx*C*xx portion of the text. Named groups of "R" and "C" are used.
"(?:...)+" : this portion from the above portion matches but does not capture the inner pattern (i.e., the R/C part). This is to avoid unnecessarily capturing them while we are actually capturing them with the named groups.
More general regexes for R1C1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:R((?<RAbs>\d+)|(?<RRel>\[-?\d+\]))C((?<CAbs>\d+)|(?<CRel>\[-?\d+\]))){1,2}$
And A1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:(?<Col1>\$?[a-z]+)(?<Row1>\$?\d+))(?:\:(?<Col2>\$?[a-z]+)(?<Row2>\$?\d+))?$
It doesn't match external references like =[Book1]Sheet1!A1 though.