Regular expression to match page number groups - regex

I need a regular expression to match page numbers as found in common programs.
These usually take the form 1-5,3,5,1-9 for example.
I have a regular expression (\d+-\d+)?,(\d+-\d+?)* which I need help to refine.
As can be seen here regex101 I am matching commas and missing numbers entirely.
What I need is to match 1-5 as group 1, 3 as group 2, 5 as group 3 and 1-9 as group 4 without matching any commas.
Any help is appreciated. I will be using this in VBA.

This worked for me - am I missing something?
Sub Pages()
Dim re As Object, allMatches, m, rv, sep, c As Range, i As Long
Set re = CreateObject("VBScript.RegExp")
re.Pattern = "(\d+(-\d+)?)"
re.ignorecase = True
re.MultiLine = True
re.Global = True
For Each c In Range("B5:B20").Cells 'for example
c.Offset(0, 1).Resize(1, 10).ClearContents 'clear output cells
i = 0
If re.test(c.Value) Then
Set allMatches = re.Execute(c.Value)
For Each m In allMatches
i = i + 1
c.Offset(0, i).Value = m
Next m
End If
Next c
End Sub

If I recall correctly, capturing a dynamic number of groups will not work. You can pre-specify the format / number of groups to be matched, or you can catch the repeated groups as one and split them afterwards.
If you know the format, just do
(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)
which of course is not very neat.
If you want the flexible structure, match the first group and all the rest as a second and then split the latter by the delimiter ',' in whichever language.
(\d+(?:-\d+)?)((?:(?:,)(\d+(?:-\d+)?))*)

You need to make the -\d+ part optional, since you don't always have ranges. And the comma between each range should be part of the second group with the * quantifier, so you can match a single range with no comma after it.
\d+(-\d+)?(,\d+(-\d+)?)*
This will match the string that contains all the ranges. To get an array of individual ranges without the commas, do a second match in this string:
\d+(-\d+)?
Use the VBA function for getting an array of all matches of a regexp (sorry, I don't know VBA, so can't provide the specific syntax).

Related

Access multiple captures of one capture group in substition string

Suppose I have the regex (\d)+.
In .Net I can access all captures of this capture group using the match.Groups[1].Captures.
Can I also access these captures in a substition string?
So for example for the input string 523, I need to use 5, 2 and 3 in my substition string (and not just 3, which is $1).
If you intend to capture the digits each in its separate capturing group then you need to actually make a separate capturing groups for every digits like this:
(\d)(\d)(\d)
NOTE: This does not scale very well and you could not match numbers of any other length than 3 digits. In other words, no math on either 23 or 345667!
An good page with a long and detailed explanation why this cant be done as (\d)+ can be found here:
https://www.regular-expressions.info/captureall.html
So if this is indeed what you want then you need to craft your own loop that searches the string for every digit separately.
If you on the other hand need to capture the number and not the individual digits then you simply put the +sign in the wrong position. I think you should write:
(\d+)
I think the OP wants to get every single digit match separately.
Perhaps this will help you then:
<!-- language: lang-vb -->
' Create a list to put the resulting matches in
Dim ResultList As StringCollection = New StringCollection()
Dim RegexObj As New Regex("(\d)")
Dim MatchResult As Match = RegexObj.Match(strName)
While MatchResult.Success
ResultList.Add(MatchResult.Groups(1).Value)
' Console.WriteLine(MatchResult.Groups(1).Value)
MatchResult = MatchResult.NextMatch()
End While

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

Excel VBA Regex Check For Repeated Strings

I have some user input that I want to validate for correctness. The user should input 1 or more sets of characters, separated by commas.
So these are valid input
COM1
COM1,COM2,1234
these are invalid
COM -- only 3 characters
COM1,123 -- one set is only 3 characters
COM1.1234,abcd -- a dot separator not comma
I googled for a regex pattern to this and found a possible pattern that tested for a recurring instance of any 3 characters, and I modified like so
/^(.{4,}).*\1$/
but this is not finding matches.
I can manage the last comma that may or may not be there before passing to the test so that it is always there.
Preferably, I would like to test for letters (any case) and numbers only, but I can live with any characters.
I know I could easily do this in straight VBA splitting the input on a comma delimiter and looping through each character of each array element, but regex seems more efficient, and I will have more cases than have slightly different patterns, so parameterising the regex for that would be better design.
TIA
I believe this does what you want:
^([A-Z|a-z|0-9]{4},)*[A-Z|a-z|0-9]{4}$
It's a line beginning followed by zero or more groups of four letters or numbers ending with a comma, followed by one group of four letters or number followed by an end-of-line.
You can play around with it here: https://regex101.com/r/Hdv65h/1
The regular expression
"^[\w]{4}(,[\w]{4})*$"
should work.
You can try this to see whether it works for all your cases using the following function. Assuming your test strings are in cells A1 thru A5 on the spreadsheet:
Sub findPattern()
Dim regEx As New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.Pattern = "^[\w]{4}(,[\w]{4})*$"
Dim i As Integer
Dim val As String
For i = 1 To 5:
val = Trim(Cells(i, 1).Value)
Set mat = regEx.Execute(val)
If mat.Count = 0 Then
MsgBox ("No match found for " & val)
Else
MsgBox ("Match found for " & val)
End If
Next
End Sub

Regex to determine if a string is a name of a range or a cell's address

I'm struggling to come up with a regular expression pattern that can help me determine if a string is a cell's address or if it is a cell's name.
Here are some examples of cell addresses:
"E5"
"AA55:E5"
"DD5555:DDD55555, E5, F5:AA55"
"$F7:$G$7"
Here are some examples of cell names:
"bis_document_id"
"PCR1MM_YPCVolume"
"sheet_error7"
"blahE5"
"training_A1"
"myNameIsGeorgeJR"
Is there a regex pattern you guys can come up with that will match all of either group and none of the other?
I have been able to think of a couple of ways to determine what a string is not:
If it has any other character than "$" or ":" in it, I know it is not a cell's name and is most likely a cell's address.
If it has more than three consecutive numbers, it is most likely not a cell's address.
A cell's address is extremely unlikely to have more than 2 letters preceding a number, 99.9% of the cell addresses will be in columns A to ZZ.
Alas, these three small tests can hardly prove what this string is.
Thanks for the help!
OK, this one's fun:
^\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?(?:,\s*(?:\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?))*$
Let's break it down, because it's rather nasty. The magic subpattern, really, is this:
\$?[A-Z]+\$?\d+
This little thing will match any single valid cell address, with optional absolute-value $s. The next bit,
(?::\$?[A-Z]+\$?\d+)?
will match the same thing optionally (the ? quantifier at the end), but preceded by a colon (:). That lets us get ranges. The next bit,
(?:,\s*(?:\$?[A-Z]+\$?\d+(?::\$?[A-Z]+\$?\d+)?))*
matches the same thing as the first, but zero or more times (using the * quantifier), and preceded by a comma and optional spaces using the special \s token (which means "any whitespace").
Demo on Regex101
If we want to get really fancy (and, mind you, I have no idea whether Excel's regex engine supports this; I just wrote it for fun), we can use recursion to accomplish the same thing:
^((\$?[A-Z]+\$?\d+)(?::(?2))?)(?:,\s*(?1))*$
In this case, the magic \$?[A-Z]+\$?\d+ is inside the second capturing group, which is used recursively by the (?2) token. The entire subpattern for a single address or range of them is contained within the first capture group, and is then used to match additional addresses or ranges in a list.
Demo on Regex101
So here's a regex for VBA which will find any cell reference irrespective where it is.
NOTE: I've assumed you're performing this on a Formula object and thus doesn't require being at the start or end of the string; so you can have a string with cell references and cell names and it will only pick up the cell references as below:
(?:\W|^)(\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?)(?!\w)
(?:\W|^) is at the start and ensures that there is a non-word character before it or the start of the string (remove |^ if it there is always a = at the start as in Formula objects) --- VBA I found out regrettably does not have a functioning negative lookbehind)
(\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?) finds the actual cell reference and is broken down below:
\$?[A-Z]{1,3}\$?[0-9]{1,7} matches to one to three capital letters (as applicable to Excel's possible current ranges;
(:\$?[A-Z]{1,3}\$?[0-9]{1,7})? is the same as above except it adds the option of a second cell reference after a column ? makes it optional.
(?!\w) is a negative look forward and says that the character after it must not be a word character (presumably in functions the only things you can have around a cell references are parentheses and operators).
I wrote a VBA function in Excel and it returned the following with the above RegEx:
NB: It doesn't pick up obviously if the characters are in the right order as the reference $AZO113:A4 is returned despite it being impossible.
After trying several solutions I had to modify a regex so it works for me. my version only support non-named ranges.
((?![\=,\(\);])(\w+!)|('.+'!))?((\$?[A-Z]{1,3}\$?[0-9]{1,7}(:\$?[A-Z]{1,3}\$?[0-9]{1,7})?)|(\$?[A-Z]{1,3}(:\$?[A-Z]{1,3}\$?)))
It will capture ranges in all of the following situations
=FUNCTION(F:F)
=FUNCTION($B22,G$5)
=SUM($F$10:$F$11)
=$J10-$K10
=SUMMARY!D4
I created the following function for RegEx. but first tick the reference to "Microsoft VBScript Regular Expressions 5.5" from Tools>References
Function RegExp(ByVal sText As String, ByVal sPattern, Optional bGlobal As Boolean = True, Optional bIgnoreCase As Boolean = False, Optional bArray As Boolean = False) As Variant
Dim objRegex As New RegExp
Dim Matches As MatchCollection
Dim Match As Match
Dim i As Integer
objRegex.IgnoreCase = bIgnoreCase
objRegex.Global = bGlobal
objRegex.Pattern = sPattern
If objRegex.test(sText) Then
Set Matches = objRegex.Execute(sText)
If Matches.count <> 0 Then
If bArray Then ' if we want to return array instead of MatchCollection
ReDim aMatches(Matches.count - 1) As Variant
For Each Match In Matches
aMatches(i) = Match.value
i = i + 1
Next
RegExp = aMatches
Else
Set RegExp = Matches
End If
End If
End If
End Function

Parsing Excel reference with regular expression?

Excel returns a reference of the form
=Sheet1!R14C1R22C71junk
("junk" won't normally be there, but I want to be sure that there's no extraneous text.)
I would like to 'split' this into a VB array, where
a(0)="Sheet1"
a(1)="14"
a(2)="1"
a(3)="22"
a(4)="71"
a(5)="junk"
I'm sure it can be done easily with a regular expression, but I just can't get the hang of it.
Is there a kind soul who could help me?
Thanks
=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)
should work.
[^!]+ matches a sequence of non-exclamation-point characters.
\d+ matches a sequence of digits.
.* matches anything.
So, in VB.NET:
Dim a As Match
a = Regex.Match(SubjectString, "=([^!]+)!R(\d+)C(\d+)R(\d+)C(\d+)(.*)")
If a.Success Then
' matched text: a.Value
' backreference n text: a.Groups(n).Value
Else
' Match attempt failed
End If
A straightforward String.Split would work, provided the "junk" text wasn't there:
Dim input As String = "=Sheet1!R14C1R22C71"
Dim result = input.Split(New Char() { "="c, "!"c, "R"c, "C"c }, StringSplitOptions.RemoveEmptyEntries)
For Each item As String In result
Console.WriteLine(item)
Next
The regex gets a little tricky since you will need to go through the Groups and Captures of the nested portions to get the proper order.
EDIT: here's my regex solution. It accepts multiple occurrences of R's and C's.
Dim input As String = "=Sheet1!R14C1R22C71junk"
Dim pattern As String = "=(?<Sheet>Sheet\d+)!(?:R(?<R>\d+)C(?<C>\d+))+"
Dim m As Match = Regex.Match(input, pattern)
If m.Success Then
Console.WriteLine(m.Groups("Sheet").Value)
For i = 0 To m.Groups("R").Captures.Count - 1
Console.WriteLine(m.Groups("R").Captures(i).Value)
Console.WriteLine(m.Groups("C").Captures(i).Value)
Next
End If
Pattern explanation:
"=(?Sheet\d+)" : matches an = sign followed by "Sheet" and digits. Uses named group of "Sheet"
"!(?:R(?\d+)C(?\d+))+" : matches the exclamation mark followed by at least one occurrence of the *R*xx*C*xx portion of the text. Named groups of "R" and "C" are used.
"(?:...)+" : this portion from the above portion matches but does not capture the inner pattern (i.e., the R/C part). This is to avoid unnecessarily capturing them while we are actually capturing them with the named groups.
More general regexes for R1C1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:R((?<RAbs>\d+)|(?<RRel>\[-?\d+\]))C((?<CAbs>\d+)|(?<CRel>\[-?\d+\]))){1,2}$
And A1 style:
^=(?:(?<Sheet>[^!]+)!)?(?:(?<Col1>\$?[a-z]+)(?<Row1>\$?\d+))(?:\:(?<Col2>\$?[a-z]+)(?<Row2>\$?\d+))?$
It doesn't match external references like =[Book1]Sheet1!A1 though.