Manipulate string to extract address - regex

I'm currently doing some work with a very large data source on city addresses where the data looks something like this.
137 is the correct address but it belongs in a building that takes up 135-138A on the street.
source:
137 9/F 135-138A KING STREET 135-138A KING STREET TOR
i've used a function which removes the duplicates shown on extendoffice.
the second column has become this:
137 9/F 135-138A KING STREET TOR
what I want to do now is
find address number and add it in front of the street name
remove the numbers that are connected to the dash - ):
9/F 137 KING STREET TOR
Would the the best way to accomplish this?
The main problem I'm having with this is there are many inconsistent spaces in address names ex. "van dyke rd".
Is there anyway I can locate in an array the "-" and set variables for the 2 numbers on either side of the dash and replace it with the correct address number located at the front
Function RemoveDupes2(txt As String, Optional delim As String = " ") As String
Dim x
With CreateObject("Scripting.Dictionary")
.CompareMode = vbTextCompare
For Each x In Split(txt, delim)
If Trim(x) <> "" And Not .exists(Trim(x)) Then .Add Trim(x), Nothing
Next
If .Count > 0 Then RemoveDupes2 = Join(.keys, delim)
End With
End Function
Thanks

Regular Expressions are a way to (amongst other things) search for a feature in a string.
It looks like the feature you are looking for is: number:maybe some spaces : dash : maybe some spaces : number
In regex notation this would be expressed as:
([0-9]*)[ ]*-[ ]*([0-9]*)
Which translates to: Find a sequential group of digits followed by zero or more spaces, then a dash, then zero or more spaces, then some more digits.
The parenthesis indicate the elements that will be returned. So you could assign variables to the be the first number or the second number.
You might need to tweak this if a dash can potentially occur elsewhere in the address.
Further information on actually implementing that is available here: How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops

This meets the case you want, it captures the address range as two separate matches (if you want to process further).
The current code simple removes this range altogether.
What logic is there to move the 9/F to front?
See regex here
Function StripString(strIn As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "(\d+[A-C]?)-(\d+[A-C]?)"
If .test(strIn) Then
StripString = .Replace(strIn, vbullstring)
Else
StripString = "No match"
End If
End With
End Function

I'd just:
swap 1st and 2nd substrings
erase the substring with "-" in it
Function RemoveDupes2(txt As String, Optional delim As String = " ") As String
Dim x As Variant, arr As Variant, temp As Variant
Dim iArr As Long
With CreateObject("Scripting.Dictionary")
.CompareMode = vbTextCompare
For Each x In Split(txt, delim)
If Trim(x) <> "" And Not .exists(Trim(x)) Then .Add Trim(x), Nothing
Next
If .count > 0 Then
arr = .keys
temp = arr(0)
arr(0) = arr(1)
arr(1) = temp
For iArr = LBound(arr) To UBound(arr)
If InStr(arr(iArr), "-") <> 0 Then arr(iArr) = ""
Next
RemoveDupes2 = Join(arr, delim)
End If
End With
End Function

Related

Check string has a date in it and extract part of the string

I have thousands of lines of text that I need to work through and the lines I am interested with lines that look like the following:
01/04/2019 09:35:41 - Test user (Additional Comments)
I am currently using this code to filter out all the other rows:
If InStr(FullCell(i), " - ") <> 0 And InStr(FullCell(i), ":") <> 0 And InStr(FullCell(i), "(") <> 0 Then
FullCell is the array that I am working through.
which I know is not the best way to do it. Is there a way to check that there is a date at the beginning of the string in the format dd/mm/yyyy and then extract the user name inbetween the '-' and the '(' symbol.
I had a play with regex to see if that could help but i'm limited in skills to be able to pull off both VBA and regex in the same code.
Whats the best way to do this.
Assuming Fullcell(i) contains the string,
If Left(Fullcell(i), 10) Like "##/##/####"
Will return True if you have a date (note that it will not differentiate between dd/mm/yyyy and mm/dd/yyyy.
And
Mid(Fullcell(i), InStr(Fullcell(i), " - ") + 2, InStr(Fullcell(i), " (") - InStr(Fullcell(i), " - ") - 2)
Will return the username
I'm sure there is a more efficient way to do this, but I've used the following solution quite a few times:
This will select the date:
x = 1
Do While Mid(FullCell,1,x) <> " "
x = x + 1
Loop
strDate = Left(FullCell,x)
This will find the character number of the hyphen, the username starts 2 characters after.
x = 1
Do While Mid(FullCell,x,1) <> "-"
x = x + 1
Loop
Then we will find the end of the username
y = x + 2
Do While Mid(FullCell,y,1) <> " "
y = y + 1
Loop
The username should now be characters (x+2 to y-1)
strUsername = Mid(FullCell, x + 2, y - (x + 2) - 1)
Here's how I would do it
Dim your variables
Dim ring as Range
Dim dat as variant
Dim FullCell() as string
Dim User as string
Dim I as long
Set your range
Set rng = ` any way you choose
Dat = rng.value2
Loop dat
For i = 1 to UBound(dat, 1)
Split the data
FullCell = Trim(Split(FullCell, "-"))
Test if it split
If UBound(FullCell) > 0 Then
Test if it matches
If IsDate(FullCell(0)) Then
i = Instr(FullCell(1), "(")-1)
If i then
User = left$(FullCell(1), i)
' Found a user
End If
End If
End If
Next
Abstraction is your friend, it's always helpful to break these into their own private functions whenever you can. You could put your code in a function and call it something like ExtractUsername.
Below I did an example of this, and I decided to go with the RegExp approach (late binding), but you could use string functions like the examples above as well.
This function returns the username if it finds the pattern you mentioned above, otherwise, it returns an empty string.
Private Function ExtractUsername(ByVal SourceString As String) As String
Dim RegEx As Object
Set RegEx = CreateObject("vbscript.regexp")
'(FIRST GROUP FINDS THE DATE FORMATTED AS DD/MM/YYY, AS WELL AS THE FORWARD SLASH)
'(SECOND GROUP FINDS THE USERNAME) THIS WILL BE SUBMATCH 1
With RegEx
.Pattern = "(^\d{2}\/\d{2}\/\d{4}.*-)(.+)(\()"
.Global = True
End With
Dim Match As Object
Set Match = RegEx.Execute(SourceString)
'ONLY RETURN IF A MATCH WAS FOUND
If Match.Count > 0 Then
ExtractUsername = Trim(Match(0).SubMatches(1))
End If
Set RegEx = Nothing
End Function
The regex pattern is grouped into three parts, the date (and slash), username, and opening parentheses. What you are interested in is the username, which in the SubMatch would be number 1.
Regexr is a helpful site for practicing regular expressions and can show you a bit more of what the pattern I went with is doing.
Please note that using regular expressions might give you performance issues and you should test it against regular string functions to see what works best for your situation.

Extract largest numeric sequence from string (regex, or?)

I have strings similar to the following:
4123499-TESCO45-123
every99999994_54
And I want to extract the largest numeric sequence in each string, respectively:
4123499
99999994
I have previously tried regex (I am using VB6)
Set rx = New RegExp
rx.Pattern = "[^\d]"
rx.Global = True
StringText = rx.Replace(StringText, "")
Which gets me partway there, but it only removes the non-numeric values, and I end up with the first string looking like:
412349945123
Can I find a regex that will give me what I require, or will I have to try another method? Essentially, my pattern would have to be anything that isn't the longest numeric sequence. But I'm not actually sure if that is even a reasonable pattern. Could anyone with a better handle of regex tell me if I am going down a rabbit hole? I appreciate any help!
You cannot get the result by just a regex. You will have to extract all numeric chunks and get the longest one using other programming means.
Here is an example:
Dim strPattern As String: strPattern = "\d+"
Dim str As String: str = "4123499-TESCO45-123"
Dim regEx As New RegExp
Dim matches As MatchCollection
Dim match As Match
Dim result As String
With regEx
.Global = True
.MultiLine = False
.IgnoreCase = False
.Pattern = strPattern
End With
Set matches = regEx.Execute(str)
For Each m In matches
If result < Len(m.Value) Then result = m.Value
Next
Debug.Print result
The \d+ with RegExp.Global=True will find all digit chunks and then only the longest will be printed after all matches are processed in a loop.
That's not solvable with an RE on its own.
Instead you can simply walk along the string tracking the longest consecutive digit group:
For i = 1 To Len(StringText)
If IsNumeric(Mid$(StringText, i, 1)) Then
a = a & Mid$(StringText, i, 1)
Else
a = ""
End If
If Len(a) > Len(longest) Then longest = a
Next
MsgBox longest
(first result wins a tie)
If the two examples you gave, are of a standard where:
<long_number>-<some_other_data>-<short_number>
<text><long_number>_<short_number>
Are the two formats that the strings come in, there are some solutions.
However, if you are searching any string in any format for the longest number, these will not work.
Solution 1
([0-9]+)[_-].*
See the demo
In the first capture group, you should have the longest number for those 2 formats.
Note: This assumes that the longest number will be the first number it encounters with an underscore or a hyphen next to it, matching those two examples given.
Solution 2
\d{6,}
See the demo
Note: This assumes that the shortest number will never exceed 5 characters in length, and the longest number will never be shorter than 6 characters in length
Please, try.
Pure VB. No external libs or objects.
No brain-breaking regexp's patterns.
No string manipulations, so - speed. Superspeed. ~30 times faster than regexp :)
Easy transform on variouse needs.
For example, concatenate all digits from the source string to a single string.
Moreover, if target string is only intermediate step,
so it's possible to manipulate with numbers only.
Public Sub sb_BigNmb()
Dim sSrc$, sTgt$
Dim taSrc() As Byte, taTgt() As Byte, tLB As Byte, tUB As Byte
Dim s As Byte, t As Byte, tLenMin As Byte
tLenMin = 4
sSrc = "every99999994_54"
sTgt = vbNullString
taSrc = StrConv(sSrc, vbFromUnicode)
tLB = LBound(taSrc)
tUB = UBound(taSrc)
ReDim taTgt(tLB To tUB)
t = 0
For s = tLB To tUB
Select Case taSrc(s)
Case 48 To 57
taTgt(t) = taSrc(s)
t = t + 1
Case Else
If CBool(t) Then Exit For ' *** EXIT FOR ***
End Select
Next
If (t > tLenMin) Then
ReDim Preserve taTgt(tLB To (t - 1))
sTgt = StrConv(taTgt, vbUnicode)
End If
Debug.Print "'" & sTgt & "'"
Stop
End Sub
How to handle sSrc = "ev_1_ery99999994_54", please, make by yourself :)
.

Slight adaptation of a User Defined Function

I would like to extract a combination of text and numbers from a larger string located within a column within excel.
The constants I have to work with is that each Text string will
•either start with a A, C or S, and
•will always be 7 Characters long
•the position of he string I would like to extract varies
The code I have been using which has been working efficiently is;
Public Function Xtractor(r As Range) As String
Dim a, ary
ary = Split(r.Text, " ")
For Each a In ary
If Len(a) = 7 And a Like "[SAC]*" Then
Xtractor = a
Exit Function
End If
Next a
Xtractor = ""
End Function
However today I have learnt that sometimes my data may include scenarios like this;
What I would like is to adapt my code so If the 8th character is "Underscore" and the 1st character of the 7 characters is either S, A or C please extract up until the "Underscore"
Secondly I would like to exclude commons words like "Support" & "Collect" from being extracted.
Finally the 7th letter should be a number
Any ideas around this would be much appreciated.
Thanks
try this
ary = Split(Replace(r.Text, "_", " "))
or
ary = Split(Replace(r.Text, "_", " ")," ")
result will be same for both variants
test
update
Do you know how I could leave the result blank if the 7th character returned a letter?
Public Function Xtractor(r As Range) As String
Dim a, ary
ary = Split(Replace(r.Text, "_", " "))
For Each a In ary
If Len(a) = 7 And a Like "[SAC]*" And IsNumeric(Mid(a, 7, 1)) Then
Xtractor = a
Exit Function
End If
Next a
Xtractor = ""
End Function
test
Add Microsoft VBScript Regular Expressions 5.5 to project references. Use the following code to test matching and extracting with Xtractor:
Public Function Xtractor(ByVal p_val As String) As String
Xtractor = ""
Dim ary As String, v_re As New VBScript_RegExp_55.RegExp, Matches
v_re.Pattern = "^([SAC][^_]{1,6})_?"
Set Matches = v_re.Execute(p_val)
If Matches.Count > 0 Then Xtractor = Matches(0).SubMatches(0) Else Xtractor = ""
End Function
Sub test_Xtractor(p_cur As Range, p_val As String, p_expected As String)
Dim v_cur As Range, v_res As Range
p_cur.Value = p_val
Set v_cur = p_cur.Offset(columnOffset:=1)
v_cur.FormulaR1C1 = "='" & ThisWorkbook.Name & "'!Xtractor(RC[-1])"
Set v_res = v_cur.Offset(columnOffset:=1)
v_res.FormulaR1C1 = "=RC[-1]=""" & p_expected & """"
Debug.Print p_val; "->"; v_cur.Value; ":"; v_res.Value
End Sub
Sub test()
test_Xtractor ActiveCell, "A612002_MDC_308", "A612002"
test_Xtractor ActiveCell.Offset(1), "B612002_MDC_308", ""
test_Xtractor ActiveCell.Offset(2), "SUTP038_MDC_3", "SUTP038"
test_Xtractor ActiveCell.Offset(3), "KUTP038_MDC_3", ""
End Sub
Choose the workbook and cell for writing test fixture, then run test from the VBA Editor.
Output in the Immediate window (Ctrl+G):
A612002_MDC_308->A612002:True
B612002_MDC_308->:True
SUTP038_MDC_3->SUTP038:True
KUTP038_MDC_3->:True
UPD
Isit possible to ammend this code so if the 7th character is a letter to return blank?
Replace line with assign to v_re by the following:
v_re.Pattern = "^([SAC](?![^_]{5}[A-Z]_?)[^_]{1,6})_?"
v_re.IgnoreCase = True
And add to the test suite:
test_Xtractor ActiveCell.Offset(4), "SUTP03A_MDC_3", ""
Output:
A612002_MDC_308->A612002:True
B612002_MDC_308->:True
SUTP038_MDC_3->SUTP038:True
KUTP038_MDC_3->:True
SUTP03A_MDC_3->:True
I inserted negative lookahead subrule (?![^_]{5}[A-Z]_?) to reject SUTP03A_MDC_3. But pay attention: the rejecting rule is applied exactly to the 7th character. Now v_re.IgnoreCase set to True, but if only capitalized characters are allowed, set it to False. See also Regular Expression Syntax on MSDN.

VBA and RegEx matching arbitrary strings in Excel 2010

I need to extract adress and potentially zip code as separate entites from the same line. The address line may or may not contain a zip code, and may or may not contain other unwanted strings. This is due to a bug in a web form, which is fixed, but the damage is already done to a set of elements.
Possible forms and results:
Address: Some address 251, 99302 Something Telephone: 555 6798 8473 -- Return "some address 251" and "99302 something" in separate strings. Comma may or may not be trailed by whitespace.
Address: Some address 251 -- Return "some address 251"
Address: Some address 251, 99302 -- Return "some address 251" and "99302". Again, comma may or may not be trailed by whitespace.
I have a basic understanding of how this could be done programatically in VBA by iterating over the string and checking individual characters and substrings, but I feel like it will be time-consuming and not very robust afterwards. Or if it's robust, it would end up being huge because of all the possible variations.
I am struggling the most with how to form the regular expression(s) and possibly the conditionals to get the desired results.
This is part of a larger project, so I won't paste all the various code, but I am pulling mailitems from Outlook to analyze and dump relevant info into an Excel sheet. I have both the Outlook and Excel code working, but the logic that extracts information is a bit flawed.
Here are the new snippets I've been working on:
Function regexp(str As String, regP As String)
Dim rExp As Object, rMatch As Object
Set rExp = CreateObject("vbscript.regexp")
With rExp
.Global = False
.MultiLine = False
.IgnoreCase = True
.Pattern = regP
End With
Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
regexp = rMatch(0)
Else
RegEx = vbNullString
Debug.Print "No match found!"
End If
End Function
Sub regexpAddress(str As String)
Dim result As String
Dim pattern As String
If InStr(str, "Telephone:") Then pattern = "/.+?(?=Telephone:)/"
result = regexp(str, pattern)
End Sub
I'm not sure how to form the regexps here. The one outlined should pull the right information (in 1 string instead 2, but that's still an improvement) - but only when the line contains the string "Telephone:", and I have a lot of cases where it won't contain that.
This is the current and somewhat flawed logic, which for some reason doesn't always yield the results I want:
For Each objMail In olFolder.Items
name = ""
address = ""
telephone = ""
email = ""
vIterations = vIterations + 1
arrBody = Split(objMail.body, Chr(10)) ' Split mail body when linebreak is encountered, throwing each line into its own array position
For i = 0 To UBound(arrBody)
arrLine = Split(arrBody(i), ": ") ' For each element (line), make new array, and if text search matches then write the 2nd half of the element to variable
If InStr(arrBody(i), "Name:") > 0 Then ' L2
name = arrLine(1) ' Reference 2nd column in array after the split
ElseIf InStr(arrBody(i), "Address:") > 0 Then
address = arrLine(1)
ElseIf InStr(arrBody(i), "Telephone:") > 0 Then
telephone = CLng(arrLine(1))
ElseIf InStr(arrBody(i), "Email:") > 0 Then
email = arrLine(1)
End If ' L2
Next
Next ' Next/end-for
This logic accepts and formats input of the following type:
Name: Joe
Address: Road
Telephone: 55555555555555
Email: joe#road.com
and returns joe, road, 55555 and joe#road.com to some defined Excel cells. This works fine when the mailitems are ordered as expected.
Problem: A bug lead to not my webform not inserting a linebreak after the address in some cases. The script still worked for the most part, but the mailitem contents sometimes ended up looking like this:
Name: Joe
Address: Road Telephone: 55555555555555
Email: joe#road.com
The address field was contaminated when it reached Excel ("Road Telephone" instead of just "Road"), but there was no loss of information. Which was acceptable, as it's easy to remove the surpluss string.
But in the following case (no email is entered), the phone number is not only lost but is actually replaced by a phone number from some other, arbitrary mailitem and I can't FOR THE LIFE OF ME figure out (1) why it won't get the correct number, (2) why it jumps to a new mail item to find the phone number or (3) how it selects this other mailitem:
Name: Joe
Address: Road Telephone: 5555555555555
Email:
In Excel:
Name: Joe
Address: Road Telephone
Telephone: 8877445511
Email:
So, TL;DR: my selection logic is flawed, and being that it is so hastily hacked together, not to mention how it yields false information and I am unable to figure out how and why, I would like to do a better operation using some other solution (like regexp?) instead for a more robust code.
Not so long ago I had a similar problem.
Code may not be very professional, but it can be helpful :)
Could you check if this code work for you correctly?
Function regexp(str As String, regP As String)
Dim rExp As Object, rMatch As Object
Set rExp = CreateObject("vbscript.regexp")
With rExp
.Global = False
.MultiLine = False
.IgnoreCase = True
.pattern = regP
End With
Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
regexp = rMatch(0)
Else
RegEx = vbNullString
Debug.Print "No match found!"
End If
End Function
Function for_vsoraas()
For Each objMail In olFolder.Items
vIterations = vIterations + 1
objMail_ = Replace(objMail.body, Chr(10), " ")
Dim StringToSearch(3) As String
StringToSearch(0) = "Name:"
StringToSearch(1) = "Address:"
StringToSearch(2) = "Telephone:"
StringToSearch(3) = "Email:"
Dim ArrResults(4) As String 'name,address,telephone,email, zipcode
For i = 0 To UBound(StringToSearch)
ResultString = ""
StartString = InStr(objMail_, StringToSearch(i))
If StartString > 0 Then
If i = UBound(StringToSearch) Then 'last string to search, dont search EndString
ResultString = Right(objMail_, Len(objMail_) + Len(StringToSearch(i)))
Else
EndString = 0
j = i
While (EndString = 0) 'prevent case no existing EndString
EndString = InStr(objMail_, StringToSearch(j + 1))
j = j + 1
If j = UBound(StringToSearch) And EndString = 0 Then
EndString = Len(objMail_) + 1
End If
Wend
ResultString = Mid(objMail_, StartString + Len(StringToSearch(i)) + 1, EndString - 1 - StartString - Len(StringToSearch(i)))
End If
ArrResults(i) = ResultString
End If
Next i
'search zipcode and address
ArrResults(4) = regexp(ArrResults(1), "\b(\d{5})\b")
ArrResults(1) = regexp(ArrResults(1), "([a-z ]{2,}\s{0,1}\d{0,3})")
'your varabile
Name = ArrResults(0)
Address = ArrResults(1)
Telephone = ArrResults(2)
Email = ArrResults(3)
ZipCode = ArrResults(4)
Next ' Next/end-for
End Function
I don't know if it was dumb luck or if I actually managed to learn some regex, but these patterns turn out to do exactly what I need.
' regex patterns - use flag /i
adrPattern = "([a-z ]{2,}\s{0,1}\d{0,3})" ' Select from a-z or space, case insensitive and at least 2 characters long, followed by optional space, ending with 0-3 digits
adrZipcode = "\b(\d{4})\b" ' Exactly 4 digits surrounded on both sides by either space, text or non-word character like comma
Edit: "Fixed" the telephone problem too. After spending 2 hours trying to write it in regex, and failing miserably, it dawned on me that solving the problem as a matter of faulty creation of the array had to be so much easier than treating it as a computational problem. And it was:
mailHolder = Replace(objMail.body, "Telephone:", Chr(10) + "Telephone:")
arrBody = Split(mailHolder, Chr(10))

Is this the RegEx for matching any cell reference in an Excel formula?

I have been trying to create a regular expressions pattern that matches any reference in any Excel formula, including absolute, relative, and external references. I need to return the entire reference, including the worksheet and workbook name.
I haven't been able to find exhaustive documentation about Excel A1-notation, but with a lot of testing I have determined the following:
Formulas are preceded with an equal sign "="
Strings within formulas are enclosed in double quotes and need to be removed before looking for real references, otherwise =A1&"A1" would break regex
Worksheet names can be up to 31 characters long, excluding \ / ? * [ ] :
Worksheet names in external references must be succeeded with bang =Sheet1!A1
Workbook names in external references must be enclosed in square brackets =[Book1.xlsx]Sheet1!A1
Workbook paths, which Excel adds if a reference is to a range in a closed workbook, are always enclosed in single quotes and to the left of the brackets for the workbook name 'C:\[Book1.xlsx]Sheet1'!A1
Some characters (non-breaking space, for example) cause Excel to enclose the workbook and worksheet name in an external reference in single quotes, but I don't know specifically which characters ='[Book 1.xlsx]Sheet 1'!A1
Even if R1C1-notation is enabled, Range.Formula still returns references in A1-notation. Range.FormulaR1C1 returns references in R1C1 notation.
3D reference style allows a range of sheet names on one workbook =SUM([Book5]Sheet1:Sheet3!A1)
Named ranges can be specified in formulas:
The first character of a name must be a letter, an underscore character (_), or a backslash (\). Remaining characters in the name can be letters, numbers, periods, and underscore characters.
You cannot use the uppercase and lowercase characters "C", "c", "R", or "r" as a defined name, because they are all used as a shorthand for selecting a row or column for the currently selected cell when you enter them in a Name or Go To text box.
Names cannot be the same as a cell reference, such as Z$100 or R1C1.
Spaces are not allowed as part of a name.
A name can be up to 255 characters in length.
Names can contain uppercase and lowercase letters. Excel does not distinguish between uppercase and lowercase characters in names.
Here is what I came up with wrapped in a VBA procedure for testing. I updated the code to handle names as well:
Sub ReturnFormulaReferences()
Dim objRegExp As New VBScript_RegExp_55.RegExp
Dim objCell As Range
Dim objStringMatches As Object
Dim objReferenceMatches As Object
Dim objMatch As Object
Dim intReferenceCount As Integer
Dim intIndex As Integer
Dim booIsReference As Boolean
Dim objName As Name
Dim booNameFound As Boolean
With objRegExp
.MultiLine = True
.Global = True
.IgnoreCase = True
End With
For Each objCell In Selection.Cells
If Left(objCell.Formula, 1) = "=" Then
objRegExp.Pattern = "\"".*\"""
Set objStringMatches = objRegExp.Execute(objCell.Formula)
objRegExp.Pattern = "(\'.*(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\'\!" _
& "|(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\!)?" _
& "(\$?[a-z]{1,3}\$?[0-9]{1,7}(\:\$?[a-z]{1,3}\$?[0-9]{1,7})?" _
& "|\$[a-z]{1,3}\:\$[a-z]{1,3}" _
& "|[a-z]{1,3}\:[a-z]{1,3}" _
& "|\$[0-9]{1,7}\:\$[0-9]{1,7}" _
& "|[0-9]{1,7}\:[0-9]{1,7}" _
& "|[a-z_\\][a-z0-9_\.]{0,254})"
Set objReferenceMatches = objRegExp.Execute(objCell.Formula)
intReferenceCount = 0
For Each objMatch In objReferenceMatches
intReferenceCount = intReferenceCount + 1
Next
Debug.Print objCell.Formula
For intIndex = intReferenceCount - 1 To 0 Step -1
booIsReference = True
For Each objMatch In objStringMatches
If objReferenceMatches(intIndex).FirstIndex > objMatch.FirstIndex _
And objReferenceMatches(intIndex).FirstIndex < objMatch.FirstIndex + objMatch.Length Then
booIsReference = False
Exit For
End If
Next
If booIsReference Then
objRegExp.Pattern = "(\'.*(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\'\!" _
& "|(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\!)?" _
& "(\$?[a-z]{1,3}\$?[0-9]{1,7}(\:\$?[a-z]{1,3}\$?[0-9]{1,7})?" _
& "|\$[a-z]{1,3}\:\$[a-z]{1,3}" _
& "|[a-z]{1,3}\:[a-z]{1,3}" _
& "|\$[0-9]{1,7}\:\$[0-9]{1,7}" _
& "|[0-9]{1,7}\:[0-9]{1,7})"
If Not objRegExp.Test(objReferenceMatches(intIndex).Value) Then 'reference is not A1
objRegExp.Pattern = "^(\'.*(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\'\!" _
& "|(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\!)" _
& "[a-z_\\][a-z0-9_\.]{0,254}$"
If Not objRegExp.Test(objReferenceMatches(intIndex).Value) Then 'name is not external
booNameFound = False
For Each objName In objCell.Worksheet.Parent.Names
If objReferenceMatches(intIndex).Value = objName.Name Then
booNameFound = True
Exit For
End If
Next
If Not booNameFound Then
objRegExp.Pattern = "^(\'.*(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\'\!" _
& "|(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\!)"
For Each objName In objCell.Worksheet.Names
If objReferenceMatches(intIndex).Value = objRegExp.Replace(objName.Name, "") Then
booNameFound = True
Exit For
End If
Next
End If
booIsReference = booNameFound
End If
End If
End If
If booIsReference Then
Debug.Print " " & objReferenceMatches(intIndex).Value _
& " (" & objReferenceMatches(intIndex).FirstIndex & ", " _
& objReferenceMatches(intIndex).Length & ")"
End If
Next intIndex
Debug.Print
End If
Next
Set objRegExp = Nothing
Set objStringMatches = Nothing
Set objReferenceMatches = Nothing
Set objMatch = Nothing
Set objCell = Nothing
Set objName = Nothing
End Sub
Can anyone break or improve this? Without exhaustive documentation on Excel's formula syntax it is difficult to know if this is correct.
Thanks!
jtolle steered me in the right direction. As far as I can tell, this is what I was trying to do. I've been testing and it seems to work.
stringOriginFormula = rangeOrigin.Formula
rangeOrigin.Cut rangeDestination
rangeOrigin.Formula = stringOriginFormula
Thanks jtolle!
I'm a few years late here, but I was looking for something similar and so dug into this. The main pattern you use is this:
objRegExp.Pattern = "(\'.*(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\'\!" _
& "|(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\!)?" _
& "(\$?[a-z]{1,3}\$?[0-9]{1,7}(\:\$?[a-z]{1,3}\$?[0-9]{1,7})?" _
& "|\$[a-z]{1,3}\:\$[a-z]{1,3}" _
& "|[a-z]{1,3}\:[a-z]{1,3}" _
& "|\$[0-9]{1,7}\:\$[0-9]{1,7}" _
& "|[0-9]{1,7}\:[0-9]{1,7}" _
& "|[a-z_\\][a-z0-9_\.]{0,254})"
Basically you have six alternatives for a range reference (lines 3-8), any of which will produce a match by itself, with two alternatives for an optional filename/sheet name prefix (lines 1-2).
For the two prefix alternatives, the only difference is that the first is wrapped in single quotes, with an extra dot star after the initial quote. These single quotes occur mainly when there is a space in a sheet name. The purpose of the dot star, matching unconstrained text after an initial single quote, is unclear and it appears to create problems. I'll discuss those problems below. Besides that the two alternative prefixes are the same, and I'll refer to them collectively as the Optional External Prefix (OEP).
The OEP has its own two optional prefixes (the same in either alternative). The first is for the workbook name, an open-ended dot star in brackets.
(\[.*\])?
The second is for a "3D" cell reference, with two sheet names separated by a colon; it is the initial sheet name including the colon. The pattern here is a negated character class allowing up to 31 characters of anything except forward slash, back slash, question mark, asterisk, brackets, or colon, followed by a colon:
([^\:\\\/\?\*\[\]]{1,31}\:)?
Finally for the OEP is its only required part: a sheet name, same as the optional sheet name but with no colon. The effect is (if these all worked correctly) that the required sheet name will match if it can, and then only if there is a 3d reference or additional prior bracketed text will its optional prefixes also match.
Issues with the Workbook/Sheet name prefix: First, the dot star at the beginning of the first line is over-inclusive. Similarly, the negated character class for the sheet name appears to need additional characters including parens, comma, plus, minus, equals, and bang. Otherwise, extra material is interpreted as part of the sheet name. On my testing, this overinclusion happened with any of these:
=SUM(Sheet1!A1,Sheet2!A2)
=Sheet1!A1+Sheet2!A2
=Sheet1!A1-Sheet2!A2
Sheet names can include some of these characters, so accounting for that would require some additional measure. For instance, a sheet could be named "(Sheet1)", giving an odd formula like:
=SUM('(Sheet1)'!A1:A2)
You'd like to get the inner parens with the sheet name there, but not the outer paren. Excel puts the single quotes on that one, as it would with a space in the sheet name. You could then exclude parens in the non-single quote version since within the single quote it's ok. But then beware Excel seems to even allow single quotes in sheet names. Taking these naming quirks to the extreme, I just successfully named a sheet "Hi'Sheet1'SUM('Sheet2'!A1,A2)!". That's absurd but it points to what could happen. I learned in doing this that if I include a single quote in a sheet name, formulas escape the single quote with a second single quote. So a SUM(A1:A2) referring to the sheet I just created ends up looking like this:
=SUM('Hi''Sheet1''SUM(''Sheet2''!A1,A2)!'!A1:A2)
That actually does give some insight into the Excel parser itself. I suspect to adequately deal with this you may want separately (outside the regex) to compare the potential sheet names or workbook names to the actual sheet names, as you have done with the named ranges.
This leads to the six forms of cell references allowed in the regex (any one of which, if met, will produce a match):
1.) A one-cell or multi-cell range with rows and columns
"(\$?[a-z]{1,3}\$?[0-9]{1,7}(\:\$?[a-z]{1,3}\$?[0-9]{1,7})?"
The open paren here is closed at the end of the 6 options. Otherwise, this line allows a basic cell reference of the type "$A$1", "A1", "$A1", "A$1", or any combination of these in a multi-cell range ("$A1:A$2", etc.).
2.) A full-column or multi-column range with absolute references only
"|\$[a-z]{1,3}\:\$[a-z]{1,3}"
This one allows a cell reference of the type "$A:$B" with a dollar sign on both. Note a dollar sign on only one side will not match.
3.) A full-column or multi-column range with relative references only
"|[a-z]{1,3}\:[a-z]{1,3}"
This line is like the last, but matches only with no dollar signs. Note a dollar sign on only one side will not match here either.
4.) A full-row or multi-row range with absolute references only
"|\$[0-9]{1,7}\:\$[0-9]{1,7}"
This line allows a cell reference of the type "$1:$2" with a dollar sign on both.
5.) A full-row or multi-row range with relative references only
"|[0-9]{1,7}\:[0-9]{1,7}"
This version is like the last, but matches only with no dollar signs.
6.) Other text that could be a named range
"|[a-z_\\][a-z0-9_\.]{0,254})"
Finally, the sixth option allows text. This text is compared to actual named ranges later in sub.
The main omission that I see here is ranges that have both absolute and relative references, of the type "A:$A" or "1:$1". While $A:A is captured because it includes "A:A", "A:$A" is not captured. You could address this and simplify the regex by combining 2 and 3 and combining 4 and 5 with optional dollar signs:
objRegExp.Pattern = "(\'.*(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\'\!" _
& "|(\[.*\])?([^\:\\\/\?\*\[\]]{1,31}\:)?[^\:\\\/\?\*\[\]]{1,31}\!)?" _
& "(\$?[a-z]{1,3}\$?[0-9]{1,7}(\:\$?[a-z]{1,3}\$?[0-9]{1,7})?" _
& "|\$?[a-z]{1,3}\:\$?[a-z]{1,3}" _
& "|\$?[0-9]{1,7}\:\$?[0-9]{1,7}" _
& "|[a-z_\\][a-z0-9_\.]{0,254})"
Combining these further would seem to come up against the everything-is-optional problem.
One other issue is in the initial regex pattern for matching strings, which you use to expunge potential ranges that fall inside a quoted string:
objRegExp.Pattern = "\"".*\"""
When I test this on a formula with a string at the beginning and end of a formula, the greediness of the dot star captures everything from the initial quote to the final quote (in other words it interprets the entire formula as one big quoted string, even though there is non-string material in the middle). It appears you can fix this by making the dot star lazy (adding a question mark after it). That raises questions about quotes within quotes, but they may not be a problem. For instance, I tested this formula:
="John loves his A1 steak sauce, but said the ""good A1 steak sauce price"" is $" & A2+A3 & " less than the ""bad price"" of $" & A4 & "."
With cell values plugged in, this formula evaluates to:
John loves his A1 steak sauce, but said the "good A1 steak sauce
price" is $5 less than the "bad price" of $8.
With the lazy modifier added to your string pattern, both versions of "A1" above were recognized as occurring within a string and so were expunged, while A2, A3 and A4 were recognized as cell references.
I'm sure there are some technical issues with some of my language above, but hopefully the analysis is still useful.
Thanks Ben (I'm new to post here, even though Stackoverflow has caught my attention for years for high quality technical stuff, so I'm not sure if I read this page correctly for the author J)
I tried the posted solutions (testing, testing updated, as well as the one using range.precendents (which as correctly pointed, does not cover references to other sheets or other workbooks) and found a minor flaw: the external sheet name is enclosed in 'single quotation marks' only if it is a number; if it contains space (and possibly other characters as Ben (?) listed in the orginal post. with a simple addition to the regEx (opening [) this can be corrected (added "[", see code below). In addition, for my own purpose I converted the sub to a function that will return a comma-separated list with duplicates removed (note, this removes just identical reference notation, not cells that are included in multiple ranges):
Public Function CellReflist(Optional r As Range) ' single cell
Dim result As Object: Dim testExpression As String: Dim objRegEx As Object
If r Is Nothing Then Set r = ActiveCell ' Cells(1, 2) ' INPUT THE CELL HERE , e.g. RANGE("A1")
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.IgnoreCase = True: objRegEx.Global = True: objRegEx.Pattern = """.*?""" ' remove expressions
testExpression = CStr(r.Formula)
testExpression = objRegEx.Replace(testExpression, "")
'objRegEx.Pattern = "(([A-Z])+(\d)+)" 'grab the address
objRegEx.Pattern = "(['\[].*?['!])?([[A-Z0-9_]+[!])?(\$?[A-Z]+\$?(\d)+(:\$?[A-Z]+\$?(\d)+)?|\$?[A-Z]+:\$?[A-Z]+|(\$?[A-Z]+\$?(\d)+))"
If objRegEx.Test(testExpression) Then
Set result = objRegEx.Execute(testExpression)
If result.Count > 0 Then CellReflist = result(0).Value
If result.Count > 1 Then
For i = 1 To result.Count - 1 'Each Match In result
dbl = False ' poistetaan tuplaesiintymiset
For j = 0 To i - 1
If result(i).Value = result(j).Value Then dbl = True
Next j
If Not dbl Then CellReflist = CellReflist & "," & result(i).Value 'Match.Value
Next i 'Match
End If
End If
End Function
I resolved a similar problem in Google Sheets.
The following adds/subtract row references from a formula. Because I just needed to update row references, rather than extracting the formula I just extracted and updated the row reference with this /((?<=[A-Za-z\$:\!])\d+(?![A-Za-z\(!]))|(\d+(?=[:]))/
String.prototype.replaceAt = function(index, replacement, diff = 0) {
let end = this.substr(index + replacement.length + diff)
if((this.length - 1) === index) end = ""
return this.substr(0, index) + replacement + end;
}
// Ref: https://stackoverflow.com/a/1431113/2319414
/**
* #param row - positive integer to add, negative to subtract rows.
*/
function updateRowReference(formula, row){
let masked = formula
const mask = "#"
// masking double quotes in string literals
let exp = /""/g
let result;
while((result = exp.exec(masked)) !== null){
masked = masked.replaceAt(result.index, new Array(result[0].length).fill(mask).join(""))
}
// masking string literals
exp = /\"([^\\\"]|\\.)*\"/g
// Ref: https://stackoverflow.com/a/9260547
while((result = exp.exec(masked)) !== null){
masked = masked.replaceAt(result.index, new Array(result[0].length).fill(mask).join(""))
}
// updating row references
const sRow = row.toString()
// The magic is happening here
// Just matching a number which is part of range address
exp = /((?<=[A-Za-z\$:\!])\d+(?![A-Za-z\(!]))|(\d+(?=[:]))/g
while((result = exp.exec(masked)) !== null){
const oldRow = Number(result[0])
// adding/subtracting rows
const newRow = (row + oldRow).toString()
// preserving formula string length integrity if number of digits of new row is different than old row
const diff = result[0].length - newRow.length
masked = masked.replaceAt(result.index, newRow, diff)
formula = formula.replaceAt(result.index, newRow, diff)
exp.lastIndex -= diff
}
let updated = masked;
// revert mask
const array = formula.split("")
while((result = updated.search(mask)) !== -1){
updated = updated.replaceAt(result, array[result])
}
return updated
}
function test(){
const cases = [
"=$A$1",
"=A1",
"=$A1",
"=A$1",
"=$A1:B$1",
"=1:1",
"=Sheet1!1:1",
"=Sheet1!$A1:B$1",
"=Sheet1!A$1",
'=IF(AND($C6 <> ""; NOT(ISBLANK(B$6))); IF(SUM(FILTER($F$6:$F$7;$C$6:$C$7 = $C6)) < $G6; 1; IF($E6 = 0; 1; 0)); 0)',
"=$A$111", "=A111", "=$A111", "=A$111", "=$A111:B$111",
"=111:111",
"=Sheet1!111:111",
"=Sheet1!$A111:B$111",
"=Sheet1!A$111",
'=IF(AND($C111 <> ""; NOT(ISBLANK(B$111))); IF(SUM(FILTER($F$111:$F$112;$C$111:$C$112 = $C111)) < $G111; 1; IF($E111 = 0; 1; 0)); 0)',
// if string literals have addresses they shouldn't be affected
'=IF(AND($C111 <> "A1 $A1 $A1:B$1";$C111 <> "Sheet1!1:1";$C111 <> "Sheet1!$A1:B$1"); 1 , 0)'
]
const expectedAdd = [
'=$A$16',
'=A16',
'=$A16',
'=A$16',
'=$A16:B$16',
'=16:16',
'=Sheet1!16:16',
'=Sheet1!$A16:B$16',
'=Sheet1!A$16',
'=IF(AND($C21 <> ""; NOT(ISBLANK(B$21))); IF(SUM(FILTER($F$21:$F$22;$C$21:$C$22 = $C21)) < $G21; 1; IF($E21 = 0; 1; 0)); 0)',
'=$A$126',
'=A126',
'=$A126',
'=A$126',
'=$A126:B$126',
'=126:126',
'=Sheet1!126:126',
'=Sheet1!$A126:B$126',
'=Sheet1!A$126',
'=IF(AND($C126 <> ""; NOT(ISBLANK(B$126))); IF(SUM(FILTER($F$126:$F$127;$C$126:$C$127 = $C126)) < $G126; 1; IF($E126 = 0; 1; 0)); 0)',
'=IF(AND($C126 <> "A1 $A1 $A1:B$1";$C126 <> "Sheet1!1:1";$C126 <> "Sheet1!$A1:B$1"); 1 , 0)'
]
let results = cases.map(_case => updateRowReference(_case, 15))
console.log('Test Add')
console.log(results.every((result, i) => result === expectedAdd[i]))
console.log('Test Subtract')
results = results.map(_case => updateRowReference(_case, -15))
console.log(results.every((result, i) => result === cases[i]))
}
test()
'INDIRECT' function with addresses as strings will not be updated