Regex to extract inconsistent postal codes from string - regex

Using the solution posted here, I'm looking to extract postal codes from a list of irregular data in Excel.
Below is a sample of what my data looks like:
Brampton L6P 2G9 ON Canada
M5B2R3 Toronto ON
Toronto M5J 0A6 ON Canada
M1H1T7 Canada
Toronto M4P1T8 ON Canada
MISSISUAGABRAMPTON L5M6S6 ON Canada
333 Sea Ray Inisfil l4e2y6 ON Canada
To call the function, I'm using the following formula
=RegexExtract(A1,"^(?!.*[DFIOQU])[A-VXY][0-9][A-Z] ?[0-9][A-Z][0-9]$")
However the function is not working for me. I think I need to tweak my regex expression in some way but I don't know what I'm missing.

google-spreadsheet
Try,
=REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d")
'alternate
=left(REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d"), 3)&" "&right(REGEXEXTRACT(upper(A2), "[A-X]\d[A-Z] ?\d[A-Z]\d"), 3)

You have 2 issues.
First, the expression - if you need to extract the postal code, you can't anchor your regex with ^ and $. The first means "match must occur at the start of the string" and the second means "match must end at the end of the string". This is only useful if you are validating a postal code, but it obviously can't be used to extract one from your examples because they all contain things other than the postal code.
The other problem with the regex is the negative look-ahead assertion (?!.*[DFIOQU]), which means "no match can contain the letters D, F, I, O, Q, or U". To the best of my recollection, this isn't supported in VBScript regex. If I'm mistaken, please correct me in the comments.
That gives you the slightly more pedantic expression:
[ABCEGHJKLMNPRSTVX]\d[ABCEGHJKLMNPRSTVWXYZ][ -]?\d[ABCEGHJKLMNPRSTVWXYZ]\d
I took the liberty of optionally allowing a - between the FSA and LDU because I see that a lot, particularly from non-Canadians.
Second, the function that you're calling (copied below from the linked answer):
Function RegexExtract(ByVal text As String, _
ByVal extract_what As String, _
Optional separator As String = ", ") As String
Dim allMatches As Object
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
Dim i As Long, j As Long
Dim result As String
RE.pattern = extract_what
RE.Global = True
Set allMatches = RE.Execute(text)
For i = 0 To allMatches.count - 1
For j = 0 To allMatches.Item(i).submatches.count - 1
result = result & (separator & allMatches.Item(i).submatches.Item(j))
Next
Next
If Len(result) <> 0 Then
result = Right$(result, Len(result) - Len(separator))
End If
RegexExtract = result
End Function
The first problem is that it is case sensitive. It is also tailored to extracting submatches, which you don't care about - your examples are looking for a single match.
I'd go with this much simpler option that also correctly formats the output:
Public Function ExtractCanadianPostalCode(inputText As String) As String
With CreateObject("vbscript.regexp")
.Pattern = "[ABCEGHJKLMNPRSTVX]\d[ABCEGHJKLMNPRSTVWXYZ][ -]?\d[ABCEGHJKLMNPRSTVWXYZ]\d"
.IgnoreCase = True
If .Test(inputText) Then
Dim matches As Object
Set matches = .Execute(inputText)
ExtractCanadianPostalCode = UCase$(Left$(matches(0), 3) & " " & Right$(matches(0), 3))
End If
End With
End Function

Related

Regular expression to match page number groups

I need a regular expression to match page numbers as found in common programs.
These usually take the form 1-5,3,5,1-9 for example.
I have a regular expression (\d+-\d+)?,(\d+-\d+?)* which I need help to refine.
As can be seen here regex101 I am matching commas and missing numbers entirely.
What I need is to match 1-5 as group 1, 3 as group 2, 5 as group 3 and 1-9 as group 4 without matching any commas.
Any help is appreciated. I will be using this in VBA.
This worked for me - am I missing something?
Sub Pages()
Dim re As Object, allMatches, m, rv, sep, c As Range, i As Long
Set re = CreateObject("VBScript.RegExp")
re.Pattern = "(\d+(-\d+)?)"
re.ignorecase = True
re.MultiLine = True
re.Global = True
For Each c In Range("B5:B20").Cells 'for example
c.Offset(0, 1).Resize(1, 10).ClearContents 'clear output cells
i = 0
If re.test(c.Value) Then
Set allMatches = re.Execute(c.Value)
For Each m In allMatches
i = i + 1
c.Offset(0, i).Value = m
Next m
End If
Next c
End Sub
If I recall correctly, capturing a dynamic number of groups will not work. You can pre-specify the format / number of groups to be matched, or you can catch the repeated groups as one and split them afterwards.
If you know the format, just do
(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)(?:,)(\d+(?:-\d+)?)
which of course is not very neat.
If you want the flexible structure, match the first group and all the rest as a second and then split the latter by the delimiter ',' in whichever language.
(\d+(?:-\d+)?)((?:(?:,)(\d+(?:-\d+)?))*)
You need to make the -\d+ part optional, since you don't always have ranges. And the comma between each range should be part of the second group with the * quantifier, so you can match a single range with no comma after it.
\d+(-\d+)?(,\d+(-\d+)?)*
This will match the string that contains all the ranges. To get an array of individual ranges without the commas, do a second match in this string:
\d+(-\d+)?
Use the VBA function for getting an array of all matches of a regexp (sorry, I don't know VBA, so can't provide the specific syntax).

Extract number and a character from sting using regex

I am trying to extract the number along with 'x'from string:
1. "KAWAN (FRZ) LACHA FLACKEY PARATHA 8X25X80 GM" or
2. G.G. HOT SEV 20X285GM" using function: but it returns only last number with "x". Expected output is 2X25X or 20X... also is it possible to store the string without the extracted value using the same function?:
Public Function getNumber(strInput As String) As Variant
Dim regex As New RegExp
Dim matches As Object
regex.Pattern = "(\d??[x|X])"
regex.Global = False
Set matches = regex.Execute(strInput)
If matches.Count = 0 Then
getNumber = CVErr(xlErrNA)
Else
getNumber = matches(0).Value
End If
End Function
Try the following pattern for your regular expression...
regex.Pattern = "((\d{1,2}[xX])+)"
Results
Demo
By the way, since you're using early binding, you can declare matches as MatchCollection instead of Object.
Dim matches As MatchCollection
Tru changing your regular expression to (\d*)[xX], this will capture none or more numbers followed by an X and put the numbers in a group. You can test this regex in this website, you'll see that applying this regex in your first exemple KAWAN (FRZ) LACHA FLACKEY PARATHA 8X25X80 GM it will capture 8X and 25X and each match will have the 8 and 25 as its group, respectively

Remove tweet regular expressions from string of text

I have an excel sheet filled with tweets. There are several entries which contain #blah type of strings among other. I need to keep the rest of the text and remove the #blah part. For example: "#villos hey dude" needs to be transformed into : "hey dude". This is what i ve done so far.
Sub Macro1()
'
' Macro1 Macro
'
Dim counter As Integer
Dim strIN As String
Dim newstring As String
For counter = 1 To 46
Cells(counter, "E").Select
ActiveCell.FormulaR1C1 = strIN
StripChars (strIN)
newstring = StripChars(strIN)
ActiveCell.FormulaR1C1 = StripChars(strIN)
Next counter
End Sub
Function StripChars(strIN As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "^#?(\w){1,15}$"
.ignorecase = True
StripChars = .Replace(strIN, vbNullString)
End With
End Function
Moreover there are also entries like this one: Ÿ³é‡ï¼Ÿã€€åˆã‚ã¦çŸ¥ã‚Šã¾ã—ãŸã€‚ shiftã—ãªãŒã‚‰ã‚¨ã‚¯ã‚¹ãƒ
I need them gone too! Ideas?
For every line in the spreadsheet run the following regex on it: ^(#.+?)\s+?(.*)$
If the line matches the regex, the information you will be interested in will be in the second capturing group. (Usually zero indexed but position 0 will contain the entire match). The first capturing group will contain the twitter handle if you need that too.
Regex demo here.
However, this will not match tweets that are not replies (starting with #). In this situation the only way to distinguish between regular tweets and the junk you are not interested in is to restrict the tweet to alphanumerics - but this may mean some tweets are missed if they contain any non-alphanumerical characters. The following regex will work if that is not an issue for you:
^(?:(#.+?)\s+?)?([\w\t ]+)$
Demo 2.

How to test for specific characters with regex in VBA

I need to test for a string variable to ensure it matches a specific format:
XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
...where x can be any alphanumerical character (a - z, 0 - 9).
I've tried the following, but it doesn't seem to work (test values constantly fail)
If val Like "^([A-Za-z0-9_]{8})([-]{1})([A-Za-z0-9_]{4})([-]{1})([A-Za-z0-9_]{4})([-]{1})([A-Za-z0-9_]{4})([-]{1})([A-Za-z0-9_]{12})" Then
MsgBox "OK"
Else
MsgBox "FAIL"
End If
.
fnCheckSubscriptionID "fdda752d-32de-474e-959e-4b5bf7574436"
Any pointers? I don't mind if this can be achieved in vba or with a formula.
You are already using the ^ beginning-of-string anchor, which is terrific. You also need the $ end-of-string anchor, otherwise in the last group of digits, the regex engine is able to match the first 12 digits of a longer group of digits (e.g. 15 digits).
I rewrote your regex in a more compact way:
^[A-Z0-9]{8}-(?:[A-Z0-9]{4}-){3}[A-Z0-9]{12}$
Note these few tweaks:
[-]{1} can just be expressed with -
I removed the underscores as you say you only want letters and digits. If you do want underscores, instead of [A-Z0-9]{8} (for instance), you can just write \w{8} as \w matches letters, digits and underscores.
Removed the lowercase letters. If you do want to allow lowercase letters, we'll turn on case-insensitive mode in the code (see line 3 of the sample code below).
No need for (capturing groups), so removed the parentheses
We have three groups of four letters and a dash, so wrote (?:[A-Z0-9]{4}-) with a {3}
Sample code
Dim myRegExp, FoundMatch
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = "^[A-Z0-9]{8}-(?:[A-Z0-9]{4}-){3}[A-Z0-9]{12}$"
FoundMatch = myRegExp.Test(SubjectString)
You can do this either with a regular expression, or with just native VBA. I am assuming from your code that the underscore character is also valid in the string.
To do this with native VBA, you need to build up the LIKE string since quantifiers are not included. Also using Option Compare Text makes the "like" action case insensitive.
Option Explicit
Option Compare Text
Function TestFormat(S As String) As Boolean
'Sections
Dim S1 As String, S2_4 As String, S5 As String
Dim sLike As String
With WorksheetFunction
S1 = .Rept("[A-Z0-9_]", 8)
S2_4 = .Rept("[A-Z0-9_]", 4)
S5 = .Rept("[A-Z0-9_]", 12)
sLike = S1 & .Rept("-" & S2_4, 3) & "-" & S5
End With
TestFormat = S Like sLike
End Function
With regular expressions, the pattern is simpler to build, but the execution time may be longer, and that may make a difference if you are processing very large amounts of data.
Function TestFormatRegex(S As String) As Boolean
Dim RE As Object
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = "^\w{8}(?:-\w{4}){3}-\w{12}$"
TestFormatRegex = .test(S)
End With
End Function
Sub Test()
MsgBox fnCheckSubscriptionID("XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX")
End Sub
Function fnCheckSubscriptionID(strCont)
' Tools - References - add "Microsoft VBScript Regular Expressions 5.5"
With New RegExp
.Pattern = "^\w{8}-\w{4}-\w{4}-\w{4}-\w{12}$"
.Global = True
.MultiLine = True
fnCheckSubscriptionID = .Test(strCont)
End With
End Function
In case of any problems with early binding you can use late binding With CreateObject("VBScript.RegExp") instead of With New RegExp.

How to change case of matching letter with a VBA regex Replace?

I have a column of lists of codes like the following.
2.A.B, 1.C.D, A.21.C.D, 1.C.D.11.C.D
6.A.A.5.F.A, 2.B.C.H.1
8.ABC.B, A.B.C.D
12.E.A, 3.NO.T
A.3.B.C.x, 1.N.N.9.J.K
I want to find all instances of two single upper-case letters separated by a period, but only those that follow a number less than 6. I want to remove the period between the letters and convert the second letter to lower case. Desired output:
2.Ab, 1.Cd, A.21.C.D, 1.Cd.11.C.D
6.A.A.5.Fa, 2.Bc.H.1
8.ABC.B, A.B.C.D
12.E.A, 3.NO.T
A.3.Bc.x, 1.Nn.9.J.K
I have the following code in VBA.
Sub fixBlah()
Dim re As VBScript_RegExp_55.RegExp
Set re = New VBScript_RegExp_55.RegExp
re.Global = True
re.Pattern = "\b([1-5]\.[A-Z])\.([A-Z])\b"
For Each c In Selection.Cells
c.Value = re.Replace("$1$2")
Next c
End Sub
This removes the period, but doesn't handle the lower-case requirement. I know in other flavors of regular expressions, I can use something like
re.Replace("$1\L$2\E")
but this does not have the desired effect in VBA. I tried googling for this functionality, but I wasn't able to find anything. Is there a way to do this with a simple re.Replace() statement in VBA?
If not, how would I go about achieving this otherwise? The pattern matching is complex enough that I don't even want to think about doing this without regular expressions.
[I have a solution I worked up, posted below, but I'm hoping someone can come up with something simpler.]
Here is a workaround that uses the properties of each individual regex match to make the VBA Replace() function replace only the text from the match and nothing else.
Sub fixBlah2()
Dim re As VBScript_RegExp_55.RegExp, Matches As VBScript_RegExp_55.MatchCollection
Dim M As VBScript_RegExp_55.Match
Dim tmpChr As String, pre As String, i As Integer
Set re = New VBScript_RegExp_55.RegExp
re.Global = True
re.Pattern = "\b([1-5]\.[A-Z])\.([A-Z])\b"
For Each c In Selection.Cells
'Count of number of replacements made. This is used to adjust M.FirstIndex
' so that it still matches correct substring even after substitutions.
i = 0
Set Matches = re.Execute(c.Value)
For Each M In Matches
tmpChr = LCase(M.SubMatches.Item(1))
If M.FirstIndex > 0 Then
pre = Left(c.Value, M.FirstIndex - i)
Else
pre = ""
End If
c.Value = pre & Replace(c.Value, M.Value, M.SubMatches.Item(0) & tmpChr, _
M.FirstIndex + 1 - i, 1)
i = i + 1
Next M
Next c
End Sub
For reasons I don't quite understand, if you specify a start index in Replace(), the output starts at that index as well, so the pre variable is used to capture the first part of the string that gets clipped off by the Replace function.
So this question is old, but I do have another workaround. I use a double regex so to speak, where the first engine looks for the match as an execute, then I loop through each of those items and replace with a lowercase version. For example:
Sub fixBlah()
Dim re As VBScript_RegExp_55.RegExp
dim ToReplace as Object
Set re = New VBScript_RegExp_55.RegExp
for each c in Selection.Cells
with re `enter code here`
.Global = True
.Pattern = "\b([1-5]\.[A-Z])\.([A-Z])\b"
Set ToReplace = .execute(C.Value)
end with
'This generates a list of items that match. Now to lowercase them and replace
Dim LcaseVersion as string
Dim ItemCt as integer
for itemct = 0 to ToReplace.count - 1
LcaseVersion = lcase(ToReplace.item(itemct))
with re `enter code here`
.Global = True
.Pattern = ToReplace.item(itemct) 'This looks for that specific item and replaces it with the lowercase version
c.value = .replace(C.Value, LCaseVersion)
end with
End Sub
I hope this helps!