Regex Excel Remove Middle Part of String - regex

I have some address data that needs to be corrected. It is intending to show a range of addresses, but this will not work for geocoding. What is an effective way to to remove everything between the hyphen and the first space using regex in excel? Example:
29-45 SICKLES ST
31-39 SHERMAN AV
36-44 ARDEN ST
118-22 NAGLE AV
Becomes
29 SICKLES ST
31 SHERMAN AV
36 ARDEN ST
118 NAGLE AV

Since you have tagged RegEx you could use it like so within Excel's VBA:
Sub Test()
Dim arr As Variant: arr = Array("29-45 SICKLES ST", "31-39 SHERMAN AV", "36-44 ARDEN ST", "118-22 NAGLE AV")
With CreateObject("VBScript.RegExp")
.Pattern = "-\d*\s*"
For x = LBound(arr) To UBound(arr)
arr(x) = .Replace(arr(x), " ")
Next
End With
End Sub

You will create a new column by = REGEXREPLACE(current_column,"-\d+ ","")

Related

Match n amount of words separated by commas after base text

I would like to match an infinite amount of words separated by commas and whitespaces.
Is there a better solution than just repeating the search parameter?
Sample:
"2_i Art des Problems:\s*(.[^,\s]+)[,]\s*(.[^,\s]+)[,]\s*(.[^,\s]+)"
2_i Art des Problems: Elektrisch, Schweißausrüstung, Burgenland
View on regex101: https://regex101.com/r/yP7PPO/1
Full code for this operation:
With Reg1
.Pattern = "2_i Art des Problems:+\s*([^\r\n]*\S)"
.Global = False
End With
If Reg1.Test(olMail.Body) Then
Set M1 = Reg1.Execute(olMail.Body)
End If
For Each M In M1
With xExcelApp
Select Case M.SubMatches
Case Software
Range("D6").Value = 1
Case Mechanisch
Range("E6").Value = 1
Case Elektrisch
Range("F6").Value = 1
Case Roboter
Range("G6").Value = 1
Case Schweißausrüstung
Range("H6").Value = 1
Case Anwendung
Range("I6").Value = 1
Case Ersatzteil
Range("J6").Value = 1
Case Else
Range("K6").Value = 1
End Select
End With
Next M
Does it really need to be a RegEx?
I think this is over complicating things as this can easily be solved with Split():
Option Explicit
Public Sub Example()
Const TestString As String = "2_i Art des Problems: Elektrisch, Schweißausrüstung, Burgenland"
Const ConstantPart As String = "2_i Art des Problems: "
If Left$(TestString, Len(ConstantPart)) = ConstantPart Then
Dim Parts() As String
Parts = Split(Mid$(TestString, Len(ConstantPart) + 1), ", ")
Dim Part As Variant
For Each Part In Parts
Debug.Print Part
Next Part
End If
End Sub
Output is:
Elektrisch
Schweißausrüstung
Burgenland
If you realy need to use regexp than use global flag and e.g. this regexp
(.[^,\s]+)(,|$)
Explanation here
With regEx
.Global = True
Use .SubMatches to get capturing groups values
EDIT:
according to one of comment "Then you still need to Trim the matches because they will include the spaces. – Pᴇʜ 1 min ago"
you can still use regexp
.([^,\s]+)(,|$)
check

How can I extract a 6 digit number from the text in excel? Examples shown below

I want to extract a pincode from the address. For example, I want to extract 751003 from below address:
Siksha O Annushandhan University, Extension of Sum Hospital,Khandagiri,K-8,Bhubaneswar-751003,Odisha
and another example, I want to extract 799001 from below address:
Saha Drug Distributors; Santipara,Maszid Road,Agartala-799 001,Tripura
Assuming that your data starts in cell A2, try below formula.
=LOOKUP(1,1/MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6),MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6))
Note for OP: SO Expects user to attempt a solution and include description of attempt and difficulty faced.
Edit: See below edit.
=LOOKUP(1,1/MID(SUBSTITUTE(A2," ","")&"a",ROW($A$1:$A$199),6),MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6))
Try using regular expressions:
Sub ExtractCode()
Set regex = CreateObject("VBScript.RegExp")
' pattern explanation: \d{6} - match 6 digits
regex.Pattern = "\d{6}"
' Get address from cell A1 and remove all spaces
testString = Replace(Cells(1, 1), " ", "")
MsgBox regex.Execute(testString)(0).Value
' Get address from cell A2 and remove all spaces
testString = Replace(Cells(2, 1), " ", "")
MsgBox regex.Execute(testString)(0).Value
End Sub
Example I used
Saha Drug Distributors; Santipara,Maszid Road,Agartala-799 001,Tripura
U can use the following formulas to extract Pin Code:
=MID(TRIM(A1),FIND(CHAR(1),SUBSTITUTE(TRIM(A1)," ",CHAR(1),LEN(TRIM(A1))-LEN(SUBSTITUTE(TRIM(A1)," ",""))))-3,7)
Or
=MID(A2,FIND("-",A2)+1,7)
Here is a rather easy approach considering the following data:
Sub GetVals()
Dim rng As Range, cl As Range
Dim lr As Long
With Sheet1 'Change accordingly
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
Set rng = .Range("A1:A" & lr)
For Each cl In rng
cl.Offset(0, 1) = Val(Mid(cl, InStrRev(cl, "-") + 1, Len(cl)))
Next cl
End With
End Sub
Val will get the numberic values from the position of the last "-" (found through InstrRev without the space (as your wanted result in the question shows).

Manipulate string to extract address

I'm currently doing some work with a very large data source on city addresses where the data looks something like this.
137 is the correct address but it belongs in a building that takes up 135-138A on the street.
source:
137 9/F 135-138A KING STREET 135-138A KING STREET TOR
i've used a function which removes the duplicates shown on extendoffice.
the second column has become this:
137 9/F 135-138A KING STREET TOR
what I want to do now is
find address number and add it in front of the street name
remove the numbers that are connected to the dash - ):
9/F 137 KING STREET TOR
Would the the best way to accomplish this?
The main problem I'm having with this is there are many inconsistent spaces in address names ex. "van dyke rd".
Is there anyway I can locate in an array the "-" and set variables for the 2 numbers on either side of the dash and replace it with the correct address number located at the front
Function RemoveDupes2(txt As String, Optional delim As String = " ") As String
Dim x
With CreateObject("Scripting.Dictionary")
.CompareMode = vbTextCompare
For Each x In Split(txt, delim)
If Trim(x) <> "" And Not .exists(Trim(x)) Then .Add Trim(x), Nothing
Next
If .Count > 0 Then RemoveDupes2 = Join(.keys, delim)
End With
End Function
Thanks
Regular Expressions are a way to (amongst other things) search for a feature in a string.
It looks like the feature you are looking for is: number:maybe some spaces : dash : maybe some spaces : number
In regex notation this would be expressed as:
([0-9]*)[ ]*-[ ]*([0-9]*)
Which translates to: Find a sequential group of digits followed by zero or more spaces, then a dash, then zero or more spaces, then some more digits.
The parenthesis indicate the elements that will be returned. So you could assign variables to the be the first number or the second number.
You might need to tweak this if a dash can potentially occur elsewhere in the address.
Further information on actually implementing that is available here: How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops
This meets the case you want, it captures the address range as two separate matches (if you want to process further).
The current code simple removes this range altogether.
What logic is there to move the 9/F to front?
See regex here
Function StripString(strIn As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "(\d+[A-C]?)-(\d+[A-C]?)"
If .test(strIn) Then
StripString = .Replace(strIn, vbullstring)
Else
StripString = "No match"
End If
End With
End Function
I'd just:
swap 1st and 2nd substrings
erase the substring with "-" in it
Function RemoveDupes2(txt As String, Optional delim As String = " ") As String
Dim x As Variant, arr As Variant, temp As Variant
Dim iArr As Long
With CreateObject("Scripting.Dictionary")
.CompareMode = vbTextCompare
For Each x In Split(txt, delim)
If Trim(x) <> "" And Not .exists(Trim(x)) Then .Add Trim(x), Nothing
Next
If .count > 0 Then
arr = .keys
temp = arr(0)
arr(0) = arr(1)
arr(1) = temp
For iArr = LBound(arr) To UBound(arr)
If InStr(arr(iArr), "-") <> 0 Then arr(iArr) = ""
Next
RemoveDupes2 = Join(arr, delim)
End If
End With
End Function

VBA and RegEx matching arbitrary strings in Excel 2010

I need to extract adress and potentially zip code as separate entites from the same line. The address line may or may not contain a zip code, and may or may not contain other unwanted strings. This is due to a bug in a web form, which is fixed, but the damage is already done to a set of elements.
Possible forms and results:
Address: Some address 251, 99302 Something Telephone: 555 6798 8473 -- Return "some address 251" and "99302 something" in separate strings. Comma may or may not be trailed by whitespace.
Address: Some address 251 -- Return "some address 251"
Address: Some address 251, 99302 -- Return "some address 251" and "99302". Again, comma may or may not be trailed by whitespace.
I have a basic understanding of how this could be done programatically in VBA by iterating over the string and checking individual characters and substrings, but I feel like it will be time-consuming and not very robust afterwards. Or if it's robust, it would end up being huge because of all the possible variations.
I am struggling the most with how to form the regular expression(s) and possibly the conditionals to get the desired results.
This is part of a larger project, so I won't paste all the various code, but I am pulling mailitems from Outlook to analyze and dump relevant info into an Excel sheet. I have both the Outlook and Excel code working, but the logic that extracts information is a bit flawed.
Here are the new snippets I've been working on:
Function regexp(str As String, regP As String)
Dim rExp As Object, rMatch As Object
Set rExp = CreateObject("vbscript.regexp")
With rExp
.Global = False
.MultiLine = False
.IgnoreCase = True
.Pattern = regP
End With
Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
regexp = rMatch(0)
Else
RegEx = vbNullString
Debug.Print "No match found!"
End If
End Function
Sub regexpAddress(str As String)
Dim result As String
Dim pattern As String
If InStr(str, "Telephone:") Then pattern = "/.+?(?=Telephone:)/"
result = regexp(str, pattern)
End Sub
I'm not sure how to form the regexps here. The one outlined should pull the right information (in 1 string instead 2, but that's still an improvement) - but only when the line contains the string "Telephone:", and I have a lot of cases where it won't contain that.
This is the current and somewhat flawed logic, which for some reason doesn't always yield the results I want:
For Each objMail In olFolder.Items
name = ""
address = ""
telephone = ""
email = ""
vIterations = vIterations + 1
arrBody = Split(objMail.body, Chr(10)) ' Split mail body when linebreak is encountered, throwing each line into its own array position
For i = 0 To UBound(arrBody)
arrLine = Split(arrBody(i), ": ") ' For each element (line), make new array, and if text search matches then write the 2nd half of the element to variable
If InStr(arrBody(i), "Name:") > 0 Then ' L2
name = arrLine(1) ' Reference 2nd column in array after the split
ElseIf InStr(arrBody(i), "Address:") > 0 Then
address = arrLine(1)
ElseIf InStr(arrBody(i), "Telephone:") > 0 Then
telephone = CLng(arrLine(1))
ElseIf InStr(arrBody(i), "Email:") > 0 Then
email = arrLine(1)
End If ' L2
Next
Next ' Next/end-for
This logic accepts and formats input of the following type:
Name: Joe
Address: Road
Telephone: 55555555555555
Email: joe#road.com
and returns joe, road, 55555 and joe#road.com to some defined Excel cells. This works fine when the mailitems are ordered as expected.
Problem: A bug lead to not my webform not inserting a linebreak after the address in some cases. The script still worked for the most part, but the mailitem contents sometimes ended up looking like this:
Name: Joe
Address: Road Telephone: 55555555555555
Email: joe#road.com
The address field was contaminated when it reached Excel ("Road Telephone" instead of just "Road"), but there was no loss of information. Which was acceptable, as it's easy to remove the surpluss string.
But in the following case (no email is entered), the phone number is not only lost but is actually replaced by a phone number from some other, arbitrary mailitem and I can't FOR THE LIFE OF ME figure out (1) why it won't get the correct number, (2) why it jumps to a new mail item to find the phone number or (3) how it selects this other mailitem:
Name: Joe
Address: Road Telephone: 5555555555555
Email:
In Excel:
Name: Joe
Address: Road Telephone
Telephone: 8877445511
Email:
So, TL;DR: my selection logic is flawed, and being that it is so hastily hacked together, not to mention how it yields false information and I am unable to figure out how and why, I would like to do a better operation using some other solution (like regexp?) instead for a more robust code.
Not so long ago I had a similar problem.
Code may not be very professional, but it can be helpful :)
Could you check if this code work for you correctly?
Function regexp(str As String, regP As String)
Dim rExp As Object, rMatch As Object
Set rExp = CreateObject("vbscript.regexp")
With rExp
.Global = False
.MultiLine = False
.IgnoreCase = True
.pattern = regP
End With
Set rMatch = rExp.Execute(str)
If rMatch.Count > 0 Then
regexp = rMatch(0)
Else
RegEx = vbNullString
Debug.Print "No match found!"
End If
End Function
Function for_vsoraas()
For Each objMail In olFolder.Items
vIterations = vIterations + 1
objMail_ = Replace(objMail.body, Chr(10), " ")
Dim StringToSearch(3) As String
StringToSearch(0) = "Name:"
StringToSearch(1) = "Address:"
StringToSearch(2) = "Telephone:"
StringToSearch(3) = "Email:"
Dim ArrResults(4) As String 'name,address,telephone,email, zipcode
For i = 0 To UBound(StringToSearch)
ResultString = ""
StartString = InStr(objMail_, StringToSearch(i))
If StartString > 0 Then
If i = UBound(StringToSearch) Then 'last string to search, dont search EndString
ResultString = Right(objMail_, Len(objMail_) + Len(StringToSearch(i)))
Else
EndString = 0
j = i
While (EndString = 0) 'prevent case no existing EndString
EndString = InStr(objMail_, StringToSearch(j + 1))
j = j + 1
If j = UBound(StringToSearch) And EndString = 0 Then
EndString = Len(objMail_) + 1
End If
Wend
ResultString = Mid(objMail_, StartString + Len(StringToSearch(i)) + 1, EndString - 1 - StartString - Len(StringToSearch(i)))
End If
ArrResults(i) = ResultString
End If
Next i
'search zipcode and address
ArrResults(4) = regexp(ArrResults(1), "\b(\d{5})\b")
ArrResults(1) = regexp(ArrResults(1), "([a-z ]{2,}\s{0,1}\d{0,3})")
'your varabile
Name = ArrResults(0)
Address = ArrResults(1)
Telephone = ArrResults(2)
Email = ArrResults(3)
ZipCode = ArrResults(4)
Next ' Next/end-for
End Function
I don't know if it was dumb luck or if I actually managed to learn some regex, but these patterns turn out to do exactly what I need.
' regex patterns - use flag /i
adrPattern = "([a-z ]{2,}\s{0,1}\d{0,3})" ' Select from a-z or space, case insensitive and at least 2 characters long, followed by optional space, ending with 0-3 digits
adrZipcode = "\b(\d{4})\b" ' Exactly 4 digits surrounded on both sides by either space, text or non-word character like comma
Edit: "Fixed" the telephone problem too. After spending 2 hours trying to write it in regex, and failing miserably, it dawned on me that solving the problem as a matter of faulty creation of the array had to be so much easier than treating it as a computational problem. And it was:
mailHolder = Replace(objMail.body, "Telephone:", Chr(10) + "Telephone:")
arrBody = Split(mailHolder, Chr(10))

What is the RegExp Pattern to Extract Bullet Points Between Two Group Words using VBA in Word?

I can't seem to figure out the RegExp to extract the bullet points between two group of words in a word document.
For example:
Risk Assessment:
Test 1
Test 2
Test 3
Internal Audit
In this case I want to extract the bullet points between "Risk Assessment" and "Internal Audit", one bullet at a time and assign that bullet to an Excel cell. As shown in the code below I have pretty much everything done, except I cant figure out the correct Regex pattern. Any help would be great. Thanks in advance!
Sub PopulateExcelTable()
Dim fd As Office.FileDialog
Set fd = Application.FileDialog(msoFileDialogFilePicker)
With fd
.AllowMultiSelect = False
.Title = "Please select the file."
.Filters.Clear
.Filters.Add "Word 2007-2013", "*.docx"
If .Show = True Then
txtFileName = .SelectedItems(1)
End If
End With
Dim WordApp As Word.Application
Set WordApp = CreateObject("Word.Application")
Dim WordDoc As Word.Document
Set WordDoc = WordApp.Documents.Open(txtFileName)
Dim str As String: str = WordDoc.Content.Text ' Assign entire document content to string
Dim rex As New RegExp
rex.Pattern = "\b[^Risk Assessment\s].*[^Internal Audit\s]"
Dim i As long : i = 1
rex.Global = True
For Each mtch In rex.Execute(str)
Debug.Print mtch
Range("A" & i).Value = mtch
i = i + 1
Next mtch
WordDoc.Close
WordApp.Quit
End Sub
This is probably a long way around the problem but it works.
Steps I'm taking:
Find bullet list items using keywords before and after list in regexp.
(Group) regexp pattern so that you can extract everything in-between words.
Store listed items group into a string.
Split string by new line character into a new array.
Output each array item to excel.
Loop again since there may be more than one list in document.
Note: I don't see your code for a link to Excel workbook. I'll assume this part is working.
Dim rex As New RegExp
rex.Pattern = "(\bRisk Assessment\s)(.*)(Internal\sAudit\s)"
rex.Global = True
rex.MultiLine = True
rex.IgnoreCase = True
Dim lineArray() As String
Dim myMatches As Object
Set myMatches = rex.Execute(str)
For Each mtch In rex.Execute(str)
'Debug.Print mtch.SubMatches(1)
lineArray = Split(mtch.SubMatches(1), vbLf)
For x = LBound(lineArray) To UBound(lineArray)
'Debug.Print lineArray(x)
Range("A" & i).Value = lineArray(x)
i = i + 1
Next
Next mtch
My test page looks like this:
Results from inner Debug.Print line return this:
Item 1
Item 2
Item 3