Excel Macro Unable to Separate String Address - regex

Software: MS Excel 2016
Update 1
Please note there can be any number of digits before West, i.e.
123124234234West18th Street
2West 14th Avenue
12324West
Please assist with general solution
Original Question
There is address, 31West 52nd Street I am trying to split the 31 and West so output will be
31 West 52nd Street
Tried this Macro statement but it won't work, please guide
Selection.Replace What:="?#West ", Replacement:=" West " _
, LookAt:=xlPart, SearchOrder:=xlByRows, MatchCase:=False, SearchFormat _
:=False, ReplaceFormat:=False

This is a sample of code, that would check for the first few chars. If they are digits, if would split them with a space from the rest:
Option Explicit
Public Sub TestMe()
Debug.Print fnStrStripMyNumber("31West 52nd Street")
Debug.Print fnStrStripMyNumber("123Vityata Shampion")
End Sub
Public Function fnStrStripMyNumber(strStr As String) As String
Dim lngCountDigits As Long
Dim lngCounter As Long
strStr = Trim(strStr)
For lngCounter = 1 To Len(strStr)
If IsNumeric(Mid(strStr, lngCounter, 1)) Then
lngCountDigits = lngCountDigits + 1
Else
Exit For
End If
Next lngCounter
strStr = Left(strStr, lngCountDigits) & " " & Right(strStr, Len(strStr) - lngCountDigits)
fnStrStripMyNumber = Trim(strStr)
End Function
Thus, from input:
"31West 52nd Street"
"123Vityata Shampion"
We get output:
31 West 52nd Street
123 Vityata Shampion

You can try this excel formula as well,
=LEFT(A1,FIND("West",A1)-1)&" "&RIGHT(A1,LEN(A1)-FIND("West",A1)+1)
Or if you want a macro only,
Sub rep()
Range("B1") = Replace(Range("A1"), "West", " West")
End Sub

Related

Regex extract string based on String match

I have this data with some messy addresses inside which contains sometimes not in order a Province, District, and ward :
Name ADDRESS
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Thanh pho Quang Ngai
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY
Store3 98 Phan Xich Long- P. 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5
Store5 22, Ngo 421/16, Tran Duy Hung, To 42, Phuong Trung Hoa, Quan Cau Giay
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
//Replace each \ with \\ so that C# doesn't treat \ as escape character
//Pattern: Start of string, any integers, 0 or 1 letter, end of word
string sPattern = "^[0-9]+([A-Za-z]\\b)?";
string sString = Row.ADDRESS ?? ""; //Coalesce to empty string if NULL
//Find any matches of the pattern in the string
Match match = Regex.Match(sString, sPattern, RegexOptions.IgnoreCase);
//If a match is found
if (match.Success)
//Return the first match into the new
//HouseNumber field
Row.ward= match.Groups[0].Value;
else
//If not found, leave the HouseNumber blank
Row.ward= "";
}
}
I would like to modify my regex formula to return the data like this in the column Ward. (you can see the synonyms in my addresses (Phuong,P.,ward,etc).
Name ADDRESS ward
Store1 453, Duy Tan, Phuong Nguyen Nghiem, Quang Ngai Phuong Nguyen Nghiem
Store2 13 DUNG SY THANH KHE, P. THANH KHE TAY Phuong THANH KHE TAY
Store3 98 Phan Xich Long- P. 2 Phuong 2
Store4 306 B4, NGUYENVAN LINH, Ward - 5 Phuong 5
Store5 22, Ngo 421/16,--. To 42, Phuong Trung Hoa, Quan Cau Giay Phuong Trung Hoa
I use that regex expression to extract the civic number, but is there a way with REGEX i can modifiu return the data in my column ward like in the example above?
The groups in this regex, as tested in https://regex101.com/, match the data in your column ward, as in your example. However, you may need to better define the patterns where each will appear since this regex only matches them as they appear in your example data. However, it may be enough for you to extrapolate and get the regex that you really need.
(Phuong.*),|P\.(.*$)|Ward - (.*$)
The group in option 1 matches from Phuong (inclusive) until the first comma.
The group in option 2 matches anything that comes after P. until the end of the string.
The group in option 3 matches anything that comes after Ward - until the end of the string.
This one is a bit more advanced, but it only matches what you mentioned in your examples, no groups:
Phuong.*(?=,)|(?<=P\.).*$|(?<=Ward - ).*$
Test it in https://regex101.com to see how it works and to see what each part means.
Finally, you may want to exclude Phuong from the match in option 1 on so that your program can always print Phuong and then the match.

Regex Excel Remove Middle Part of String

I have some address data that needs to be corrected. It is intending to show a range of addresses, but this will not work for geocoding. What is an effective way to to remove everything between the hyphen and the first space using regex in excel? Example:
29-45 SICKLES ST
31-39 SHERMAN AV
36-44 ARDEN ST
118-22 NAGLE AV
Becomes
29 SICKLES ST
31 SHERMAN AV
36 ARDEN ST
118 NAGLE AV
Since you have tagged RegEx you could use it like so within Excel's VBA:
Sub Test()
Dim arr As Variant: arr = Array("29-45 SICKLES ST", "31-39 SHERMAN AV", "36-44 ARDEN ST", "118-22 NAGLE AV")
With CreateObject("VBScript.RegExp")
.Pattern = "-\d*\s*"
For x = LBound(arr) To UBound(arr)
arr(x) = .Replace(arr(x), " ")
Next
End With
End Sub
You will create a new column by = REGEXREPLACE(current_column,"-\d+ ","")

How can I extract a 6 digit number from the text in excel? Examples shown below

I want to extract a pincode from the address. For example, I want to extract 751003 from below address:
Siksha O Annushandhan University, Extension of Sum Hospital,Khandagiri,K-8,Bhubaneswar-751003,Odisha
and another example, I want to extract 799001 from below address:
Saha Drug Distributors; Santipara,Maszid Road,Agartala-799 001,Tripura
Assuming that your data starts in cell A2, try below formula.
=LOOKUP(1,1/MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6),MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6))
Note for OP: SO Expects user to attempt a solution and include description of attempt and difficulty faced.
Edit: See below edit.
=LOOKUP(1,1/MID(SUBSTITUTE(A2," ","")&"a",ROW($A$1:$A$199),6),MID(SUBSTITUTE(A2," ",""),ROW($A$1:$A$199),6))
Try using regular expressions:
Sub ExtractCode()
Set regex = CreateObject("VBScript.RegExp")
' pattern explanation: \d{6} - match 6 digits
regex.Pattern = "\d{6}"
' Get address from cell A1 and remove all spaces
testString = Replace(Cells(1, 1), " ", "")
MsgBox regex.Execute(testString)(0).Value
' Get address from cell A2 and remove all spaces
testString = Replace(Cells(2, 1), " ", "")
MsgBox regex.Execute(testString)(0).Value
End Sub
Example I used
Saha Drug Distributors; Santipara,Maszid Road,Agartala-799 001,Tripura
U can use the following formulas to extract Pin Code:
=MID(TRIM(A1),FIND(CHAR(1),SUBSTITUTE(TRIM(A1)," ",CHAR(1),LEN(TRIM(A1))-LEN(SUBSTITUTE(TRIM(A1)," ",""))))-3,7)
Or
=MID(A2,FIND("-",A2)+1,7)
Here is a rather easy approach considering the following data:
Sub GetVals()
Dim rng As Range, cl As Range
Dim lr As Long
With Sheet1 'Change accordingly
lr = .Cells(.Rows.Count, 1).End(xlUp).Row
Set rng = .Range("A1:A" & lr)
For Each cl In rng
cl.Offset(0, 1) = Val(Mid(cl, InStrRev(cl, "-") + 1, Len(cl)))
Next cl
End With
End Sub
Val will get the numberic values from the position of the last "-" (found through InstrRev without the space (as your wanted result in the question shows).

Extracting specific words from a single cell containing text string

Basically I have a very long text containing multiple spaces, special characters, etc. in one cell in an excel file and I need to extract only specific words from it, each one to a seperate cell in another column.
What I'm looing for:
symbols that are always 9 characters in lenght, and always contain at least one number (up to 9).
So for an example in A1 I have:
euhe: djj33 dkdakofja. kaowdk ---------- jffjbrjjjj j jrjj 08/01/2222 999ABC123
fjfjfj 321XXX888 .... ........ 123456789AA
And in the end I want to have:
999ABC123 in B1
and
321XXX888 in B2.
Right now I'm doing this by using Text to columns feature and then just looking for specific words manually but sometimes the volume is so big it takes too much time and would be cool to automate this.
Can anyone help with this? Thank you!
EDIT:
More examples:
INPUT: '10/01/2016 1,060X 8.999%!!! 1.33 0.666 928888XE0'
OUTPUT: '928888XE0'
INPUT: 'ABCDEBATX ..... ,,00,001% 20///^^ addcA7 7777a 123456789 djaoij8888888 0.000001 12#'
OUTPUT: '123456789'
INPUT: 'FAR687465 B22222222 __ djj^66 20/20/20/20 1:'
OUTPUT: 'FAR687465' in B1 'B22222222' in B2
INPUT: 'fil476 .00 20/.. BUT AAAAAAAAA k98776 000.0001'
OUTPUT: 'blank'
To clarify: the 9 character string can be anywhere, there is no rule what is before or after them, they can be next to each other, or just at the beginning and end of this wall of text, no rules here, the text is random, taken out of some system, can contain dates, etc anything... The symbols are always 9 characters long and they are not the only 9 character symbols in the text. I call them symbols but they should only consist of numbers and letters. Can be only numbers, but never only letters. A1 cell can contain multiple spaces/tabs between words/symbols.
Also if possible to do this not only for A1, but the whole column A until it finds the first blank cell.
Try this code
Sub Test()
Dim r As Range
Dim i As Long
Dim m As Long
With CreateObject("VBScript.RegExp")
.Global = True
.Pattern = "\b[a-zA-Z\d]{9}\b"
For Each r In Range("A1", Range("A" & Rows.Count).End(xlUp))
If .Test(r.Value) Then
For i = 0 To .Execute(r.Value).Count - 1
If CBool(.Execute(r.Value)(i) Like "*[0-9]*") Then
m = IIf(Cells(1, 2).Value = "", 1, Cells(Rows.Count, 2).End(xlUp).Row + 1)
Cells(m, 2).Value = .Execute(r.Value)(i)
End If
Next i
End If
Next r
End With
End Sub
This bit of code is almost it... just need to check the strings... but excel crashes on the Str line of code
Sub Test()
Dim Outputs, i As Integer, LastRow As Long, Prueba, Prueba2
Outputs = Split(Range("A1"), " ")
For i = 0 To UBound(Outputs)
If Len(Outputs(i)) = 9 Then
Prueba = 0
Prueba2 = 0
On Error Resume Next
Prueba = Val(Outputs(i))
Prueba2 = Str(Outputs(i))
On Error GoTo 0
If Prueba <> 0 And Prueba2 <> 0 Then
LastRow = Range("B10000").End(xlUp).Row + 1
Cells(LastRow, 2) = Outputs(i)
End If
End If
Next i
End Sub
If someone could help to set the string check.. that would do the thing I guess.

Extract pattern from column

I am struggling with a huge Excel sheet (with 200K rows), where I need to extract from a certain column (B) list of all email addresses present in the string.
What I want to achieve:
Extract the email from string
convert (at) to # and (dot) to .
Save name and email in separate columns
Example of column B:
Shubhomoy Biswas <biswas_shubhomoy777(at)yahoo(dot)com>
Puneet Arora <ar.puneetarora(at)gmail(dot)com>
Anand Upadhyay <001.anand(at)gmail(dot)com>
Rajat Gupta <rajatgupta0889(at)gmail(dot)com>
Sarvesh Sonawane <sarvesh.s(at)suruninfocoresystems.
Although I want to be able to do it on Excel any other Windows-based utility suggestion would be helpful.
this can be done assuming they are all in the same format and only 1 email add per cell
=SUBSTITUTE(SUBSTITUTE(MID(B1,FIND("<",B1)+1,LEN(B1)-FIND("<",B1)-1),"(at)","#"),"(dot)",".")
Give this a try:
Sub splitter()
Dim r As Range, v As String
For Each r In Intersect(Range("B:B"), ActiveSheet.UsedRange)
v = r.Text
If v <> "" Then
ary = Split(v, " <")
r.Offset(0, 1).Value = ary(0)
r.Offset(0, 2).Value = Replace(Replace(Replace(ary(1), ">", ""), "(at)", "#"), "(dot)", ".")
End If
Next r
End Sub
This sub uses columns C and D for the output. Modify the code to suite your needs.
To extract the name, try =TRIM(LEFT(B1,FIND("<",B1)-1)). user3005775's answer works for the email.
You can also do this easily a regular expression (you'll need to add a reference to Microsoft VBScript Regular Expressions):
Private Sub ExtractEmailInfo(value As String)
Dim expr As New RegExp
Dim result As Object
Dim user As String
Dim addr As String
expr.Pattern = "(.+)(<.+>)"
Set result = expr.Execute(value)
If result.Count > 0 Then
user = result(0).SubMatches(0)
addr = result(0).SubMatches(1)
'Strip the < and >
addr = Mid$(addr, 2, Len(addr) - 2)
addr = Replace$(addr, "(at)", "#")
addr = Replace$(addr, "(dot)", ".")
End If
Debug.Print user
Debug.Print addr
End Sub
Replace the Debug.Print calls with whatever you need to do to place them in cells.
This does it for 200 K rows in less than 15 seconds:
Option Explicit
Sub extractPattern()
Dim ws As Worksheet, ur As Range, rng As Range, t As Double
Dim fr As Long, fc As Long, lr As Long, lc As Long
Set ws = Application.ThisWorkbook.Worksheets("Sheet1")
Set ur = ws.UsedRange
fr = 1
fc = 1
lr = ws.Cells(ur.Row + ur.Rows.Count + 1, fc).End(xlUp).Row
lc = ws.Cells(fr, ur.Column + ur.Columns.Count + 1).End(xlToLeft).Column
Set rng = ws.Range(ws.Cells(fr, fc), ws.Cells(lr, fc))
enableXL False
t = Timer
rng.TextToColumns Destination:=ws.Cells(fr, lc + 1), DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, _
Space:=True
With ws.Columns(lc + 3)
.Replace What:="(at)", Replacement:="#", LookAt:=xlPart
.Replace What:="(dot)", Replacement:=".", LookAt:=xlPart
.Replace What:="<", Replacement:=vbNullString, LookAt:=xlPart
.Replace What:=">", Replacement:=vbNullString, LookAt:=xlPart
End With
ws.Range(ws.Cells(fr, lc + 1), ws.Cells(fr, lc + 3)).EntireColumn.AutoFit
Debug.Print "Total rows: " & lr & ", Duration: " & Timer - t & " seconds"
enableXL 'Total rows: 200,000, Duration: 14.4296875 seconds
End Sub
Private Sub enableXL(Optional ByVal opt As Boolean = True)
Application.ScreenUpdating = opt
Application.EnableEvents = opt
Application.Calculation = IIf(opt, xlCalculationAutomatic, xlCalculationManual)
End Sub
It places the new data in the first unused column at the end (splits the names as well)