Extract pattern from column - regex

I am struggling with a huge Excel sheet (with 200K rows), where I need to extract from a certain column (B) list of all email addresses present in the string.
What I want to achieve:
Extract the email from string
convert (at) to # and (dot) to .
Save name and email in separate columns
Example of column B:
Shubhomoy Biswas <biswas_shubhomoy777(at)yahoo(dot)com>
Puneet Arora <ar.puneetarora(at)gmail(dot)com>
Anand Upadhyay <001.anand(at)gmail(dot)com>
Rajat Gupta <rajatgupta0889(at)gmail(dot)com>
Sarvesh Sonawane <sarvesh.s(at)suruninfocoresystems.
Although I want to be able to do it on Excel any other Windows-based utility suggestion would be helpful.

this can be done assuming they are all in the same format and only 1 email add per cell
=SUBSTITUTE(SUBSTITUTE(MID(B1,FIND("<",B1)+1,LEN(B1)-FIND("<",B1)-1),"(at)","#"),"(dot)",".")

Give this a try:
Sub splitter()
Dim r As Range, v As String
For Each r In Intersect(Range("B:B"), ActiveSheet.UsedRange)
v = r.Text
If v <> "" Then
ary = Split(v, " <")
r.Offset(0, 1).Value = ary(0)
r.Offset(0, 2).Value = Replace(Replace(Replace(ary(1), ">", ""), "(at)", "#"), "(dot)", ".")
End If
Next r
End Sub
This sub uses columns C and D for the output. Modify the code to suite your needs.

To extract the name, try =TRIM(LEFT(B1,FIND("<",B1)-1)). user3005775's answer works for the email.

You can also do this easily a regular expression (you'll need to add a reference to Microsoft VBScript Regular Expressions):
Private Sub ExtractEmailInfo(value As String)
Dim expr As New RegExp
Dim result As Object
Dim user As String
Dim addr As String
expr.Pattern = "(.+)(<.+>)"
Set result = expr.Execute(value)
If result.Count > 0 Then
user = result(0).SubMatches(0)
addr = result(0).SubMatches(1)
'Strip the < and >
addr = Mid$(addr, 2, Len(addr) - 2)
addr = Replace$(addr, "(at)", "#")
addr = Replace$(addr, "(dot)", ".")
End If
Debug.Print user
Debug.Print addr
End Sub
Replace the Debug.Print calls with whatever you need to do to place them in cells.

This does it for 200 K rows in less than 15 seconds:
Option Explicit
Sub extractPattern()
Dim ws As Worksheet, ur As Range, rng As Range, t As Double
Dim fr As Long, fc As Long, lr As Long, lc As Long
Set ws = Application.ThisWorkbook.Worksheets("Sheet1")
Set ur = ws.UsedRange
fr = 1
fc = 1
lr = ws.Cells(ur.Row + ur.Rows.Count + 1, fc).End(xlUp).Row
lc = ws.Cells(fr, ur.Column + ur.Columns.Count + 1).End(xlToLeft).Column
Set rng = ws.Range(ws.Cells(fr, fc), ws.Cells(lr, fc))
enableXL False
t = Timer
rng.TextToColumns Destination:=ws.Cells(fr, lc + 1), DataType:=xlDelimited, _
TextQualifier:=xlDoubleQuote, ConsecutiveDelimiter:=True, _
Space:=True
With ws.Columns(lc + 3)
.Replace What:="(at)", Replacement:="#", LookAt:=xlPart
.Replace What:="(dot)", Replacement:=".", LookAt:=xlPart
.Replace What:="<", Replacement:=vbNullString, LookAt:=xlPart
.Replace What:=">", Replacement:=vbNullString, LookAt:=xlPart
End With
ws.Range(ws.Cells(fr, lc + 1), ws.Cells(fr, lc + 3)).EntireColumn.AutoFit
Debug.Print "Total rows: " & lr & ", Duration: " & Timer - t & " seconds"
enableXL 'Total rows: 200,000, Duration: 14.4296875 seconds
End Sub
Private Sub enableXL(Optional ByVal opt As Boolean = True)
Application.ScreenUpdating = opt
Application.EnableEvents = opt
Application.Calculation = IIf(opt, xlCalculationAutomatic, xlCalculationManual)
End Sub
It places the new data in the first unused column at the end (splits the names as well)

Related

How to extract phone number and split to column in Google sheet?

I am trying to extract phone numbers in this format (123) 456-7890 and put each number in a separate column with format 1234567890 with no space, dash or parenthesis.
I am able to achieve this in Excel using VBA code below from another StackOverflow question, but not able to get it working on Google sheet
Sub ewqre()
Dim str As String, n As Long, rw As Long
Dim rgx As Object, cmat As Object, ws As Worksheet
Set rgx = CreateObject("VBScript.RegExp")
Set ws = Worksheets("Sheet4")
With rgx
.Global = True
.MultiLine = True
'phone number pattern is: ###-###-####
.Pattern = "/^(\d{3})(\d{3})(\d{4})$/"
For rw = 2 To ws.Cells(Rows.Count, "A").End(xlUp).Row
str = ws.Cells(rw, "A").Value2
If .Test(str) Then
Set cmat = .Execute(str)
'populate the worksheet with the matches
For n = 0 To cmat.Count - 1
ws.Cells(rw, Columns.Count).End(xlToLeft).Offset(0, 1) = cmat.Item(n)
Next n
End If
Next rw
End With
Set rgx = Nothing: Set ws = Nothing
End Sub
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(TO_TEXT(SPLIT(
REGEXREPLACE(A2:A, "\(|\)|\-| ", ""), CHAR(10))), "\d+")))
or:
=ARRAYFORMULA(REGEXREPLACE(REGEXEXTRACT(SPLIT(A2, CHAR(10)),
"\((.*)\(.T"),"\)|\s|\-", ""))
true array formula would be:
=ARRAYFORMULA(IFERROR(REGEXREPLACE(REGEXEXTRACT(SPLIT(A2:A, CHAR(10)),
"\((.*)\(.T"),"\)|\s|\-", "")))

Excel VBA RegEx that extracts numbers from price values in range (has commas, $ and -)

I have a field data extracted from a database which represents a range of values, but it's coming in Excel as a String format $86,000 - $162,000.
I need to extract the minimum value and the maximum value from each cell, so I need to extract the numeric portion of it, and ignore the $, - and the ,.
I've attached an image of the data I have, and the values I want to extract from it.
This is the closest pattern I got with RegEx, but I'ts not what I'm looking for.
Pattern = (\d+)(?:\.(\d{1,2}))?
Can anyone assist ?
Just wondering why Regex?
Function GetParts(priceRange As String) As Double()
Dim arr() As String
Dim parts() As Double
If InStr(1, priceRange, "-") > 0 Then
arr = Split(priceRange, "-")
ReDim parts(0 To UBound(arr))
Dim i As Long
For i = 0 To UBound(arr)
parts(i) = CDbl(Replace$(Replace$(Trim$(arr(i)), "$", ""), ",", ""))
Next i
End If
GetParts = parts
End Function
Sub test()
MsgBox GetParts("$14,000 - $1,234,567")(0) 'Minimum
End Sub
EDIT
Yet you could do this with regex to match the data string into the parts:
Function GetPartsRegEx(priceRange As String) As Variant
Dim arr() As Double
Dim pricePattern As String
pricePattern = "(\$?\d+[\,\.\d]*)"
'START EDIT
Static re As RegExp
If re Is Nothing Then
Set re = New RegExp
re.IgnoreCase = True
re.Global = True
re.Pattern = pricePattern & "\s*[\-]\s*" & pricePattern 'look for the pattern first
End If
Static nums As RegExp
If nums Is Nothing Then
Set nums = New RegExp
'to remove all non digits, except decimal point in case you have pennies
nums.Pattern = "[^0-9.]"
nums.Global = True
End If
'END EDIT
If re.test(priceRange) Then
ReDim arr(0 To 1) ' fill return array
arr(0) = CDbl(nums.Replace(re.Replace(priceRange, "$1"), ""))
arr(1) = CDbl(nums.Replace(re.Replace(priceRange, "$2"), ""))
Else
'do some error handling here
Exit Function
End If 'maybe throw error if no +ve test or
GetPartsRegEx = arr
End Function
Sub test()
MsgBox GetPartsRegEx("$1,005.45 - $1,234,567.88")(1)
End Sub
Here is quick Example Demo https://regex101.com/r/RTNlVF/1
Pattern "^\$(\d+\,\d+)\s\-\s\$(\d+\,\d+)"
Option Explicit
Private Sub Example()
Dim RegExp As New RegExp
Dim Pattern As String
Dim CelValue As String
Dim rng As Range
Dim Cel As Range
Set rng = ActiveWorkbook.Sheets("Sheet1" _
).Range("A2", Range("A9999" _
).End(xlUp))
For Each Cel In rng
DoEvents
Pattern = "^\$(\d+\,\d+)\s\-\s\$(\d+\,\d+)"
If Pattern <> "" Then
With RegExp
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = Pattern
End With
If RegExp.Test(Cel.Value) Then
' Debug.Print Cel.Value
Debug.Print RegExp.Replace(CStr(Cel), "$1")
Debug.Print RegExp.Replace(CStr(Cel), "$2")
End If
End If
Next
End Sub
Without a loop (but still no regex):
Sub Split()
With Columns("B:B")
.Replace What:="$", Replacement:=""
Application.CutCopyMode = False
.TextToColumns Destination:=Range("B1"), DataType:=xlDelimited, Other:=True, OtherChar:="-", FieldInfo:=Array(Array(1, 1), Array(2, 1))
End With
Columns("B:C").Insert Shift:=xlToRight
Columns("D:E").NumberFormat = "0"
Range("D1").FormulaR1C1 = "Min Value"
Range("E1").FormulaR1C1 = "Max Value"
With Range("D1:E1").Interior
.Pattern = xlSolid
.PatternColorIndex = xlAutomatic
.Color = 12611584
End With
With Range("D1:E1").Font
.ThemeColor = xlThemeColorDark1
.TintAndShade = 0
End With
End Sub
I made this function:
Hope it helps.
Code:
Function ExtractNumber(ByVal TextInput As String, _
Optional ByVal Position As Byte = 0, _
Optional ByVal Delimiter As String = "-") As Variant
' You can use this function in a subprocess that
' writes the values in the cells you want, or
' you can use it directly in the ouput cells
' Variables
Dim RemoveItems(2) As String
Dim Aux As Variant
' The variable RemoveItems is an array
' containing the characters you want to remove
RemoveItems(0) = "."
RemoveItems(1) = ","
RemoveItems(2) = " "
' STEP 1 - The variable Aux will store the text
' given as input
Aux = TextInput
' STEP 2 - Characters stored in the variable
' RemoveItems will be removed from Aux
For i = 0 To UBound(RemoveItems)
Aux = Replace(Aux, RemoveItems(i), "")
Next i
' STEP 3 - Once Aux is "clean", it will be
' transformed into an array containing the
' values separated by the delimiter
' As you can see at the function's header,
' Delimiter default value is "-". You can change
' it depending on the situation
Aux = Split(Aux, Delimiter)
' STEP 4 - The result of this function will be
' a numeric value. So, if the value of the
' selected position in Aux is not numeric it will
' remove the first character assuming it is a
' currency symbol.
' If something fails in the process the function
' will return "ERROR", so you can know you may
' verify the inputs or adjust this code for
' your needs.
On Error GoTo ErrHndl
If Not IsNumeric(Aux(Position)) Then
ExtractNumber = CLng(Mid(Aux(Position), 2))
Else
ExtractNumber = CLng(Aux(Position))
End If
Exit Function
ErrHndl:
ExtractNumber = "ERROR"
End Function
You can even do this with just worksheet formulas. Under certain circumstances, Excel will ignore the $ and ,. The double unary converts the returned string to a numeric value.
First Value: =--LEFT(A1,FIND("-",A1)-1)
Second Value: =--MID(A1,FIND("-",A1)+1,99)

Replace all characters in a String, unless they are within double quotes

I am sadly unfamiliar with regular expressions since I'm not a programmer, but I would guess this problem is easily solvable using regex (I am definitely open to other suggestions, though)
I want to use the split function to split the value of a cell and spread it out over multiple cells. The delimiter is a comma. The problem though is that some users use commas in comments for example, which the Split function uses to split the string mid-comment.
for example a cell containing the value:
0001,"name","address","likes apples, oranges
and plums"
needs to be split into multiple cells saying 0001 "name" "address" and "likes apples, oranges and plums".
my code splits the comment as well, and I want it to ignore the comment or everything else withing double quotes. here is a sample:
Sub SplittingStrings()
Dim wb As Workbook
Dim ws As Worksheet
Dim strInput As String
Dim counter As Integer
Dim cell As Variant
Dim splitCount As Integer
Dim splitString() As String
Dim category As Variant
Set wb = ActiveWorkbook
Set ws = wb.ActiveSheet
counter = 1
For Each cell In Range("A1", "A2000")
If cell.Value <> "" Then
strInput = cell.Value
splitCount = 2
splitString = Split(strInput, ",")
For Each category In splitString
Cells(counter, splitCount).Value = category
splitCount = splitCount + 1
Next category
End If
counter = counter + 1
Next cell
End Sub
how do I exclude stuff withing the double quotes from being considered by the split function?
Please give this a try and see if you get the desired output.
Tweak the variables if required.
Sub SplittingStringsUsingRegEx()
Dim lr As Long, c As Long
Dim Rng As Range, cell As Range
Dim RE, Match, Matches
Application.ScreenUpdating = False
lr = Cells(Rows.Count, 1).End(xlUp).Row
Set Rng = Range("A1:A" & lr)
Set RE = CreateObject("VBScript.RegExp")
With RE
.Global = True
.Pattern = "\d+|"".+?"""
End With
c = 2
For Each cell In Rng
If RE.test(cell.Value) Then
Set Matches = RE.Execute(cell.Value)
For Each Match In Matches
Cells(cell.Row, c) = Replace(Match, """", "")
c = c + 1
Next Match
End If
c = 2
Next cell
Application.ScreenUpdating = True
End Sub
Without Regex:
We need to "protect" commas that are encapsulated with double quotes:
Sub ProtectStuff()
Dim i As Long, N As Long, v As String, v2 As String
Dim ProtectMode As Boolean, DQ As String, rep As String
Dim CH As String, arr
DQ = """"
rep = Chr(1)
N = Cells(Rows.Count, "A").End(xlUp).Row
For i = 1 To N
v = Cells(i, "A").Value
If v <> "" Then
ProtectMode = False
v2 = ""
For j = 1 To Len(v)
CH = Mid(v, j, 1)
If CH = DQ Then ProtectMode = Not ProtectMode
If CH = "," And ProtectMode Then CH = rep
v2 = v2 & CH
Next j
End If
arr = Split(v2, ",")
j = 2
For Each a In arr
Cells(i, j) = Replace(a, rep, ",")
j = j + 1
Next a
Next i
End Sub
Text to Columns will do what you want, differently than the split function.

Trying to find " " in an array with regular expressions in VBA (office 2010)

I'm looking in an array, created from Split() of an email body, for spaces (" "), to ignore them.
I've tried array(i) = " " and srtComp (array(i), " ", vbTextCompare).
In debug mode I see that the array element is indeed " ", but it doesn't recognize it. Now I'm trying wiht regular expressions and my code looks like this:
Set reg = New RegExp
With reg
.IgnoreCase = True
.Global = True
.Pattern = " +"
End With
'~~> Write to excel
With oXLws
'
strBody = Split(olMail.Body)
For i = 0 To UBound(strBody)
If strBody(i) Like "*Hora:*" Then
i = i + 1
Set MyMatches = reg.Execute(strBody(i))
While MyMatches.Count <> 0
i = i + 1
Set MyMatches = reg.Execute(strBody(i))
Wend
.Range("B" & lRow).Value = strBody(i + 1)
i = i + 1
End If
Next i
'
End With
The MyMatches.Count <> 0 condition is not working either, it never enters the cicle.
Can anybody see what I'm doing wrong? Thanks in advance.
Unprintable characters aren't always spaces (ASCII 32). If it comes from an HTML email, it's probably ASCII 160. Try this sub to see what your single character array elements are, then you'll know what to look for. You'll have to figure out how to set olMail, but I assume you're doing that already.
Sub TestWhiteSpace()
Dim olMail As Outlook.MailItem
Dim vaBody As Variant
Dim i As Long
vaBody = Split(olMail.Body, Space(1))
For i = LBound(vaBody) To UBound(vaBody)
If Len(vaBody(i)) = 1 Then
Debug.Print i, Chr$(vaBody(i)), vaBody(i)
End If
Next i
End Sub

Separating strings from numbers with Excel VBA

I need to
a) separate strings from numbers for a selection of cells
and
b) place the separated strings and numbers into different columns.
For example , Excel sheet is as follows:
A1 B1
100CASH etc.etc.
The result should be:
A1 B1 C1
100 CASH etc.etc.
Utilization of regular expressions will be useful, as there may be different cell formats,such as 100-CASH, 100/CASH, 100%CASH. Once the procedure is set up it won't be hard to use regular expressions for different variations.
I came across a UDF for extracting numbers from a cell. This can easily be modified to extract string or other types of data from cells simply changing the regular expression.
But what I need is not just a UDF but a sub procedure to split cells using regular expressions and place the separated data into separate columns.
I've also found a similar question in SU, however it isn't VBA.
See if this will work for you:
UPDATED 11/30:
Sub test()
Dim RegEx As Object
Dim strTest As String
Dim ThisCell As Range
Dim Matches As Object
Dim strNumber As String
Dim strText As String
Dim i As Integer
Dim CurrCol As Integer
Set RegEx = CreateObject("VBScript.RegExp")
' may need to be tweaked
RegEx.Pattern = "-?\d+"
' Get the current column
CurrCol = ActiveCell.Column
Dim lngLastRow As Long
lngLastRow = Cells(1, CurrCol).End(xlDown).Row
' add a new column & shift column 2 to the right
Columns(CurrCol + 1).Insert Shift:=xlToRight
For i = 1 To lngLastRow ' change to number of rows to search
Set ThisCell = ActiveSheet.Cells(i, CurrCol)
strTest = ThisCell.Value
If RegEx.test(strTest) Then
Set Matches = RegEx.Execute(strTest)
strNumber = CStr(Matches(0))
strText = Mid(strTest, Len(strNumber) + 1)
' replace original cell with number only portion
ThisCell.Value = strNumber
' replace cell to the right with string portion
ThisCell.Offset(0, 1).Value = strText
End If
Next
Set RegEx = Nothing
End Sub
How about:
Sub UpdateCells()
Dim rng As Range
Dim c As Range
Dim l As Long
Dim s As String, a As String, b As String
''Working with sheet1 and column C
With Sheet1
l = .Range("C" & .Rows.Count).End(xlUp).Row
Set rng = .Range("C1:C" & l)
End With
''Working with selected range from above
For Each c In rng.Cells
If c <> vbNullString Then
s = FirstNonNumeric(c.Value)
''Split the string into numeric and non-numeric, based
''on the position of first non-numeric, obtained above.
a = Mid(c.Value, 1, InStr(c.Value, s) - 1)
b = Mid(c.Value, InStr(c.Value, s))
''Put the two values on the sheet in positions one and two
''columns further along than the test column. The offset
''can be any suitable value.
c.Offset(0, 1) = a
c.Offset(0, 2) = b
End If
Next
End Sub
Function FirstNonNumeric(txt As String) As String
With CreateObject("VBScript.RegExp")
.Pattern = "[^0-9]"
FirstNonNumeric = .Execute(txt)(0)
End With
End Function