how do i extract only 5-digit strings from cells in excel? - regex

I have a bunch of data which contains any number of 5-digit strings in completely inconsistent formats, and i want to extract these 5-digit strings (in bold) out. I am not bothered about strings containing less than or more than 5-digits. as an example, this is the kind of data i have in my file
Cell A1: "1. 76589 - wholesale activities. 2. 33476 - general"
Cell A2: "WHOLESALE ACTIVITIES (76589). SHIPPING (12235). REAL
ESTATE ACTIVITIES (67333)"
Cell A3: "1. 33476 General. 658709 annual road. Unknown 563"
I've tried the usual SEARCH/FIND, MIN, LEFT/RIGHT/MID functions, but am not sure how to get them to produce the result i need, and even text-to-columns wasn't giving me a clean result
thanks in advance

Here is a macro that will split your line into the columns as you requested.
The range being processed is whatever you have selected.
The results are written into the adjacent columns on the same row.
Depending on your worksheet setup, you may want to "clear out" the rows where the results are going before executing the extraction code.
You can also write code to select the data to be processed automatically. Plenty of examples on this forum.
Option Explicit
Sub Extract5Digits()
Dim R As Range, C As Range
Dim RE As Object, MC As Object, M As Object
Dim I As Long
Set R = Selection
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.Pattern = "\b\d{5}\b"
For Each C In R
If .test(C.Text) = True Then
I = 0
Set MC = .Execute(C.Text)
For Each M In MC
I = I + 1
C.Offset(0, I) = M
Next M
End If
Next C
End With
End Sub

Simply with Excel functions this is impossibile.
The best way for you is to use the Regex 55 library in VBA.
Let's consider this example:
+---+--------------------------------------------------------------+
| | A |
+---+--------------------------------------------------------------+
| 1 | Cell A3: "1. 33476 General. 658709 annual road. Unknown 563" |
| 2 | 33476 |
+---+--------------------------------------------------------------+
From the Excel file hit Alt + F11, then go to Tools => Reference and select "Microsoft VBScript Regular Expression 5.5".
Then you can use the following function definition:
Public Function Get5DigitsNumer(search_str As String)
Dim regEx As New VBScript_RegExp_55.RegExp
Dim matches
GetStringInParens = ""
regEx.Pattern = "[0-9]{5}"
regEx.Global = True
If regEx.test(search_str) Then
Set matches = regEx.Execute(search_str)
GetStringInParens = matches(0).SubMatches(0)
End If
End Function
At this time you can use the following code:
Sub PatternExtractor()
Range("A2").Value = Get5DigitsNumer(Range("A1"))
End Sub
which take the value of cell A1 and extract the 5 digits numer, thn the result is saved into cell A2.
At the time I don't have any idea how this code could work where the same cell contains more than one time; like "Cell A1: "1. 76589 - wholesale activities. 2. 33476 - general" in your example.
I suggest you to have a look at this answer. The pattern is different but the question is really similar to yours.

The only way that you can do it is by writing a regex in VBA. I would recommend you to look at this question.

Related

Optional parts of regex pattern in vba

I am trying to build regex pattern for the text like that
numb contra: 1.29151306 number mafo: 66662308
numb contra 1.30789668number mafo 60.046483
numb contra/ 1.29154056 number mafo: 666692638
numb contra 137459625
mafo: 666692638
mafo: 666692638 numb contra/ 1.29154056
Here's the pattern I could build
contra?.\s+?(\d+\.?\d+)(.+mafo.?\s+(\d+\.?\d+))?
It works fine for all the lines except the last one. How can I implement all the possibilities to include the last line too?
Please have a look at this link
https://regex101.com/r/pSThAU/1
All is OK as for contra but not as for mafo
I think the key here is to make your regexp do less and your vba do more. What I think I see here is either the word 'mafo' or 'contra' and a number following. Don't know what order or whether each is present or how many times. So you can scan each of your strings for ALL occurrences with a regexp like this:
(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)
Then process it with some VBA code like this that I created in Excel:
Sub BreakItUp()
Dim rg As RegExp, scanned As MatchCollection, eachMatch As Match, i As Long, col As Long
Set rg = New RegExp
rg.Pattern = "(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)"
rg.IgnoreCase = True
rg.Global = True
i = 1
Do While (Not IsEmpty(ActiveSheet.Cells(i, 1).Value))
Set scanned = rg.Execute(ActiveSheet.Cells(i, 1).Value)
col = 2
For Each eachMatch In scanned
ActiveSheet.Cells(i, col).Value = eachMatch.SubMatches(0) & eachMatch.SubMatches(1)
ActiveSheet.Cells(i, col + 1).Value2 = "'" & eachMatch.SubMatches(2)
col = col + 2
Next eachMatch
i = i + 1
Loop
End Sub
That MatchCollection object will get one item for each Match that occurs and the subMatches array contains each capturing group. You should be able write your own logic within this processing loop to interpret what was extracted. When I ran it on your data it created all the fields in blue:
Notice I added a line to your data that had two contra entries and one mafo and it found all the occurrences. You should be able to modify this to interpret the meanings.

VBA and regexp - identify numbers stored as text

I have some data that looks like this (more than 400 columns) :
year
ID
fake_num1
fake_num2
text1
2019
11
36 000
10'000
text, 1
2020
12
-1 275
1 000,00
text 2
Columns fake_num1 and fake_num2 are stored as text. What I'm trying to achieve is
Identify those fake numbers columns
Clean the data (e.g. remove space, columns, replace comma by points) with a for loop
I need some help with step 1. I have to identify columns fake_num1 and fake_num2, while avoiding columns like text1. I was thinking of going with regexp but maybe there is another solution.
I used part of the code here: SO regexp, however I am not sure how to proceed from there.
Dim strPattern as String: strPattern = "^[0-9]$"
will find anything that starts and ends with a number, and only has numbers (if my comprehension is correct). What's the best way to manage the cases listed in the table above ?
Please, try the next code, It considers "fake numbers columns" as ones where replacing the necessary characters makes from string a number:
Sub testMakeNumbers()
Dim sh As Worksheet, lastR As Long, lastCol As Long, i As Long, rngCol As Range
Set sh = ActiveSheet 'you can use here the necessary sheet
lastR = sh.Range("A" & sh.rows.Count).End(xlUp).row
lastCol = sh.cells(1, Columns.Count).End(xlToLeft).Column
'determine the problematic columns:
For i = 1 To lastCol
If Not IsNumeric(sh.cells(2, i).Value) And _
IsNumeric(Replace(Replace(Replace(sh.cells(2, i).Value, " ", ""), "'", ""), ",", ".")) Then
If rngCol Is Nothing Then
Set rngCol = sh.cells(2, i)
Else
Set rngCol = Union(rngCol, sh.cells(2, i))
End If
End If
Next
'replace the characters making the string as number:
With Intersect(rngCol.EntireColumn, sh.Range("A2", sh.cells(lastR, lastCol)))
.Replace ",", "."
.Replace Chr(160), ""
.Replace " ", ""
.Replace "'", ""
End With
End Sub

Reverse string search in Excel

Trying to get Column F/VENDOR # to populate the vendor number only. The vendor number are highlighted. My strategy is from the right, find the third "_" and substitute it with a "|". Then anything right of the pipe is populated in column D.
However the ones with more than three "_" are not following the logic. What am I doing wrong?
Column D formula =IF(ISERROR(FIND("_",C2)),"",RIGHT(C2,LEN(C2)-FIND("|",SUBSTITUTE(C2,"_","|",LEN(C2)-LEN(SUBSTITUTE(C2,"_","",3))))))
Column F/Vendor# formula =IF(ISERROR(LEFT(D2,FIND("_",D2)-1)),"",LEFT(D2,FIND("_",D2)-1))
The issue is in the column D formula - you have:
...LEN(C2)-LEN(SUBSTITUTE(C2,"_","",3))...
It should be:
...LEN(C2)-LEN(SUBSTITUTE(C2,"_",""))-2...
Giving a full formula for column D of:
=IF(ISERROR(FIND("_",A17)),"",RIGHT(A17,LEN(A17)-FIND("|",SUBSTITUTE(A17,"_","|",LEN(A17)-LEN(SUBSTITUTE(A17,"_",""))-2))))
The reason is because that part of the formula is really being used to calculate an index in another SUBSTITUTE function. You need to use a relative offset (-2 is kind of 3rd from right) if you have a unknown number of _s in the string.
If you can use VBA then you should look at using an UDF with regular expressions as I feel this is slightly less complex than the double-formula method which is not trivial to step through. The UDF could simply be this:
Option Explicit
Function GetVendorNumber(rng As Range) As String
Dim objRegex As Object
Dim objMatches As Object
GetVendorNumber = ""
Set objRegex = CreateObject("VBScript.RegExp")
With objRegex
.Pattern = "\D+_(\d+)_.+"
Set objMatches = .Execute(rng.Text)
If objMatches.Count = 1 Then
GetVendorNumber = objMatches(0).SubMatches(0)
End If
End With
End Function

Visual Basic Multiple Regex inserting into DataGridView

Below is part of my code for a small webscraping project I'm working on. I'm having an issue inserting the values of the second Regex into the datagrid and have them line up with the values of the first regex. Each page will have a list of ten items, each with a unique ID along with a format and a rating. When the code is run, it only adds the formatID data to the first ten rows.
Dim r As New System.Text.RegularExpressions.Regex("<div id=""R.*"" class=""a-section review"">")
Dim Matches As MatchCollection = r.Matches(SourceCode)
For Each ItemID As Match In Matches
DataGridView1.Rows.Add("", Split(ItemID.Value, """").GetValue(1), AsinTextBox.Text, "", "", "", "")
Next
Dim R2 As New System.Text.RegularExpressions.Regex("<a class=""a-size-mini a-link-normal a-color-secondary"" href=""/.*/product-reviews/.*/ref=.*"">.*</a>")
Dim Matches2 As MatchCollection = R2.Matches(SourceCode)
Dim Z2
Dim i As Integer = 0
For Each FormatID As Match In Matches2
i = i + 1
Z2 = Split(FormatID.Value, ">").GetValue(1)
Z2 = Split(Z2, "<").GetValue(0)
DataGridView1.Rows(i).Cells(5).Value = (Z2)
Next
I figured it out. Instead of doing multiple Regex searches, I did one larger one that incorporated all of the three that I was trying to do previously. The three regex searches from before were all apart of one large string of html and after I had searched for it I used the split function to get the correct value for Z2.

Split one table's column into two columns based on value

I have a table with so many rows. It's structure is like this picture:
As you can see i have "or", "And" between names in columns A. How i can splite these column into twi parts?. IN that case i will have David, Tylor, Fred, Jessi, Roland in the firstcolumn and Peter, Mark, Alfered, Hovard and DAvid in the second.
Note: Please pay attention to row 2 and 5. in these rows i have 2 "or" or two "and".
Edit: I prefer to do that in Excel
What I Have Tried
As one possible solution, i have this function in vba.
Function udfRegEx(CellLocation As Range, RegPattern As String)
Dim RegEx As Object, RegMatchCollection As Object, RegMatch As Object
Dim OutPutStr As String
Dim i As Integer
i = ActiveWorkbook.Worksheets(ActiveWorksheet.Name).UsedRange.rows.Count
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.Pattern = RegPattern
End With
OutPutStr = ""
Set RegMatchCollection = RegEx.Execute(CellLocation.Value)
For Each RegMatch In RegMatchCollection
OutPutStr = OutPutStr & RegMatch
Next
udfRegEx = OutPutStr
Set RegMatchCollection = Nothing
Set RegEx = Nothing
Set Myrange = Nothing
End Function
This function uses Regex. but i don't know how to use that.
As I mentioned that you do not need VBA for this. An Excel formula will also do what you need.
My Assumptions
Col A has the data
You want the output in Col B and Col C
Paste this formula in Cell B1 and copy it down
=IF(ISERROR(SEARCH(" or ",A1,1))=TRUE,IF(ISERROR(SEARCH(" and ",A1,1))=TRUE,"",LEFT(A1,SEARCH(" and ",A1,1))),LEFT(A1,SEARCH(" or ",A1,1)))
and this in Cell C1 and copy it down
=IF(ISERROR(SEARCH(" or ",A1,1))=TRUE,IF(ISERROR(SEARCH(" and ",A1,1))=TRUE,"",MID(A1,SEARCH(" and ",A1,1)+5,LEN(A1)-SEARCH(" and ",A1,1))),MID(A1,SEARCH(" or ",A1,1)+4,LEN(A1)-SEARCH(" or ",A1,1)))
SNAPSHOT
(\w)+(( or | and ){0,1}(\w)+)*
Its not a coding solution, but since you did not ask for code (and because its not necessary in this case), simply do a find/replace on the words "and" and "or" to replace them with some delimiter (e.g. replace them with a comma). Then in excel, you can select the data, and split them into different columns using excels "text to columns" feature (on the data tab in excel 2007).