Cleaning bad data in excel, splitting words by capital letters - regex

I'm using excel 2011 on Mac OSX. I have a data set with about 3000 entries. In the fields that contain names, many of the names are not separated. First and last names are separated by a space, but separate names are bunched together.
Here's what I have, (one cell):
Grant MorrisonSholly FischBen OliverCarlos Alberto Fernandez UrbanoBen OliverCarlos Alberto Fernandez UrbanoBen OliverBen Oliver
Here's what I want to accomplish, (one cell, comma separated with one space after comma):
Grant Morrison, Sholly Fisch, Ben Oliver, Carlos Alberto, Fernandez Urbano, Ben Oliver, Carlos Alberto, Fernandez Urbano, Ben Oliver, Ben Oliver
I have found a few VBA scripts that will split words by capital letters, but the ones I've tried will add spaces where I don't need them like this one...
Function splitbycaps(inputstr As String) As String
Dim i As Long
Dim temp As String
If inputstr = vbNullString Then
splitbycaps = temp
Exit Function
Else
temp = inputstr
For i = 1 To Len(temp)
If Mid(temp, i, 1) = UCase(Mid(temp, i, 1)) Then
If i <> 1 Then
temp = Left(temp, i - 1) + " " + Right(temp, Len(temp) - i + 1)
i = i + 1
End If
End If
Next i
splitbycaps = temp
End If
End Function
There was another one that I found here that used RegEx, (forgive me, I'm just learning all of this so I may sound a little dumb) but when I tried that one, it wouldn't work at all, and my research pointed me to a way to add references to the library that would add the necessary tools so I could use it. Unfortunately, I cannot, for the life of me, find how to add a reference to the library on my mac version of excel... I may be doing something wrong, but this is the answer that I could not get to work...
Function SplitCaps(strIn As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Global = True
.Pattern = "([a-z])([A-Z])"
SplitCaps = .Replace(strIn, "$1 $2")
End With
End Function
I am basically brand new at adding custom functions via VBA through excel, and there may even be a better way to do this, but it seems like every answer that I come to just doesn't quite get the data right. Thanks for any answers!

My function from Split Uppercase words in Excel needs udpdating for your additional string matching.
You would use this function in cell B1 for text in A1 as follows
One assumption your cleansing does make is people have only two names, so
Ben OliverCarlos Alberto
is broken to
Ben Oliver
Carlos Alberto
is that actually what should happen? (needs a minor tweak if so)
code
Function SplitCaps(strIn As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Global = True
.Pattern = "([a-z])([A-Z])"
SplitCaps = Replace(.Replace(strIn, "$1, $2"), "<br>", ", ")
End With
End Function

Related

Changing formulas on the fly with VBA RegEx

i'm trying to change formulas in excel, i need to change the row number of the formulas.
I'm trying do use replace regex to do this. I use an loop to iterate through the rows of the excel and need to change the formula for the row that is iterating at the time. Here is an exemple of the code:
For i = 2 To rows_aux
DoEvents
Formula_string= "=IFS(N19='Z001';'xxxxxx';N19='Z007';'xxxxxx';0=0;'xxxxxxx')"
Formula_string_new = regEx.Replace(Formula_string, "$1" & i)
wb.Cells(i, 33) = ""
wb.Cells(i, 33).Formula = Formula_string_new
.
.
.
Next i
I need to replace rows references but not the ones in quotes or double quotes. Example:
If i = 2 i want the new string to be this:
"=IFS(N2='Z001';'xxxxxx';N2='Z007';'xxxxxx';0=0;'xxxxxxx')"
I'm trying to use this regex:
([a-zA-Z]+)(\d+)
But its changing everything in quotes too. Like this:
If i = 2:
"=IFS(N2='Z2';'xxxxxx';N2='Z2';'xxxxxx';0=0;'xxxxxxx')"
If anyone can help me i will be very grateful!
Thanks in advance.
As others have written, there are probably better ways to write this code. But for a regex that will capture just the Column letter in capturing group #1, try:
\$?\b(XF[A-D]|X[A-E][A-Z]|[A-W][A-Z]{2}|[A-Z]{2}|[A-Z])\$?(?:104857[0-6]|10485[0-6]\d|1048[0-4]\d{2}|104[0-7]\d{3}|10[0-3]\d{4}|[1-9]\d{1,5}|[1-9])d?
Note that is will NOT include the $ absolute addressing token, but could be altered if that were necessary.
Note that you can avoid the loop completely with:
Formula_string = "=IFS(N19=""Z001"",""xxxxxx"",N$19=""Z007"",""xxxxxx"",0=0,""xxxxxxx"")"
Formula_string_new = regEx.Replace(Formula_string, "$1" & firstRow)
With Range(wb.Cells(firstRow, 33), wb.Cells(lastRow, 33))
.Clear
.Formula = Formula_string_new
End With
When we write a formula to a range like this, the references will automatically adjust the way you were doing in your loop.
Depending on unstated factors, you may want to use the FormulaLocal property vice the Formula property.
Edit:
To make this a little more robust, in case there happens to be, within the quote marks, a string that exactly mimics a valid address, you can try checking to be certain that a quote (single or double) neither precedes nor follows the target.
Pattern: ([^"'])\$?\b(XF[A-D]|X[A-E][A-Z]|[A-W][A-Z]{2}|[A-Z]{2}|[A-Z])\$?(?:104857[0-6]|10485[0-6]\d|1048[0-4]\d{2}|104[0-7]\d{3}|10[0-3]\d{4}|[1-9]\d{1,5}|[1-9])d?\b(?!['"])
Replace: "$1$2" & i
However, this is not "bulletproof" as various combinations of included data might match. If it is a problem, let me know and I'll come up with something more robust.
If you can identify some unique features like in the example preceding bracket ( or colon ; and trailing equal = then this might work
Sub test()
Dim s As String, sNew As String, i As Long
Dim Regex As Object
Set Regex = CreateObject("vbscript.regexp")
With Regex
.Global = True
.MultiLine = False
.IgnoreCase = True
.Pattern = "([(;][a-zA-Z]{1,3})(\d+)="
End With
i = 1
s = "=IFS(NANA19='Z001';'xxxxxx';NA19='Z007';'xxxxxx';0=0;'xxxxxxx')"
sNew = Regex.Replace(s, "$1" & i & "=")
Debug.Print s & vbCr & sNew
End Sub

excel vba - use regex to return information between indicators

I have an app which returns data in the form of a table copied into the clipboard.
the table takes the form of:
table name
other info
-------------------------------
|heading 1|heading 2|heading 3|
-------------------------------
|data|date|other Data|
|data|date|other Data|
-------------------------------
time stamp
etc
I'm looking to pull back only the heading and data rows, minus the horizontal rows which are represented by dashes (---) in my data.
I need the pipes (|) as they are used to split the rows for passing back to excel.
I've used the following regex attempts
strPattern = "(?<=\|)[^|]++(?=\|)"
strPattern = "(\|[^|]++(\|)"
strPattern = "(^\s\|[\d\D]+?\|\s$)"
strPattern = "(^\s\|[\d\D]*\|\s$)"
strReplace = "$1"
thinking that the above uses the pipes as bookends and returns any digit or non digit character between the pipes. none of these work and at best it returns the entire string (I know I don't have anything removing the dashes yet)
looking for:
|heading 1|heading 2|heading 3|
|data|date|other Data|
|data|date|other Data|
Thanks in advance for any help
To answer your question, for a regex that will take your text as a block (multi-line variable) and only return the desired lines, try:
^(?:(?:(?:(?=-).)+)|(?:[^|]+))\n?
There may be better ways to accomplish your overall goal, but this accomplishes what you requested.
Option Explicit
Function PipedLines(S As String)
Dim RE As Object
Const sPat As String = "^(?:(?:(?:(?=-).)+)|(?:[^|]+))\n?"
Set RE = CreateObject("vbscript.regexp")
With RE
.Global = True
.MultiLine = True
.Pattern = sPat
PipedLines = .Replace(S, "")
End With
End Function
Hi #tsuimark have you treid copying Clipboard data to directly to excel.?
tried and attched screenshot. and remove unwanted rows in sheet.
Thanks.

Reverse string search in Excel

Trying to get Column F/VENDOR # to populate the vendor number only. The vendor number are highlighted. My strategy is from the right, find the third "_" and substitute it with a "|". Then anything right of the pipe is populated in column D.
However the ones with more than three "_" are not following the logic. What am I doing wrong?
Column D formula =IF(ISERROR(FIND("_",C2)),"",RIGHT(C2,LEN(C2)-FIND("|",SUBSTITUTE(C2,"_","|",LEN(C2)-LEN(SUBSTITUTE(C2,"_","",3))))))
Column F/Vendor# formula =IF(ISERROR(LEFT(D2,FIND("_",D2)-1)),"",LEFT(D2,FIND("_",D2)-1))
The issue is in the column D formula - you have:
...LEN(C2)-LEN(SUBSTITUTE(C2,"_","",3))...
It should be:
...LEN(C2)-LEN(SUBSTITUTE(C2,"_",""))-2...
Giving a full formula for column D of:
=IF(ISERROR(FIND("_",A17)),"",RIGHT(A17,LEN(A17)-FIND("|",SUBSTITUTE(A17,"_","|",LEN(A17)-LEN(SUBSTITUTE(A17,"_",""))-2))))
The reason is because that part of the formula is really being used to calculate an index in another SUBSTITUTE function. You need to use a relative offset (-2 is kind of 3rd from right) if you have a unknown number of _s in the string.
If you can use VBA then you should look at using an UDF with regular expressions as I feel this is slightly less complex than the double-formula method which is not trivial to step through. The UDF could simply be this:
Option Explicit
Function GetVendorNumber(rng As Range) As String
Dim objRegex As Object
Dim objMatches As Object
GetVendorNumber = ""
Set objRegex = CreateObject("VBScript.RegExp")
With objRegex
.Pattern = "\D+_(\d+)_.+"
Set objMatches = .Execute(rng.Text)
If objMatches.Count = 1 Then
GetVendorNumber = objMatches(0).SubMatches(0)
End If
End With
End Function

Why does Find/Replace zRngResult.Find work fine, but RegEx myRegExp.Execute(zRngResult) mess up the range.Start?

I wish to select and add comments after certain words, e.g. “not”, “never”, “don’t” in sentences in a Word document with VBA. The Find/Replace with wildcards works fine, but “Use wildcards” cannot be selected with “Match case”. The RegEx can “IgnoreCase=True”, but the selection of the word is not reliable when there are more than one comments in a sentence. The Range.start seems to be getting modified in a way that I cannot understand.
A similar question was asked in June 2010. https://social.msdn.microsoft.com/Forums/office/en-US/f73ca32d-0af9-47cf-81fe-ce93b13ebc4d/regex-selecting-a-match-within-the-document?forum=worddev
Is there a new/different way of solving this problem?
Any suggestion will be appreciated.
The code using RegEx follows:
Function zRegExCommentor(zPhrase As String, tComment As String) As Long
Dim sTheseSentences As Sentences
Dim rThisSentenceToSearch As Word.Range, rThisSentenceResult As Word.Range
Dim myRegExp As RegExp
Dim myMatches As MatchCollection
Options.CommentsColor = wdByAuthor
Set myRegExp = New RegExp
With myRegExp
.IgnoreCase = True
.Global = False
.Pattern = zPhrase
End With
Set sTheseSentences = ActiveDocument.Sentences
For Each rThisSentenceToSearch In sTheseSentences
Set rThisSentenceResult = rThisSentenceToSearch.Duplicate
rThisSentenceResult.Select
Do
DoEvents
Set myMatches = myRegExp.Execute(rThisSentenceResult)
If myMatches.Count > 0 Then
rThisSentenceResult.Start = rThisSentenceResult.Start + myMatches(0).FirstIndex
rThisSentenceResult.End = rThisSentenceResult.Start + myMatches(0).Length
rThisSentenceResult.Select
Selection.Comments.Add Range:=Selection.Range
Selection.TypeText Text:=tComment & "{" & zPhrase & "}"
rThisSentenceResult.Start = rThisSentenceResult.Start + 1 'so as not to find the same phrase again and again
rThisSentenceResult.End = rThisSentenceToSearch.End
rThisSentenceResult.Select
End If 'If myMatches.Count > 0 Then
Loop While myMatches.Count > 0
Next 'For Each rThisSentenceToSearch In sTheseSentences
End Function
Relying on Range.Start or Range.End for position in a Word document is not reliable due to how Word stores non-printing information in the text flow. For some kinds of things you can work around it using Range.TextRetrievalMode, but the non-printing characters inserted by Comments aren't affected by these settings.
I must admit I don't understand why Word's built-in Find with wildcards won't work for you - no case matching shouldn't be a problem. For instance, based on the example: "Never has there been, never, NEVER, a total drought.":
FindText:="[n,N][e,E][v,V][e,E][r,R]"
Will find all instances of n-e-v-e-r regardless of the capitalization. The brackets let you define a range of values, in this case the combination of lower and upper case for each letter in the search term.
The workarounds described in my MSDN post you link to are pretty much all you can if you insist on RegEx:
Using the Office Open XML (or possibly Word 2003 XML) file format will let you use RegEx and standard XML processing tools to find the information, add comment "tags" into the Word XML, close it all up... And when the user sees the document it will all be there.
If you need to be doing this in the Word UI a slightly different approach should work (assuming you're targeting Word 2003 or later): Work through the document on a range-by-range basis (by paragraph, perhaps). Read the XML representation of the text into memory using the Range.WordOpenXML property, perform the RegEx search, add comments as WordOpenXML, then write the WordOpenXML back into the document using the InserXml method, replacing the original range (paragraph). Since you'd be working with the Paragraph object Range.Start won't be a factor.

Split one table's column into two columns based on value

I have a table with so many rows. It's structure is like this picture:
As you can see i have "or", "And" between names in columns A. How i can splite these column into twi parts?. IN that case i will have David, Tylor, Fred, Jessi, Roland in the firstcolumn and Peter, Mark, Alfered, Hovard and DAvid in the second.
Note: Please pay attention to row 2 and 5. in these rows i have 2 "or" or two "and".
Edit: I prefer to do that in Excel
What I Have Tried
As one possible solution, i have this function in vba.
Function udfRegEx(CellLocation As Range, RegPattern As String)
Dim RegEx As Object, RegMatchCollection As Object, RegMatch As Object
Dim OutPutStr As String
Dim i As Integer
i = ActiveWorkbook.Worksheets(ActiveWorksheet.Name).UsedRange.rows.Count
Set RegEx = CreateObject("vbscript.regexp")
With RegEx
.Global = True
.Pattern = RegPattern
End With
OutPutStr = ""
Set RegMatchCollection = RegEx.Execute(CellLocation.Value)
For Each RegMatch In RegMatchCollection
OutPutStr = OutPutStr & RegMatch
Next
udfRegEx = OutPutStr
Set RegMatchCollection = Nothing
Set RegEx = Nothing
Set Myrange = Nothing
End Function
This function uses Regex. but i don't know how to use that.
As I mentioned that you do not need VBA for this. An Excel formula will also do what you need.
My Assumptions
Col A has the data
You want the output in Col B and Col C
Paste this formula in Cell B1 and copy it down
=IF(ISERROR(SEARCH(" or ",A1,1))=TRUE,IF(ISERROR(SEARCH(" and ",A1,1))=TRUE,"",LEFT(A1,SEARCH(" and ",A1,1))),LEFT(A1,SEARCH(" or ",A1,1)))
and this in Cell C1 and copy it down
=IF(ISERROR(SEARCH(" or ",A1,1))=TRUE,IF(ISERROR(SEARCH(" and ",A1,1))=TRUE,"",MID(A1,SEARCH(" and ",A1,1)+5,LEN(A1)-SEARCH(" and ",A1,1))),MID(A1,SEARCH(" or ",A1,1)+4,LEN(A1)-SEARCH(" or ",A1,1)))
SNAPSHOT
(\w)+(( or | and ){0,1}(\w)+)*
Its not a coding solution, but since you did not ask for code (and because its not necessary in this case), simply do a find/replace on the words "and" and "or" to replace them with some delimiter (e.g. replace them with a comma). Then in excel, you can select the data, and split them into different columns using excels "text to columns" feature (on the data tab in excel 2007).