Whole word replacements using Regular Expression - regex

I have a list of original words and replace with words which I want to replace occurrence of the original words in some sentences to the replace words.
For example my list:
theabove the above
myaddress my address
So the sentence "This is theabove." will become "This is the above."
I am using Regular Expression in VB like this:
Dim strPattern As String
Dim regex As New RegExp
regex.Global = True
If Not IsEmpty(myReplacementList) Then
For intRow = 0 To UBound(myReplacementList, 2)
strReplaceWith = IIf(IsNull(myReplacementList(COL_REPLACEMENTWORD, intRow)), " ", varReplacements(COL_REPLACEMENTWORD, intRow))
strPattern = "\b" & myReplacementList(COL_ORIGINALWORD, intRow) & "\b"
regex.Pattern = strPattern
TextToCleanUp = regex.Replace(TextToReplace, strReplaceWith)
Next
End If
I loop all entries in my list myReplacementList against the text TextToReplace I want to process, and the replacement have to be whole word so I used the "\b" token around the original word.
It works well but I have a problem when the original words contain some special characters for example
overla) overlay
I try to escape the ) in the pattern but it does not work:
\boverla\)\\b
I can't replace the sentence "This word is overla) with that word." to "This word is overlay with that word."
Not sure what is missing? Is regular expression the way to the above scenario?

I'd use string.replace().
That way you don't have to escape special chars .. only these: ""!
See here for examples: http://www.dotnetperls.com/replace-vbnet
Regex is good if your looking for patterns. Or renaming your mp3 collection ;-) and much, much more. But in your case, I'd use string.replace().

Related

How to split a string in VBA to array by Split function delimited by Regular Expression

I am writing an Excel Add In to read a text file, extract values and write them to an Excel file. I need to split a line, delimited by one or more white spaces and store it in the form of array, from which I want to extract desired values.
I am trying to implement something like this:
arrStr = Split(line, "/^\s*/")
But the editor is throwing an error while compiling.
How can I do what I want?
If you are looking for the Regular Expressions route, then you could do something like this:
Dim line As String, arrStr, i As Long
line = "This is a test"
With New RegExp
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
IMPORTANT: You will need to create a reference to:
Microsoft VBScript Regular Expressions 5.5 in Tools > References
Otherwise, you can see Late Binding below
Your original implementation of your original pattern \^S*\$ had some issues:
S* was actually matching a literal uppercase S, not the whitespace character you were looking for - because it was not escaped.
Even if it was escaped, you would have matched every string that you used because of your quantifier: * means to match zero or more of \S. You were probably looking for the + quantifier (one or more of).
You were good for making it greedy (not using *?) since you were wanting to consume as much as possible.
The Pattern I used: (\S+) is placed in a capturing group (...) that will capture all cases of \S+ (all characters that are NOT a white space, + one or more times.
I also used the .Global so you will continue matching after the first match.
Once you have captured all your words, you can then loop through the match collection and place them into an array.
Late Binding:
Dim line As String, arrStr, i As Long
line = "This is a test"
With CreateObject("VBScript.RegExp")
.Pattern = "\S+"
.Global = True
If .test(line) Then
With .Execute(line)
ReDim arrStr(.Count - 1)
For i = 0 To .Count - 1
arrStr(i) = .Item(i)
Next
End With
End If
End With
Miscellaneous Notes
I would have advised just to use Split(), but you stated that there were cases where more than one consecutive space may have been an issue. If this wasn't the case, you wouldn't need regex at all, something like:
arrStr = Split(line)
Would have split on every occurance of a space

How to mimic regular Expression negative lookbehind?

What I'm trying to accomplish
I'm trying to create a function to use string interpolation within VBA. The issue I'm having is that I'm not sure how to replace "\n" with a vbNewLine, as long as it does not have the escape character "\" before it?
What I have found and tried
VBScript does not have a negative look behind as far as I can research.
Below has two examples of Patterns that I have already tried:
Private Sub testingInjectFunction()
Dim dict As New Scripting.Dictionary
dict("test") = "Line"
Debug.Print Inject("${test}1\n${test}2 & link: C:\\notes.txt", dict)
End Sub
Public Function Inject(ByVal source As String, dict As Scripting.Dictionary) As String
Inject = source
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
regEx.Global = True
' PATTERN # 1 REPLACES ALL '\n'
'regEx.Pattern = "\\n"
' PATTERN # 2 REPLACES EXTRA CHARACTER AS LONG AS IT IS NOT '\'
regEx.Pattern = "[^\\]\\n"
' REGEX REPLACE
Inject = regEx.Replace(Inject, vbNewLine)
' REPLACE ALL '${dICT.KEYS(index)}' WITH 'dICT.ITEMS(index)' VALUES
Dim index As Integer
For index = 0 To dict.Count - 1
Inject = Replace(Inject, "${" & dict.Keys(index) & "}", dict.Items(index))
Next index
End Function
Desired result
Line1
Line2 & link: C:\notes.txt
Result for Pattern # 1: (Replaces when not wanted)
Line1
Line2 & link: C:\
otes.txt
Result for Pattern # 2: (Replaces the 1 in 'Line1')
Line
Line2 & link: C:\\notes.txt
Summary question
I can easily write code that doesn't use Regular Expressions that can achieve my desired goal but want to see if there is a way with Regular Expressions in VBA.
How can I use Regular Expressions in VBA to Replace "\n" with a vbNewLine, as long as it does not have the escape character "\" before it?
Yes, you may use a regex here. Since the backslash is not used to escape itself in these strings, you may modify your solution like this:
regEx.Pattern = "(^|[^\\])\\n"
S = regEx.Replace(S, "$1" & vbNewLine)
It will match and capture any char but \ before \n and then will put it back with the $1 placeholder. As there is a chance that \n appears at the start of the string, ^ - the start of string anchor - is added as an alternative into the capturing group.
Pattern details
(^|[^\\]) - Capturing group 1: start of string (^) or (|) any char but a backslash ([^\\])
\\ - a backslash
n - a n char.

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

VBA Regular Expressions - Run-Time Error 91 when trying to replace characters in string

I am doing this task as part of a larger sub in order to massively reduce the workload for a different team.
I am trying to read in a string and use Regular Expressions to replace one-to-many spaces with a single space (or another character). At the moment I am using a local string, however in the main sub this data will come from an external .txt file. The number of spaces between elements in this .txt can vary depeneding on the row.
I am using the below code, and replacing the spaces with a dash. I have tried different variations and different logic on the below code, but always get "Run-time error '91': Object Variable or with clock variable not set" on line "c = re.Replace(s, replacement)"
After using breakpoints, I have found out that my RegularExpression (re) is empty, but I can't quite figure out how to progress from here. How do I replace my spaces with dashes? I have been at this problem for hours and spent most of that time on Google to see if someone has had a similar issue.
Sub testWC()
Dim s As String
Dim c As String
Dim re As RegExp
s = "hello World"
Dim pattern As String
pattern = "\s+"
Dim replacement As String
replacement = "-"
c = re.Replace(s, replacement)
Debug.Print (c)
End Sub
Extra information: Using Excel 2010. Have successfully linked all my references (Microsoft VBScript Regular Expressions 5.5". I was sucessfully able to replace the spaces using the vanilla "Replace" function, however as the number of spaces between elements vary I am unable to use that to solve my issue.
Ed: My .txt file is not fixed either, there are a number of rows that are different lengths so I am unable to use the MID function in excel to dissect the string either
Please help
Thanks,
J.H.
You're not setting up the RegExp object correctly.
Dim pattern As String
pattern = "\s+" ' pattern is just a local string, not bound to the RegExp object!
You need to do this:
Dim re As RegExp
Set re = New RegExp
re.Pattern = "\s+" ' Now the pattern is bound to the RegExp object
re.Global = True ' Assuming you want to replace *all* matches
s = "hello World"
Dim replacement As String
replacement = "-"
c = re.Replace(s, replacement)
Try setting the pattern inside your Regex object. Right now, re is just a regex with no real pattern assigned to it. Try adding in re.Pattern = pattern after you initialize your pattern string.
You initialized the pattern but didn't actually hook it into the Regex. When you ended up calling replace it didn't know what it was looking for pattern wise, and threw the error.
Try also setting the re as a New RegExp.
Sub testWC()
Dim s As String
Dim c As String
Dim re As RegExp
Set re = New RegExp
s = "hello World"
Dim pattern As String
pattern = "\s+"
re.Pattern = pattern
Dim replacement As String
replacement = "-"
c = re.Replace(s, replacement)
Debug.Print (c)
End Sub

Split string on several words, and track which word split it where

I am trying to split a long string based on an array of words. For Example:
Words: trying, long, array
Sentence: "I am trying to split a long string based on an array of words."
Resulting string array:
I am
trying
to split a
long
string based on an
array
of words
Multiple instances of the same word is likely, so having two instances of trying cause a split, or of array, will probably happen.
Is there an easy way to do this in .NET?
The easiest way to keep the delimiters in the result is to use the Regex.Split method and construct a pattern using alternation in a group. The group is key to including the delimiters as part of the result, otherwise it will drop them. The pattern would look like (word1|word2|wordN) and the parentheses are for grouping. Also, you should always escape each word, using the Regex.Escape method, to avoid having them incorrectly interpreted as regex metacharacters.
I also recommend reading my answer (and answers of others) to a similar question for further details: How do I split a string by strings and include the delimiters using .NET?
Since I answered that question in C#, here's a VB.NET version:
Dim input As String = "I am trying to split a long string based on an array of words."
Dim words As String() = { "trying", "long", "array" }
If (words.Length > 0)
Dim pattern As String = "(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")"
Dim result As String() = Regex.Split(input, pattern)
For Each s As String in result
Console.WriteLine(s)
Next
Else
' nothing to split '
Console.WriteLine(input)
End If
If you need to trim the spaces around each word being split you can prefix and suffix \s* to the pattern to match surrounding whitespace:
Dim pattern As String = "\s*(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")\s*"
If you're using .NET 4.0 you can drop the ToArray() call inside the String.Join method.
EDIT: BTW, you need to decide up front how you want the split to work. Should it match individual words or words that are a substring of other words? For example, if your input had the word "belong" in it, the above solution would split on "long", resulting in {"be", "long"}. Is that desired? If not, then a minor change to the pattern will ensure the split matches complete words. This is accomplished by surrounding the pattern with a word-boundary \b metacharacter:
Dim pattern As String = "\s*\b(" + String.Join("|", words.Select(Function(s) Regex.Escape(s)).ToArray()) + ")\b\s*"
The \s* is optional per my earlier mention about trimming.
You could use a regular expression.
(.*?)((?:trying)|(?:long)|(?:array))(.*)
will give you three groups if it matches:
1) The bit before the first instance of any of the split words.
2) The split word itself.
3) The rest of the string.
You can keep matching on (3) until you run out of matches.
I've played around with this but I can't get a single regex that will split on all instances of the target words. Maybe someone with more regex-fu can explain how.
I've assumed that VB has regex support. If not, I'd recommend using a different language. Certainly C# has regexes.
You can split with " ",
and than go through the words and see which one is contained in the "splitting words" array
Dim testS As String = "I am trying to split a long string based on an array of words."
Dim splitON() As String = New String() {"trying", "long", "array"}
Dim newA() As String = testS.Split(splitON, StringSplitOptions.RemoveEmptyEntries)
Something like this
Dim testS As String = "I am trying to split a long string based on a long array of words."
Dim splitON() As String = New String() {"long", "trying", "array"}
Dim result As New List(Of String)
result.Add(testS)
For Each spltr As String In splitON
Dim NewResult As New List(Of String)
For Each s As String In result
Dim a() As String = Strings.Split(s, spltr)
If a.Length <> 0 Then
For z As Integer = 0 To a.Length - 1
If a(z).Trim <> "" Then NewResult.Add(a(z).Trim)
NewResult.Add(spltr)
Next
NewResult.RemoveAt(NewResult.Count - 1)
End If
Next
result = New List(Of String)
result.AddRange(NewResult)
Next
Peter, I hope the below would be suitable for Split string by array of words using Regex
// Input
String input = "insert into tbl1 inserttbl2 insert into tbl2 update into tbl3
updatededle into tbl4 update into tbl5";
//Regex Exp
String[] arrResult = Regex.Split(input, #"\s+(?=(?:insert|update|delete)\s+)",
RegexOptions.IgnoreCase);
//Output
[0]: "insert into tbl1 inserttbl2"
[1]: "insert into tbl2"
[2]: "update into tbl3 updatededle into tbl4"
[3]: "update into tbl5"