VBA Word Wildcards - finding shortest possible set of characters - regex

I have trouble finding working solution for couple of hours now. I hope you will help me.
My problem:
I need to find and select in Word a whole sentence after providing the starting and ending strings of particular sentence.
For example, when my starting string is "People" and ending string is "apples." I expect Word to select the whole "People like red apples." sentence in my document. (If such a sentence exists)
For this purpose I prepared a macro which works almost like I want. The only problem is that it doesn't select the smallest possible set of characters (which I want it to do). To make it clear let's assume I have this text in my document: People like smoking. People like red apples.
Now, when I provide the starting and ending strings to the macro respectively as "People" and "apples.", it selects all the text, which contains 2 sentences mentioned above. That is my problem: I wanted it to select only the second sentence (People like red apples.), not both of them, even though they start with the same word. So, basically, I always want to select the shortest possible set of characters (which in this case is only the last sentence).
Here is a part of my macro in VBA:
`text_str = startStr & "*" & endStr
With Application.Selection.Find
.ClearFormatting
.Forward = True
.Wrap = wdFindContinue
.Text = text_str
.MatchWildcards = True
.MatchCase = True
.Execute
End With
I know the problem is with the Wildcards (or very limited set of regular expressions), so I also tried something like this as the search string:
text_str = "(" & startStr & "*){1}" & endStr
It also didn't help. I'm stuck here. :/
Thanks for any suggestions!

Selection.Find has something similar to regular expressions,
but in this case you must use real regular expressions.
The pattern (in this particular case) should be:
People[^.]+apples\.
I wrote an example macro, which:
Selects the whole text in the document and assigns it to src
variable (searched by the regex).
Sets the cursor at the beginning of the document.
Checks whether the pattern can be matched (regEx.Test).
Executes the regex.
Assigns the matched string to ret variable.
Displays it in a message box.
Below you have a complete macro. Probably you should change it to
select (find) the text matched (instead of the message box).
Sub Re()
Dim startStr As String: startStr = "People"
Dim endStr As String: endStr = "apples"
Dim pattern As String: pattern = startStr & "[^.]+" & endStr & "\."
Dim regEx As New RegExp
Dim src As String
Dim ret As String
Dim colMatches As MatchCollection
ActiveDocument.Range.Select
src = ActiveDocument.Range.Text
Selection.StartOf
regEx.pattern = pattern
If (regEx.Test(src)) Then
Set colMatches = regEx.Execute(src)
ret = "Match: " & colMatches(0).Value
Else
ret = "Matching Failed"
End If
MsgBox ret, vbOKOnly, "Result"
End Sub

Related

Using regex to find paragraphs in VBA Excel

I am trying to use regex to 'extract' paragraphs in a document. Each paragraph is preceded and followed by a '-' on separate line and each paragraph starts with a number.
For example
-
1. This is a paragraph
It may go over multiple lines
-
Ideally, I would like to not include the '-', but it doesn't really matter as I will be placing it in a string and running another regex against it (One that I know works)
The code I am trying to use is basically as follows
Dim matchPara as Object
Dim regex as Object
Dim theMatch as Object
Dim matches as Object
Dim fileName as String
Dim fileNo as Integer
Dim document as String
matchPara = "-?(\d.*?)?-"
Set regex = CreateObject("VBScript.RegExp")
regex.Pattern = matchPara
regex.Global = True
regex.Multiline = True
fileName = "C:\file.txt"
fileNo = FreeFile
Open fileName For Input As #fileNo
document = Input$(LOF(fileNo), fileNo)
set matches = regex.Execute(document)
For Each theMatch in matches
MsgBox(theMatch.Value)
Next theMatch
Close #fileNo
I have tested this regex on regex101 and it appeared to do what I wanted. I have also tested it without the grouping
-?\d.*?-
However when I run the code the theMatch.Value only ever contains a single '-'. After some messing around with the regex I got it to display the first line of text but never any more than the first line.
I have checked the length of theMatch.Value with:
MsgBox(len(theMatch.Value))
and placed the contents of theMatch.Value in a cell on the worksheet to see if It was cutting off in the message box, but both theories were proved wrong.
I am at a complete loss now and I am beginning to suspect it is possibly a VBA thing and not a regex thing. There is no requirement to use regex, I just assumed it would be the easiest thing to do.
The paragraphs contain data that I am trying to extract. So the idea was to regex each paragraph out place that in a string then run other regex to get the information that I need. Some paragraphs wont contain the data that I need so the idea was to loop through each individual paragraph and then error handle better if the data I need wasn't in that paragraph (ie get what I can and drop the rest with an error message)
Here is a screenshot:
This simple approach does not use Regex. It assumes the data is in column A and the paragraphs are placed in column B:
Sub paragraph_no_regex()
Dim s As String
Dim ary
With Application.WorksheetFunction
s = .TextJoin(" ", False, Columns(1).SpecialCells(2))
End With
ary = Split(s, "-")
i = 1
For Each a In ary
Cells(i, 2) = a
i = i + 1
Next a
End Sub
Sub F()
Dim re As New RegExp
Dim sMatch As String
Dim document As String
re.Pattern = "-\n((.|\n)+?)\n-"
'Getting document
document = ...
sMatch = re.Execute(document)(0).SubMatches(0)
End Sub
If you need dashes -, then just include them into capture group (the outer parenthesis).
This RegEx matches your description and successfully extracts paragraphs (as tested on regex101.com):
matchPara = "-\n\d+\.\s*((?:.|\n)+?)\s*\n-"
It needs the 'global' flag but not the 'multiline' flag. Instead, the end-of-line token is matched in the regex. The main point is that the innermost matching group will match any character including the end-of-line (given as an alternative) but does so in a non-greedy way ("+?"). It doesn't care about word boundaries as this is not necessary here. Also, "-" is not a special character where used in the regex so it doesn't have to be escaped.
As added benefit leading and trailing whitespace is cut off ("\s*" outside the group).

Extract text using word VBA regex then save to a variable as string

I am trying to create code in Word VBA that will automatically save (as PDF) and name a document based on it's content, which is in text and not fields. Luckily the formatting is standardized and I already know how to save it. I tested my regex elsewhere to make sure it pulls what I am looking for. The trouble is I need to extract the matched statement, convert it to a string, and save it to an object (so I have something to pass on to the code where it names the document).
The part of the document I need to match is below, from the start of "Program" through the end of the line and looks like:
Program: Program Name (abr)
and the regex I worked out for this is "Program:[^\n]"
The code I have so far is below, but I don't know how to execute the regex in the active document, convert the output to a string and save to an object:
Sub RegExProgram()
Dim regEx
Dim pattern As String
Set regEx = CreateObject("VBScript.RegExp")
regEx.IgnoreCase = True
regEx.Global = False
regEx.pattern = "Program\:[^\n]"
(missing code here)
End Sub
Any ideas are welcome, and I am sorry if this is simple and I am just overlooking something obvious. This is my first VBA project, and most of the resources I can find suggest replacing using regex, not saving extracted text as string. Thank you!
Try this:
You can find documentation for the RegExp class here.
Dim regEx as Object
Dim matchCollection As Object
Dim extractedString As String
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.IgnoreCase = True
.Global = False ' Only look for 1 match; False is actually the default.
.Pattern = "Program: ([^\r]+)" ' Word separates lines with CR (\r)
End With
' Pass the text of your document as the text to search through to regEx.Execute().
' For a quick test of this statement, pass "Program: Program Name (abr)"
set matchCollection = regEx.Execute(ActiveDocument.Content.Text)
' Extract the first submatch's (capture group's) value -
' e.g., "Program Name (abr)" - and assign it to variable extractedString.
extractedString = matchCollection(0).SubMatches(0)
I've modified your regex based on the assumption that you want to capture everything after Program: through the end of the line; your original regex would only have captured Program:<space>.
Enclosing [^\r]+ (all chars. through the end of the line) in (...) defines a so-called subexpression (a.k.a. capture group), which allows selective extraction of only the substring of interest from what the overall pattern captures.
The .Execute() method, to which you pass the string to search in, always returns a collection of matches (Match objects).
Since the .Global property is set to False in your code, the output collection has (at most) 1 entry (at index 0) in this case.
If the regular expression has subexpressions (1 in our case), then each entry of the match collection has a nonempty .SubMatches collection, with one entry for each subexpression, but note that the .SubMatches entries are strings, not Match objects.
Match objects have properties .FirstIndex, .Length, and Value (the captured string). Since the .Value property is the default property, it is sufficient to access the object itself, without needing to reference the .Value property (e.g., instead of the more verbose matchCollection(0).Value to access the captured string (in full), you can use shortcut matchCollection(0) (again, by contrast, .SubMatches entries are strings only).
If you're just looking for a string that starts with "Program:" and want to go to the end of the line from there, you don't need a regular expression:
Public Sub ReadDocument()
Dim aLine As Paragraph
Dim aLineText As String
Dim start As Long
For Each aLine In ActiveDocument.Paragraphs
aLineText = aLine.Range.Text
start = InStr(aLineText, "Program:")
If start > 0 Then
my_str = Mid(aLineText, start)
End If
Next aLine
End Sub
This reads through the document line by line, and stores your match in the variable "my_str" when it encounters a line that has the match.
Lazier version:
a = Split(ActiveDocument.Range.Text, "Program:")
If UBound(a) > 0 Then
extractedString = Trim(Split(a(1), vbCr)(0))
End If
If I remember correctly, paragraphs in Word end with vbCr ( \r not \n )

Remove tweet regular expressions from string of text

I have an excel sheet filled with tweets. There are several entries which contain #blah type of strings among other. I need to keep the rest of the text and remove the #blah part. For example: "#villos hey dude" needs to be transformed into : "hey dude". This is what i ve done so far.
Sub Macro1()
'
' Macro1 Macro
'
Dim counter As Integer
Dim strIN As String
Dim newstring As String
For counter = 1 To 46
Cells(counter, "E").Select
ActiveCell.FormulaR1C1 = strIN
StripChars (strIN)
newstring = StripChars(strIN)
ActiveCell.FormulaR1C1 = StripChars(strIN)
Next counter
End Sub
Function StripChars(strIN As String) As String
Dim objRegex As Object
Set objRegex = CreateObject("vbscript.regexp")
With objRegex
.Pattern = "^#?(\w){1,15}$"
.ignorecase = True
StripChars = .Replace(strIN, vbNullString)
End With
End Function
Moreover there are also entries like this one: Ÿ³é‡ï¼Ÿã€€åˆã‚ã¦çŸ¥ã‚Šã¾ã—ãŸã€‚ shiftã—ãªãŒã‚‰ã‚¨ã‚¯ã‚¹ãƒ
I need them gone too! Ideas?
For every line in the spreadsheet run the following regex on it: ^(#.+?)\s+?(.*)$
If the line matches the regex, the information you will be interested in will be in the second capturing group. (Usually zero indexed but position 0 will contain the entire match). The first capturing group will contain the twitter handle if you need that too.
Regex demo here.
However, this will not match tweets that are not replies (starting with #). In this situation the only way to distinguish between regular tweets and the junk you are not interested in is to restrict the tweet to alphanumerics - but this may mean some tweets are missed if they contain any non-alphanumerical characters. The following regex will work if that is not an issue for you:
^(?:(#.+?)\s+?)?([\w\t ]+)$
Demo 2.

Replace matched pattern with different font

I am using Outlook 2010, and I am trying to write a macro to replace the font of text with a different one, if it matches a pattern.
The logic I am trying to apply is simple - in the user selected text, check for a pattern, and on match, change the font for the matched text.
So far I have been able to split the text and apply/check regex, but the replacement is something that I am not clear on how to do.
Dim objOL As Application
Dim objDoc As Object
Dim objSel As Object
Dim regEx As RegExp
Dim matches As MatchCollection
Dim m As Match
Dim lines As Variant
Dim ms As String
Set objOL = Application
Set objDoc = objOL.ActiveInspector.WordEditor
Set objSel = objDoc.Windows(1).Selection
lines = Split(objSel, Chr(13))
For i = 0 To UBound(lines) Step 1
Set regEx = New RegExp
With regEx
.Pattern = "\[(ok|edit|error)\](\[.*\])?" ' <-- this is just one regex, I want to be able to check more regexes
.Global = True
End With
If regEx.Test(lines(i)) Then
Set matches = regEx.Execute(lines(i))
For Each m In matches
ms = m.SubMatches(1)
' ms.Font.Italic = True
' <-- here is where I am not sure how to replace! :( -->
Next
End If
Next i
P.S there seems to be text-search (objSel.Find.Text)and replace (objSel.Find.Replacement.Text) methods in Selection object, but not pattern-search ! (or I am missing it)
--EDIT--
Adding a sample text
user#host> show some data
..<few lines of data>.. <-- these lines as-is (but monospaced)
[ok][2014-11-26 11:05:02]
user#host> edit some other data
[edit data]
user#host(data)% some other command
I want to convert the whole block to a monospaced font (like Courier New, or Consolas)
And change the part that begins with something#somewhere.. and till > or % to dimmer color,
(i.e in this example user#host> and user#host(data)% dimmer/grey)
The rest in that line to bold (show some data et al)
And, all the bracketed text followed by time-stamps (or without timestamps) similar to 2. (i.e, dim/grey)
This is getting closer to being done. The framework is here to make all sorts of changes now. Just need to get some of the regex patterns down to make the changes.
Sub FormatSelection()
Dim objMailItem As Outlook.MailItem
Dim objInspector As Outlook.Inspector: Set objInspector = Application.ActiveInspector
Dim objHtmlEditor As Object
Dim objWord As Object
Dim Range As Word.Selection
Dim objSavedSelection As Word.Selection
Dim objFoundText As Object
' Verify a mail object is in focus.
If objInspector.CurrentItem.Class = olMail Then
' Get the mail object.
Set objMailItem = objInspector.CurrentItem
If objInspector.EditorType = olEditorWord Then
' We are using a Word editor. Get the selected text.
Set objHtmlEditor = objMailItem.GetInspector.WordEditor
Set objWord = objHtmlEditor.Application
Set Range = objWord.Selection
Debug.Print Range.Range
' Set defaults for the selection
With Range.Font
.Name = "Courier"
.ColorIndex = wdAuto
End With
' Stylize the bracketed text
Call FormatTextWithRegex(Range, 2, "\[(.+?)\]")
' Prompt style text.
Call FormatTextWithRegex(Range, 2, "(\w+?#.+?)(?=[\>\%])")
' Text following the prompt.
Call FormatTextWithRegex(Range, 3, "(\w+?#.+?[\>\%])(.+)")
End If
End If
Set objInspector = Nothing
Set Range = Nothing
Set objHtmlEditor = Nothing
Set objMailItem = Nothing
End Sub
Private Sub FormatTextWithRegex(ByRef pRange As Word.Selection, pActionIndex As Integer, pPattern As String)
' This routine will perform a regex replacement on the text in pRange using pPattern
' on text based on the pactionindex passed.
Const intLightColourIndex = 15
Dim objRegex As RegExp: Set objRegex = New RegExp
Dim objSingleMatch As Object
Dim objMatches As Object
' Configure Regex object.
With objRegex
.IgnoreCase = True
.MultiLine = False
.Pattern = pPattern ' Example "\[(ok|edit|error)\](\[.+?\])?"
.Global = True
End With
' Locate all matches if any.
Set objMatches = objRegex.Execute(pRange.Text)
' Find
If (objMatches.Count > 0) Then
Debug.Print objMatches.Count & " Match(es) Found"
For Each objSingleMatch In objMatches
' Locate the text associated to this match in the selection so we can replace it.
Debug.Print "Match Found: '" & objSingleMatch & "'"
With pRange.Find
'.ClearFormatting
.Text = objSingleMatch.Value
.ClearFormatting
Select Case pActionIndex
Case 1 ' Italisize text
.Replacement.Text = objSingleMatch.Value
.Replacement.Font.Bold = False
.Replacement.Font.Italic = True
.Replacement.Font.ColorIndex = wdAuto
.Execute Replace:=wdReplaceAll
Case 2 ' Dim the colour
.Replacement.Text = objSingleMatch.Value
.Replacement.Font.Bold = False
.Replacement.Font.Italic = False
.Replacement.Font.ColorIndex = intLightColourIndex
.Execute Replace:=wdReplaceAll
Case 3 ' Bold that text!
.Replacement.Text = objSingleMatch.Value
.Replacement.Font.Bold = True
.Replacement.Font.Italic = False
.Replacement.Font.ColorIndex = wdAuto
.Execute Replace:=wdReplaceAll
End Select
End With
Next
Else
Debug.Print "No matches found for pattern: " & pPattern
End If
Set objRegex = Nothing
Set objSingleMatch = Nothing
Set objMatches = Nothing
End Sub
So we take what the user has selected and execute the macro. I have my Outlook configured with Word for the editor so that is tested for. Take the selected text and run the regex query against the text saving the matches.
The issue you had is what to do with the match once you found it. In my case since we have the actual text that matched we can run that through a find and replace using the selection once again. Replacing the text with itself instead styled as directed.
Caveats
My testing text was the following:
asdfadsfadsf [ok][Test]dsfadsfasdf asdfadsfadsfasdfasdfadsfadsf [ok][Test]dsfadsfasdf asdfadsfadsfasdf
I had to change your regex in your sample to be less greedy since it was matching both [ok][Test] sections. I don't know what kind of text you are working with so my logic might not apply to your situation. Test with caution.
You also had a comment that you needed to test multiple regexes... regexies.... I don't know what the plural is. Wouldn't be hard to create another function that calls this one for several patterns. Assuming this logic works repeating it should not be a big deal. I would like to make this work for you so if something is wrong let me know.
Code Update
I have changed the code so that the regex replacement is in a sub. So what the code does right now is change the selected text to courier and italisize text based on a regex. Now with how it is set up you can use the sub routine FormatTextWithRegex to make changes. Just need to update the pattern and action index which will perform the different styles. Will be updating this again soon with more information. Right now all that exists is the structure that I think you need.
Having issues with the bolding still but you can see the grey part is working correctly. Also the since this relies on highlighting the multiple calls to the function are causing an issue. Just not sure what it is.

Excluding delimiters with Search in MS Word

Say I have the following string:
"Hello how are you."
Since MS Word allows for regular expressions, I can use "*" to find the complete string. But what if I want to exclude the delimiters (the quotes)? I'm afraid that MS Word doesn't support either of the two methods explained here. My question is: would there be any way to do this in one search query?
Thanks in advance.
There are different ways to achieve what you want. Here is one way to find text in VBA Word without the dilimiters using Regex. Let's say you have the following text in Word Document (do not copy and paste it from here as the the website distorts the Double quotes. See the screenshot)
This is a sample
"This is another Sample"
"Wake me up before you go go"
"War of the worlds"
The code to return text using Regex between two quotes is as follows
Sub FindText()
Dim regEx, Match, Matches
Set regEx = New RegExp
regEx.Pattern = "([^“]*)(?=\”)"
regEx.IgnoreCase = False
regEx.Global = True
Set Matches = regEx.Execute(ActiveDocument.Range.Text)
For Each Match In Matches
Debug.Print Match.Value
Next
End Sub
and if you want to say find "Wake me up before you go go" without quotes then you can use this as well
Sub FindText()
Dim regEx, Match, Matches
Dim searchText As String
searchText = "Wake me up before you go go"
Set regEx = New RegExp
regEx.Pattern = "([^“]*)(?=\”)"
regEx.IgnoreCase = False
regEx.Global = True
Set Matches = regEx.Execute(ActiveDocument.Range.Text)
For Each Match In Matches
If Trim(Match.Value) = (searchText) Then
Debug.Print "Found"
End If
Next
End Sub
NOTE: The website distorts the actual double quote so I am posting screenshots.
FOLLOWUP
For the sample file that you posted, use this code
Sub FindText()
Dim regEx, Match, Matches
Set regEx = New RegExp
regEx.Pattern = """([^""]*)"""
regEx.IgnoreCase = False
regEx.Global = True
Set Matches = regEx.Execute(ActiveDocument.Range.Text)
For Each Match In Matches
Debug.Print Match.SubMatches(0)
Next
End Sub
Sample File can be downloaded from here. Please note that this link will be active for 7 days.
Sample File
HTH
Sid
You are wrong. Word does support some wildcards, ? for a single character and * for a series of characters.
This is not a regular expression
means no lookbehind and no lookahead
While there will never be everything in Ms-Word that you want, e.g. like this one where you want to find something else, but want to select only a part of it, there are always macros which you can program to accomplish your task.
Add the following VBA code to your document. You can add a custom button on the toolbar to call it.
Sub FindSpecial()
FindSpecialA
End Sub
Private Sub FindSpecialA(Optional text As String)
Dim ToFind As String
ToFind = InputBox("Enter the text you want to find in double-quotes (without double-quotes):" & vbCrLf & vbCrLf & "(Enter * to match anything within double-quotes)", "Find", text)
If ToFind = "" Then Exit Sub
Selection.Find.ClearFormatting
With Selection.Find
.text = """" & ToFind & """"
.Replacement.text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = False
.MatchSoundsLike = False
.MatchAllWordForms = False
If ToFind = "*" Then
.text = "[“""]*[”""]"
.MatchWildcards = True
End If
End With
Selection.Find.Execute
If Selection.Find.Found Then
Selection.MoveStart unit:=wdCharacter, Count:=1
Selection.MoveEnd unit:=wdCharacter, Count:=-1
FindSpecialA ToFind
Else
MsgBox "Not found!"
End If
End Sub
EDIT:
Updated the code to handle wildcard * matches.
Some versions of MS Word support regex-style groups with their "search with wildcards" option, meaning that if you can create a search expression between two quotes -- the one that works for me is "?#" -- you can change it to "(?#)" and enter \1 for the replace text. This will replace the text that was found with just the text that matches the expression between the parentheses, getting rid of your quote marks. (MS Word's ?# is equivalent to .* (non-greedy) in common regex.)
This works for me in Word 2008 for Mac, but I don't have a guide to which versions of Office support this syntax.
Beware! In this search form, Word does not equate the straight quotes on your keyboard with the curly quotes it inserts in order to look pretty. You will need to either turn off "smart quotes" for this document, or construct your search phrase by cutting and pasting the opening and closing quote characters from your document.
I have had luck using color (Format < Font in the Search dialog box while you are clicked into Find What) to solve problems like this. Execute search all content with delimiters (“*” with wildcards checked for this example) and replace using a non-black color like blue. Search and replace delimiter (in this case quote marks) color from blue to black. Perform changes to content in blue. Select all and change to black. If this comes up often, I would suggest macros on a toolbar for the step one (blue, take blue off delimiter) and step two (change all to black).