Use VBA to replace RegEx strings with hyperlinks in Word - regex

I'm trying to search through a word document and replace regex matches with a series of static hyperlinks.
For example, if I get a regex match for "A-1", I want to replace the "A-1" string with a hyperlink with Anchor = "A-1" and Address = "https://www.my_website.com/A-1". My RegEx matches could be "A-1", "A-2", "A-3", etc.
I'm familiar with RegEx but I'm very new to VBA. What I have so far:
Sub FindAndHyperlink()
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = "([A][\-])([0-9])"
Set Matches = RegEx.Execute(ActiveDocument.Range.Text)
For Each Match In Matches
ActiveDocument.Range.Text = RegEx.Replace(ActiveDocument.Range.Text, (ActiveDocument.Hyperlinks.Add Anchor:=Match, Address:="https://www.my_website.com/" & Match))
Next
End Sub
This doesn't compile because it's expecting a ) after ActiveDocument.Hyperlinks.Add.
I think the problem is that the RegEx.Replace() method is expecting (String, String) arguments rather than (String, Hyperlink object), but I'm not sure the best way to get around that.
Any help would be appreciated.

Try:
Sub FindAndHyperlink()
Application.ScreenUpdating = False
Const HLnk As String = "https://www.my_website.com/"
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "A-[0-9]{1,}"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchWildcards = True
.Execute
End With
Do While .Find.Found
.Hyperlinks.Add Anchor:=.Duplicate, Address:=HLnk & .Text, TextToDisplay:=.Text
.Start = .Hyperlinks(1).Range.End
.Find.Execute
Loop
End With
Application.ScreenUpdating = True
End Sub

I just tested this and it worked fine for me.
([a-z _:/.A-Z-1-9]+)
Notice the UPPERCASE and lowercase, plus the 1-9 (one through nine, inclusive), as well as the characters such as slash and period.

Related

Find and highlight non-Latin code-page characters in a Word document

I have looked for this on the net but unfortunately no joy. I can find a lot of examples of find and replace but not much on find and highlight with regex.
What I want to do is look for non-Latin code-page characters in a MS Word document and highlight them. I thought the best approach would be to use regex (open to suggestions if that's not the case). Below code high lights the whole document:
Sub Highlight_Words()
Dim oRE As New RegExp: oRE.Pattern = "[^a-zA-Z0-9:]"
Dim oM As Match
Application.ScreenUpdating = False
Options.DefaultHighlightColorIndex = wdRed
With ActiveDocument.Content.Find
.ClearFormatting
.Text = oRE.Pattern
'.Text = "[^a-zA-Z0-9\s:]"
'.Text = "[a-zA-Z\d\s:]"
'.Text = " "
With .Replacement
.Text = "^&"
.ClearFormatting
.Highlight = True
End With
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchWildcards = True
.Execute Replace:=wdReplaceAll
End With
Application.ScreenUpdating = True
End Sub
Any help would be much appreciated, thanks
P.S. I am working on Windows 7 (64 bit) and Word 2013
Update1:
Below is a sample text :
This is just a sample text to test highlighting of non-alphanumeric
characters (i.e. characters that are not English language characters
(i.e. À) and not numbers). There are exceptions to this rule like
apostrophe (“2’”) or colon (“:”) or hyphen (“-“). But I can add these
exceptions once I have the main pattern to search for non-alphanumeric
characters
So from the above sample text, À should be highlighted (I did have other characters in the text but unfortunately they are not displaying on the site)
You can't use RegEx inside a Word document; RegEx can only run on a string you'd extract from the document (assign to a String variable). But location in the document and any formatting information is lost. Any search must be done using Word's built-in wildcard functionality which is similar to RegEx but not the same and not as "powerful".
Basically, the requirement appears to be anything in the "standard" ANSI codes through ANSI 126, given a "Latin" code page. (You can see the characters and their codes using Word's Insert Symbol (Insert/Symbol, More Symbols) dialog box.
Testing with the provided sample text, this search string works: [!^011-^0126]. This searches for anything not in the character range ANSI 11 (Word's new line character) through ANSI 126 (lower case z). Additional characters with higher ANSI codes can be appended after 126, as literal characters.
Put into the code in the question:
Sub Highlight_Words()
Dim Pattern As String
Pattern = "[!^011-^0126]"
Application.ScreenUpdating = False
Options.DefaultHighlightColorIndex = wdRed
With ActiveDocument.Content.Find
.ClearFormatting
.Text = Pattern
With .Replacement
.Text = "^&"
.ClearFormatting
.Highlight = True
End With
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchWildcards = True
.Execute Replace:=wdReplaceAll
End With
Application.ScreenUpdating = True
End Sub

Regex with hyperlinks ActiveDocument.Range and Format

I have seen several other StackOverflow posts such as this one:
How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word
on using regular expressions in Microsoft Word with VBA using the Microsoft VB script Regular Expressions 5.5 Reference.
That helped me prepare the following, which I use in Word to highlight US Dollar amounts:
Sub dollarHighlighter()
Set regExp = New regExp
Dim objMatch As Match
Dim colMatches As MatchCollection
Dim offsetStart As Long
offsetStart = Selection.Start
regExp.Pattern = "\$([\,\d{1,3}]*(?:\.\d{2})?)"
regExp.Global = True
Set colMatches = regExp.Execute(Selection.Text) ' Execute search.
For Each objMatch In colMatches ' Iterate Matches collection.
Set myRange = ActiveDocument.Range(objMatch.FirstIndex + offsetStart,
End:=offsetStart + objMatch.FirstIndex + objMatch.Length)
myRange.FormattedText.HighlightColorIndex = wdYellow
Next
End Sub
While this works as expected on a list of dollar amounts within text (for the most part - among its imperfections the regex is intentionally a bit loose) it does not work as anticipated when there are hyperlinks present in the Word document.
In that instance, there appears to be a shift in offset of the highlighted characters in a somewhat unpredictable manner. I assume this is because there is a lot of new xml/css in the document.xml source file.
Ultimately, my overarching questions is, can I use regex to highlight word document contents even if it contains hyperlinks? Is it an offset question or should I run the regex on the compressed xml, re compress and reopen for better results? As when I test various regex variations on the source code, I get the anticipated results, but not when formatting what would be the Word range.
I have also asked this here: https://social.msdn.microsoft.com/Forums/en-US/3a95c5e4-9e0c-4da9-970f-e0bf801c3170/macro-for-a-regexp-search-replace?forum=isvvba&prof=required but realize it was an ancient post...
Per question below, here are some possibly helpful links:
an example document
http://www.smithany.com/test.docx
step 1
http://www.smithany.com/wordusd1.jpg
Step 2
http://www.smithany.com/wordhighlighterrun.jpg
and what happens
http://www.smithany.com/whatactuallyhappens.jpg
Temporary Workaround: As suggested below Word's Wildcard find is fast if you do not stack the loops. try this:
Sub Macro2()
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.highlight = True
With Selection.Find
.Text = "$[0-9,]{1,}"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.highlight = True
With Selection.Find
.Text = "$[0-9,]{1,}.[0-9]{2,3}"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
which basically gets all the dollar amounts highlighted. That said, complex expressions like matching a variety of date formats could get messy, but i suppose it is all possible doing them one step at a time.
I had not touched VBA for years but I guess it's like bicycling.
Anyways, here is a sub that should help you out. It's based on Cindy Meister sound recommendation and fills the gap between Regex and Wildcard Find using a collection of match patterns for optional parts.
First, the Wildcard matches: $[0-9,]{1,} and $[0-9,]{1,}.[0-9]{2}
It's not that different after all, isn't it? However, to take the optional fraction part into account I have to use two patterns.
And here's the routine:
Sub WildcardsHighlightWords()
Dim Word As Range
Dim WildcardCollection(2) As String
Dim Words As Variant
WildcardCollection(0) = "$[0-9,]{1,}"
WildcardCollection(1) = "$[0-9,]{1,}.[0-9]{2}"
Options.DefaultHighlightColorIndex = wdYellow
'Clear existing formatting and settings in Find
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
'Set highlight to replace setting.
Selection.Find.Replacement.Highlight = True
'Cycle through document and find wildcards patterns, highlight words when found
For Each Word In ActiveDocument.Words
For Each WildcardsPattern In WildcardCollection
With Selection.Find
.Text = WildcardsPattern
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = True
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Next
Next
End Sub
It should be easy to extend or modify this approach if needed.
This highlithts the Dollar amounts as desired on my end:
Note: The separator in the quantifier {n,m} is not the same in all localizations, e.g. it's {n;m} in the German version.

Creating a RegEx to find a sentence with a parenthetical acronym (VBasic Word)

I'm writing a script that scrubs a document to find acronyms in the format (USA). As a processing tool I need to grab the entire sentence in which that parenthetical acronym appears. Right now my code for finding the acronym is:
With oRange.Find
.Text = "\([A-Z]{2,}\)"
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = True
.MatchWildcards = True
Combining this with a Do While .Execute I can comb the doc and find the acronyms, then using a string function I take the acronym out of the parentheses and put it in a table. Is there a RegEx that I could use which would find any sentence an (USA) type acronym is in? As an input you could use this paragraph.
Thank you very much.
edit: I found the following Regex to try and make it work:
.Text = "[^.]*\([A-Z]{2,}\)[^.]*\."
But this is giving me an error, saying that the carrot can't be used in the Find function.
I didn't managed to use that regular expression directly in the .Find method, so using Regex directly :
Sub AcronymFinder()
Dim Para As Paragraph
Set Para = ThisDocument.Paragraphs.First
Dim ParaNext As Paragraph
Dim oRange As Range
Set oRange = Para.Range
Dim regEx As New RegExp
Dim ACrO As String
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = ".*[^.]*([\(][A-Z]{2,}[\)])[^.]*[\.]"
End With
Do While Not Para Is Nothing
Set ParaNext = Para.Next
Set oRange = Para.Range
'Debug.Print oRange.Text
If regEx.test(oRange.Text) Then
ACrO = CStr(regEx.Execute(oRange.Text)(0))
'Debug.Print ACrO
With oRange.Find
.Text = ACrO
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = True
.MatchWildcards = False
.Execute
End With
Else
End If
Set Para = ParaNext
Loop
End Sub
to use it, remember to add the reference :
Description: Microsoft VBScript Regular Expressions 5.5
FullPath: C:\windows\SysWOW64\vbscript.dll\3
Major.Minor: 5.5
Name: VBScript_RegExp_55
GUID: {3F4DACA7-160D-11D2-A8E9-00104B365C9F}
This regex
[^046][!^046]*\([A-Z]{2,10}\)[!^046]*[^046]
when used in the Find dialog will find a sentence (bounded by full stops ^046).
Note that this regex returns a string with full stops on both ends, e.g.,
. A three-letter acronym (TLA) was used.
Also note that I limited acronym length to 10 chars [A-Z]{2,10}; change the upper limit as needed.
Finally I observed that this DOES NOT find acronyms at the end of a sentence, e.g.
I used a three-letter acronym (TLA).
The [!^046]* part of the regex does not appear to match a zero length string. To catch those cases you would need to do a second pass search with this:
[^046][!^046]*\([A-Z]{2,10}\)[^046]
Hope that helps

Merge Three Regexes into One (or Two)

I would like to merge my three regexes which clean text (empty lines, leading and trailing spaces etc.) into, if possible, one regex, or if it is not possible - into two.
My first regex is [ \t]+. It does this sort of cleaning.
My second regex is ^(?:[\t ]*(?:\r?\n|\r))+ Not image included since it won't catch anything if the previous regex has not run.
The third regex is ^[\s\xA0]+|[\s\xA0]+$. It does this sort of cleaning.
EDIT: I have forgotten to mention that in each case I replace match with nothing "".
EDIT 2: I use the following code in Word:
With selection
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.MultiLine = True
' clean selection
RegEx.Pattern = "[ \t]+"
.Text = RegEx.Replace(.Text, " ")
RegEx.Pattern = "^(?:[\t ]*(?:\r?\n|\r))+"
.Text = RegEx.Replace(.Text, "")
' the following is from http://stackoverflow.com/a/24049145/2657875
RegEx.Pattern = "^[\s\xA0]+|[\s\xA0]+$"
.Text = RegEx.Replace(.Text, "")
End With
The last regexps can be merged as
RegEx.Pattern = "^(?:[\t ]*(?:\r?\n|\r)?)*|[ \t]+$"
I do not think there can be a chance to merge all 3 in VBA since you are using two different replacement patterns.
If i am not wrong, you want all your lines/spaces/tabs/white lines to be matched and removed, so you could merge the input strings. Well, that's easy and can be done if you do use the following regex in your replace program/script/command:
/([\s\t]{0,50}\r?\n)+|\s+/s
The regex should work well on windows as well as linux based files.
Not pro but I use multiple regex one after another. If you are not familiar with below code than you should try.
Set regEx_ = new regExp
With regEx_
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = "Pattern 1"
TextLine = regEx_.replace(TextLine, "")
.Pattern = "Pattern 2"
TextLine = regEx_.replace(TextLine, "")
'and so on
End With

Find and replace from given word to right parenthesis

I have just taken up the VBA route to automate a few day today tasks so pls excuse if I sound very naive
I'm trying to open a word document & then searching for a expression to highlight(Bold) it,however Im getting error "User defined type not defined"
I'm able to open the word document but unable to perform the pattern search.I have gathered bits & peices of code from internet, however its not working
I'm using Office 2013 & have added the Microsoft VBscript Reg Ex 5.5 in references.
The pattern Im searching is starting from "Dear" till ) is encountered.
Cheers #GoingMad#
Sub Pattern_Replace()
Dim regEx, Match, Matches
Dim rngRange As Range
Dim pathh As String, i As Integer
pathh = "D:\Docs\Macro.docx"
Dim pathhi As String
Dim from_text As String, to_text As String
Dim WA As Object, WD As Object
Set WA = CreateObject("Word.Application")
WA.Documents.Open (pathh)
WA.Visible = True
Set regEx = New RegExp
regEx.Pattern = "Dear[^0-9<>]+)"
regEx.IgnoreCase = False
regEx.Global = True
Set Matches = regEx.Execute(ActiveDocument.Range.Text)
For Each Match In Matches
ActiveDocument.Range(Match.FirstIndex, Match.FirstIndex + Len(Match.Value)).Bold = True
Next
End Sub
You need to escape the bracket ")" within the regex, using a back-slash:
regex.Pattern = "Dear[^0-9<>]+\)"
This is because it has a particular meaning within a regex expression.
I would personally also split the reference to the Word-Range across a few lines:
Set rngRange = ActiveDocument.Range
rngRange.Expand Unit:=wdStory
Set Matches = regex.Execute(rngRange.Text)
although this isn't necessary.
Consider the following text
Dear aunt sally ) I have gone to school.
Your regex pattern would be "Dear[^)]+"
Find the word Dear
Match Any character that is not ")"
Repeat
Refiddle here
This one will include the parenthesis. Dear[\w\s]+\)
Find the word Dear
Match Any Character or whitespace
Repeat as needed
Until a right parenthesis is found
You don't need regex for this - a wildcard Find/Replace in Word will do the job far more efficiently:
With WA.ActiveDocument.Range.Find
.ClearFormatting
.Text = "Dear[!\)]#\)"
.Replacement.ClearFormatting
.Replacement.Font.Bold = True
.Replacement.Text = "^&"
.Format = True
.Forward = True
.Wrap = wdFindContinue
.MatchWildcards = True
.Execute Replace:=wdReplaceAll
End With