Regex with hyperlinks ActiveDocument.Range and Format - regex

I have seen several other StackOverflow posts such as this one:
How to Use/Enable (RegExp object) Regular Expression using VBA (MACRO) in word
on using regular expressions in Microsoft Word with VBA using the Microsoft VB script Regular Expressions 5.5 Reference.
That helped me prepare the following, which I use in Word to highlight US Dollar amounts:
Sub dollarHighlighter()
Set regExp = New regExp
Dim objMatch As Match
Dim colMatches As MatchCollection
Dim offsetStart As Long
offsetStart = Selection.Start
regExp.Pattern = "\$([\,\d{1,3}]*(?:\.\d{2})?)"
regExp.Global = True
Set colMatches = regExp.Execute(Selection.Text) ' Execute search.
For Each objMatch In colMatches ' Iterate Matches collection.
Set myRange = ActiveDocument.Range(objMatch.FirstIndex + offsetStart,
End:=offsetStart + objMatch.FirstIndex + objMatch.Length)
myRange.FormattedText.HighlightColorIndex = wdYellow
Next
End Sub
While this works as expected on a list of dollar amounts within text (for the most part - among its imperfections the regex is intentionally a bit loose) it does not work as anticipated when there are hyperlinks present in the Word document.
In that instance, there appears to be a shift in offset of the highlighted characters in a somewhat unpredictable manner. I assume this is because there is a lot of new xml/css in the document.xml source file.
Ultimately, my overarching questions is, can I use regex to highlight word document contents even if it contains hyperlinks? Is it an offset question or should I run the regex on the compressed xml, re compress and reopen for better results? As when I test various regex variations on the source code, I get the anticipated results, but not when formatting what would be the Word range.
I have also asked this here: https://social.msdn.microsoft.com/Forums/en-US/3a95c5e4-9e0c-4da9-970f-e0bf801c3170/macro-for-a-regexp-search-replace?forum=isvvba&prof=required but realize it was an ancient post...
Per question below, here are some possibly helpful links:
an example document
http://www.smithany.com/test.docx
step 1
http://www.smithany.com/wordusd1.jpg
Step 2
http://www.smithany.com/wordhighlighterrun.jpg
and what happens
http://www.smithany.com/whatactuallyhappens.jpg
Temporary Workaround: As suggested below Word's Wildcard find is fast if you do not stack the loops. try this:
Sub Macro2()
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.highlight = True
With Selection.Find
.Text = "$[0-9,]{1,}"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.highlight = True
With Selection.Find
.Text = "$[0-9,]{1,}.[0-9]{2,3}"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub
which basically gets all the dollar amounts highlighted. That said, complex expressions like matching a variety of date formats could get messy, but i suppose it is all possible doing them one step at a time.

I had not touched VBA for years but I guess it's like bicycling.
Anyways, here is a sub that should help you out. It's based on Cindy Meister sound recommendation and fills the gap between Regex and Wildcard Find using a collection of match patterns for optional parts.
First, the Wildcard matches: $[0-9,]{1,} and $[0-9,]{1,}.[0-9]{2}
It's not that different after all, isn't it? However, to take the optional fraction part into account I have to use two patterns.
And here's the routine:
Sub WildcardsHighlightWords()
Dim Word As Range
Dim WildcardCollection(2) As String
Dim Words As Variant
WildcardCollection(0) = "$[0-9,]{1,}"
WildcardCollection(1) = "$[0-9,]{1,}.[0-9]{2}"
Options.DefaultHighlightColorIndex = wdYellow
'Clear existing formatting and settings in Find
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
'Set highlight to replace setting.
Selection.Find.Replacement.Highlight = True
'Cycle through document and find wildcards patterns, highlight words when found
For Each Word In ActiveDocument.Words
For Each WildcardsPattern In WildcardCollection
With Selection.Find
.Text = WildcardsPattern
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchWildcards = True
.MatchSoundsLike = False
.MatchAllWordForms = False
End With
Selection.Find.Execute Replace:=wdReplaceAll
Next
Next
End Sub
It should be easy to extend or modify this approach if needed.
This highlithts the Dollar amounts as desired on my end:
Note: The separator in the quantifier {n,m} is not the same in all localizations, e.g. it's {n;m} in the German version.

Related

Find and highlight non-Latin code-page characters in a Word document

I have looked for this on the net but unfortunately no joy. I can find a lot of examples of find and replace but not much on find and highlight with regex.
What I want to do is look for non-Latin code-page characters in a MS Word document and highlight them. I thought the best approach would be to use regex (open to suggestions if that's not the case). Below code high lights the whole document:
Sub Highlight_Words()
Dim oRE As New RegExp: oRE.Pattern = "[^a-zA-Z0-9:]"
Dim oM As Match
Application.ScreenUpdating = False
Options.DefaultHighlightColorIndex = wdRed
With ActiveDocument.Content.Find
.ClearFormatting
.Text = oRE.Pattern
'.Text = "[^a-zA-Z0-9\s:]"
'.Text = "[a-zA-Z\d\s:]"
'.Text = " "
With .Replacement
.Text = "^&"
.ClearFormatting
.Highlight = True
End With
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchWildcards = True
.Execute Replace:=wdReplaceAll
End With
Application.ScreenUpdating = True
End Sub
Any help would be much appreciated, thanks
P.S. I am working on Windows 7 (64 bit) and Word 2013
Update1:
Below is a sample text :
This is just a sample text to test highlighting of non-alphanumeric
characters (i.e. characters that are not English language characters
(i.e. À) and not numbers). There are exceptions to this rule like
apostrophe (“2’”) or colon (“:”) or hyphen (“-“). But I can add these
exceptions once I have the main pattern to search for non-alphanumeric
characters
So from the above sample text, À should be highlighted (I did have other characters in the text but unfortunately they are not displaying on the site)
You can't use RegEx inside a Word document; RegEx can only run on a string you'd extract from the document (assign to a String variable). But location in the document and any formatting information is lost. Any search must be done using Word's built-in wildcard functionality which is similar to RegEx but not the same and not as "powerful".
Basically, the requirement appears to be anything in the "standard" ANSI codes through ANSI 126, given a "Latin" code page. (You can see the characters and their codes using Word's Insert Symbol (Insert/Symbol, More Symbols) dialog box.
Testing with the provided sample text, this search string works: [!^011-^0126]. This searches for anything not in the character range ANSI 11 (Word's new line character) through ANSI 126 (lower case z). Additional characters with higher ANSI codes can be appended after 126, as literal characters.
Put into the code in the question:
Sub Highlight_Words()
Dim Pattern As String
Pattern = "[!^011-^0126]"
Application.ScreenUpdating = False
Options.DefaultHighlightColorIndex = wdRed
With ActiveDocument.Content.Find
.ClearFormatting
.Text = Pattern
With .Replacement
.Text = "^&"
.ClearFormatting
.Highlight = True
End With
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchWildcards = True
.Execute Replace:=wdReplaceAll
End With
Application.ScreenUpdating = True
End Sub

Use VBA to replace RegEx strings with hyperlinks in Word

I'm trying to search through a word document and replace regex matches with a series of static hyperlinks.
For example, if I get a regex match for "A-1", I want to replace the "A-1" string with a hyperlink with Anchor = "A-1" and Address = "https://www.my_website.com/A-1". My RegEx matches could be "A-1", "A-2", "A-3", etc.
I'm familiar with RegEx but I'm very new to VBA. What I have so far:
Sub FindAndHyperlink()
Dim RegEx As Object
Set RegEx = CreateObject("VBScript.RegExp")
RegEx.Global = True
RegEx.Pattern = "([A][\-])([0-9])"
Set Matches = RegEx.Execute(ActiveDocument.Range.Text)
For Each Match In Matches
ActiveDocument.Range.Text = RegEx.Replace(ActiveDocument.Range.Text, (ActiveDocument.Hyperlinks.Add Anchor:=Match, Address:="https://www.my_website.com/" & Match))
Next
End Sub
This doesn't compile because it's expecting a ) after ActiveDocument.Hyperlinks.Add.
I think the problem is that the RegEx.Replace() method is expecting (String, String) arguments rather than (String, Hyperlink object), but I'm not sure the best way to get around that.
Any help would be appreciated.
Try:
Sub FindAndHyperlink()
Application.ScreenUpdating = False
Const HLnk As String = "https://www.my_website.com/"
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "A-[0-9]{1,}"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchWildcards = True
.Execute
End With
Do While .Find.Found
.Hyperlinks.Add Anchor:=.Duplicate, Address:=HLnk & .Text, TextToDisplay:=.Text
.Start = .Hyperlinks(1).Range.End
.Find.Execute
Loop
End With
Application.ScreenUpdating = True
End Sub
I just tested this and it worked fine for me.
([a-z _:/.A-Z-1-9]+)
Notice the UPPERCASE and lowercase, plus the 1-9 (one through nine, inclusive), as well as the characters such as slash and period.

Creating a RegEx to find a sentence with a parenthetical acronym (VBasic Word)

I'm writing a script that scrubs a document to find acronyms in the format (USA). As a processing tool I need to grab the entire sentence in which that parenthetical acronym appears. Right now my code for finding the acronym is:
With oRange.Find
.Text = "\([A-Z]{2,}\)"
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = True
.MatchWildcards = True
Combining this with a Do While .Execute I can comb the doc and find the acronyms, then using a string function I take the acronym out of the parentheses and put it in a table. Is there a RegEx that I could use which would find any sentence an (USA) type acronym is in? As an input you could use this paragraph.
Thank you very much.
edit: I found the following Regex to try and make it work:
.Text = "[^.]*\([A-Z]{2,}\)[^.]*\."
But this is giving me an error, saying that the carrot can't be used in the Find function.
I didn't managed to use that regular expression directly in the .Find method, so using Regex directly :
Sub AcronymFinder()
Dim Para As Paragraph
Set Para = ThisDocument.Paragraphs.First
Dim ParaNext As Paragraph
Dim oRange As Range
Set oRange = Para.Range
Dim regEx As New RegExp
Dim ACrO As String
With regEx
.Global = True
.MultiLine = True
.IgnoreCase = False
.Pattern = ".*[^.]*([\(][A-Z]{2,}[\)])[^.]*[\.]"
End With
Do While Not Para Is Nothing
Set ParaNext = Para.Next
Set oRange = Para.Range
'Debug.Print oRange.Text
If regEx.test(oRange.Text) Then
ACrO = CStr(regEx.Execute(oRange.Text)(0))
'Debug.Print ACrO
With oRange.Find
.Text = ACrO
.Forward = True
.Wrap = wdFindStop
.Format = False
.MatchCase = True
.MatchWildcards = False
.Execute
End With
Else
End If
Set Para = ParaNext
Loop
End Sub
to use it, remember to add the reference :
Description: Microsoft VBScript Regular Expressions 5.5
FullPath: C:\windows\SysWOW64\vbscript.dll\3
Major.Minor: 5.5
Name: VBScript_RegExp_55
GUID: {3F4DACA7-160D-11D2-A8E9-00104B365C9F}
This regex
[^046][!^046]*\([A-Z]{2,10}\)[!^046]*[^046]
when used in the Find dialog will find a sentence (bounded by full stops ^046).
Note that this regex returns a string with full stops on both ends, e.g.,
. A three-letter acronym (TLA) was used.
Also note that I limited acronym length to 10 chars [A-Z]{2,10}; change the upper limit as needed.
Finally I observed that this DOES NOT find acronyms at the end of a sentence, e.g.
I used a three-letter acronym (TLA).
The [!^046]* part of the regex does not appear to match a zero length string. To catch those cases you would need to do a second pass search with this:
[^046][!^046]*\([A-Z]{2,10}\)[^046]
Hope that helps

MS Word VBA macro to search and replace (Regex)

Suppose a word file contains
ab{cdefg{hij{k
And I want { to be moved one place to the right like
abc{defgh{ijk{
I need to make an array with all characters then run a loop with Regex search and replace
search:
({)(array[index])
replace:
$2$1
Plain Regex without loop won't work because I'm dealing with Indic text which have complex characters. I've done this on JavaScript and ExtendScript in inDesign, but I've no clue about VB. Can anyone please help?
This can be done using a Word wildcard search-and-replace:
With ActiveDocument.Range.Find
.ClearFormatting
.Replacement.ClearFormatting
.ClearAllFuzzyOptions
.Text = "(\{)(?)" ' find opening brace followed by a single character
.Replacement.Text = "\2\1" ' swap positions
.Forward = True
.Wrap = wdFindContinue
.Format = False
.MatchCase = False
.MatchWholeWord = False
.MatchByte = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchFuzzy = False
.MatchWildcards = True
End With
ActiveDocument.Range.Find.Execute Replace:=wdReplaceAll

MS word find and replace: Change how to replace based on captured values in Find's RegEx

I am working on bilingual documents (Arabic-English), The two languages have different directions (RTL and LTR respectively) which makes working with it a bit more challenging.
I am writing a macro to change numbers of the form (x.x) to (x,x), a change in the comma.
now here is the problem, sometimes the two numbers are switched when I do the replace, for example: x.y becomes y,x.
after some debugging it turned out that sometimes this (.) in the first form is an Arabic character and when it is replaced with (,) which is an English one it is causing this change in order.
so I want to do is the following, but I don't know how to translate it to VBA:
1- look for a match to the expression ([0-9]{1, }).([0-9]{1, }) 'two numbers with a dot in between
2- if the dot in between is English, replace as follows \1,\2 'no change
3- else if it is Arabic, replace as follows \2,\1
Thanks for your help
EDIT
here is the current version of the macro, it is a recorded macro. I just added the if else statement.
Sub fixComma()
'
' fixComma Macro
'
'
If (Selection.Start <> Selection.End) Then
Selection.Find.ClearFormatting
Selection.Find.Replacement.ClearFormatting
Selection.Find.Replacement.LanguageID = wdEnglishUS
With Selection.Find
.Text = "([0-9]).([0-9])"
.Replacement.Text = "\1,\2"
.Forward = True
.Wrap = wdFindStop
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchKashida = False
.MatchDiacritics = False
.MatchAlefHamza = False
.MatchControl = False
.MatchByte = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
Else
MsgBox "Nothing is selected, Macro terminated"
End If
End Sub
EDIT 2
Maybe I am taking the wrong approach here and I should go through the paragraphs myself and check for a match for the RegEx instead of using Find and Replace ?
EDIT 3
Maybe I am using the wrong tool here and VBA can't help me with this ? if so is there anything else I can use ? maybe a C# add in for example.