MS Word + VBA + RegExp: Get Page Number of Match

MS Word + VBA + RegExp: Get Page Number of Match - regex

Is that possible? Probably not? How can I then find all exact occurrences of a match and the according page numbers?
EDIT:
I have the regex working properly. What I need is for each match to get all the pages it appears on.
Example:
regex = \b\d{3}\b
123 appears on page 1,4,20
243 appear on page 3,5,7
523 appears on page 9
How can I get that information (all the pages a match occurs on?)
This is for creating some kind of index automatically.
EDIT 2:
I got a basic working version, snippet:
Set Matches = regExp.Execute(ActiveDocument.range.Text)
For Each Match In Matches
Set range = ActiveDocument.range(Match.FirstIndex, Match.FirstIndex + Len(Match.Value))
page = range.Information(wdActiveEndAdjustedPageNumber)
The problem is that Match.FirstIndex does not always point to the first character of the match in ActiveDocument.range. Word tables mess this up as ActiveDocument.range.Text contains characters that are not on the text put represent something in the table.

I think this probably fits better in SuperUser.
The answer to the question is "yes."
Selection.Information(wdActiveEndAdjustedPageNumber)
The above property in VBA will get you the page number of a selection.
Also, VBA can do some regular expression work.

This turned out to be rather complex and I can't say if my solution works for any document. The main issue is as indicated in the Question, that RegexMatch.FirstIndex can not be used to determine were the actually Match is within the MS Word Document. This is due to the fact that regex matching is done on range.Text property (String) and that string just contains different amount of characters than the range object does and hence Indexes don't match.
So my solution is for each match, I do a Find in the whole document for that match. the find methods gives a Range object from which the correct page can be determined.
In my special case a match could be the same thing also different value. Example: 343in my case would be the same as Prefix-343. A second issue was that the matches must be sorted eg 123before 324regardless which one occurs first in the document.
If you require the Sort Functionality you will also need the following to "modules":
SortDictionary Function:
http://www.cpearson.com/excel/CollectionsAndDictionaries.htm
Module "modQSortInPlace":
http://www.cpearson.com/Zips/modQSortInPlace.zip
If no sort is needed you don't need them but you need to remove the according function call SortDictionary Dict, Truefrom my code.
Now to my code. Soem parts you can remove, especially the formatting one. This is specific to my case. Also if your match is "unique", eg. not prefix or so you can simplify the code too. You will need to reference the "Microsoft Scripting Library".
Option Explicit
Sub ExtractRNumbers()
Dim Dict As Scripting.Dictionary
Set Dict = CreateObject("Scripting.dictionary")
Dim regExp, Match, Matches
Dim rNumber As String
Dim range As range
Set regExp = CreateObject("VBScript.RegExp")
regExp.Pattern = "\b(R-)?\d{2}-\d{4,5}(-\d)?\b"
regExp.IgnoreCase = False
regExp.Global = True
' determine main section, only extract R-Numbers from main section
' and not the Table of contents as example
' main section = section with most characters
Dim section As section
Dim maxSectionSize As Long
Dim sectionSize As Long
Dim sectionIndex As Integer
Dim currentIndex As Integer
maxSectionSize = 0
currentIndex = 1
For Each section In ActiveDocument.Sections
sectionSize = Len(section.range.text)
If sectionSize > maxSectionSize Then
maxSectionSize = sectionSize
sectionIndex = currentIndex
End If
currentIndex = currentIndex + 1
Next
Set Matches = regExp.Execute(ActiveDocument.Sections(sectionIndex).range.text)
For Each Match In Matches
' If the Document contains Tables, ActiveDocument.range.Text will contain
' BEL charachters (chr(7)) that probably define the table structure. The issue
' is that then Match.FirstIndex does not point to the actual first charachter
' of a Match in the Document.
' Also there are other things (unknwon) that lead to the same issue, eg.
' Match.FirstIndex can not be used to find the actual "matching word" within the
' document. Because of that below commented apporach does not work on a generic document
' Set range = ActiveDocument.range(Match.FirstIndex, Match.FirstIndex + Len(Match.Value))
' page = range.Information(wdActiveEndAdjustedPageNumber)
' Maybe there is a simpler solution but this works more or less
' the exception beign tables again. see http://support.microsoft.com/kb/274003
' After a match is found the whole document is searched using the find method.
' For each find result the page number is put into an array (if it is not in the array yet)
' Then the match is formatted properly.
' After formatting, it is checked if the match was previously already found
'
' If not, we add a new entry to the dictionary (key = formatted match, value = array of page numbers)
'
' If match was already found before (but potentially in a different format! eg R-87-1000 vs 87-1000 as example),
' all additional pages are added to the already found pages.
Set range = ActiveDocument.Sections(sectionIndex).range
With range.Find
.text = Match.Value
.MatchWholeWord = True
.MatchCase = True
.Wrap = wdFindStop
End With
Dim page As Variant
Dim pages() As Integer
Dim index As Integer
index = 0
ReDim pages(0)
Do While range.Find.Execute() = True
page = range.Information(wdActiveEndAdjustedPageNumber)
If Not IsInArray(page, pages) Then
ReDim Preserve pages(index)
pages(index) = page
index = index + 1
End If
Loop
' FORMAT TO PROPER R-NUMBER: This is specific to my case
rNumber = Match.Value
If Not rNumber Like "R-*" Then
rNumber = "R-" & rNumber
End If
' remove possible batch number as r-number
If Len(rNumber) > 11 Then
rNumber = Left(rNumber, Len(rNumber) - 2)
End If
' END FORMAT
If Not Dict.Exists(rNumber) Then
Dict.Add rNumber, pages
Else
Dim existingPages() As Integer
existingPages = Dict(rNumber)
For Each page In pages
If Not IsInArray(page, existingPages) Then
' add additonal pages. this means that the previous match
' was formatted different, eg R-87-1000 vs 87-1000 as example
ReDim Preserve existingPages(UBound(existingPages) + 1)
existingPages(UBound(existingPages)) = page
Dict(rNumber) = existingPages
End If
Next
End If
Next
'sort dictionary by key (R-Number)
SortDictionary Dict, True
Dim fso
Set fso = CreateObject("Scripting.FileSystemObject")
Dim stream
' Create a TextStream.
Set stream = fso.CreateTextFile(ActiveDocument.Path & "\" & ActiveDocument.Name & "-rNumbers.txt", True)
Dim key As Variant
Dim output As String
Dim i As Integer
For Each key In Dict.Keys()
output = key & vbTab
pages = Dict(key)
For i = LBound(pages) To UBound(pages)
output = output & pages(i) & ", "
Next
output = Left(output, Len(output) - 2)
stream.WriteLine output
Next
Set Dict = Nothing
stream.Close
End Sub
Private Function IsInArray(page As Variant, pages As Variant) As Boolean
Dim i As Integer
IsInArray = False
For i = LBound(pages) To UBound(pages)
If pages(i) = page Then
IsInArray = True
Exit For
End If
Next
End Function

Related

How to find and format multiple matching words with Regex in Word document using VB script

I have a word document in which I have to do the formatting of the words using VB script. The text can be as follows :
hello <bu ABC bu>, We are pleased to confirm our offer of employment to you. The terms and conditions that will apply to your employment with are set forth in this letter and Exhibit A attached hereto and incorporated herein by reference together, the “Agreement”
You have been offered and accepted the position of , presently reporting to . Your start date is expected to be
The words which are written inside tag needs to be bold and underlined. Currently I have written a VBscript which will find the text given as argument and make it bold and underline as required.
But to make the solution/script more dynamic, I want the script to match Regular Expression pattern which I have written : (?<=(<bu))[a-zA-Z0-9 -:/\[]()]+(?=(bu>))
The script I have written :
Option Explicit
Function Macro1()
Dim strFilePath
strFilePath = "C:\Users\<UserID>\Documents\OfferLetterTemplate.docx"
Dim strTextToReplace
strTextToReplace = "<bu XYZ bu>"
Dim Word, objDoc, objSelection
Set Word = CreateObject("Word.Application")
Word.Visible = True
Dim wordfile
Set wordfile = Word.Documents.Open(strFilePath)
Set objDoc = Word.ActiveDocument
Set objSelection = Word.Selection
objSelection.Find.Forward = True
objSelection.Find.MatchWholeWord = False
objSelection.Find.ClearFormatting
objSelection.Find.Replacement.ClearFormatting
objSelection.Find.Replacement.Font.Bold = True
objSelection.Find.Replacement.Font.Underline = True
objSelection.Find.Text = strTextToReplace
objSelection.Find.Replacement.Text = ""
objSelection.Find.Execute , , , , , , , 0, , , 2
wordfile.save
Word.Quit
End Function
call Macro1
Can someone help me how I can search for the RegEx which I have given above and format all the matching occurrences at once?

Optional parts of regex pattern in vba

I am trying to build regex pattern for the text like that
numb contra: 1.29151306 number mafo: 66662308
numb contra 1.30789668number mafo 60.046483
numb contra/ 1.29154056 number mafo: 666692638
numb contra 137459625
mafo: 666692638
mafo: 666692638 numb contra/ 1.29154056
Here's the pattern I could build
contra?.\s+?(\d+\.?\d+)(.+mafo.?\s+(\d+\.?\d+))?
It works fine for all the lines except the last one. How can I implement all the possibilities to include the last line too?
Please have a look at this link
https://regex101.com/r/pSThAU/1
All is OK as for contra but not as for mafo

I think the key here is to make your regexp do less and your vba do more. What I think I see here is either the word 'mafo' or 'contra' and a number following. Don't know what order or whether each is present or how many times. So you can scan each of your strings for ALL occurrences with a regexp like this:
(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)
Then process it with some VBA code like this that I created in Excel:
Sub BreakItUp()
Dim rg As RegExp, scanned As MatchCollection, eachMatch As Match, i As Long, col As Long
Set rg = New RegExp
rg.Pattern = "(?:^|[^A-Z])(?:(mafo)|(contra))[^A-Z]\s*(\d*\.?\d+)"
rg.IgnoreCase = True
rg.Global = True
i = 1
Do While (Not IsEmpty(ActiveSheet.Cells(i, 1).Value))
Set scanned = rg.Execute(ActiveSheet.Cells(i, 1).Value)
col = 2
For Each eachMatch In scanned
ActiveSheet.Cells(i, col).Value = eachMatch.SubMatches(0) & eachMatch.SubMatches(1)
ActiveSheet.Cells(i, col + 1).Value2 = "'" & eachMatch.SubMatches(2)
col = col + 2
Next eachMatch
i = i + 1
Loop
End Sub
That MatchCollection object will get one item for each Match that occurs and the subMatches array contains each capturing group. You should be able write your own logic within this processing loop to interpret what was extracted. When I ran it on your data it created all the fields in blue:
Notice I added a line to your data that had two contra entries and one mafo and it found all the occurrences. You should be able to modify this to interpret the meanings.

Changing formulas on the fly with VBA RegEx

i'm trying to change formulas in excel, i need to change the row number of the formulas.
I'm trying do use replace regex to do this. I use an loop to iterate through the rows of the excel and need to change the formula for the row that is iterating at the time. Here is an exemple of the code:
For i = 2 To rows_aux
DoEvents
Formula_string= "=IFS(N19='Z001';'xxxxxx';N19='Z007';'xxxxxx';0=0;'xxxxxxx')"
Formula_string_new = regEx.Replace(Formula_string, "$1" & i)
wb.Cells(i, 33) = ""
wb.Cells(i, 33).Formula = Formula_string_new
.
.
.
Next i
I need to replace rows references but not the ones in quotes or double quotes. Example:
If i = 2 i want the new string to be this:
"=IFS(N2='Z001';'xxxxxx';N2='Z007';'xxxxxx';0=0;'xxxxxxx')"
I'm trying to use this regex:
([a-zA-Z]+)(\d+)
But its changing everything in quotes too. Like this:
If i = 2:
"=IFS(N2='Z2';'xxxxxx';N2='Z2';'xxxxxx';0=0;'xxxxxxx')"
If anyone can help me i will be very grateful!
Thanks in advance.

As others have written, there are probably better ways to write this code. But for a regex that will capture just the Column letter in capturing group #1, try:
\$?\b(XF[A-D]|X[A-E][A-Z]|[A-W][A-Z]{2}|[A-Z]{2}|[A-Z])\$?(?:104857[0-6]|10485[0-6]\d|1048[0-4]\d{2}|104[0-7]\d{3}|10[0-3]\d{4}|[1-9]\d{1,5}|[1-9])d?
Note that is will NOT include the $ absolute addressing token, but could be altered if that were necessary.
Note that you can avoid the loop completely with:
Formula_string = "=IFS(N19=""Z001"",""xxxxxx"",N$19=""Z007"",""xxxxxx"",0=0,""xxxxxxx"")"
Formula_string_new = regEx.Replace(Formula_string, "$1" & firstRow)
With Range(wb.Cells(firstRow, 33), wb.Cells(lastRow, 33))
.Clear
.Formula = Formula_string_new
End With
When we write a formula to a range like this, the references will automatically adjust the way you were doing in your loop.
Depending on unstated factors, you may want to use the FormulaLocal property vice the Formula property.
Edit:
To make this a little more robust, in case there happens to be, within the quote marks, a string that exactly mimics a valid address, you can try checking to be certain that a quote (single or double) neither precedes nor follows the target.
Pattern: ([^"'])\$?\b(XF[A-D]|X[A-E][A-Z]|[A-W][A-Z]{2}|[A-Z]{2}|[A-Z])\$?(?:104857[0-6]|10485[0-6]\d|1048[0-4]\d{2}|104[0-7]\d{3}|10[0-3]\d{4}|[1-9]\d{1,5}|[1-9])d?\b(?!['"])
Replace: "$1$2" & i
However, this is not "bulletproof" as various combinations of included data might match. If it is a problem, let me know and I'll come up with something more robust.

If you can identify some unique features like in the example preceding bracket ( or colon ; and trailing equal = then this might work
Sub test()
Dim s As String, sNew As String, i As Long
Dim Regex As Object
Set Regex = CreateObject("vbscript.regexp")
With Regex
.Global = True
.MultiLine = False
.IgnoreCase = True
.Pattern = "([(;][a-zA-Z]{1,3})(\d+)="
End With
i = 1
s = "=IFS(NANA19='Z001';'xxxxxx';NA19='Z007';'xxxxxx';0=0;'xxxxxxx')"
sNew = Regex.Replace(s, "$1" & i & "=")
Debug.Print s & vbCr & sNew
End Sub

Excel VBA using RegEx for Conditional Formating

I have an Excel 2010 VBA macro that does some conditional formatting over a select area of a spreadsheet. As an example the following snippet searches for a text pattern then colors the cell:
Selection.FormatConditions.Add Type:=xlTextString, String:="TextToMatch", _
TextOperator:=xlContains Selection.FormatConditions(Selection.FormatConditions.Count).SetFirstPriority
With Selection.FormatConditions(1).Interior
.PatternColorIndex = xlAutomatic
.ColorIndex = 36
.TintAndShade = 0
End With
Selection.FormatConditions(1).StopIfTrue = False
What I would like to add is to match against a regular expression TN[0-9]. A simple match of the string TN followed by a digit.
I have created the RegExp obect:
Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
With regEx
.Pattern = "TN[0-9]"
End With
However I have not figured out how to apply this to the Selection.
As always, thank you for your assistance.

I would recommend using a Static type object for your VBScript.RegExp object.
Cut the range passed into the function down to the Worksheet.UsedRange property. This allows a selection of full columns without calculating empty rows/columns.
Option Explicit
Sub createCFR()
With Selection
'cut Selection down to the .UsedRange so that full row or full
'column references do not use undue calculation
With Intersect(.Cells, .Cells.Parent.UsedRange)
.FormatConditions.Delete
With .FormatConditions.Add(Type:=xlExpression, Formula1:="=myCFR(" & .Cells(1).Address(0, 0) & ")")
.SetFirstPriority
With .Interior
.PatternColorIndex = xlAutomatic
.ColorIndex = 36
.TintAndShade = 0
End With
.StopIfTrue = False
End With
End With
End With
End Sub
Function myCFR(rng As Range)
Static rgx As Object
'with rgx as static, it only has to be created once
'this is beneficial when filling a long column with this UDF
If rgx Is Nothing Then
Set rgx = CreateObject("VBScript.RegExp")
End If
'make sure rng is a single cell
Set rng = rng.Cells(1, 1)
With rgx
.Global = True
.MultiLine = True
.Pattern = "TN[0-9]"
myCFR = .Test(rng.Value2)
End With
End Function
Depending on your Selection, you may need to modify the parameters of the Range.Address property used to create the CFR; e.g. $A1 would be .Address(1, 0).
In the following image, B2:B7 contain =myCFR(A2) filled down to proof the UDF.

How to capture value between two strings in VB.NET

i'm trying to capture a value between two strings using VB.NET
Each line from the file i'm reading in from can contain many different parameters, in any order, and I'd like to store the values of these parameters in their own variables. Two sample lines would be:
identifier="121" messagecount="112358" timestamp="11:31:41.622" column="5" row="98" colour="ORANGE" value="Hello"
or it could be:
identifier="1121" messagecount="1123488" timestamp="19:14:41.568" valid="true" state="running"
Also, this may not be the sole text in the string, there may be other values before and after (and in between) the parameters i would like to capture.
So essentially i'd need to store everything between 'identifier="' and it's closing '"' into an identifier variable, and so on... As the order of these parameters within each line can change, i can't simply stick the first value in one variable each time, I have to refer to them specifically by what their name is (identifier, messagecount) etc.
Can anyone help? Thanks. I guess it would be via a regular expression, but i'm not too hot on those. I'd prefer to have each expression for each paramater within it's own statement, rather than being all in one, thanks.

Here is a sample how you can go about that. It converts one line into a dictionary.
This will capture any string consisting of a-z-characters (case-insensitive) as the attribute name, and then catch any character other than " in the value string. (If " can occur in the string as "" you need to add some treatment for that.)
Imports System.Text.RegularExpressions
[...]
Dim s As String =
"identifier=""121"" messagecount=""112358"" " &
"timestamp=""11:31:41.622"" column=""5"" row=""98"" " &
"colour=""ORANGE"" value=""Hello"""
Dim d As New Dictionary(Of String, String)
Dim rx As New Regex("([a-z]+)=""(.*?)""", RegexOptions.IgnoreCase)
Dim rxM As MatchCollection = rx.Matches(s)
For Each M As Match In rxM
d.Add(M.Groups(1).Value, M.Groups(2).Value)
Next
' Dictionary is ready
' test output
For Each k As String In d.Keys
MsgBox(String.Format("{0} => {1}", k, d(k)))
Next

You just need to split the data into manageable clumps, and then go through it. Something like this to start you off.
Private Sub ProcessMyData(LineOfData As String)
' NOTE! This assumes all your 'names' have no spaces in!
Dim vElements = LineOfData.Split({" "c}, StringSplitOptions.RemoveEmptyEntries)
For Each vElement In vElements
Dim vPair = vElement.Split({"="c})
Dim vResult = vPair(1).Trim(Convert.ToChar(34))
Select Case vPair(0).ToLower
Case "identifier"
MyIDVariable = CInt(vResult)
Case "colour"
MyColourVariable = vResult
' etc., etc.
End Select
Next
End Sub
You can define the variables you want locally in the sub [function], and then return a list/dictionary/custom class of the things you're interested in.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MS Word + VBA + RegExp: Get Page Number of Match - regex

I think this probably fits better in SuperUser. The answer to the question is "yes." Selection.Information(wdActiveEndAdjustedPageNumber) The above property in VBA will get you the page number of a selection. Also, VBA can do some regular expression work.

Related

How to find and format multiple matching words with Regex in Word document using VB script

Optional parts of regex pattern in vba

Changing formulas on the fly with VBA RegEx

Excel VBA using RegEx for Conditional Formating

How to capture value between two strings in VB.NET

Categories

Resources