vb.net regex parse paragraph from report

vb.net regex parse paragraph from report - regex

I have a report I am given in plain text that a coworker typically has to manually edit out various headers. I know the top line and bottom line of the header - they do not differ throughout the document, but the various lines of text between does.
Formatting looks like this:
BEGIN REPORT FOR CLIENT XXYYZZ
RANDOM BODY TEXT
RANDOM BODY TEXT
RANDOM BODY TEXT
RANDOM BODY TEXT
RANDOM BODY TEXT
FINAL REPORT
I am attempting to use regular expressions to highlight this text within a rich text box. If I use the below code I can highlight every occurrence of the top line without issue:
Dim mystring As String = "(BEGIN)(.+?)(XXYYZZ)"
Dim regHeader As New Regex(mystring)
Dim regMatch As Match = regHeader.Match(rtbMain.Text)
While regMatch.Success
rtbMain.Select(regMatch.Index, regMatch.Length)
rtbMain.SelectionColor = Color.Blue
regMatch = regMatch.NextMatch()
End While
However, once I attempt to change the code to find the entire paragraph it no longer will highlight anything. Below is what I was expecting it to be but it does not seam to like it for whatever reason and will not highlight anything:
Dim mystring As String = "(BEGIN REPORT FOR CLIENT XXYYZZ)(.+?)(FINAL REPORT)"
Dim regHeader As New Regex(mystring)
Dim regMatch As Match = regHeader.Match(rtbMain.Text)
While regMatch.Success
rtbMain.Select(regMatch.Index, regMatch.Length)
rtbMain.SelectionColor = Color.Blue
regMatch = regMatch.NextMatch()
End While
Any help would be greatly appreciated.

What you need is singleline mode, in order to let . match even newlines.
Try this:
Dim mystring As String = "(BEGIN REPORT FOR CLIENT XXYYZZ)(.+?)(FINAL REPORT)"
Dim regHeader As New Regex(mystring, RegexOptions.Singleline)
Dim regMatch As Match = regHeader.Match(rtbMain.Text)
While regMatch.Success
rtbMain.Select(regMatch.Index, regMatch.Length)
rtbMain.SelectionColor = Color.Blue
regMatch = regMatch.NextMatch()
End While
Notice RegexOptions.Singleline.

Related

How to find and format multiple matching words with Regex in Word document using VB script

I have a word document in which I have to do the formatting of the words using VB script. The text can be as follows :
hello <bu ABC bu>, We are pleased to confirm our offer of employment to you. The terms and conditions that will apply to your employment with are set forth in this letter and Exhibit A attached hereto and incorporated herein by reference together, the “Agreement”
You have been offered and accepted the position of , presently reporting to . Your start date is expected to be
The words which are written inside tag needs to be bold and underlined. Currently I have written a VBscript which will find the text given as argument and make it bold and underline as required.
But to make the solution/script more dynamic, I want the script to match Regular Expression pattern which I have written : (?<=(<bu))[a-zA-Z0-9 -:/\[]()]+(?=(bu>))
The script I have written :
Option Explicit
Function Macro1()
Dim strFilePath
strFilePath = "C:\Users\<UserID>\Documents\OfferLetterTemplate.docx"
Dim strTextToReplace
strTextToReplace = "<bu XYZ bu>"
Dim Word, objDoc, objSelection
Set Word = CreateObject("Word.Application")
Word.Visible = True
Dim wordfile
Set wordfile = Word.Documents.Open(strFilePath)
Set objDoc = Word.ActiveDocument
Set objSelection = Word.Selection
objSelection.Find.Forward = True
objSelection.Find.MatchWholeWord = False
objSelection.Find.ClearFormatting
objSelection.Find.Replacement.ClearFormatting
objSelection.Find.Replacement.Font.Bold = True
objSelection.Find.Replacement.Font.Underline = True
objSelection.Find.Text = strTextToReplace
objSelection.Find.Replacement.Text = ""
objSelection.Find.Execute , , , , , , , 0, , , 2
wordfile.save
Word.Quit
End Function
call Macro1
Can someone help me how I can search for the RegEx which I have given above and format all the matching occurrences at once?

Why does Find/Replace zRngResult.Find work fine, but RegEx myRegExp.Execute(zRngResult) mess up the range.Start?

I wish to select and add comments after certain words, e.g. “not”, “never”, “don’t” in sentences in a Word document with VBA. The Find/Replace with wildcards works fine, but “Use wildcards” cannot be selected with “Match case”. The RegEx can “IgnoreCase=True”, but the selection of the word is not reliable when there are more than one comments in a sentence. The Range.start seems to be getting modified in a way that I cannot understand.
A similar question was asked in June 2010. https://social.msdn.microsoft.com/Forums/office/en-US/f73ca32d-0af9-47cf-81fe-ce93b13ebc4d/regex-selecting-a-match-within-the-document?forum=worddev
Is there a new/different way of solving this problem?
Any suggestion will be appreciated.
The code using RegEx follows:
Function zRegExCommentor(zPhrase As String, tComment As String) As Long
Dim sTheseSentences As Sentences
Dim rThisSentenceToSearch As Word.Range, rThisSentenceResult As Word.Range
Dim myRegExp As RegExp
Dim myMatches As MatchCollection
Options.CommentsColor = wdByAuthor
Set myRegExp = New RegExp
With myRegExp
.IgnoreCase = True
.Global = False
.Pattern = zPhrase
End With
Set sTheseSentences = ActiveDocument.Sentences
For Each rThisSentenceToSearch In sTheseSentences
Set rThisSentenceResult = rThisSentenceToSearch.Duplicate
rThisSentenceResult.Select
Do
DoEvents
Set myMatches = myRegExp.Execute(rThisSentenceResult)
If myMatches.Count > 0 Then
rThisSentenceResult.Start = rThisSentenceResult.Start + myMatches(0).FirstIndex
rThisSentenceResult.End = rThisSentenceResult.Start + myMatches(0).Length
rThisSentenceResult.Select
Selection.Comments.Add Range:=Selection.Range
Selection.TypeText Text:=tComment & "{" & zPhrase & "}"
rThisSentenceResult.Start = rThisSentenceResult.Start + 1 'so as not to find the same phrase again and again
rThisSentenceResult.End = rThisSentenceToSearch.End
rThisSentenceResult.Select
End If 'If myMatches.Count > 0 Then
Loop While myMatches.Count > 0
Next 'For Each rThisSentenceToSearch In sTheseSentences
End Function

Relying on Range.Start or Range.End for position in a Word document is not reliable due to how Word stores non-printing information in the text flow. For some kinds of things you can work around it using Range.TextRetrievalMode, but the non-printing characters inserted by Comments aren't affected by these settings.
I must admit I don't understand why Word's built-in Find with wildcards won't work for you - no case matching shouldn't be a problem. For instance, based on the example: "Never has there been, never, NEVER, a total drought.":
FindText:="[n,N][e,E][v,V][e,E][r,R]"
Will find all instances of n-e-v-e-r regardless of the capitalization. The brackets let you define a range of values, in this case the combination of lower and upper case for each letter in the search term.
The workarounds described in my MSDN post you link to are pretty much all you can if you insist on RegEx:
Using the Office Open XML (or possibly Word 2003 XML) file format will let you use RegEx and standard XML processing tools to find the information, add comment "tags" into the Word XML, close it all up... And when the user sees the document it will all be there.
If you need to be doing this in the Word UI a slightly different approach should work (assuming you're targeting Word 2003 or later): Work through the document on a range-by-range basis (by paragraph, perhaps). Read the XML representation of the text into memory using the Range.WordOpenXML property, perform the RegEx search, add comments as WordOpenXML, then write the WordOpenXML back into the document using the InserXml method, replacing the original range (paragraph). Since you'd be working with the Paragraph object Range.Start won't be a factor.

How to remove/replace any line of text with empty string inside a doube quot and leave urls only, with RegEx in .vb.net?

I have a list like this
"Boring makes sense!"
"http://www.someurl.com/listsolo.php?username=fgt&id=46229&code="
"http://www.someurl2.com/members/listearn.php?username=mprogram&id=465301"
"All is there?"
"http://www.someurl.com/listsolo.php?username=loopa&id=46228&code="
"http://www.someurl3.com/members/mem.php?&mprogram"
"http://someurl4.com/members/mem.php?&loop"
I need to remove any kind of text on particular line including double quots with RegEx in vb.net
Dim fileName As String = "C:\Downloads\Links.txt"
Dim sr As New StreamReader(fileName)
While Not sr.EndOfStream
Dim re As String = sr.ReadLine()
If Not re.StartsWith("http") Then
re = Regex.Replace(re, "(^[A-Za-z]+)", "", RegexOptions.Multiline)
lblTest.Text += re.ToString()
End if
End While
sr.Close()
How to do it ...in simple way?

Using Linq, reading from file, filtering and re-writing back to it :
File.WriteAllLines("some path", From line In File.ReadAllLines("some path")
Where line.StartsWith("http"))

I figured it out :-), this regex
.[A-Za-z]\w+ .*
remove whole line of text with double quotas. I test regex here. Anyway, thanks for help.

Find word with RegExp and bold

I've a word document where I want to find all the words as have the following layout: ABC-12:123456 DEF. Where this is found in the document the word should be selected and put in bold. (Later i'll add a hyperlink instead of bold). I have successfully found the word and put it in a MatchCollection just to try RegExp. It looks like:
Sub searchDocument()
Set matchPattern = New RegExp
matchPattern.Pattern = "ABC-\d{2}:\d{6} DEF"
matchPattern.Global = True
Dim matchPatternWords As MatchCollection
Set matchPatternWords = matchPattern.Execute(ActiveDocument.Range)
For Each matchPatternWord In matchPatternWords
MsgBox (matchPatternWord)
Next matchPatternWord
End Sub

You need to go from the regexp match to the range object representing the match.
matchRange = ActiveDocument.Range
(matchPatternWord.FirstIndex, matchPatternWord.FirstIndex+matchPatternWord.Length)
would be the obvious invocation.
However this post indicates that there might be issues with this approach, because formating can mess up the character count. It's from 2010 though so the issue might be resolved in a better way now.
If the above doesn't work, or if you don't trust it you can do;
matchRange = ActiveDocument.Range.Find(FindText:=matchPatternWord.Value)
The latter needs a bit more handeling if multiple occurences of the same word is a possibility.
Once you have the range it's straight forward.
matchRange.Bold = True

Regex regular expression to remove lines which start with certain text

I know it may be quite easily for you.
I have a text which contains 40 lines, I want to remove lines which starts with a constant text.
Please check below data.
When I used (?mn)[\+CMGL:].*($) it removes the whole text , when I use (?mn)[\+CMGL:].*(\r) , it only leaves the first line.
+CMGL: 0,1,,159
07910201956905F0440B910201532762F20008709021225282808
+CMGL: 1,1,,159
07910201956905F0240B910201915589F7000860013222244480
+CMGL: 2,1,,151
07910201956905F0240B910201851177F6000850218122415
+CMGL: 3,1,,159
07910201956905F0440B910201532762F200087090311
+CMGL: 4,1,,159
07910221020020F0440B910221741514F40008802041120481808C050
I want to remove all lines that starts with +CMGL , and leave only other line.
Thanks...

Why do you need Regex for this? String.StartsWith was created for this purpose.
Dim result = lines.Where(Function(l) Not l.StartsWith("+CMGL")).ToList()
Edit: If you don't have "lines" but a text which contains NewLine-characters:
Dim result = text.Split({ControlChars.CrLf, ControlChars.Lf}, StringSplitOptions.None).
Where(Function(l) Not l.StartsWith("+CMGL")).ToList()
If you want it to be converted back to a string:
Dim text = String.Join(Environment.NewLine, result)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

vb.net regex parse paragraph from report - regex

Related

How to find and format multiple matching words with Regex in Word document using VB script

Why does Find/Replace zRngResult.Find work fine, but RegEx myRegExp.Execute(zRngResult) mess up the range.Start?

How to remove/replace any line of text with empty string inside a doube quot and leave urls only, with RegEx in .vb.net?

Find word with RegExp and bold

Regex regular expression to remove lines which start with certain text

Categories

Resources