RichTextBox search'n'replace results are staggered - regex

I am currently trying to generate colored results after a search containing keywords. My code displays a richtextbox containing a text that was succesfully hit by the search engine.
Now I want to highlight the keywords in the text, by making them bold and colored in red. I have my list of words in a nice string table, which I browse this way (rtb is my RichTextBox, plainText is the only Run from rtb, containing the entire text of it) :
rtb.SelectAll();
string allText = rtb.Selection.Text;
string expression = "";
foreach (string word in words)
{
expression = Regex.Escape(word);
Regex regExp = new Regex(expression);
foreach (Match match in regExp.Matches(allText))
{
TextPointer start = plainText.ContentStart.GetPositionAtOffset(match.Index, LogicalDirection.Forward);
TextPointer end = plainText.ContentStart.GetPositionAtOffset(match.Index + match.Length, LogicalDirection.Forward);
rtb.Selection.Select(start, end);
rtb.Selection.ApplyPropertyValue(Run.FontWeightProperty, FontWeights.Bold);
rtb.Selection.ApplyPropertyValue(Run.ForegroundProperty, "red");
}
}
Now I thought this would do the trick. But somehow, only the first word gets highlighted correctly. Then, the second occurence of the highlights starts two early, with the correct amount of letters getting highlighted, but a few characters before the actual word. Then for the third occurence it's further more characters earlier, etc.
Have you got any idea what is causing this behavior?
EDIT (01/07/2013): Still not figuring out why these results are staggered... So far I noticed that if I created a variable set to zero right before the second foreach statement, added it up to each textpointer's positions and incremented it by 4 (no idea why) at the end of each loop, the results are colored adequately. Nevertheless, if I search for two keywords or more (doesn't matter if they're the same size), each occurence of the first keyword get colored correctly, but only the first occurences of the other keywords are well-colored. (the others are staggered again) Here's the edited code:
rtb.SelectAll();
string allText = rtb.Selection.Text;
string expression = "";
foreach (string word in words)
{
expression = Regex.Escape(word);
Regex regExp = new Regex(expression);
int i = 0;
foreach (Match match in regExp.Matches(allText))
{
TextPointer start = plainText.ContentStart.GetPositionAtOffset(match.Index + i, LogicalDirection.Forward);
TextPointer end = plainText.ContentStart.GetPositionAtOffset(match.Index + match.Length + i, LogicalDirection.Forward);
rtb.Selection.Select(start, end);
rtb.Selection.ApplyPropertyValue(Run.FontWeightProperty, FontWeights.Bold);
rtb.Selection.ApplyPropertyValue(Run.ForegroundProperty, "red");
i += 4; // number found out from trials
}
}

Alright! So I learned by reading this question that everytime I modify the style, it adds 4 characters to the text, which is what was messing up my setting.
In order to fix this, since I possibly have multiple keywords and that they do not appear one after the other in the text in the order that they were typed in the search box, I had to first browse my text to locate each occurence for each keyword without modifying the text. For each occurence, I store in a custom list the start position, end position and desired color of the occurence.
When this selection is done, I order my occurence list by the start attribute of each member in it. I can now be assured that each occurence I browse in my foreach loop is the next one in the text, with no regard to its content or length. And I know in which color I want to make it appear, so I can distinguish different keywords.
Then, finally, I can browse each member of my ordered list and modify the style of my text, knowing that the next word will appear later in the text, so I must add 4 characters to my index at the end of each loop.

Related

Word VBA - find a text string where a one word (not all words in string) have a particular style or format

I was trying to construct some code to search for text where one word within the text is a particular format or style. For example, I would like to search for the text "Hello world, all is good" but only hit instances where the word "all" is in bold.
I thought about searching for the first few words "Hello world, "; collapsing the selection, searching the next three characters forward for the word "all" in bold; collapsing the selection (if true) then searching the next bit for the words " is good". This would result in identifying the whole phrase with the bold word but it seems really inefficient and not very flexible. Also, to then select the whole sentence, I have to write code to move the selection back to the start and extend the selection forward. Then I need to reset the search to continue forward from that position.
Is there some easy/easier/more elegant way to search for a string where only one word within the string has specific properties like bold? I specifically want the search to ignore instances of the phrase where the relevant word is not in bold.
I have spent a few hours searching google and stackflow and can't find anything on this.
I haven't posted code because I am not very good at writing the code, and I really want to understand if there is a flexible/elegant way of doing what I want. The inflexible root I've explained above is so inflexible I'm reluctant to bother coding something.
Thanks
Jeremy
The method I would use is to search for the string and, if found, then search the string for the word. Here is an example.
Sub Demo()
Dim StringRange As Range
Dim MatchFound As Boolean
With ActiveDocument.Range.Find
' The string to find
.Text = "Hello world, all is good"
' Search the document
Do While .Execute
' Capture the string
Set StringRange = .Parent.Duplicate
With .Parent.Duplicate.Find
' The word to find
.Text = "all"
.Font.Bold = True
' Search the string
If .Execute Then
MatchFound = True
StringRange.Select
If MsgBox("Match found. Continue searching?", vbQuestion + vbYesNo) = vbNo Then
Exit Sub
End If
End If
End With
Loop
If MatchFound Then
MsgBox "Finished searching document", vbInformation
Else
MsgBox "No match found", vbInformation
End If
End With
End Sub

How do you select specific range in a string using RegEx

So I have a string. Let's say for the argument it is this one:
1234567891113SomeTextExample
I want to have two regular expresions:
Select from beginning to, say, 6th position;
Select from 8th position to 12th position.
I know how to select everything AFTER specific position, e.g.:
(?<=.{6})(.*)$
would select everything after 5 characters.
I am using Sublime Text editor and need to cleanup some logs and these two expressions would save a whole lot of time.
use ^ to get your regex to start at the beginning.
Beginning to 6th position : ^(.{6})
var str = 'xdcfvgbhdsds';
var regex = /^(.{6})/;
console.log(regex.exec(str)[1]);
8th to 12th position : ^.{7}(.{5})
var str = 'xdcfvgbhddsfsffsds';
var regex = /^.{7}(.{5})/;
console.log(regex.exec(str)[1]);
Beginning to 6th position (Demo):
^(.{6}).*$
Characters 8 to 12, inclusive on both ends (Demo):
^.{7}(.{5}).*$
I am assuming here that you want to capture these specific ranges for some sort of use.
Finally I found it out.
First one - Select from beginning to, say, 6th position:
^(.{6})
Thanks Zenoo for this.
And select from 8th position to 12th position:
^(.{8})|(?<=.{12})(.*)$
Well, at least this one works in Sublime Text. I am sure there are lots and lots of editors/applications which are fine with Zenoo's approach (^.{7}(.{5})).

Regular expression for custom syntax in text input

I'm supposed to enforce a certain search-syntax in a text input, and after watching several RegEx videos and tutorials, I'm still having difficulties creating a regex for my purpose.
The expression structure should be something like that:
$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8
may start with a free text search that may contain any character other than the delimiter, which is ,. (free text must be first, and the string may be ONLY free text search).
after free text comma-separated parts of field names which consist only [a-z][A-Z], followed by operator: (=|<|>|<>) and followed by field search value that may be anything but ,.
between the commas that separate the parts there may be spaces (\s*).
The free text part or at least one field=value must appear in order for the string to be valid.
Did anyone understand the question? :)
^[^,]*(?:,\s*[a-zA-Z]+(?:[=><]|<>)[^,]+)*$? – Rawing
Thanks, that seems to work. Why did you use non-capturing groups?
He did it most probably because he didn't assume that the groups are to be captured (you didn't specify that).
Plus - if I start out the string with a comma, it is valid, whereas I
want it to not be valid (if there's no free text at the beginning).
That can be accomplished by changing the first * to a +, i. e. ^[^,]+…
I'm using javascript. I want to be able to separate each key=value
pair (including the possible free text as a group), and within that
group I would like to be able to capture key, operator, and value as
separate entities (or groups)
That's not doable with only one RegExp invocation, see e. g. How to capture an arbitrary number of groups in JavaScript Regexp? Here's an example solution:
s = '$earch://site.com?y=7, app=app1, wu>7, cnt<>8, url=http://kuku.com?adasd=343 , p=8'
part = /,\s*([a-zA-Z]+)(<>|[=><])([^,]+)/
re = RegExp('^([^,]+)('+part.source+')*$')
freetext = re.exec(s)[1] // validate s and take free text as 1st capture group of re
if (freetext)
{ document.write('free text:', freetext, '<p>')
parts = RegExp(part.source, 'g')
m = s.slice(freetext.length).match(parts) // now get all key=value pairs into m[]
if (m)
{ field = []
for (i = 0; i < m.length; ++i)
{ f = m[i].match(part) // and now capture key, operator and value from m[i]
field[i] = { key:f[1], operator:f[2], value:f[3] }
for (property in field[i]) // display them
document.write(property, ':', field[i][property], '; ')
document.write('<p>')
}
document.write(field.length, ' key/value pairs total<p>')
}
}

google doc script to capitalize sentences

i am writing the google doc script below to capitalize the sentences in a document.
function cap6() {
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var str1 = text.getText();
Logger.log(str1);
// define function "replacement" to change the matched pattern to uppercase
function replacement(match) { return match.toUpperCase(); }
// period followed by any number of blank spaces (1,2,3, etc.)
var reg = /\.(\s*\s)[a-z]/g;
// capitalize sentence
var str2 = str1.replace(reg, replacement);
Logger.log(str2);
// replace string str1 by string str2
text.replaceText(str1, str2);
}
the code almost worked in the sense that the correct result is shown in the log file as follows:
[15-10-22 22:37:03:562 EDT] capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
[15-10-22 22:37:03:562 EDT] capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
the 1st line above is the original paragraph without capitalized sentences; the 2nd line below is the transformed paragraph with capitalized sentences, regardless of the number of blank spaces after the period.
the problem was that i could not replace the original paragraph in the google doc with the transformed paragraph using the code:
// replace string str1 by string str2
text.replaceText(str1, str2);
i suspect that i made an error in the arguments of the method "replaceText".
any help to point out my errors would be appreciated. thank you.
in a flash of inspiration, i ALMOST solved the problem using the following code:
text.replaceText(".*", str2);
my inspiration actually came from reading about the method "replaceText".
the above code worked when i had only ONE paragraph in the google doc.
but when i had two paragraphs in the google doc, then the code gave a duplicate of the document, i.e., a 2nd exact copy of the two paragraphs just below the original two paragraphs (with correct capitalization of the sentences, including the beginning of the 2nd paragraph, but not the beginning of the 1st paragraph).
when i had 3 paragraphs, then i had 3 copies of these 3 paragraphs, such as shown below:
capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
then after running the script, i got 3 copies of these 3 paragraphs (with correct capitalization of the sentences, including the beginning of the 2nd and 3rd paragraphs), as shown below:
capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
so there is still something wrong in the new code... which almost worked if i could get rid of the extra copies of the document.
returning to the original code
text.replaceText(str1, str2);
i suspect that there was something wrong with using the variable "str1" in the 1st argument of method "replaceText". it is hoped some experts could explain the error in my original code.
i combine the above answers from Washington Guedes and from Robin Gertenbach here that led to the following working script:
function cap6() {
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
// define variable str1
var str1 = text.getText();
// define function "replacement" to change the matched pattern to uppercase
function replacement(match) { return match.toUpperCase(); }
// period followed by any number of blank spaces (1,2,3, etc.)
// var reg = /\.(\s*\s)[a-z]/g;
// or replace \s*\s by \s+
var reg = /\.(\s+)[a-z]/g;
// capitalize sentence
var str2 = str1.replace(reg, replacement);
// replace the entire text by string str2
text.setText(str2);
}
on the other hand, the above script would wipe out all existing formatting such as links, boldface, italics, underline in a google doc.
so my next question would be how could i modify the script so it would run on a selected (highlighted) paragraph instead of the whole google doc to avoid the script to wipe out existing formatting.
The duplication issue that you have is coming from the line breaks which are not matched by the dot operator in RE2 (Googles Regular expression engine) if you don't include the s flag.
You therefore have a number of matches equal to the number of paragraphs.
You don't need to use a resource intensive replace method though, you can just use text.setText(str2); instead of text.replaceText(".*", str2);
to change the script so that it would run only on the selected text (e.g., a paragraph) to avoid wiping out the existing formatting in other paragraphs in a google doc, i was inspired by the code i found in Class Range.
i also improve on the regular expression in the variable "reg" so that the beginning of a line or paragraph would also be capitalized:
var reg = /(^|\.)(\s*)[a-z]/g;
below is a script that would capitalize the sentences in a selected text (just run the script cap7, which calls the script cap8):
function cap7() {
// script to capitalize the beginning of a paragraph and the sentences within.
// highlight a number of paragraphs, then run cap7, which calls cap8.
// get the selected text inside a google doc
var selection = DocumentApp.getActiveDocument().getSelection();
if (selection) {
var elements = selection.getRangeElements();
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
// Only modify elements that can be edited as text; skip images and other non-text elements.
if (element.getElement().editAsText) {
var text = element.getElement().editAsText();
// capitalize the sentences inside the selected text
cap8(text);
}
}
}
}
function cap8(text) {
// define variable str1
var str1 = text.getText();
// Logger.log(str1);
// define function "replacement" to change the matched pattern to uppercase
function replacement(match) { return match.toUpperCase(); }
// beginning of a line or period, followed by zero or more blank spaces
var reg = /(^|\.)(\s*)[a-z]/g;
// capitalize sentence; replace regular expression "reg" by the output of function "replacement"
var str2 = str1.replace(reg, replacement);
// Logger.log(str2);
// replace whole text by str2
text.setText(str2); // WORKING
return text;
}
see also my question in the post google doc script, capitalize sentences without removing other attributes.

GetPositionAtOffset() don't return good position

I use a RichTextBox in WPF (4.0) and I use the GetPositionAtOffset() method to get a text range between two position in the content in RichTextBox.
1) I initialize the text pointer "position" from MyRichTextBox.Document.ContentStart :
TextPointer position = RTBEditor.Document.ContentStart;
2) I get the text from my RichTextBox like that :
var textRun = new TextRange(RTBEditor.Document.ContentStart, RTBEditor.Document.ContentEnd).Text;
3) With Regex I find a string that I want in textRun and get the begin's index and the end's index (I search a text between "/*" and "*/"):
Regex regex = new Regex(#"/\*([^\*/])*\*/");
var match = regex.Match(textRun);
TextPointer start = position.GetPositionAtOffset(matchBegin.Index, LogicalDirection.Forward);
TextPointer end = position.GetPositionAtOffset(matchBegin.Index + matchBegin.Length, LogicalDirection.Backward);
But, when I use these pointers in a textrange and colorize the text inside, it's not the good text matched in my regex (with goods indexes) which is colorized in my RichTextBox.
Why the GetPositionAtOffset() method don't give the position at the index specified ? It's this method the problem or it's somewhere else ?
Thank's for reply, I am stopped in my development.
According to this, https://msdn.microsoft.com/en-us/library/ms598662%28v=vs.110%29.aspx
GetPositionAtOffset returns a TextPointer to the position indicated by the specified offset, in symbols, from the beginning of the current TextPointer.
Any of the following is considered to be a symbol:
An opening or closing tag for the TextElement element.
A UIElement element contained in an InlineUIContainer or BlockUIContainer. Note that such a UIElement is always counted as exactly one symbol; any additional content or elements contained by the UIElement are not counted as symbols.
A 16-bit Unicode character inside of a text Run element.
Sorry to bother you, the problem was somewhere else.
I initialized the text of my RichTextBox with AppendText() method and not with a paragraph that I added in the blocks. So now it works fine !