google doc script to capitalize sentences - regex

i am writing the google doc script below to capitalize the sentences in a document.
function cap6() {
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
var str1 = text.getText();
Logger.log(str1);
// define function "replacement" to change the matched pattern to uppercase
function replacement(match) { return match.toUpperCase(); }
// period followed by any number of blank spaces (1,2,3, etc.)
var reg = /\.(\s*\s)[a-z]/g;
// capitalize sentence
var str2 = str1.replace(reg, replacement);
Logger.log(str2);
// replace string str1 by string str2
text.replaceText(str1, str2);
}
the code almost worked in the sense that the correct result is shown in the log file as follows:
[15-10-22 22:37:03:562 EDT] capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
[15-10-22 22:37:03:562 EDT] capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
the 1st line above is the original paragraph without capitalized sentences; the 2nd line below is the transformed paragraph with capitalized sentences, regardless of the number of blank spaces after the period.
the problem was that i could not replace the original paragraph in the google doc with the transformed paragraph using the code:
// replace string str1 by string str2
text.replaceText(str1, str2);
i suspect that i made an error in the arguments of the method "replaceText".
any help to point out my errors would be appreciated. thank you.

in a flash of inspiration, i ALMOST solved the problem using the following code:
text.replaceText(".*", str2);
my inspiration actually came from reading about the method "replaceText".
the above code worked when i had only ONE paragraph in the google doc.
but when i had two paragraphs in the google doc, then the code gave a duplicate of the document, i.e., a 2nd exact copy of the two paragraphs just below the original two paragraphs (with correct capitalization of the sentences, including the beginning of the 2nd paragraph, but not the beginning of the 1st paragraph).
when i had 3 paragraphs, then i had 3 copies of these 3 paragraphs, such as shown below:
capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. this is one example with ONE blank space after the period. here is another example with TWO blank spaces after the period. this is yet another example with MORE THAN THREE blank spaces.
then after running the script, i got 3 copies of these 3 paragraphs (with correct capitalization of the sentences, including the beginning of the 2nd and 3rd paragraphs), as shown below:
capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
Capitalize sentences. This is one example with ONE blank space after the period. Here is another example with TWO blank spaces after the period. This is yet another example with MORE THAN THREE blank spaces.
so there is still something wrong in the new code... which almost worked if i could get rid of the extra copies of the document.
returning to the original code
text.replaceText(str1, str2);
i suspect that there was something wrong with using the variable "str1" in the 1st argument of method "replaceText". it is hoped some experts could explain the error in my original code.

i combine the above answers from Washington Guedes and from Robin Gertenbach here that led to the following working script:
function cap6() {
var body = DocumentApp.getActiveDocument().getBody();
var text = body.editAsText();
// define variable str1
var str1 = text.getText();
// define function "replacement" to change the matched pattern to uppercase
function replacement(match) { return match.toUpperCase(); }
// period followed by any number of blank spaces (1,2,3, etc.)
// var reg = /\.(\s*\s)[a-z]/g;
// or replace \s*\s by \s+
var reg = /\.(\s+)[a-z]/g;
// capitalize sentence
var str2 = str1.replace(reg, replacement);
// replace the entire text by string str2
text.setText(str2);
}
on the other hand, the above script would wipe out all existing formatting such as links, boldface, italics, underline in a google doc.
so my next question would be how could i modify the script so it would run on a selected (highlighted) paragraph instead of the whole google doc to avoid the script to wipe out existing formatting.

The duplication issue that you have is coming from the line breaks which are not matched by the dot operator in RE2 (Googles Regular expression engine) if you don't include the s flag.
You therefore have a number of matches equal to the number of paragraphs.
You don't need to use a resource intensive replace method though, you can just use text.setText(str2); instead of text.replaceText(".*", str2);

to change the script so that it would run only on the selected text (e.g., a paragraph) to avoid wiping out the existing formatting in other paragraphs in a google doc, i was inspired by the code i found in Class Range.
i also improve on the regular expression in the variable "reg" so that the beginning of a line or paragraph would also be capitalized:
var reg = /(^|\.)(\s*)[a-z]/g;
below is a script that would capitalize the sentences in a selected text (just run the script cap7, which calls the script cap8):
function cap7() {
// script to capitalize the beginning of a paragraph and the sentences within.
// highlight a number of paragraphs, then run cap7, which calls cap8.
// get the selected text inside a google doc
var selection = DocumentApp.getActiveDocument().getSelection();
if (selection) {
var elements = selection.getRangeElements();
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
// Only modify elements that can be edited as text; skip images and other non-text elements.
if (element.getElement().editAsText) {
var text = element.getElement().editAsText();
// capitalize the sentences inside the selected text
cap8(text);
}
}
}
}
function cap8(text) {
// define variable str1
var str1 = text.getText();
// Logger.log(str1);
// define function "replacement" to change the matched pattern to uppercase
function replacement(match) { return match.toUpperCase(); }
// beginning of a line or period, followed by zero or more blank spaces
var reg = /(^|\.)(\s*)[a-z]/g;
// capitalize sentence; replace regular expression "reg" by the output of function "replacement"
var str2 = str1.replace(reg, replacement);
// Logger.log(str2);
// replace whole text by str2
text.setText(str2); // WORKING
return text;
}
see also my question in the post google doc script, capitalize sentences without removing other attributes.

Related

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

vim: search, capture & replace on different lines using regex

Relatively new linux/vim/regex user here. I want to use regex to search for a numerical patterns, capture it, and then use the captured value to append a string to the previous line. In other words...I have a file of format:
title: description_id
text: {en: '2. text description'}
I want to capture the values from the text field and append them to the beginning of the title field...to yield something like this:
title: q2_description_id
text: {en: '2. text description'}
I feel like I've come across a way to reference other lines in a search & replace but am having trouble finding that now. Or maybe a macro would be suitable. Any help would be appreciated...thanks!
Perhaps something like:
:%s/\(title: \)\(.*\n\)\(text: \D*\)\(\d*\)/\1q\4_\2\3\4/
Where we are searching for 4 parts:
"title: "
rest of line and \n
"text: " and everything until next digit in line
first string of consecutive digits in line
and spitting them back out, with 4) inserted between 1) and 2).
EDIT: Shorter solution by Peter in the comments:
:%s/title: \zs\ze\_.\{-}text: \D*\(\d*\)/q\1_/
Use \n for the new lines (and ^v+enter for new lines on the substitute line): A quick and not very elegant example:
:%s/title: description_id\n\ntext: {en: '\(\i*\)\(.*\)/title: q\1_description_id^Mtext: {en: '\1\2/

Regexextract over multiple lines within one cell

In Google Sheets, I have this in one cell:
Random stuff blah blah 123456789
<Surname, Name><123456><A><100><B><200>
<Surname2, Name2><456789><A><300><B><400>
Some more random stuff
And would like to match the strings within <> brackets. With = REGEXEXTRACT(A4, "<(.*)>") I got thus far:
Surname, Name><123456><A><100><B><200
which is nice, but it is only the first line. The desired output would be this (maybe including the <> at the beginning/end, it doesn't really matter):
Surname, Name><123456><A><100><B><200>
<Surname2, Name2><456789><A><300><B><400
or simply:
Surname, Name><123456><A><100><B><200><Surname2, Name2><456789><A><300><B><400
How to get there?
Please try:
=SUBSTITUTE(regexextract(substitute(A4,char(10)," "),"<(.*)>"),"> <",">"&char(10)&"<")
Starting in the middle, the substitute replaces line breaks (char(10)) with spaces. This enables the regexextract the complete (ie multi-line) string to work on, with the same pattern as already familiar to OP. SUBSTITUTE then reinstates the relevant space (identified as being immediately surrounded by > and <) with a line break.
Google sheets uses RE2 syntax. You can set the multi-line and s flags in order to match multiple lines. The following will match all characters over multiple lines in cell A2.
=REGEXEXTRACT(A2, "(?ms)^(.*)$")
REGEXEXTRACT(A1,"text1(?ms)(.*)text2")
So, in this case:
REGEXEXTRACT(A1,"<(?ms)(.*)>")

RichTextBox search'n'replace results are staggered

I am currently trying to generate colored results after a search containing keywords. My code displays a richtextbox containing a text that was succesfully hit by the search engine.
Now I want to highlight the keywords in the text, by making them bold and colored in red. I have my list of words in a nice string table, which I browse this way (rtb is my RichTextBox, plainText is the only Run from rtb, containing the entire text of it) :
rtb.SelectAll();
string allText = rtb.Selection.Text;
string expression = "";
foreach (string word in words)
{
expression = Regex.Escape(word);
Regex regExp = new Regex(expression);
foreach (Match match in regExp.Matches(allText))
{
TextPointer start = plainText.ContentStart.GetPositionAtOffset(match.Index, LogicalDirection.Forward);
TextPointer end = plainText.ContentStart.GetPositionAtOffset(match.Index + match.Length, LogicalDirection.Forward);
rtb.Selection.Select(start, end);
rtb.Selection.ApplyPropertyValue(Run.FontWeightProperty, FontWeights.Bold);
rtb.Selection.ApplyPropertyValue(Run.ForegroundProperty, "red");
}
}
Now I thought this would do the trick. But somehow, only the first word gets highlighted correctly. Then, the second occurence of the highlights starts two early, with the correct amount of letters getting highlighted, but a few characters before the actual word. Then for the third occurence it's further more characters earlier, etc.
Have you got any idea what is causing this behavior?
EDIT (01/07/2013): Still not figuring out why these results are staggered... So far I noticed that if I created a variable set to zero right before the second foreach statement, added it up to each textpointer's positions and incremented it by 4 (no idea why) at the end of each loop, the results are colored adequately. Nevertheless, if I search for two keywords or more (doesn't matter if they're the same size), each occurence of the first keyword get colored correctly, but only the first occurences of the other keywords are well-colored. (the others are staggered again) Here's the edited code:
rtb.SelectAll();
string allText = rtb.Selection.Text;
string expression = "";
foreach (string word in words)
{
expression = Regex.Escape(word);
Regex regExp = new Regex(expression);
int i = 0;
foreach (Match match in regExp.Matches(allText))
{
TextPointer start = plainText.ContentStart.GetPositionAtOffset(match.Index + i, LogicalDirection.Forward);
TextPointer end = plainText.ContentStart.GetPositionAtOffset(match.Index + match.Length + i, LogicalDirection.Forward);
rtb.Selection.Select(start, end);
rtb.Selection.ApplyPropertyValue(Run.FontWeightProperty, FontWeights.Bold);
rtb.Selection.ApplyPropertyValue(Run.ForegroundProperty, "red");
i += 4; // number found out from trials
}
}
Alright! So I learned by reading this question that everytime I modify the style, it adds 4 characters to the text, which is what was messing up my setting.
In order to fix this, since I possibly have multiple keywords and that they do not appear one after the other in the text in the order that they were typed in the search box, I had to first browse my text to locate each occurence for each keyword without modifying the text. For each occurence, I store in a custom list the start position, end position and desired color of the occurence.
When this selection is done, I order my occurence list by the start attribute of each member in it. I can now be assured that each occurence I browse in my foreach loop is the next one in the text, with no regard to its content or length. And I know in which color I want to make it appear, so I can distinguish different keywords.
Then, finally, I can browse each member of my ordered list and modify the style of my text, knowing that the next word will appear later in the text, so I must add 4 characters to my index at the end of each loop.

Regex regular expression to remove lines which start with certain text

I know it may be quite easily for you.
I have a text which contains 40 lines, I want to remove lines which starts with a constant text.
Please check below data.
When I used (?mn)[\+CMGL:].*($) it removes the whole text , when I use (?mn)[\+CMGL:].*(\r) , it only leaves the first line.
+CMGL: 0,1,,159
07910201956905F0440B910201532762F20008709021225282808
+CMGL: 1,1,,159
07910201956905F0240B910201915589F7000860013222244480
+CMGL: 2,1,,151
07910201956905F0240B910201851177F6000850218122415
+CMGL: 3,1,,159
07910201956905F0440B910201532762F200087090311
+CMGL: 4,1,,159
07910221020020F0440B910221741514F40008802041120481808C050
I want to remove all lines that starts with +CMGL , and leave only other line.
Thanks...
Why do you need Regex for this? String.StartsWith was created for this purpose.
Dim result = lines.Where(Function(l) Not l.StartsWith("+CMGL")).ToList()
Edit: If you don't have "lines" but a text which contains NewLine-characters:
Dim result = text.Split({ControlChars.CrLf, ControlChars.Lf}, StringSplitOptions.None).
Where(Function(l) Not l.StartsWith("+CMGL")).ToList()
If you want it to be converted back to a string:
Dim text = String.Join(Environment.NewLine, result)