Replacing Regex expression that is not supported with Google Script - regex

A short background of what I am trying to achieve: I have a Google Doc and A google sheet.
The google doc contains text and the google sheet contains 2 columns: a word and it's translation.
the function gets the body of the google doc and supposed to go over the "words" col, identify all appearances of each word in the body and replace it with its translation - but it matches only occurrences that are whole-words and exact match only.
What basically I want to have would be easier to explain with an example:
Let's say I have the word "pop" and it is translated to "pretty". I want the function to replace the word except for cases like:
pop's
allpop
popping
etc..
So basically, as was mentioned only if it's an exact match and a whole word only.
This is the function, the regex works fine, the problem is that it is not supported with google script. I couldn't come up with a solution that replaces the regex I made with one that works and meet my requirements.
I attach the code so in case something is unclear, you would be able to understand what I meant if you're familiar with regex.
function replaceText(body, words, origin, translated) {
for(var i=0; i<words.length; i++){
var word = words[i][origin-1];
var regex = RegExp("(?:\\b)" + word + "\\b(?!\\')",'gi');
Logger.log(body.getText().match(regex));
Logger.log(body.replaceText(regex, translation));
var translation = words[i][translated-1];
var foundElement = body.replaceText(regex, translation);
}
return body;
}
Also if you're interested, attached the link with what regex expressions are supported by Google Script:
https://github.com/google/re2/wiki/Syntax

First, (?:\\b) should just be \\b, the word boundary is zero-width anyway, so it does not need a lookaround.
Second, I understand that your issue is specifically with replaceText. The line body.getText().match(regex); works with regular JavaScript string method, which supports the usual regexes. The issue is that you need replaceText, and that one is different.
Third, replaceText does not take a regular expression object as a parameter: its arguments are strings. Check the docs again.
Finally, since we don't want to treat ' as a word boundary and don't have lookahead support, a solution is to escape ' by replacing it with a weird enough alphanumeric string that won't occur naturally. At the end, replace back.
function translate() {
var body = DocumentApp.getActiveDocument().getBody();
var escape = "uJKiy5hzXNUWFDl7k2pSZoDZ8ipv6LR1ArTi6gXu"; // from https://www.random.org/strings/?num=2&len=20&digits=on&upperalpha=on&loweralpha=on&unique=on&format=html&rnd=new
body.replaceText("'", escape);
// the loop would begin here
var word = "pop";
body.replaceText("(?i)\\b" + word + "\\b", "translation");
// loop would end here.
body.replaceText(escape, "'");
}
Note that case-insensitive flag is (?i), and that replacement in replaceText is always global.
And watch out for curly apostrophes: if they need to special treatment too, escape them similarly but using some other random string.

Related

Regex to match text from multiple links

How to extract links which contain a certain word?
For e.g.:
https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text
How to search "word" from below regex?
((https:).*?(###))
The result should be like this
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
https://www.test.com/word/3/text/text
Let's try to build such regex. First we need to find the beginning of url:
/(https?:\/\//
We add ? after https for http urls.
Then we need to find any text except ###, so we need to add:
(?:(?!###).)*
which means - any amount of characters not starting a ### sequence.
Also we need to add word itself and previous sub-expression again, since word can be surrounded by any text:
word(?:(?!###).)*
But the thing is that last sub-expression will skip last character before ###, so we need to add one more thing to handle it:
.(?=###|$)
which means - any character followed by ### or end of string. The final expression will look like:
/(https:\/\/(?:(?!###).)*word(?:(?!###).)*.(?=###|$))/g
But i believe, it's better to just split text by ### and then check for needed word by String.prototype.includes.
If the word has to be a part of the pathname, you might use filter in combination with URL and check if the parts of the pathname contain word.
let str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
let filteredUrls = str.split("###")
.filter(s =>
new URL(s).pathname
.split('/')
.includes('word')
);
console.log(filteredUrls);
If you want to use regex only and possessive quantifiers are supported (The javascript tag has been removed) you might use:
https?://[^#w]*(?:#(?!##)|w(?!ord)|[^#w]*)++word.*?(?=###|$)
Regex demo
Previous answer
You for sure looking for this regular expression:
https://www.test.com/(text/)*word/\d+(/text)*
Here is how you can use it in JavaScript context (very slash / is escaped by backslash \/):
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/https:\/\/www.test.com\/(text\/)*word\/\d+(\/text)*/g);
console.log(urls);
In the array you get exactly the elements you wanted.
Update the answer after update question and adding comment by the author
If you need take the words from your example string, then you have to use a little more complex regular exception:
var str = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var urls = str.match(/(?<=\/)\w+(?=\/\d+\/\w)|(?<=(\w\/\w+\/))\w+(?=\/\d)/g);
console.log(urls);
Explanation
Here is regular expression /(?<=(\w\/\w+\/))\w+(?=\/\d)|(?<=\/)\w+(?=\/\d+\/\w)/g, limited by /.../ and with the g flag forcing pattern searches for occurrence.
The regular expression has two alternatives ...|...
The first one (?<=\/)\w+(?=\/\d+\/\w) captures cases when the searched word is directly behind the slash (?<=\/) and before more words behind the number (?=\/\d+\/\w).
https://www.test.com/word/3/text/text
The second alternative (?<=(\w\/\w+\/))\w+(?=\/\d) captures cases where the word is preceded by other words following the domain (?<=(\w\/\w+\/)) (in fact two slashes separated by alphanumeric characters) and the searched word is immediately before the slash followed by the number (?=\/\d).
https://www.test.com/text/word/2
https://www.test.com/text/text/word/3
All slashes must be escaped: \/.
The construction (?<=...) means lookbehind in regular expressions and (?=...) means lookahead in regular expressions.
Note 1. The above example currently only works well in a Chrome browser, as that:
(...) now lookbehind is part of the ECMAScript 2018 specification. As of this writing (late 2018), Google's Chrome browser is the only popular JavaScript implementation that supports lookbehind. So if cross-browser compatibility matters, you can't use lookbehind in JavaScript.
Note 2. Lookbehnd, even if it is interpreted correctly, in most regular expression engines must contain a fixed length regular expression, which I do not keep in the example above, because this one is still valid and works for regular expression engines used in Google Chrome's JavaScript engine, JGsoft engine and .NET framework RegEx classes.
Note 3. The lookbehind syntax or its poorer \K replacement are widely supported by many regular expression engines used in a large group of programming languages.
More explanation about regular expressions which I used you can find for example here.
You may first split by ### then check whether /word/ exists in each element:
var s = 'https://www.test.com/text/1###https://www.test.com/text/word/2###https://www.test.com/text/text/word/3###https://www.test.com/3/text###https://www.test.com/word/3/text/text';
var result = [];
s.split(/###/).forEach(function(el) {
if (el.includes('/word/'))
result.push(el);
})
// or else by using filter
// result = s.split(/###/).filter(el => el.includes('/word/'))
console.log(result);

How to get the string that start after the last > by regular expression?

I am writing a C# code that read a webpage and grep the content from the webpage.
I spent a lot of time to figure the content and now I stuck on this:
<i class="icon"></i><a href="https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged
I wanna get the "Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged" only
I used to use "(?<=\">)(.*)" to get some content out successfully but not fit for all of it.
Therefore, how could I use R.E. to point I want the element that start get after the last ' > '
Thank you.
If the substring that you want to match appears after the last > then the main thing you know about it is that it does not contain a >. This is matched with [^>]. If the string must contain at least one character then you'll want to use + as the quantifier; if it's allowed to be empty then you'll want to use * to allow for zero matches. Finally, you need to match the full remainder of the text, up to the end of the line, which you do with a $.
So the full expression is [^>]*$ (or [^>]+$ if it can't be zero length).
If you want to also require that the preceding text does have a >, you can make it a bit more complicated, using a non-matching look-behind, (?<=\>). This says to find a > (which needs to be escaped here with \>) but don't include it in the match. The final expression would then be (?<=\>)[^>]*$. Now, C# strings also make use of \ for escaping, so you have to escape it twice before passing it to the Regex constructor. So it becomes new Regex("(?<=\\>)[^>]*$").
The simpler version, [^>]*$, is probably sufficient for your needs.
Finally, I would add that parsing XML or HTML with regular expressions is usually the wrong thing to do because there are lots of edge cases, and you will have to make assumptions about the formatting. For example, based on your example text, I assumed you are searching up to the end of the input text. It's usually better to parse XML with an XML parser, which won't have these problems.
This is the Regex you need here is a working example in RegexStorm.net example:
>([^<>]+)
This says: Find a string that matches a closing angle bracket, followed by text that doesn't include angle brackets. The [^<>] says find letters, numbers, whitespace that are NOT open/close angle brackets. The parenthesis around the [^<>] captures the text as a separate group. The (+) says get at least one or more.
Here is a C# example that uses it. You need to get the second capture group for the text you want.
void Main()
{
string text = "<i class=\"icon\"></i><a href=\"https://www.nytimes.com/2017/09/12/us/irma-storm-updates.html\">Latest Updates: 90 Percent of Houses in Florida Keys Are Damaged";
Regex regex = new Regex(">([^<>]+)");
MatchCollection matchCollection = regex.Matches(text);
if (matchCollection != null)
{
foreach (Match m in matchCollection)
{
Console.WriteLine(m.Groups[1].Value);
}
}
}
RegexStorm.net is a good .Net test site. Regex101.com is a good site to learn different Regex tools.

Regex wrapping word

Regex example
How can I exclude the first space in every match?
The same regex: (?:^|\W)#(\w+)(?!\w)
Is this what you're looking for?
http://regexr.com/3ca98
From the information you gave us until now, this regex should also be sufficient: #(\w+)(?!\w).
But maybe there's more to it than we know. What did you want to achieve with the (?:^|\W)?
Edit: Thinking about what you probably want to achieve, it occured to me that you might only match your pattern if it's not in the middle of another word (e.g. test#case). You probably don't want to match this.
To exclude such cases, you have to asure that there's some kind of whitespace character in front of it, or in other words: nothing else but whitespace characters or nothing.
I assume you use javascript because regexr.com does and sadly, there is no regex lookbehind available in javascripts regex implementation. So there is no real option to make sure there is only nothing or whitespace in front of your pattern.
One solution would be to work with capture groups. Take this regex:
(?:^|\s+)(#\w+)
It searches for one or more whitespace characters or linestarts in front of your pattern but doesn't use a capture group for that. Then your pattern is up and it's the first capture group in the whole expression.
To use this in javascript now, you need to instantiate a RegExp object and use its function exec until there are no more matches and save the first capture group to a result array.
JS code:
var txt = text.innerHTML;
var re = /(?:^|\s+)(#\w+)/g;
var res = [];
var tmpresult = [];
while ((tmpresult = re.exec(txt)) !== null) {
res.push(tmpresult[1]); // push first capture group to result stack
}
result.innerHTML = JSON.stringify(res, null, 2);
JSFiddle: https://jsfiddle.net/j41tw4hm/1/
Updated regexr.com: http://regexr.com/3ca9n

Regular Expression for phrases starting with TO

I am pretty new to Regular Expression. I want to write a regular expression to get the TO Followed by the rest of it after each new line. I tried to use this but doesn't work properly.
^TO\n?\s?[A-Za-z0-9]\n?[A-Za-z0-9]
It only highlights properly the TO W11 which all are in one line. Highlights only TO from first data and the 3rd data only highlights the first line. Basically it doesn't read the new lines.
Some of my data looks like this:
TO
EXTERNAL
TRAVERSE
TO W11
TO CONTROL
TRAVERSE
I would appreciate if anybody can help me.
Make sure you use a multiline regex:
var options = RegexOptions.MultiLine;
foreach (Match match in Regex.Matches(input, pattern, options))
...
More at: http://msdn.microsoft.com/en-us/library/yd1hzczs(v=vs.110).aspx
It looks like your pattern isn't matching because the start of the string is really a space and not the T character. Also, [A-Za-z0-9] matches only one character, and you want the whole word. I used the + to denote that I want one or more matches of those characters.
(TO\n?\s?[A-Za-z0-9]+)
This regex matches "TO EXTERNAL", "TO W11" and "TO CONTROL". Be sure to use the global modifier so that you get all matches, not just the first one.

Regular expression to find specific text within a string enclosed in two strings, but not the entire string

I have this type of text:
string1_dog_bit_johny_bit_string2
string1_cat_bit_johny_bit_string2
string1_crocodile_bit_johny_bit_string2
string3_crocodile_bit_johny_bit_string4
string4_crocodile_bit_johny_bit_string5
I want to find all occurrences of “bit” that occur only between string1 and string2. How do I do this with regex?
I found the question Regex Match all characters between two strings, but the regex there matches the entire string between string1 and string2, whereas I want to match just parts of that string.
I am doing a global replacement in Notepad++. I just need regex, code will not work.
Thank you in advance.
Roman
If I understand correctly here a code to do what you want
var intput = new List<string>
{
"string1_dog_bit_johny_bit_string2",
"string1_cat_bit_johny_bit_string2",
"string1_crocodile_bit_johny_bit_string2",
"string3_crocodile_bit_johny_bit_string4",
"string4_crocodile_bit_johny_bit_string5"
};
Regex regex = new Regex(#"(?<bitGroup>bit)");
var allMatches = new List<string>();
foreach (var str in intput)
{
if (str.StartsWith("string1") && str.EndsWith("string2"))
{
var matchCollection = regex.Matches(str);
allMatches.AddRange(matchCollection.Cast<Match>().Select(match => match.Groups["bitGroup"].Value));
}
}
Console.WriteLine("All matches {0}", allMatches.Count);
This regex will do the job:
^string1_(?:.*(bit))+.*_string2$
^ means the start of the text (or line if you use the m option like so: /<regex>/m )
$ means the end of the text
. means any character
* means the previous character/expression is repeated 0 or more times
(?:<stuff>) means a non-capturing group (<stuff> won't be captured as a result of the matching)
You could use ^string1_(.*(bit).*)*_string2$ if you don't care about performance or don't have large/many strings to check. The outer parenthesis allow multiple occurences of "bit".
If you provide us with the language you want to use, we could give more specific solutions.
edit: As you added that you're trying a replacement in Notepad++ I propose the following:
Use (?<=string1_)(.*)bit(.*)(?=_string2) as regex and $1xyz$2 as replacement pattern (replace xyz with your string). Then perform an "replace all" operation until N++ doesn't find any more matches. The problem here is that this regex will only match 1 bit per line per iteration - and therefore needs to be applied repeatedly.
Btw. even if a regexp matches the whole line, you can still only replace parts of it using capturing groups.
You can use the regex:
(?:string1|\G)(?:(?!string2).)*?\Kbit
regex101 demo. Tried it on notepad++ as well and it's working.
There're description in the demo site, but if you want more explanations, let me know and I'll elaborate!